Search engine love: now they crawl me, now they don’t

Posted by admin on February 21, 2007 in Link Building, Link Theory

In PageRank: Where it helps, where it doesn’t help, and other facts I noted that “one of the classic crawling strategies that Google has used is the amount of PageRank on your pages” (Source: Matt Cutts). That, of course, is only part of the story.

Another part of the story is that Google does not use external PageRank (what you see in the Toolbar) for these crawling strategies.

Yet another part of the story is that, after Google’s Bigdaddy architecture took over completely on March 29, 2006, Matt warned people that “if you were getting crawled more before and you’re trading a bunch of reciprocal links, don’t be surprised if the new crawler has different crawl priorities and doesn’t crawl as much”.

On May 16, in a followup comment to that post, Matt shared the following insight into one visitor’s concern about whether Google was aware of his forum: “it’s by design in Bigdaddy that we crawl somewhat more than we index in Bigdaddy. If you index everything that you crawl, you never know what you might be missing by crawling a little more, for example. I see at least one indexed post from your forum, so the fact that we’ve been visiting those pages is a good indicator that we’re aware of those pages, and they may be incorporated in the index in the future” (emphasis is mine).

It’s curious that Matt said “if you index everything that you crawl, you never know what you might be missing by crawling a little more”. That really makes no sense, given that historically crawlers (spiders) have not been responsible for indexing content. However, the indexing process can only begin after the documents have been parsed (broken down) — so do the Google crawling servers handle the parsing tasks as well as page fetches?

In October Matt wrote: [Bigdaddy] brought smarter Googlebot crawling, including tricks like full gzip support and a crawl caching proxy that means less bandwidth usage for site owners”. In the same post he quickly added, “We used the summer to swap in a completely new architecture for Supplemental Results. The core of that infrastructure is complete and fully deployed, but I’m sure we’ll see additional smaller changes (mostly making sure that queries off the beaten path such as site: do what people expect).”

That must be the same change that SE Roundtable noticed in September.

And later on in the blog post I just cited Matt said “PageRank is the primary factor determining whether a url is in the main web index vs. the supplemental results, so I’d concentrate on good backlinks more than worrying about varying page layouts, etc.” As I have mentioned elsewhere, Matt clarified for me that when he speaks of “PageRank” he is talking about Internal PageRank. Matt usually refers to the Toolbar PageRank as External PageRank.

For example, on October 2, 2006 he wrote our internal PageRank computations have many more degrees of resolution than the 0-10 values shown in the toolbar. Further on in the same post he said, “By the time you see newer PageRanks in the toolbar, those values have already been incorporated in how we score/rank our search results. So while you may be happy to see that the Google Toolbar shows a little more PageRank for a given page, it’s not as if that causes a change in search results at that point….”

On May 10, 2006 (Google Press Day) Matt played the role of blogger/journalist while other Google representatives shared insight into the mega giant’s services. One early note follows Alan Eustace (V.P. of Engineering) “through the life of a query! He runs through the need to crawl, index, and then score relevant results. ‘Speed matters.’ With 8 billion pages, it would take 253 years … to fetch pages if you fetch one page per second. … Alan talks about duplicate pages… can vary from 30-50% of pages with a naive approach. Alan notes that you have to avoid infinite loops such as calendars. Freshness matters. Size matters, especially with long-tail queries” (emphasis is mine).

There is an art to getting a site crawled, and Google took a cue from classic spam hallway/doorway organization in suggesting that people submit multiple XML sitemaps for sections of their sites. That is, if you have 1,000,000 pages on your site, you cannot include them all in a single XML sitemap file anyway. Each XML sitemap is limited to 50,000 URLs.

Vanessa Fox suggested at one point that people should divide their page URLs into content that is updated more often and content that is updated less often. Of course, many people have complained that their pages dropped out of Google after they submitted XML sitemap files. While I have never seen anyone prove a connection, it’s a widely expressed concern that holds many people back from submitting XML sitemap files to Google.

Note: It would be easy enough to prove that submitting a sitemap caused Gootgle to drop all pages from its index. Just remove the sitemap, let the pages come back, and then submit the sitemap again. If the pages drop out again, then you have shown a connection (although not a cause).

An alternative to the XML sitemap file is to include an HTML sitemap on your Web site. However, people often shoot themselves in the foot with HTML sitemaps.

Two of the most common mistakes people make with on-site HTML sitemaps is that they don’t have every page on the site link to the sitemap page and they don’t keep the sitemap page lean. The best HTML sitemap pages are provided as HTML list elements — no CSS, no Javascript, no descriptive text — just a list of static HTML links. Your sitemap page should be structured so that a person can quickly scan it to find what they are looking for. And that means using meaningful anchor text, not spammy “please pass value to my destination pages” anchor text that consists only of keywords. Your mileage may vary.

A tiered sitemap structure — in either XML or HTML format — is necessary for large content sites but is also helpful for sites with fewer than 50,000 page URLs anyway. If you have 10,000 pages on your site, you definitely need a tiered HTML sitemap structure. The neat thing about these kinds of mapping pages is that you can replicate portions of the data throughout the whole Web site without duplicating content. Each section of a site can have its own HTML sitemap page(s) and the master HTML sitemap page can link to all of those sitemap pages as well as to other portions of the site.

The more interlinked your large content site is, the more often your most important content will be crawled. Simply relying upon a sitemap for crawling is inefficient. You should have two types of internal navigation links on every page: links to local content (usually in the same directory or sub-domain) and links to high-value content (such as the root URL, section front pages, sitemap pages, contact information pages, and special feature pages).

All too often people become obsessed with PageRank and place as few links on a page as possible so as to “preserve PageRank” or “concentrate PageRank”. This is a self-defeating strategy as it does not take into consideration the lag time that falls between the fetch of a random page and the fetch of a high-value page that may be (indirectly) linked to by the random pages.

Your internal linking strategy should focus on getting crawlers to high value pages as soon and as frequently as possible. If you have only 1,000 pages on your site, then aiming 1,000 internal navigation links at the root URL and another 1,000 links at the HTML sitemap page will increase the odds of those pages being crawled and reindexed considerably. If you only point 1-way links from your HTML sitemap to your section front pages it will take forever for that HTML sitemap page to be found and crawled. And it won’t help your visitors much, either.

If you verify your site with Webmaster Central’s Webmaster Tools (or with Yahoo! Site Explorer) you can see which links the search engine is pointing toward your pages. The Google link report is presently better designed than Yahoo!’s link report, except in that Yahoo! lets you look at just the domains that link to a site (or page).

Looking at how many internal links point to any of your pages will tell you how strong your internal linkage is. A good rule of thumb is that every page should have at least 2 internal links pointing to it: one from the HTML sitemap page and 1 from a front page, either the root URL or a section front page. More important pages will naturally accrue higher link counts from child and sibling pages, as well as from special cross-promotion.

In fact, a secondary priority for your internal linking strategy should be to get the crawlers to your section front pages (and secondary HTML sitemap pages, if you have them) as quickly as possible. High value pages should link to intermediate value pages and intermediate value pages should link to low value pages. The more balanced (evenly distributed) your link tree is, the more likely all pages are to be crawled.

But if you identify sections of your site that are not well-crawled, you can create new internal linking structures (such as a feature article that summarizes each of the unindexed pages) to help the crawlers find your content. These linking structures need to be treated as more important than random content pages, about equl with section front pages.

Your external link profile should tell you a little bit about which pages are most likely to be discovered by the search engines. You can set up a simple scoring system to weight the strength of your inbound links this way: each link has a value of 1 if it is an unimpaired static HTML link (”impaired” links are nofollowed, embedded in Javascript, or otherwise designed not to pass value). Each unimpaired linking page gets an additional point if all or most of its siblings appear in the search engine index. Each unimpaired linking page gets an additional point if you see strong in-site navigation.

The more 3-point external links your pages receive, the better.

You can improve your site’s crawling by placing a few promotional links to other content on your site on your 3-point pages.

Looking at your site through a search engine’s eyes helps you see where your site relies on weak internal navigation. It can also tell you where your content has achieved high visibility outside of search. But most importantly you can begin to understand why it takes much longer for some sites to be crawled and indexed if you look for the natural crawling obstacles that Webmasters often implement in their fundamental design.

The search engines will love your Web site if you create a lot of unique content that is well-linked internally. Some people in the SEO community are starting to claim that smaller, leaner sites perform better. That’s not really the case. What is happening is that less content is appearing in Google’s main Web index, and we know that it’s important that pages be well-linked in order to be included in that index. If you paralyze yourself with the fear that dividing your PageRank 4 ways instead of 2 makes your links weaker, you’ll be assured of slow success because real search engine optimization begins with getting crawled and indexed.

Although Google says that PageRank is used to determine page crawling, you haev the ability to boost the PageRank of your own content through internal navigation that also directs the flow of crawling toward your most important content. The robots won’t just go to the root URL of your domain. They’ll hit any URL they have in their index, and if you are getting deep links from other sites it’s imperative that you humble those deeply-linked pages so they share the love freely with the most valuable real estate on your site.

Remember: Google especially will crawl more than it indexes, so the more often you get your key pages crawled, the more often they’ll be indexed and the more frequently your lower-tier pages will be crawled and indexed. Keep the crawlers constantly flowing through your content by sending them back to link-rich pages.

2 Comments on Search engine love: now they crawl me, now they don’t

By Plymouth style week on February 21, 2007 at 4:46 pm

A brilliant and well balanced article.
The query i cannot find an answer to is that google state in their guidleines not to have more than 100 links on any one page - so how can you reconcile that guidleine with creating a HTML sitemap which lists all the URLS - does it mean that you create a 5 tiered html sitemap if you have for example 500 urls.

By Michael Martinez on February 22, 2007 at 3:23 am

Well, I have pages with as many as 200 links that all get crawled. Google’s recommendation is just that: a recommendation.

It’s okay to have a multi-page HTML sitemap. It’s okay to have a tiered HTML sitemap. Just understand that, for your visitors, the sitemap should only point toward useful content.

If you have a 40-page section on your site where the content is organized through a mini-directory — what I call a “front page” — then it’s okay if the HTML sitemap only links to the front page.

You have to find a good arrangement and that may take a little experimentation. My HTML sitemaps are constantly evolving as I add content.

Comment

Log in or Register to post a comment.

More

Read more posts by admin

The alchemy of SEO magic You will love these guaranteed proven free easy 100 quality links