Large Web site design theory and crawl management

Posted by Michael Martinez on April 10, 2008 in Content Theory, Link Theory

Crawl refers to all aspects of search engine crawling. It includes:

  1. Crawl rate (how many pages are fetched in a given timeframe)
  2. Crawl frequency (how often a search engine initiates a new crawl)
  3. Crawl depth (how many clicks deep a search engine goes from a crawl initiation point)
  4. Crawl saturation (how many unique pages are fetched)
  5. Crawl priority (which pages are used to initiate crawls)
  6. Crawl redundancy (how many crawlers are used to crawl a site)
  7. Crawl mapping (creating paths for crawlers)

There are different types of crawlers. Search engines and spammers are not the only people out there crawling the Web. Other tools crawl your sites for a variety of reasons. Each tool is built around a unique set of assumptions and not every crawler looks at robots.txt for any sort of guidance.

Crawlers that do their own parsing and queuing have very limited capabilities. A search engine has the option of vetting links its extracts from pages if its separates the the crawling and queuing functions. Link vetting introduces efficiencies in crawling that Web site operators cannot influence. For example, a search engine that vets links may be able to eliminate a lot of redundant and unnecessary fetches if it already has a recent image of a page for which it finds a lot of links.

When you are designing a large Web site you want to improve crawl efficiencies as much as possible. Your internal linking will both help and hinder the process. For example, I have strongly advocated building a minimum of three links for every page on a site. However, if you build a 100,000 page site you’re creating at least 300,000 internal links if you follow that principle. Every link will have a saturation edge at least one link deep, perhaps two links deep. A saturation edge is the distance from a crawled page that a crawler will travel before giving up on parsing and queuing links.

Some of the technical literature suggests that on-site navigational links are not as important as other links because the on-site navigation (if done well) will permeate a site. By comparing linking structures on a selection of pages it would be possible to create a map for a site that identifies its navigational structures. Large Web sites, however, inevitably have to use multiple navigational layers. It’s impossible to incorporate 100,000 links into a human-usable navigation system.

The introduction of XML sitemaps unquestionably resolved some huge problems for both search engines and large Web site operators, but XML sitemaps have to be propagated either as part of Web site creation or as part of content creation. i.e., you either create an XML sitemap that covers your entire site (up to 50,000 pages) or you publish an XML sitemap as you create content. Blogs and forums publish XML sitemaps as new posts are released (but we call these sitemaps “RSS feeds”).

Dynamically growing large content sites present different challenges from relatively static large content sites. If you’re just publishing an archive of old data that is not expected to grow your site structure will be static. If you’re allowing people to add content to your site on a random basis and if you add new content sections and features every month or two your site structure will evolve well beyond whatever your original concept was.

An ecommerce site that publishes an inventory of 10 million items can maintain a relatively static structure while changing out content on a frequent, continuous basis. A Web forum, blog service, article distribution service, or news publishing service will start out with a fairly small tree structure and add new branches and leaves in different areas. The more flat the overall structure remains the more difficult it becomes for a search engine to figure out how much content a large site may have. Tiered, hierarchical structures allow people and machines to project probable paths of content development.

Do the search engines project how sites should look? I don’t know, but if they’re studying and developing efficient crawl patterns they cannot help but notice how some large sites have more efficient tree-like structures. People, on the other hand, will quickly tire of clicking endlessly through irrelevant content sections and turn either to site search tools or on-site directory structures.

A large Web site has several options it can employ to assist with crawling but these options don’t help much with improving crawl efficiencies. For example:

  1. On-site directories provide both structure and prioritization but rarely provide deep coverage
  2. HTML sitemaps provide less categorized structure, no prioritization, and almost never provide deep coverage
  3. On-site search tools may or may not be indexable by search engines, may provide poor quality search results, and are subject to random keyword selection
  4. Featured link showcase pages help launch new content and provide focus on small regions of content
  5. Off-site promotional references may provide links to specialized content sections
  6. On-site cross-promotional references may provide links to specialized content sections

When you’re dealing with dual search indexes simply being crawled and indexed is not enough because the odds favor most of your content being stored in the secondary index. Where Google’s Supplemental Results Index is concerned your “long tail visibility” is significantly diminished because Google does not fully index the contents of SRI pages. Nor does it appear to allow SRI links to pass anchor text (or, perhaps, it restricts how anchor text is passed between SRI pages).

There is no question that Google finds, follows, and indexes content through SRI-listed pages and their links. The Supplemental Results Index has its own crawler (according to Matt Cutts) and it can build itself without any help from XML sitemaps. But a large Web site does not obtain much visibility if most of its content is relegated to the Supplemental Results Index.

On the other hand, a large Web site offers no inherent reason for Google to place most of its content in the Main Web Index. Google is looking for PageRank so a typical large Web site with many PageRank-passing external links will see more of its content in the MWI than a large site with a small number of PageRank-passing external links.

By developing link profiles for sub-sections of a large content site we enhance the site’s long tail visibility and improve crawl efficiency. But on-site navigation has to be supplemented by additional in-body link placements to help emphasize which pages on the site are most important. Through these additional links you can create link warehouses — pages that are frequently crawled and indexed — from which you can direct or redirect crawling priorities for the search engines. Link warehouses help you launch new content, help you bring older content back into visibility, and give your linking structures flexibility.

Most importantly, they give you a part to play in the process of managing the crawl for your large Web sites.

2 Comments on Large Web site design theory and crawl management

By SEO Ranter on April 10, 2008 at 9:04 am

Holy crap, you’re still writing these things, good work!

Early work from Cho found that the best way to order a crawl queue was by pagerank, based on his assumptions. I’m sure you can find the paper on Google Scholar (it was “Efficient ordering of a URL queue for crawling” or something along those lines). This might provide a completely different hierarchy to the one site designers intended.. which infers that, maybe too obviously, your link prioritisation and structure ought to match the proposed site structure.

Hotlinking words to a glossary can work in some sections. A two-dimensional (as in, it’ll fit on a piece of paper) site structure is almost never going to work over a few thousand pages; there will either be junctures of too many outbound paths (read: pages with too many links on), or very long routes between nodes (pages will get buried and lost). Adding alternative navigation / directory structures over a site is really going to help things along. Good crossing points would be at the pages you’ve termed “link warehouses”.

Have you got any ideas of how to monitor crawl progress through a site’s structure, providing information that would lead to changes in site structure that improve crawl saturation & content discovery?

By Michael Martinez on April 10, 2008 at 9:55 am

SEO Ranter: “Have you got any ideas of how to monitor crawl progress through a site’s structure, providing information that would lead to changes in site structure that improve crawl saturation & content discovery?”

Michael: You have to map progress through site: queries on each search engine and compare them to fetch patterns. It’s a tedious process but I have found (and other people’s experiences may not agree with mine) that simply monitoring the fetches is insufficient.

If you see a range of pages being fetched but not appearing in a search engine’s index, you can take a look at the content and/or links and possibly decide to make some adjustments.

An alternative approach (and probably something that affords a good sanity check) is to monitor long tail referral activity. If the referrals are distributed evenly across your content you’re doing okay but if they come from tightly clustered sections you’re most likely not accruing enough PageRank or crawl in other sections.

Comment

Log in or Register to post a comment.

More

Read more posts by Michael Martinez

About the Author

Michael Martinez is the Director of Search Strategies for 1st Query, an Internet Marketing firm offering organic SEO and PPC services.

Large website design and optimization theory Michael Arrington doesn’t get search