When URLs divide: Managing search engine crawling
Posted by Michael Martinez on June 19, 2007 in Link Building, Link Theory, SEO Theory
Canonicalization is now well-known and understood so people are generally very good about setting up 301 redirects and setting the canonical status of their domain names in Google Webmaster Central, in their server configurations, etc.
Many Webmasters (and SEOs) remain somewhat backwards about using consistent linking formats for their internal links. Relative URLs don’t hurt you unless you’re intermixing secure pages (HTTPS) with non-secure pages (HTTP). The search engines will assume HTTP prefixes by default. I generally recommend the use of absolute URLs in all internal links to avoid such issues.
Quite often, however, if you quickly update or add a page you may find that you leave out a colon (:) in a link and it ends up looking like “http//www.example.com/” so that your domain name is prefixed before the broken URL. Your visitors will see only an Error 404 result.
In the old days, before search engines like Yahoo! and Google began dictating to Webmasters how to design their Web sites, I advised people to modify their 404 error handling so that they served a valid HTML document to their visitors. This practice helped search engines find their way through Web sites that lose copy.
However, today’s search engines don’t like that trick. They may become confused and start indexing the same content under multiple URLs (which is not a good thing) or they may not allow you to authenticate your site because it doesn’t send a proper 404 error response.
Requiring proper 404 error handling from Webmasters is stupid because many Webmasters have no choice about how their accounts handle 404s. While many Webhosting ISPs may studiously set up the default behavior correctly, if a Web site operates on a broken server the Webmaster won’t be able to fix the problem for a search engine.
Nonetheless, I now feel strongly the canonical 404 handling is the way to go. It’s better to authenticate a site than not have access to all the data a search engine is willing to share with an authenticated site. So to compensate for the occasional outdated inbound link or broken on-site link I have been creating custom 404 documents that provide current links to my content.
These custom 404 documents act like miniature HTML site maps, informing people of my most important content and helping them find useful internal documents (like HTML site maps) that tell them more about what my sites have to offer.
The only drawback is that search engines will not crawl custom 404 documents (that is, not if the documents are served with the appropriate 404 response code). Search engine crawling thus has to be managed more efficiently through crawlable content.
I strongly recommend that every site with more than 10-12 pages include an HTML sitemap as one of its pages. Many people balk at the idea of including HTML sitemaps. They either feel that the XML sitemaps they upload to search engines are sufficient or they are just so very confident in their on-page navigation they think the HTML sitemaps are useless.
Surfers disagree. HTML sitemaps receive a great deal of traffic, especially on large content sites. While many Web sites do offer site search tools, if you rely on Google or Yahoo! for your site search you’re excluding a lot of content from your search (although the new Google Custom Search capabilities may ameliorate that situation).
An HTML sitemap is a great usability feature, especially for Web sites that implement navigation through Javascript, Flash, or other functions that inhibit 100% accessibility. But HTML sitemaps also act as great crawl pages.
My philosophy is simple: Every page on a site should have at least two outbound links (one to the site’s root URL and one to the site’s HTML sitemap) and every page on a site should have at least two inbound links (one from the HTML sitemap and one from at least one other page). The more inbound links you provide yourself with, the more crawlable your site becomes (and the more link anchor text you can pass to your pages).
While it’s true that many people feel on-page navigation links may not be given as much weight as in-body text links, there is no reason for why you cannot write up some text that tells people about what they can find on your site.
A site with dozens, hundreds, or thousands of pages should have section leader pages that act like category entry pages. These section leader pages should be linking to groups of related pages, which in turn should be linking back. The destination pages will thus have at least 3 outbound links: 1 to the root URL, 1 to the HTML sitemap, and 1 to the section leader page.
‘
If you embed section navigation links on all pages then each of your deepest pages should have at least 3 inbound links. The more links you provide yourself, the better.
And if you have at least a half dozen topic sections you should be providing your visitors with cross-promotional links outside of your on-site navigation. Never pass up an opportunity to tell people about your other content. If they like what they find on the first visit they’ll often come back, and they are more likely to come back if they know they can find more information about other topics.
Your root URL will be your most important page. It should always contain a link to your HTML sitemap. But you should also be using your root URL to link to your most important content. Remember that if you don’t make the effort to tell people about your content, no one else really has an incentive to do so, either.
There may be no optimum number of links you can place for a large content site but I like to ensure that each page has at least 2 on-site links pointing to it. That is sometimes a challenge but it definitely helps to spread the links across the site.
You can (and should revisit) your internal linkage from time to time. For example, this blog could remind you about SEO Theory’s Supplemental Pages category, or you can see that we have an SEO Case Studies category (of course, these links are offered in the navigation, too).
It doesn’t hurt in the least to tell people that you recently blogged about answers to nearly 100 SEO questions and more answers to SEO questions.
You can also talk about your most recent High Quality Links post. The possibilities for linking to yourself are endless, and the benefits are both real and worthwhile. People will follow your embedded links even if the search engines do not.
In the final analysis, the value you create for your visitors can be derived from what you say about your own content in a helpful, informative fashion just as easily as it can derived from what you say about other people’s content.
Comment
Log in or Register to post a comment.