The measured nonsense of SEO relevance

Posted by Michael Martinez on August 3, 2007 in Content Theory

Let’s assume for the sake of discussion that we want to quickly estimate how many expressions a document may be relevant to. We have to make some assumptions.

First, we’ll assume that duplication of words doesn’t matter.

Second, we’ll assume that we have to look at word combinations that proceed only in one direction (from beginning to end). In fact a search engine may not care what the proximity or order of words are, but this is a useful exercise because we’re looking only for relevance to word combinations that may actually be used by people.

There are 996 possible 5-word combinations in a document consisting of 1,000 words. There are 997 possible 4-word combinations in that same document. There are 998 possible 3-word combinations in that document. And there are 999 possible 2-word combinations.

All told there are 3990 possible word combinations (up to 5 words per combination) for which a document may be relevant. In general, you can estimate the number of combinations a document is relevant to by counting the number of words in the document (from beginning to end), then multiplying the document word count by the highest number of words in a combination, then subtracting the sum of (1000 and the series of integers from 1 to 1 less than the highest number of words in a combination).

Sound complicated? Not really. Let’s play with some numbers. Let’s say we have a 1,000 word document and we want to calculate how many 2-to-5 word combinations it is relevant to. We’ll multiply 1000 by 5, which gives us 5,000. Then we’ll add up the numbers 1,2,3, and 4 (which gives us a total of 10). Then we’ll subtract 1010 from 5,000 and we’ll see that our 1,000 word document is relevant to 3,990 2-to-5 word combinations.

Of course it is possible there are some duplicate combinations in our results set so our document is relevant to at most 3,990 2, 3, 4, and 5-word combinations. If you want a more realistic ballpark estimate without having to learn a lot of math, then divide your maximum result by 2. Our 1,000 word document is thus probably relevant to about 1,995 combinations of 2-to-5 words.

This is a guestimate, not an estimate. An estimate would be more precise, making fewer assumptions. We can work with guestimates in search engine optimization because it’s not possible to optimize for almost 2,000 expressions on a page.

Well, I haven’t figured out how to do it yet, and I haven’t quite given up hope, but I’m not ready to say that it can be done.

So it’s probably not possible to optimize for 2,000 expressions on a page.

But we don’t have to limit ourselves to 5-word expressions. We can increase that limit to 10 words. Now we’re looking at a maximum of 8,955 combinations of 2-to-10 words. Our Guestimate says we’re probably looking at no more than 4,478 combinations to which our 1,000 word document is relevant.

And if we cannot optimize for 2,000 combinations on a single page then it follows that we cannot optimize for nearly 4,500 either.

Of course, most SEOs probably don’t put 1,000 words of copy on a page. There is an SEO tradition that says you should not place more than 500 words on a page. Some people won’t place more than 250 words on a page. If you’re working with 250 words, and allowing for up to 10 words per combination, you only have to be concerned about 1,103 combinations that your page may be relevant to.

Can you handle 1100 word combinations? If you optimize one page of text for one keyword expression and figure you can turn out 10 such pages a day (by hand), it will take you 110 days (or almost 4 months) to optimize those 1100 pages of text. If you could just pack all that optimization into one page, you’d be doing pretty good, wouldn’t you?

Traditional search engine optimization tends to associate any given page of text with 1, maybe 2, usually no more than 3 keyword expressions. A few brave souls (and perhaps many naive souls) will attempt to optimize for 5 to 10 expressions. And a lot of people just stuff all sorts of words into their keyword meta tags as if that will rock the universe.

The more you focus on keyword-based optimization, the less you actually optimize for search. If you create 250-word documents then your documents are relevant to as many as 1100 expressions. Why are you not optimizing for all of those expressions? The reason is that it’s not only impractical, but also a very daunting prospect. Only a total perfectionist would attempt to optimize for over 1,000 expressions on a page. We’ll never see the page.

The abundance of relevant expressions that a relatively small number of words generate underscores just how pointless some highly cherished SEO beliefs truly are. For example, you may often see people talking about only linking from relevant documents to relevant documents.

Some people feel the search engines (oh, who are we kidding? They mean Google) somehow weight a link more favorably if it comes from a page with content similar to the destination page. So a page “A” about horses that links to a page “B” about horses is helping page “B”, but if page “A” also links to page “C” (which is about cats), then the link from “A” to “C” is not as helpful as the link from “A” to “B”.

Never mind the fact that both page “A” and page “B” are probably each relevant to more than 1,000 expressions that don’t use the word “horses”.

Some people also speak about “local link popularity” although they are really thinking of “topical link popularity”. Link popularity, of course, really has nothing to do with relevance. What they really mean is that some pages acquire considerable “similar anchor text popularity”, where “similar anchor text” is only anchor text coming from pages that are topically similar to the destination — a similarity that is largely in the minds of people.

Ask does offer ExpertRank, which identifies topics. Google identifies topics for its news search. But both search services will include a document in more than one topic community. Why? Because they understand that a single document may be relevant to more than 1,000 expressions.

Topic-sensitive linking is a waste of time because it really has nothing to do with relevance. Relevance to a search engine is very different from relevance to a human being. This entire article you’re reading is about search engine optimization and content theory — but to a search engine it’s relevant to “horses” because I keep using the word “horses” throughout the article.

Some people might say that is horse manure (to put it politely) and with no offense intended to horses and horse-lovers, this article is really not relevant to horses. So the search engines would be wrong to include this article in a list of articles about horses, even though we all know that if someone types in the right horse-centric queries this article will come up first, second, in the top ten, in the twenty.

Like it or not, this article is relevant to horses even though it has nothing to do with horses. Get over it.

Search engine optimization has to respect the limitations of search algorithms and it needs to put a limit on the credibility of unreasonable expectations. Relevance is not determined by links but by text. The text may occur on the page or it may occur off the page, but the fact of the link is not a matter of relevance.

When you have a document that is relevant to 1,000 expressions, you shouldn’t be saying you’re optimizing for 1 expression or 3 or 5. You should be saying you’re increasing the importance of 1 expression to that document. By showing people which expressions are most important to your topic you show the search engines which expressions are most important to your document, and they tend to take that measured value into consideration.

That is, you can repeat your keywords through link anchor text but all you’re doing is picking one expression out of 1,000. If you have a choice between making 1 expression important to a page and making 100 expressions important to a page, which would you prefer to do?

Where do you think the most relevant traffic will come from?

2 Comments on The measured nonsense of SEO relevance

By SEO Ranter on August 4, 2007 at 1:34 am

I’m sure you’ll forcefully shoot me down here, so I’m going to add a disclaimer - I’m not suggesting this is how things actually do work; I’m saying that even the single simple method you mention is enough to get an idea of topic.

Wher you say:
‘Never mind the fact that both page “A” and page “B” are probably each relevant to more than 1,000 expressions that don’t use the word “horses”.’

There’s a long tail with n-gram analysis (which is pretty much what you’re doing) when applied to natural language text; it’s probably fairly pronounced with a simple document about horses, and certainly easier to see with longer documents. You remember the part where you create a guestimate by dividing the total number of n-grams (expressions) by two? That’s where the length of the tail and the bulk of the head come from. Topical relevance can be defined by an intersection between two documents here.

Further, if a thesaurus is used (nice simple word for LSI, that horrendously abused phrase) and synonyms replaced with symbols - so that e.g. “expensive”, “dear”, and “overpriced” all become equivalent - the total number of unique expressions will shrink again, leaving a fairly clear intersection of n-grams between any set of relevant documents.

Of course, applying synonym is no mean feat - you’d need to work out how words are being used in order to know that you have the right set of synonyms (see “word sense disambiguation”) - but it’s possible, documented, not new, and certainly feasible on the scale of the web.

Whether or not it actually happens, well, if you can think of a scientific enough method to test this given the mess that is The Web, let me know ;)

By Michael Martinez on August 5, 2007 at 6:08 am

Many have tried to show that today’s search engines are semantic. All have failed, at least where A9, AOL, Ask, Live, Google, and Yahoo! are concerned. In our blogroll we link to a couple of semantic search engines like Hakia.

Where Google specifically is concerned, if they were using semantic analysis comparable to Latent Semantic Indexing, then a query for “canine” would match queries for “tooth” or (”dog” or “man’s best friend”). But that doesn’t happen.

Also, documents are relevant to expressions that are neither grammatically nor syntactically meaningful. For example, this comment makes any page it appears on relevant to the expression “hairy stone appraising vestibularic synthos” — which has absolutely no meaning, either symbolically or literally.

That is the limit of the search technology we are optimizing for. It is best not to hope for a semantic approach to search engine optimization any time in the next few years.

Comment

Log in or Register to post a comment.

More

Read more posts by Michael Martinez

About the Author

Michael Martinez is the Director of Search Strategies for Visible Technologies, Inc. A former moderator at SEO forums such as JimWorld an Spider-food, Michael has been active in search engine optimization since 1998 and Web site design and promotion since 1996. Michael was a regular contributor to Suite101 (1998-2003) and SEOmoz (2006).

Intermediate SEO: The Magic Content Principle Google’s desperate gamble