Why SEO pundits can’t explain search algorithms
Posted by Michael Martinez on February 21, 2008 in Advanced SEO
NOTE: Some edits have been made several days after this article was originally published.
There are people in the search engine optimization industry who don’t understand what an “algorithm” is. Technically, there are many definitions for ‘algorithm’ but the most universally useful definition could be phrased as “a set of rules to define how to solve a problem or perform a task”.
A search engine has more than one algorithm. Today’s major search engines use crawling algorithm(s), indexing algorithm(s), data management algorithm(s), Web service algorithm(s), inter-data center communication algorithm(s), query resolution algorithm(s), user-behavior evaluating algorithm(s), trend analysis algorithm(s), and so on and so forth.
People in the SEO industry assume they can explain how search engines work with a simple blog post or eBook. I seriously doubt that if you put Matt Cutts and Vanessa Fox into a room together that they could really explain how a search engine decides which pages to show in response to your query in less than a day or two, and even then they would be skimming over areas that they cannot explain for either lack of time or lack of knowledge.
Many people have written search tools through the decades. I used to write them all the time when I was a full-time programmer. I had to find ways to search data files, program source code, corrupted blocks of disk storage that were once complex multi-file systems, and even sophisticated word processing documents that contained embedded font and formatting code.
Search is a normal and necessary function in the computing environment and we’ve been using search all our lives. You’ve even used it when you called a telephone information operator, looking for a number or an address. The algorithms have changed over the years because of advances in data structure technology, because of the immense increase in indexable and searchable information, and because we are constantly devising new reasons for search.
Web search algorithms also have to contend with invasive marketing techniques — search engine optimization. All Web search spammers are SEOs. All SEOs use invasive marketing techniques. All of us.
In a completely natural Web document environment no one would know or care that search engines exist. No one would write title tags, page copy, or page names (URLs) with search engines in mind. No one would go looking for links to improve their PageRank or search engine rankings.
In a completely natural Web document environment a search engine would have to be stealthy — surreptitious — in order not to disturb the natural quality of the information stored on the Web. Once the search engine makes itself known, people are drawn to the potential reward of ranking well in that search engine’s results. It only takes one idiot to piss on everyone’s parade. Once the first idiot starts optimizing for search the Web document environment is no longer natural.
That doesn’t mean that search engine optimization is a bad thing or even that we’re hurting ourselves by optimizing for search. After all, search does not exist in a vacuum. Search exists because there is a wealth of information (provided by Web publishers) that we need to organize and use (as searchers). Search makes it possible for us to access that information and it’s incumbent upon us as Web publishers to make it easier for us as searchers to find what we’re looking for.
In the three-way symbiosis of the Web search ecosystem all three partners (publishers, indexers, and searchers) play complementary roles that — when cooperative — work to everyone’s advantage. If any one of the partners steps out of line, becomes more focused on personal needs than the group’s needs, then the symbiosis suffers a crisis.
Search engines and search optimizers destabilize the ecosystem all the time. Search engines are monetizing the process and in doing so they put personal gain ahead of the group’s best interests. Search optimizers are seeking fame, money, or to persuade people to change their minds. In doing so they also put personal interests ahead of the group’s best interests. Every day sees some part of the Web ecosystem enter into a state of crisis, but other parts remain stable or return to stability as regularly.
In the natural system of Web search everything flows like water toward the lowest point through the paths of least resistance. Despite all our best efforts to ensure that Wikipedia does (Google) or does not (SEOs) rank first in query results, the overall trend is toward an equilibrium that everyone has to live with. We have too many conflicting personal interests in order for all personal interests to be satisfied.
Nonetheless, search engines have to develop algorithms that respond to the behaviors of both Web publishers and searchers and those algorithms have to interact with each other. Their interactions have both direct and indirect influences on how search results are organized. So are the publisher and searcher behaviors.
For example, let’s say you have a 1,000 page Web site that has been well-indexed for 4 years and you now decide to restructure it, changing most of the URLs. What happens to your listings?
Unless your site is recrawled on a daily basis, nothing should happen to your listings for at least a few days, possibly a few weeks. Eventually, however, the search engines will drop the old page URLs from their indexes and start adding new page URLs. But because search indexing and other algorithms do look at link relationships between documents you may actually lose some of your rankings in search results.
Today we know to use 301 redirects to minimize the damage that restructuring Web sites inflicts upon search visibility, and the 301 redirects also help to shorten the window (in some cases providing for a very smooth transition) in which search visibility declines. But that only works because search engine algorithms have been modified to look at 301 redirects and take them into consideration.
The time between the change you make and the change you see in search results is a lag time and in search engine optimization (as well as search engine operation) there are lag times all over the place. You rarely see people in the SEO industry discuss lag times and most often they are only mentioned in passing or in a different context. The most frequent context used for lag time discussions is “How long will it be before my pages are indexed?” Another popular lag time question is “How long does it take to reach the top?”
Search engineers have to account for a multitude of lag times in their various algorithms. They operate their services according to super algorithms that stipulate which functions or processes occur in which order. One does not simply write a program to crawl the Web, store the data, extract the links, and resolve queries. Getting all that information, analyzing it, dissecting it, storing it, and retrieving it when the time is right requires a great deal of work. And it requires a lot of time.
If you put all the events that occur when a single query is typed into a search engine’s Web interface into a row and measured their timeframes sequentially, you’d be amazed at just how long it takes to find a reasonably good list of sites for “michael martinez seo”.
People today are babbling about “temporal link analysis”, blather, blither, creepy-poo stuff. These are concepts, not algorithms. The algorithms are not published, not shared openly, and no one in the search engine optimization industry has a clue as to what they are talking about when they try to describe “search engine algorithms”.
You can summarize important parts of the process but you cannot reverse engineer the algorithms. There are over 100 published methods for calculating PageRank. How many unpublished methods are there? Which methods is Google using? You don’t know. I don’t know. Rand Fishkin and Aaron Wall don’t know. Danny Sullivan knows better than to suggest it’s being calculated any particular way.
The crawling process alone consumes huge resources. Have you ever tried to organize 100 pieces of paper? Just collecting them from various sources can be a real task. Think about what a teacher has to do to collect reports from 30 students in a class.
- First, the teacher has to tell the students to write the reports (crawler asks for a file).
- Then the teacher has to tell the students when the reports are due (crawler has a timeout deadline).
- Then the students write the reports (Web servers check to see if the file exists).
- Then the students hand their reports to the teacher (Web servers pass existing files to the crawler).
- The teacher then places the reports on a desk (crawler drops the retrieved files in a repository).
For the teacher, it’s a never-ending process of asking students to hand in their homework, and teachers usually teach more than one class of students. Some students have to do more homework than others. Some students are not very good about turning in homework.
The crawling process is more complex than this metaphor shows it to be, however. Sometimes the crawlers perform what is called a “deep crawl”, where they start at the root URL and parse every page and extract links and request those files immediately. This is equivalent to the teacher rifling through every report just turned in and asking for more information from each student before passing on to the next.
Deep crawls historically have crashed servers because the crawlers can send requests so fast that the servers cannot resolve them. Users complain about slow performance. ISPs complain about bandwidth overages. I’ve been through more deep crawls than I care to remember and they have never been pleasant experiences.
Search engines have made adjustments to their crawling algorithms to help Webmasters (and hosting services) tolerate the demand for files. They incorporate lag times, distribute URLs to multiple crawling queues, cache files and ask servers if the files have been changed, etc. All that just so the search engines can speed up the process of collecting files and reduce the burden on Web publishers and their supporting providers (not to mention their user communities).
The crawling process a search engine uses may have a significant impact on your search results rankings because if you don’t get crawled very often all the changes you implement today may not show up for three months or longer. Most people who complain about being indexed only once a month have no idea of how lucky they are. On average I want my sites indexed once a week. If I can get into multiple times a week I feel like I’ve done something really worthwhile as an SEO.
But like everyone else I have sites that sit out there and wait for crawlers and the crawlers take their time. When you add 100 pages to a site that is only recrawled once a quarter, the lag time is an agonizing eternity. Even submitting a sitemap to the search engines doesn’t guarantee that you’ll see any crawling activity, but they may come back and recrawl your 16-page site every 2-3 days.
Managing crawl is an important part of search engine optimization but most SEOs never even think about that. Or maybe they do now that the nofollow controversy has erupted but regardless how you feel about using nofollow on internal links so-called “PageRank sculpting” is not crawl management. To manage your crawl you have to look at as many links as you possibly can.
You can create crawl-happy sites with only a handful of links. You can have crawl-starved sites with thousands of inbound links. I’ve seen both. I have both. It’s not all about links despite what other people may tell you. Nor is it really about link quality. It’s more about link placement and link structures than anything else.
If you want to read drivel about “temporal link analysis” go ahead but unless you factor in lag times you’re not going to learn anything useful. Lag times influence search results in so many ways you could write at least one book, perhaps several, that documents lag times and their effect on search. You don’t even need to fully understand the architectures behind search to understand that if event A has to happen before event B and if they are not triggered in sequence by the same controlling process then you have a lag time that has to be dealt with.
Jill Whalen likes to say that chasing search algorithms is a waste of time. As an algorithm chaser I have to agree with her to a certain extent. There are ways to observe the behavior of search engines and some people do a reasonable job of eyeballing the lag times without really thinking of them as lag times. Most SEO analysis and technique is actually based on managing lag times. “If I do this, wait a few weeks, I should see that“.
If you do this and don’t see that in the expected time frame then you go looking for possible explanations of why that didn’t happen as expected. People toss up ideas and eventually someone proposes something that is reasonable (excluding all the usual technical flaws we introduce through site construction, server management issues, etc.). At the point where someone proposes a reasonable possible explanation most SEOs conclude there must have been an algorithm change.
They may be right, they may be wrong, but with such an imprecise method of determining when algorithms change is it any wonder the search engineers occasionally seem bemused by our conclusions? They don’t need to lie to us to protect their secrets; we’re doing a pretty good job as a community of just making up crap and passing it around.
It takes about two years to vindicate or fully discredit any proposed explanation in the search engine optimization community. In the past two years, Google introduced two major rewrites of its search engine (Bigdaddy and Searchology). AOL, Ask, Live, Yahoo!, and Google all began integrating search results from their vertical portals. Live implemented a six-month algorithmic update schedule (”Rome” is supposedly due this Spring, possibly April according to one speculative blogger). (NOTE: I’m trying to collect information about “Rome” in Spider-food’s MSN Forum if you want to know more or share information).
Ask is now the flagship property for IAC, has increased its market share, and is serving an estimated 30 million search users per month. Live is serving an estimated 70 million search users per month. Yahoo! still gets more visitors than Google but Google has nearly overtaken Yahoo!. All that has happened and more in the past two years.
Dozens of additional search features and functions have been announced by these services. Many more have been released without fanfare.
So, if you think you can explain how search engines work in a blog post, feel free. Go ahead. My hat’s off to anyone who has the nerve to stand up and share what they think is happening behind the curtain, but I wouldn’t trust those explanations under any circumstances. They’re not worth the electrons they’re virtually printed on.
Comment
Log in or Register to post a comment.