Google Broken: Supplemental Pages Not Being Parsed And Indexed

Posted by admin on February 16, 2007 in Supplemental Pages

In March 2006 I reported in several online discussions that my most recent estimates of Google’s index size put it at between 25 billion and 30 billion pages. That research was most likely performed on pre-Bigdaddy data centers, but I cannot find where I shared the actual numbers — which is very unusual for me, as I have often posted raw query results in forum discussions for a variety of reasons.

In September 2005, Yahoo! claimed it had surpassed Google’s index size, and Google fired back that it had an index about 3 times the size of Yahoo!’s. But then Google stopped reporting claimed indexed pages. People began running queries on common English words in futile attempts to analyze how many pages Google indexes.

Danny Sullivan introduced a bizarre negative query that gave some indication of index size, but such queries don’t work with the Bigdaddy architecture (which Google rolled out from December 2005 through March 2006, completely replacing the old Google). Dr. Edel Garcia, one of the few IR-credentialed people in the SEO industry, wrote that “Trying to estimate the size of a search engine from scores driven by queries is not a reliable approach and raises more questions than answers.”

While I agree with him completely, I nonetheless put together a series of queries that gave a rough basis for comparison between various search engine index sizes. I have performed these kinds of estimates for a variety of reasons, not necessarily to estimate index size. For example, in April 2005 I estimated a ratio for Google backlinks to Yahoo! backlinks by comparing a random sampling of queries to estimate approximate sizes of databases.

How accurate was that estimate? Probably not very accurate, but I made a (weak) case that Yahoo! may at that time have had a database about 92% the size of Google’s.

Last March I looked at the number of raw hits for ‘a’ across multiple search engines. 11 months later several of the engines report more raw hits but Google and Yahoo! both show significant decreases. It’s not a valid search by itself, but coupled with other queries it shows an interesting pattern.

Around September 2005 and up through early 2006 people were fond of running queries on Google consisting of wildcards (such as “*.*” and “* * * * *”) but these queries no longer return any results. They probably worked on the pre-Bigdadddy Google but not on Bigdaddy. One popular query was to search for ‘the’ and in September 2005 many people reported seeing about 25 billion results in Google. I now see 5 billion. A query for ‘and’ presently returns about 4.75 billion results.

And yet, only a few months ago I still found myself estimating Google’s index size to be between 25 billion and 30 billion pages. It’s possible I did not run the full set of test queries (many more words than these I have shared) around then. I don’t know because I cannot find any specific references to posted query numbers. This is odd because historically I have been able to find discussions where I have posted numbers. I occasionally try searches at Yahoo! but their results are so convoluted I just get frustrated and give up (hint to Yahoo!: your SERPs snippets should show the selected query terms if they are actually on the page).

About the most detailed post I can find is my post-Bigdaddy update analysis that I published in April 2006. There I openly questioned the use of wildcard query tests. I did run some queries on reported pages from various top-level domains in October 2005.

Generally speaking, I am seeing Google report about 20% of its former raw hits. That is, if Google had at one time “reported” (through unreliable queries) that it indexed 25 billion pages of content, it now reports (through similar unreliable queries) that it is only indexing about 5 billion pages of content.

However, I believe that Google still knows about and sort of mentions the other 20 billion pages…in its Supplemental Index. In January 2007, Matt Cutts said “supplemental results aren’t something to be afraid of”. In fact, he even wrote:

…I think going forward, you’ll continue to see the supplemental results get even fresher, and website owners may see more traffic from their supplemental results pages. To check out the current freshness of the supplemental results, I grabbed 20 supplemental pages from my site and checked out their crawl date using the “cache:” command and looking in the cached page header….

Well, Matt, I trusted your judgement and left it at that. I should have been a little more skeptical.

Keep in mind, folks, that I am not about to call Matt Cutts a liar. Far from it. I believe the guy has a lot of integrity but I think he has either been muzzled by Google policy or something has just completely slipped past him. You see, back in September, when I was doing more consulting and looking for a new job, I missed a very important post on SE Roundtable that noted Google’s Cache for Supplemental Results Does Not Highlight Query Words.

Matt saw that post and responded twice. On September 5 Matt wrote:”Probably just index churn. The Supplemental folks already had a fix for this ready, so it’s a matter of when some executables will be pushed. I’d expect highlighting to be working again within the next few weeks.”

Sorry, Matt. It’s still not working. In fact, last night as I wrote my previous SEO Theory blog article, I had to do some research on your blog. I became extremely frustrated because Google was not reporting results I knew should be there. Turns out you have many pages in the Supplemental Index now and they are not appearing in query results…not even when I search for specific expressions with Exact Find mode queries (where the expressions are placed in quotes).

We’ve known since Google Custom Search was first released that the custom search tools won’t report Supplemental Results Pages. I have specifically asked that Google make that possible (as have other people). Today, the only way I can see Supplemental Results pages in Google’s index is by performing site: queries in Google.

That’s a problem. That’s a big problem, especially if 80% of Google’s index is now Supplemental. It’s a problem even if most of Google’s index is not Supplemental because it looks like most of Google’s index is Supplemental.

And there are way too many people who are reporting that their pages have “gone Supplemental” for them all to have done so because of duplicate content issues. If you want to hear something from Google, then I encourage you to leave a comment for the Webmaster Central team asking why Google won’t return query results for Supplemental Pages.

I mean, pick a Supplemental Page from your own site, grab some unique text from that page, and run an Exact Find mode query for that text. Google will tell you it cannot find any results. And yet if you perform a site: query, you’ll see the page listed. If you look at the cache data for a Supplemental Page from your site, you’ll see that Google tells you the site name only appears in links pointing to the page — even if you use the site name on the page (note: only well-linked sites with good internal navigation report such results, so your mileage may vary).

I’m not going to speculate on why Google is not parsing and indexing the words on Supplemental Results pages. Nor will I accept any insistence from Google that they are until they start showing indexed words from Supplemental Pages. I’m only interested in results at this point, not excuses, explanations, or promises.

Right now, Google has an index of about 5 billion pages. Right now, most Web pages won’t be found even for obscure, long-tail queries.

That’s a problem. That’s a big problem.

Yes, we can all go out and get more “quality links”, but most people are not in the position of being able to do that for every page on their sites.

That’s a problem. That’s a big problem for a lot of people.

It’s not just hurting the Webmasters. It’s also hurting the surfers who use Google and who rely upon Google to return the most useful, relevant results possible. You can hide the elephant in the living room by blowing smoke and tooting clown horns for only so long. Eventually, the kids are going to get bored with the smoke and noise and start asking why there is an elephant in the living.

Google: I’m now asking. Are you listening?

P.S. By the way. There is a query format being passed around various SEO blogs that purportedly shows you all the Supplemental Pages in your site. These queries are structured as “site:example.com *** -djfd” or something close to that. The results from this query are misleading. I tested the query on a small content site with Supplemental Page results. I found that the pages being shown as Supplemental were also showing as normal, Main Index, non-Supplemental pages in an unmodified site: query. You cannot know how many of your pages are really only listed in the Supplemental Index — at least not by using that query or anything like it.

Right now, it doesn’t appear that nonsense negation works the way it did with the old, pre-Bigdaddy Google architecture.

Comment

Log in or Register to post a comment.

More

Read more posts by admin

Google’s Web Apartheid: Gone Supplemental and Getting Nowhere Google Supplemental Results Questions and Answers