π

Don't Use Google Search Estimates to Compare Terms

Show Sidebar

You are at a conference and the speaker compares two terms using the number of estimated hits returned by Google Search. A very common thing to do. Almost nobody seems to care that those numbers do not resemble the number of actual results.

When you search for something at Google Search and Google tells you that there are 11,346,000 results, this number is completely made up. There is an algorithm which determines those numbers without looking at the complete set of data. And this algorithm is based on some (secret) ingredients. Google has enormous capabilities but determining the number of actual hits of their search data can not be done in real-time.

Let's take a closer look.

Assumptions

Suppose that Google Search could really determine the real numbers of results. Following the theory of sets and Boolean algebra, I defined following assumptions when searching for two arbitrary search terms «foo» and «bar»:

Query Number of Hits Returned
foo A
bar B
foo AND bar C; less than A; less than B
foo OR bar D; more than A; more than B
foo -bar E; less than A
bar -foo F; less than B
foo A
bar B
foo AND bar C
pages of 10 for "foo AND bar" C/10

For the AND/OR/NOT queries, we can only give rough estimates since it is hard to determine better numbers without the whole data-set.

Test Queries

In order to get numbers myself, I defined pairs of search terms and used Google Search (via Google.com) to query for the terms.

My software environment was Debian GNU/Linux Jessie with the most current Tor Browser. I think that using Tor Browser does give me less personalized search results because I am not logged in with any Google account nor does the Tor Browser show highly unique browser fingerprints:

Within our dataset of several hundred thousand visitors, only one in 2667.37 browsers have the same fingerprint as yours.
Currently, we estimate that your browser has a fingerprint that conveys 11.38 bits of identifying information.

Using Tor does not prevent localization at all: the tor exit node I was using has a geographical location associated. However, I was using the same tor connection for all queries. Therefore, the numbers should be comparable to each other even when the reproducibility of the exact numbers is at least questionable.

My Tor Browser identifies itself as Firefox 45.6.0 (Build identifier: Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0) and the time of queries was roughly 2017-01-15 7pm CET.

The first set of terms was «emacs» and «orgmode». Since Google Search is case-insensitive, I am only using lower case terms here.

query hits reported should differeces
emacs 6210000
orgmode 346000
emacs AND orgmode 251000 <346000
emacs OR orgmode 6370000 >6210000
emacs -orgmode 4760000 <6210000
orgmode -emacs 94900 <346000
emacs 6200000 6210000 10000 additional hits
orgmode 347000 346000 1000 additional hits
emacs AND orgmode 252000 251000 1000 additional hits
actual pages 56 25100 contradiction

The last row is not really about estimated numbers: I determined the number of actual result pages («actual pages») using following algorithm: search for «foo AND bar», follow the result pages using the «Next» buttons until the end is reached. For the terms above, it was this URL. Then, Google shows following statement:

In order to show you the most relevant results we have omitted some entries very similar to the 150 already displayed. If you like you can repeat the search with the omitted results included.

I clicked on the link behind «search with the omitted results included» and followed the results until they do not yield any more results. For the terms above, the corresponding final URL I got was this. This result page shows a statement like «Page 56 of about 252,000 results (1.19 seconds)». There is clearly a discrepancy because 56 pages with ten results each is less than 560 results and Google states that they've got 252000 results. That's only 0.22 percent of the estimated results.

Except for this, there were some minor differences between the expected result numbers and the numbers returned by Google.

The second set of terms was «linux» and «torvalds»:

query hits reported should differeces
linux 396000000
torvalds 3070000
linux AND torvalds 472000 <3070000
linux OR torvalds 34600000 >396000000 contradiction: <10% of expected value
linux -torvalds 292000000 <396000000
torvalds -linux 465000 <3070000
linux 397000000 396000000 one million hits difference
torvalds 3090000 3070000 10000 hits difference
linux AND torvalds 491000 472000 19000 hits difference
actual pages 60 47200 contradiction

The query for «linux OR torvalds» returned far less than the number of pages estimated for «linux» which is a contradiction to the assumptions.

When querying the terms and its AND-combination for the second time, Google returns more results than with the first query.

Once again, the number of actual result pages differs greatly from the number of hits shown by Google: only 0.13 percent of results could be navigated to.

The next set of terms was «vienna» and «mozart»:

query hits reported should differeces
vienna 187000000
mozart 84500000
vienna AND mozart 1190000 <84500000
vienna OR mozart 271000000 >187000000
vienna -mozart 188000000 <187000000 contradiction
mozart -vienna 84900000 <84500000 contradiction
vienna 187000000 187000000
mozart 85000000 84500000 500000 additional hits
vienna AND mozart 1190000 1190000
actual pages 55 119000 contradiction

Here it was interesting to see that vienna without mozart returned even more hits than vienna alone. This is a clear contradiction. Same holds for mozart without vienna which returned more hits than mozart alone.

The second query for «mozart» returned 500000 more hits than the first one.

The number of actual result pages is only 55 which is dramatically less than the 119000 is should have been.

Now for the next set of terms: «trump» and «fake»:

query hits reported should differeces
trump 885000000
fake 675000000
trump AND fake 81400000 <675000000
trump OR fake 900000000 >885000000
trump -fake 671000000 <885000000
fake -trump 209000000 <675000000
trump 885000000 885000000
fake 675000000 675000000
trump AND fake 81400000 81400000
actual pages 57 8140000 contradiction

It is interesting to see that with these result sets, the numbers do fulfill the assumptions with only one exception: the number of actual results is again only a tiny fraction of the number of hits stated by Google. The less than 570 results found are far less than the 8140000 estimated.

The last set of terms are chosen somewhat different. Previous terms were rather general resulting in millions of search result estimates. To compare those set of terms with a set that is not likely to return millions of results, I chose «linuxtage» and «privatsphäre» (German for privacy):

query hits reported should differeces
linuxtage 47000
privatsphäre 40100000
linuxtage AND privatsphäre 12000 <47000
linuxtage OR privatsphäre 40200000 >40100000
linuxtage -privatsphäre 44600 <40100000
privatsphäre -linuxtage 8420000 <40100000
linuxtage 47000 47000
privatsphäre 40100000 40100000
linuxtage AND privatsphäre 12000 12000
actual pages 68 1200 contradiction

We still see a huge difference in the number of actual result pages. Overall, smaller numbers might indicate better result estimates.

The next terms are «treibsand» (German for quicksand) and «burggraben» (German for moat):

query hits reported should differeces
treibsand 319000
burggraben 470000
treibsand AND burggraben 1130 <319000
treibsand OR burggraben 477000 >470000
treibsand -burggraben 317000 <470000
burggraben -treibsand 394000 <470000
treibsand 319000 319000
burggraben 470000 470000
treibsand AND burggraben 1130 1130
actual pages 19 113 contradiction

This results supports the guess that terms that are less general tend to return better estimates. The number of actual result pages is still way off the estimated value. However the difference is much smaller than in the other examples.

Many more things can be tested such as «subtracting» a less general term from a general term should lead to almost the same number than the general turn alone was estimated. And so on.

Other Resources

The web page SEO Chat has a nice article about this topic. They summarize the reason for this arbitrary numbers in an excellent way:

One of the reasons that engines like Google or Bing can find so many results is that they don’t bother to collect them for your use. The chances that you will need them are slim to none, and even if you did by chance require a deeper web page, you wouldn’t be able to find it. There are just too many to sift through. Instead, you would have to go back and narrow down your search terms to get a better list of choices, something we are all pretty used to doing by now.

I can copy this from my point of view. However, they also state that «those results still do exist» which my own experiment contradicts.

Yes, Google might have those hundreds of thousands of results in their back-end. I can not think of any reason why they won't show them to me while navigating through the result set.

Wikipedia itself has a very interesting article about Google search and the notability of a subject. Unfortunately, they don't mention anything about the number of hits returned.

Here is an article that explains some discrepancies: When Google Search does a query for «foo», it does not look at as many data as a query for «foo -bar». Therefore, the second query is able to find more results than the first one. Well, this might be a perfectly fine explanation but it also underlines my basic premise: the number of estimated search results is a completely made-up arbitrary number.

Maybe the estimated number of search results will vanish in future.

In the meantime: don't use those numbers to compare terms.

Comment via email (persistent) or via Disqus (ephemeral) comments below: