advice from a fake consultant

out-of-the-box thinking about economics, politics, and more... 

Friday, December 4, 2009

On Getting Found, Or, Search Engines: Is There A Difference?

I have a story today that comes from my predilection to “self-syndicate”, meaning that I post my stories far and wide, in the same way a newspaper columnist is syndicated nationally—or beyond.

After I post, I know others will also post my stories to their sites, a topic that was itself the subject of a recent conversation.

To keep track of it all, I use the Google...but I recently wondered if that’s actually the most effective tool for the job—or not—so as an experiment I recently challenged several search engines to go out and seek the same search term.

We find out today...and the results are, indeed, interesting.

So here's the rules of the game: on the afternoon and evening of November 29th, I posted my story "On Stimulating The Future, Or, "It's The Ytterbium, Stupid!"" on 27 sites. The next morning I conducted the searches you'll see referenced in this discussion using as a search term the exact words of the title, in quotes, just as it appears above. During the course of writing this story, we'll revisit the same sites to see if the results have changed.

Let The Contest Begin!

capitol document room.jpg

So the first search was conducted on Google, which found 849 results.

The reason that happens is because the tags associated with (or the proper nouns that appear in) a story often trigger websites to place that material on pages with other stories with matching tags or names, as you can see from this example at RootsWire. (The story appears twice because it was updated after it was posted.)

This creates lots of iterations of the same title on the same site under different categories, a situation other search providers seek to reduce; this being the one of the points behind all those recent ads for Microsoft's Bing search engine.

A quick note about "search consistency": seeking for the same search term at Google on multiple occasions will yield different results each time, even if the two searches are conducted immediately after one another. For example, my search this morning found 852 results--and then, just a few minutes later, 653. (By the way, if you click on these links now, some other number of results will appear, which is its own comment on consistency.)

We next visit Bing, where 16 results were initially found. Interestingly, some of the links were the ones I placed, but 6 of the 16 were multiple iterations of the same story on three sites.

As with Google, visiting Bing today might yield 57 links--or 2530, or 152, or 26--and despite Bing's advertising claims that they make searching simpler by eliminating Internet "clutter", a huge number of the links I'm seeing here are links to the weather in virtually every city in Maine; all of these linked back to "Weather Underground" weather reporting pages...and all of those pages were from the same basic address: insert name here.wunderground.com.

Next was Yahoo!, reporting 887 results (and then, after clicking through a few pages, 1580). There was an interesting variation to the pattern of what they found, however: more results from the first 50 were links to the original 27 postings than appeared to be the case with either Google or Bing.

The search today found 860 results...four times in a row...which is by far the most consistent results reporting so far--even if the results from the other day were completely different.

Lycos found 67 iterations of the posting...or 50......and then 49...with roughly a dozen of the first 30 listings being "duplicative" entries, which is fairly consistent reporting. Returning to the site today, the search engine found 69 listings--and it was also able to do that four times in a row....which makes it at least the "consistency equal" of Yahoo!

Dogpile (a product of the fine folks at Infospace) aggregates results from Google, Yahoo!, Bing, and Ask.com into one set of results...and for some reason the first result on the second page was for a futures trading opportunity ("I'm shocked to discover there's gambling here...!").

That said, Dogpile "sniffed out" 40 results, with many of those being "duplicate" instances of the same story from the same site. Conducting the same search today yields 6 additional results--all of which appear to be duplicates of the previous 40.

On four further attempts to search, the original 40 results were found.

WebCrawler, another Infospace property, located 38 results; again, the results are highly duplicative. It is not possible to enter the entire search term at this site, instead, the term...

"On Stimulating The Future, Or, "It's The Ytterbium,


...was used.

Four additional searches were conducted today, with 38 results found each time.

(Because the page-naming conventions of both Dogpile and WebCrawler insert an ! into the page names upon which results are presented, they can't be linked here, and you'll just have to visit the pages on your own.)

Remember Altavista?

Altavista found 904 iterations of the story, then 17,800 on today's search. There is an option to either search "Worldwide" or "USA", the Worldwide search, conducted immediately after today's USA search, found, oddly enough, 2460 results--and for at least the first several pages, which was as far as I looked, the results were the same as for the USA search.

Four additional searches, conducted today, located the same 17,800 results.

One strange idiosyncrasy of the site is that it won't actually display those 17,800 results: instead, it only displays the first several pages of results (in this case, 7 pages), and then just stops, with no additional pages made available beyond that point. There is an "advanced settings" page available, but it does not offer any solution for this problem.

Ask.com displays 240 results on the first search--and they were the only site to report the listing on the Times of India site right there on the front page (which, if you return to the site, is no longer the case)--but on the down side, 1/3 of the results on that first page were "sponsored results".

After the first page, ½ of all results are "sponsored", and the results are highly duplicative. By page 10 of the results, as few as 2 of the 13 results on the page are not sponsored.

Today's searches located 452 links, then 449, then 452, and then (take a guess...) 449.

Ever heard of Duck Duck Go? Neither had I before this story. They feature an unusual format that displays some results, and then, when you click "more results", displays those below the first results on the same page; a pattern that continues until all results are displayed.

The Duck located 35 results on the first attempt, with no duplicates. The "wunderground" domain was represented--but only once.

Apparently recognizing that their searches are not going to give every result, the site encourages you to also search at YouTube, flickr, twitter, amazon, and Google.

DSCN7451.JPG

Turning off the "safe search" feature yields 43 results, including BhamLinks.com (a news aggregator from Birmingham, Alabama), Pshcye's Links, ("Esoteric Subjects on the Web"), and the "Li-Ion" page from the Journalism that matters site (Li-Ion, by the way, is the abbreviation used to describe lithium ion batteries.)

Conducting additional searches on the site today yields the same 43 results.

Finally, Cuil. I had never heard of this site before...and apparently, they've never heard of me, either, with zero results reported for my query. Searching the "127 billion web pages" they purport to scan today provided no results again during four additional checks--which makes this site the most consistent of the search engines I examined.

I conducted a test search for pizza (with no quotes). 809,000,000 results were found...but only two were displayed on the "All Results" tab: one for Pizza Hut, one for the Wikipedia entry for pizza (which featured the story of how pizza was introduced into Pakistan, of all things). Even more odd: on the same page you can look up "Pizza franchises" and other pizza related results "categories", and there's a "Timeline for Pizza" with entries like: "2004 Melbourne, Australia" and "1993 Pizza was".

All of this appears to be at odds with the intent of the site's operators:

"Popularity is useful, but has dominated search results so heavily that it gets harder and harder to find the page you want, especially if your search is a complex one. Cuil respects popular pages and recognizes that for many simple searches, popularity is an easy answer to your question. But for a deeper search, establishing relevancy is more than a numbers game. Cuil prefers to find all the pages with your keyword or phrase and then analyze the rest of the content on those pages..."


And The Winner Is...

Those are the results: so, what about conclusions?

The first conclusion we can reach about all of this is that the number of results that any search engine locates on any particular visit are highly variable--and so much so that the number of results presented appears to be virtually random (with the notable exception of Cuil, which seems to be consistently unable to find anything).

With that said, if I was quickly looking for this particular story, it appears that some of the odd search engines might be the best choices, including Lycos, Duck Duck Go, and Ask.com.

On the other hand, if the idea was to determine how far a story has been distributed, Google seems to be the winner.

There is another reason to use search engines, that being to find information about a topic that you currently don't know enough about; this test is not well suited to answer the question of which search engine is best for that purpose...and it's a test that we'll save for another day.

So that's today's story: we visit quite a few search engines, we learn that the results you get are almost always entirely unpredictable, and, in what might be the most important lesson of the day, we're learning that deifying Tiger Woods can backfire on you, big time.

No comments: