Tuesday, September 16, 2008

Google Evaluates Search Quality

In his recent blog post, Scott Huffman, Google’s engineering director for search evaluation, elaborates on Google’s search evaluation process effectively tagging the quality of their search results and the quality of their user’s experience.

“Evaluating search is difficult for several reasons.

· First, understanding what a user really wants when they type a query -- the query's "intent" -- can be very difficult. For highly navigational queries like [ebay] or [orbitz], we can guess that most users want to navigate to the respective sites. But how about [olympics]? Does the user want news, medal counts from the recent Beijing games, the IOC's homepage, historical information about the games, ... ? This same exact question, of course, is faced by our ranking and search UI teams. Evaluation is the other side of that coin.
· Second, comparing the quality of search engines (whether Google versus our competitors, Google versus Google a month ago, or Google versus Google plus the "letter T" hack) is never black and white. It's essentially impossible to make a change that is 100% positive in all situations; with any algorithmic change you make to search, many searches will get better and some will get worse.
· Third, there are several dimensions to "good" results. Traditional search evaluation has focused on the relevance of the results, and of course that is our highest priority as well. But today's search-engine users expect more than just relevance. Are the results fresh and timely? Are they from authoritative sources? Are they comprehensive? Are they free of spam? Are their titles and snippets descriptive enough? Do they include additional UI elements a user might find helpful for the query (maps, images, query suggestions, etc.)? Our evaluations attempt to cover each of these dimensions where appropriate.
· Fourth, evaluating Google search quality requires covering an enormous breadth. We cover over a hundred locales (country/language pairs) with in-depth evaluation. Beyond locales, we support search quality teams working on many different kinds of queries and features. For example, we explicitly measure the quality of Google's spelling suggestions, universal search results, image and video searches, related query suggestions, stock oneboxes, and many, many more.

To get at these issues, we employ a variety of evaluation methods and data sources:
· Human evaluators. Google makes use of evaluators in many countries and languages. These evaluators are carefully trained and are asked to evaluate the quality of search results in several different ways. We sometimes show evaluators whole result sets by themselves or "side by side" with alternatives; in other cases, we show evaluators a single result at a time for a query and ask them to rate its quality along various dimensions.
· Live traffic experiments. We also make use of experiments, in which small fractions of queries are shown results from alternative search approaches. Ben Gomes talked about how we make use of these experiments for testing search UI elements in his previous post. With these experiments, we are able to see real users' reactions (clicks, etc.) to alternative results.”
Scott Huffman went on to explain that one of the biggest draw backs in evaluating keywords verses synonyms are language barriers. Apparently the human evaluation can only be trusted at this level.
“Choosing an appropriate sample of queries to evaluate can be subtle. When evaluating a proposed search improvement, we consider not only whether a given query's results are changed by the proposal, but also how much impact the change is likely to have on users. For instance, a query whose first three results are changed is likely much higher impact than one for which results 9 and 10 are swapped. In Amit Singhal's previous post on ranking, he discussed synonyms. Recently, we evaluated a proposed update to make synonyms more aggressive in some cases. On a flat (non-impact-weighted) sample of affected queries, the change appeared to be quite positive. However, using an evaluation of an impact-weighted sample, we found that the change went much too far. For example, in Chinese, it synonymized "small" (小) and "big" (大)... not a good idea!”

An affiliate online marketer or home business owner must always attempt to monitor Google’s ever fluctuating search quality agenda in promoting their online business, and just hope that they meet whatever criteria Google may assign for today’s searches.

Working For You at Sbhummer4u
James McCanless

. James McCanless now writes on a variety of subjects related to Internet, article, and affiliate marketing. For more information visit http://www.sbhummer4u.com. His wife Betty keeps him healthy at Healthy Betty's Site, and keeps his information sensors full at Betty Jane Online.

No comments: