Saturday, July 05, 2008

NLP based search

I am getting increasingly interested in Natural Language Processing (NLP) these days. NLP can enable better human computer interfaces, powerful search engines, etc. One of the search startups in this area that I have been following is www.powerset.com which was recently acquired by Microsoft. A good source to learn about powerset and a rought technical overview is at http://www.slate.com/id/2193837/.

Powerset's NLP technology breaks a sentence into smaller entities (nouns, verbs, adjectives, etc.) and establishes relationships between them, e.g., "eiffel tower was built in 1889" gets recorded as "eiffel tower" (noun) "built" (verb) 1889 (noun). Each such relationship (called "fact") is recorded and comprises a single quantum of information derived from the web page. A search query is translated into a similar, but incomplete fact, e.g., "when was eiffel tower constructed?" would become "eiffel tower" (noun), "constructed" (verb), and "when/year/time/date" (noun/adjective). The search algorithm then matches the "factualized" query to the closest resembling fact and fills in the missing details (the year 1889 in this case).

The cool thing about converting content and queries to facts is that the search engine can identify and return relationships not explicitly stated in the contents, unlike keyword based search. However, most popular content on the web is actually explicitly stated in a single sentence, so NLP seems less useful for searching popular content since Google search would already do a pretty good job here.

However, the real promise of NLP based search seems to be in the context of the "long tail" of search - which are frequently searches not explicity answered on any single web page. As the web continues to grow and many different kinds of contents come online (blogs, books, emails, etc.), the long tail of web search will continue to increase its share of the total search volume. Most of us have experienced that the unpopular searches often are not explicitly found in any single web page, instead they require the user to scan multiple web pages before they find what they want. Keyword based searching cannot make things any better here since the keywords may either be spread out across webpages or they may simply be absent (e.g., "dog" and "tommy" can be related if tommy is the name of a dog - a fact that keyword based search cannot discover). This is where NLP can really make a difference. It can identify facts from across web pages and save users valuable time spent scanning different web pages trying to forge an answer to their search queries.

So, very roughly speaking, if you can find the answer to your query in one Google search and after scanning 1-2 returned web pages, then NLP will not make things any better for you. If it takes more than one search and visiting 5+ search results to answer a given query, and if your query and its potential response can be formed into a fact, then NLP might be useful.

Another analogy for the applicability of NLP may be the information density of a web page. NLP will be more useful finding content in web pages with low information density. By converting the text to facts, NLP is in a way converting "semantic compression" of the contents. "NLP compressed facts", owing to their increased information density, are better suited to answer user queries. These "low information density" web pages may be web pages with lower page ranks on Google. Other examples of low density content might be casual chat sessions, email threads, etc.

Unfortunately, in my experience Powerset doesn't seem to be doing a good job in identifying complex facts. They do a decent job at identifying obvious or simple facts but based on some examples I saw, not so well for complex facts. For examle, if you search for "who was the author of the godfather", you get the answer "Mario Puzo". But Google also fairly easily gives you the same answer when you search for "Godfather author" or "Godfather writer". But if you query, "how many years did Mario Puzo take to write the godfather", Powerset doesn't seem to offer any useful results.

Also, I wonder if their algorithm can really connect information from across different websites, different paragraphs in the same web page (should be there I think), etc.

I'd conclude that for NLP search to be really useful, it should target the long tail of searches - searches which individually are an insignificant part of the total search volume but put together comprise a major chunk. Powerset NLP search doesn't seem to be there yet and quite likely neither do other existing NLP based searche engines.

1 comment:

Anonymous said...

Thanks for writing this.