Tuesday, February 15, 2011

Deep Question Answering - The AI techniques behind IBM's Watson

IBM's Jeopardy-playing system Watson has been hailed recently as the greatest advance in AI since their Deep Blue system mastered chess. Deep Blue was criticized by many for taking a brute-force approach, relying on a giant opening book and sheer computational power to search the game tree, rather than the intuitive approach apparently taken by human players. So, what about Watson? Is it just a little robot in a Chinese room, blindly handing us back the correct answer?

In computer science parlance, the specific field responsible for a problem like Jeopardy is Question Answering (QA), which is sort of a subfield of Information Retrieval (IR). IR deals with all methods of returning relevant information or documents from a store of knowledge, while QA specifically deals with returning information in response to natural language questions (who, where, when, what, etc.). QA also focuses on returning specific answers (just a  word or phrase), while IR commonly returns whole documents (as search engines do). IBM title their system DeepQA, presumably a nod to Deep Blue.

This paper (Ferucci et al 2010, AI magazine), outlines Watson's approach to Question Answering in detail. It's 20 pages long, but it's not incredibly technical, and is quite readable to the determined and interested layman. (Footnote: I had assumed the system was named for Sherlock Holmes' assistant; in fact, it is named after Thomas J Watson, the founder of IBM)

If you want to succeed, double your failure rate - Thomas J. Watson

The paper begins by talking about Jeopardy and the problem's relevance, interestingly noting:

 "we appreciate that this challenge alone does not address all aspects of QA and does not by any means close the book on the QA challenge the way that Deep Blue may have for playing chess." 

The quirky 'question as answer' format of Jeopardy clues doesn't seem to be a technical problem:

" note that while the Jeopardy! game requires that answers are delivered in the form of a question ...  this transformation is trivial and for purposes of this paper we will just show the answers themselves"

The time during which the question is read aloud by the host is the time used to compute answers to the question and a confidence level; this time varies between one and three seconds, averaging three seconds. The category of the question may be straightforwardly useful for determining the domain of the answer, or it may be misleading or punny.

If the clue contains more than one statement, Watson decomposes it into subclues. There are examples, but details on how this is done automatically are kind of sketchy here. The next step is to identify what kind of answer the system is looking for. In the paper this is called the Lexical Answer Type (LAT); basically this means figuring out which word in the clue that represents the answer. For example, if the question begins "In this classic film", the Lexical Answer Type is 'film'. Often there is no clear type in the question, only a pronoun like 'it' or 'these'.

Ok, so if things have gone well at this stage, Watson knows the type of the answer it is looking for (film, city, person), one or more clues relating to the answer, and the category the clue is in. So where does it go to look for answers? Well, one place it doesn't go is directly to the internet. Although some QA competitions have allowed a live web connection, the Jeopardy challenge doesn't. Instead, the DeepQA system builds a large corpus of web documents which are then stored and searched offline during the live Jeopardy game.

An initial set of resources (example questions, thesauri, encyclopedias, ebooks) are used as seeds to generate web searches used to retrieve relevant web snippets. These web snippets, together with the initial resources and many other downloadable knowledge bases (dbpedia, freebase, wordnet, yago) form the "extended corpus" that is Watson's knowledge base.  Despite using so many structured knowledge sources, the paper notes that Watson can only directly look up the answer in less than 2% of cases, which shows the importance of the question analysis step.

The next step is to analyse the clue further in order to find out how to search the knowledge base. The question analysis section involves intensive Natural Language Processing techniques. The clue fragments are parsed, first syntactically and then semantically, to extract a logical form. The semantic roles in the clue are identified. Semantic role labeling is a major breakthrough in NLP, giving us algorithms that can identify the type of the subject and object in a parsed sentence.

Coreference resolution detects which entities pronouns refer to, named entity recognition detects terms that are proper nouns referring to people or organizations, relation extraction identifies the parts of the clue that make up predicates or relations between nouns or entities in the clue. The paper doesn't discuss the exact methods employed for these tasks, but a quick google scholar search shows the amount of research going on at the cutting edge of these tasks.

The process of searching the extended corpus is divided into two steps - at first the focus is on recall - the system generates several hundred answers in the hope that the correct answer is somewhere in this list. It does this by querying its corpus in varied ways (Lucene and SPARQL are mentioned), and returning small passages of text and individual candidate answers. This list of possible answers is then scored by a small army of standalone algorithms, each of which employs state-of-the-art computational lingusitics techniques to examine a particular aspect of the answer.

"Each of the scorers implemented in Watson, how they work, how they interact, and their independent impact on Watson's performance deserves its own research paper."

The range and variation in the knowledge incorporated at this stage is astounding. There's a component that can compare dates and times, a bit for figuring out geographic coordinates, a doohickey for matching the syntactic form of the clue with passages retrieved from web documents, and loads more scorers that can influence the confidence judgement for a particular answer. The focus here is on precision - if the system decides to buzz in, it must have a high confidence that its top answer is correct. A lot of the discussion in the paper is about precision vs. recall.

I'm not going to discuss the hardware architecture (lots of parallel computation using Hadoop, UIMA to manage component integration), and there's not a whole lot more detail about the implementation in this fairly high-level paper (they have a lot to squeeze in, in fairness).

However, there are a few papers floating about from IBM researchers that discuss methods employed in the Jeopardy challenge, even if the papers don't mention Jeopardy directly, and I just want to mention one that caught my eye: Large Scale Relation Detection . The paper describes methods for detecting relations between entities in a large corpus (over a billion words of text).

In my opinion, this paper gives a much more concrete idea of exactly how Watson goes about extracting information from unstructured text than the high-level discussion in the AI magazine article. The paper discusses how patterns of words in text such as "appeared together at the premier of " can be used to predict other, more useful relations like "co-star in". The authors emphasize the need for a large number of built-in, detailed relations between concepts that can be populated using patterns of words in text.

Watson is reading documents and understanding the relations between predicates, subjects, and objects in all kinds of noisy and sometimes contradictory data. Although I don't know enough about DeepBlue to say for sure, I think Watson is less susceptible to accusations of shallow brute-force computing than the chess machine was. Watson behaves as if it understands the clues. Does it literally understand the clues, or is it merely simulating the ability to understand them by retrieving stored information?

Given the multitude of AI components interacting to solve all the little tasks that must be accomplished to parse a question and relate it to knowledge gained from previously seen documents, I think that Watson's method is much more human than a man who doesn't understand Chinese comparing strange squiggles on pieces of card.


  1. "Does it literally understand the clues, or is it merely simulating the ability to understand them by retrieving stored information?"

    Some can say this is how we function based on our usage of search engines like google. For that we've become more efficient IMO. So is it really simulation, or is it acting more human?

  2. Just found IBM's research blog: there are interesting posts her about the details of some aspects of it's jeopardy play: