Tuesday, February 15, 2011

Deep Question Answering - The AI techniques behind IBM's Watson

IBM's Jeopardy-playing system Watson has been hailed recently as the greatest advance in AI since their Deep Blue system mastered chess. Deep Blue was criticized by many for taking a brute-force approach, relying on a giant opening book and sheer computational power to search the game tree, rather than the intuitive approach apparently taken by human players. So, what about Watson? Is it just a little robot in a Chinese room, blindly handing us back the correct answer?

In computer science parlance, the specific field responsible for a problem like Jeopardy is Question Answering (QA), which is sort of a subfield of Information Retrieval (IR). IR deals with all methods of returning relevant information or documents from a store of knowledge, while QA specifically deals with returning information in response to natural language questions (who, where, when, what, etc.). QA also focuses on returning specific answers (just a  word or phrase), while IR commonly returns whole documents (as search engines do). IBM title their system DeepQA, presumably a nod to Deep Blue.

This paper (Ferucci et al 2010, AI magazine), outlines Watson's approach to Question Answering in detail. It's 20 pages long, but it's not incredibly technical, and is quite readable to the determined and interested layman. (Footnote: I had assumed the system was named for Sherlock Holmes' assistant; in fact, it is named after Thomas J Watson, the founder of IBM)

If you want to succeed, double your failure rate - Thomas J. Watson

The paper begins by talking about Jeopardy and the problem's relevance, interestingly noting:

 "we appreciate that this challenge alone does not address all aspects of QA and does not by any means close the book on the QA challenge the way that Deep Blue may have for playing chess." 

The quirky 'question as answer' format of Jeopardy clues doesn't seem to be a technical problem:

" note that while the Jeopardy! game requires that answers are delivered in the form of a question ...  this transformation is trivial and for purposes of this paper we will just show the answers themselves"

The time during which the question is read aloud by the host is the time used to compute answers to the question and a confidence level; this time varies between one and three seconds, averaging three seconds. The category of the question may be straightforwardly useful for determining the domain of the answer, or it may be misleading or punny.

If the clue contains more than one statement, Watson decomposes it into subclues. There are examples, but details on how this is done automatically are kind of sketchy here. The next step is to identify what kind of answer the system is looking for. In the paper this is called the Lexical Answer Type (LAT); basically this means figuring out which word in the clue that represents the answer. For example, if the question begins "In this classic film", the Lexical Answer Type is 'film'. Often there is no clear type in the question, only a pronoun like 'it' or 'these'.

Ok, so if things have gone well at this stage, Watson knows the type of the answer it is looking for (film, city, person), one or more clues relating to the answer, and the category the clue is in. So where does it go to look for answers? Well, one place it doesn't go is directly to the internet. Although some QA competitions have allowed a live web connection, the Jeopardy challenge doesn't. Instead, the DeepQA system builds a large corpus of web documents which are then stored and searched offline during the live Jeopardy game.

An initial set of resources (example questions, thesauri, encyclopedias, ebooks) are used as seeds to generate web searches used to retrieve relevant web snippets. These web snippets, together with the initial resources and many other downloadable knowledge bases (dbpedia, freebase, wordnet, yago) form the "extended corpus" that is Watson's knowledge base.  Despite using so many structured knowledge sources, the paper notes that Watson can only directly look up the answer in less than 2% of cases, which shows the importance of the question analysis step.

The next step is to analyse the clue further in order to find out how to search the knowledge base. The question analysis section involves intensive Natural Language Processing techniques. The clue fragments are parsed, first syntactically and then semantically, to extract a logical form. The semantic roles in the clue are identified. Semantic role labeling is a major breakthrough in NLP, giving us algorithms that can identify the type of the subject and object in a parsed sentence.

Coreference resolution detects which entities pronouns refer to, named entity recognition detects terms that are proper nouns referring to people or organizations, relation extraction identifies the parts of the clue that make up predicates or relations between nouns or entities in the clue. The paper doesn't discuss the exact methods employed for these tasks, but a quick google scholar search shows the amount of research going on at the cutting edge of these tasks.

The process of searching the extended corpus is divided into two steps - at first the focus is on recall - the system generates several hundred answers in the hope that the correct answer is somewhere in this list. It does this by querying its corpus in varied ways (Lucene and SPARQL are mentioned), and returning small passages of text and individual candidate answers. This list of possible answers is then scored by a small army of standalone algorithms, each of which employs state-of-the-art computational lingusitics techniques to examine a particular aspect of the answer.

"Each of the scorers implemented in Watson, how they work, how they interact, and their independent impact on Watson's performance deserves its own research paper."


The range and variation in the knowledge incorporated at this stage is astounding. There's a component that can compare dates and times, a bit for figuring out geographic coordinates, a doohickey for matching the syntactic form of the clue with passages retrieved from web documents, and loads more scorers that can influence the confidence judgement for a particular answer. The focus here is on precision - if the system decides to buzz in, it must have a high confidence that its top answer is correct. A lot of the discussion in the paper is about precision vs. recall.

I'm not going to discuss the hardware architecture (lots of parallel computation using Hadoop, UIMA to manage component integration), and there's not a whole lot more detail about the implementation in this fairly high-level paper (they have a lot to squeeze in, in fairness).

However, there are a few papers floating about from IBM researchers that discuss methods employed in the Jeopardy challenge, even if the papers don't mention Jeopardy directly, and I just want to mention one that caught my eye: Large Scale Relation Detection . The paper describes methods for detecting relations between entities in a large corpus (over a billion words of text).

In my opinion, this paper gives a much more concrete idea of exactly how Watson goes about extracting information from unstructured text than the high-level discussion in the AI magazine article. The paper discusses how patterns of words in text such as "appeared together at the premier of " can be used to predict other, more useful relations like "co-star in". The authors emphasize the need for a large number of built-in, detailed relations between concepts that can be populated using patterns of words in text.

Watson is reading documents and understanding the relations between predicates, subjects, and objects in all kinds of noisy and sometimes contradictory data. Although I don't know enough about DeepBlue to say for sure, I think Watson is less susceptible to accusations of shallow brute-force computing than the chess machine was. Watson behaves as if it understands the clues. Does it literally understand the clues, or is it merely simulating the ability to understand them by retrieving stored information?

Given the multitude of AI components interacting to solve all the little tasks that must be accomplished to parse a question and relate it to knowledge gained from previously seen documents, I think that Watson's method is much more human than a man who doesn't understand Chinese comparing strange squiggles on pieces of card.

Monday, January 31, 2011

The Swimming Submarine

"The question of whether a computer can think is no more interesting than the question of whether a submarine can swim." - Edsger Dijkstra
One can imagine a long, heated and intricate debate among fish philosophers (in schools of philosophy), on the question of whether it could ever be possible for a submarine to truly swim. While some animals (for example humans and jellyfish) can achieve a certain amount of propulsion through water, fish-philosophers have long believed that only a fish can move with the power, speed and agility underwater that constitutes true swimming. However, recent engineering advances in Artificial Swimming (AS) have produced devices that have extraordinary swimming power by some measures. Some of them can even outpace fish over longer distances. Most fish-philosophers, however, are dismissive of AS. They say:

"Why, these 'submarines' aren't swimming! They don't even have proper beating tails! They are just hunks of metal with tiny revolving tails and minuscule fins at the back! They don't swim like we swim."

Swimming competitions are held. The submarines win. The fish-philosophers simply laugh and say:

"Ok, so you can beat us at swimming in a straight line, but what about swimming together in a shoal while being chased by a predator?"

As advances in AS encroach further into the definition of swimming that fish-philosophers hold; new, tougher, tests are introduced, such as the Tuna test, in which a submarine has to convince a fish, while swimming alongside it, that it is actually another fish. After a half century of work in AS, simple, cheap, readily available submarines can defeat a fish in almost every test of swimming that was traditionally used in swimming competitions between fish; yet still, fish philosophers refuse to accept that a submarine really has that ephemeral, fishy-swimming quality that makes the way fish swim distinct from how machines swim. Some fish philosophers even maintain that it would be possible for an organism to exist that was physically identical to a fish, behaved just like a fish, moved through the water exactly like a fish, but was still incapable of true swimming. This idea had previously only been suggested in low-budget fish horror movies.

One can imagine endless debates among birds about featherism. Elephants using artificial trunks. Arachnid science-fiction writers concerned that spider-engineers will construct web-spinning machines so advanced that they turn against their creators and catch all their flies. When a skill is considered to be the defining characteristic of a species, attempts to artificially engineer a machine to emulate this skill will be considered an affront to the very existence of the species. So it is with humans and thought.

This hubris must end. Machines think. Machines learn. Machines are intelligent. There does not exist a computer that thinks exactly as I do, but in terms of the ability to "think", as that word has been used historically, modern computer systems think better than modern humans. In this blog, I want to highlight pieces of AI and Machine Learning research that provide evidence of Machine Intelligence that is often ignored in other communities, mainly because Computer Science and Psychology/Philosophy don't always communicate very well. Occasionally I will also try to make a joke. Sorry about that.

My attitude to the philosophical debates on strong AI is summed up by the Dijkstra quotation. If you're interested in his views on computing he written on his "personal perspective":
Selected Writings on Computing: A Personal Perspective

Sunday, January 23, 2011

An AI system that can hear what you're typing

Imagine this problem: you are blindfolded and sat next to a person typing on a standard keyboard. You listen to the person type for 10 minutes. Click-clack, tip-tap. Your task is to learn to recognize what they are typing by the sound of fingertips striking keys. Sounds pretty tricky, but some researchers at Berkeley developed an AI system that can do just that.

The original research paper, written by Li Zhuang, Feng Zhou, and J. D. Tygar with the somewhat obscure title of "Keyboard Acoustic Emanations Revisited", is available from the authors' website at here (number 24).

How does it work? This is where things get even more impressive. The input to the system is a 10 minute sound recording of a person typing. The algorithm uses this recording to build a model of the sounds generated by certain keys being struck. Amazingly, there is almost no knowledge built into the system from the start. The initial model does not know how many keys there are or where the boundaries between keystrokes are.

Most machine learning systems begin by using data that has been labeled by people - this is called "supervised learning". The Berkeley system doesn't need an annotated dataset. It doesn't even need to know the alphabet being used. To think of the equivalent task being performed by a human, you have to imagine something like a blindfolded Russian listening to someone type English, although even that doesn't go far enough, since Russian and English share some semantic and syntactic linguistic universals that an AI system isn't privy to.

Jean Baptiste Joseph Fourier looking
skeptical about cepstra
The raw acoustic signal is broken up into keystrokes by identifying peaks of energy in the signal (loud noises). Each 'tap' lasts about one-tenth of a second. From here on, the system uses a stack of varied Machine Learning techniques, beginning with cepstrum analysis. "Cepstrum" is a word constructed by some hilarious signal analysts by reversing the first four letters of "spectrum". Accordingly, possible operations on cepstra include quefrency alanysis, liftering, or cepstral analysis. What a bunch of krods.

A cepstrum is a kind of second-order Fourier Transform. A Fourier Transform is a mathematical method of representing a two-dimensional spatial signal (like a triangle or square) into a wave (or oscillation) which has different dimensions - frequency and amplitude. You get a cepstrum basically by taking the Fourier transform of the signal, squaring it, taking the log, then taking the Fourier Transform of THAT, squaring it, and taking the log again. Interestingly, the features of the cepstrum that were found to be useful in the paper are also useful for speech recognition. The features are used with a clustering technique  - they clump together similar sounding taps into a certain number of groups.

The model is tuned by taking advantage of the structured elements of typing - we type language, and in language, symbols follow other symbols with regular probabilities. We also type on standard keyboards, which means things like the delay between certain keystrokes and the probability of mistyping a letter is regular. The tool used to model these regularities is a Hidden Markov Model.
Andrei Markov thinking about
how cool it will be to have a
model named after him.

In any system that goes from one state to another in a probabilistic (rather than a known or predetermined) way, a Markov model is a way of representing the set of possible states and the probability of moving to any state given a starting point in some other state. Applied to keystroke recognition, this is actually quite intuitive: if I have typed "Andrei is a badass nam", you can predict that the next letter I will type will be "e". Even knowing that the previous letter was 'm' narrows the choice down significantly. Here is wikipedia's excellent concrete example of a Markov Model, with Python code.


Markov hated shopping
Note that the system doesn't have to start off with a knowledge of which letters are more likely to follow others - it just listens to someone typing away and begins to tune its probabilities (by Expectancy Maximization, math fans). Once you identify the classes, resolving it to text is just solving a simple substitution cipher.

A couple of other techniques are used, including a spell checker and an n-gram language model. I've left out a lot of detail here obviously, but I'd encourage you to have a glance over the paper. It's quite accessible in parts, and it's a lovely example of a complex task that is intuitively very difficult, and probably impossible for humans (prove me wrong!).  Maybe the most similar task humans have to perform is when we are about 18 months old and our mothers start babbling to us, pointing, smiling, imitating, and tuning the little parameters in our own little model.