Sunday, January 23, 2011

An AI system that can hear what you're typing

Imagine this problem: you are blindfolded and sat next to a person typing on a standard keyboard. You listen to the person type for 10 minutes. Click-clack, tip-tap. Your task is to learn to recognize what they are typing by the sound of fingertips striking keys. Sounds pretty tricky, but some researchers at Berkeley developed an AI system that can do just that.

The original research paper, written by Li Zhuang, Feng Zhou, and J. D. Tygar with the somewhat obscure title of "Keyboard Acoustic Emanations Revisited", is available from the authors' website at here (number 24).

How does it work? This is where things get even more impressive. The input to the system is a 10 minute sound recording of a person typing. The algorithm uses this recording to build a model of the sounds generated by certain keys being struck. Amazingly, there is almost no knowledge built into the system from the start. The initial model does not know how many keys there are or where the boundaries between keystrokes are.

Most machine learning systems begin by using data that has been labeled by people - this is called "supervised learning". The Berkeley system doesn't need an annotated dataset. It doesn't even need to know the alphabet being used. To think of the equivalent task being performed by a human, you have to imagine something like a blindfolded Russian listening to someone type English, although even that doesn't go far enough, since Russian and English share some semantic and syntactic linguistic universals that an AI system isn't privy to.

Jean Baptiste Joseph Fourier looking
skeptical about cepstra
The raw acoustic signal is broken up into keystrokes by identifying peaks of energy in the signal (loud noises). Each 'tap' lasts about one-tenth of a second. From here on, the system uses a stack of varied Machine Learning techniques, beginning with cepstrum analysis. "Cepstrum" is a word constructed by some hilarious signal analysts by reversing the first four letters of "spectrum". Accordingly, possible operations on cepstra include quefrency alanysis, liftering, or cepstral analysis. What a bunch of krods.

A cepstrum is a kind of second-order Fourier Transform. A Fourier Transform is a mathematical method of representing a two-dimensional spatial signal (like a triangle or square) into a wave (or oscillation) which has different dimensions - frequency and amplitude. You get a cepstrum basically by taking the Fourier transform of the signal, squaring it, taking the log, then taking the Fourier Transform of THAT, squaring it, and taking the log again. Interestingly, the features of the cepstrum that were found to be useful in the paper are also useful for speech recognition. The features are used with a clustering technique  - they clump together similar sounding taps into a certain number of groups.

The model is tuned by taking advantage of the structured elements of typing - we type language, and in language, symbols follow other symbols with regular probabilities. We also type on standard keyboards, which means things like the delay between certain keystrokes and the probability of mistyping a letter is regular. The tool used to model these regularities is a Hidden Markov Model.
Andrei Markov thinking about
how cool it will be to have a
model named after him.

In any system that goes from one state to another in a probabilistic (rather than a known or predetermined) way, a Markov model is a way of representing the set of possible states and the probability of moving to any state given a starting point in some other state. Applied to keystroke recognition, this is actually quite intuitive: if I have typed "Andrei is a badass nam", you can predict that the next letter I will type will be "e". Even knowing that the previous letter was 'm' narrows the choice down significantly. Here is wikipedia's excellent concrete example of a Markov Model, with Python code.

Markov hated shopping
Note that the system doesn't have to start off with a knowledge of which letters are more likely to follow others - it just listens to someone typing away and begins to tune its probabilities (by Expectancy Maximization, math fans). Once you identify the classes, resolving it to text is just solving a simple substitution cipher.

A couple of other techniques are used, including a spell checker and an n-gram language model. I've left out a lot of detail here obviously, but I'd encourage you to have a glance over the paper. It's quite accessible in parts, and it's a lovely example of a complex task that is intuitively very difficult, and probably impossible for humans (prove me wrong!).  Maybe the most similar task humans have to perform is when we are about 18 months old and our mothers start babbling to us, pointing, smiling, imitating, and tuning the little parameters in our own little model.


  1. Omgomgomgomgomgomgomgomgomgomgomgomgomgomgomgomgomg

  2. Please use this on all movies and TV shows in which a character is ostensibly typing something into a computer and see what they were actually typing.

  3. Also passwords that people have typed while being recorded :/