The original research paper, written by Li Zhuang, Feng Zhou, and J. D. Tygar with the somewhat obscure title of "Keyboard Acoustic Emanations Revisited", is available from the authors' website at here (number 24).
How does it work? This is where things get even more impressive. The input to the system is a 10 minute sound recording of a person typing. The algorithm uses this recording to build a model of the sounds generated by certain keys being struck. Amazingly, there is almost no knowledge built into the system from the start. The initial model does not know how many keys there are or where the boundaries between keystrokes are.
Most machine learning systems begin by using data that has been labeled by people - this is called "supervised learning". The Berkeley system doesn't need an annotated dataset. It doesn't even need to know the alphabet being used. To think of the equivalent task being performed by a human, you have to imagine something like a blindfolded Russian listening to someone type English, although even that doesn't go far enough, since Russian and English share some semantic and syntactic linguistic universals that an AI system isn't privy to.
|Jean Baptiste Joseph Fourier looking |
skeptical about cepstra
A cepstrum is a kind of second-order Fourier Transform. A Fourier Transform is a mathematical method of representing a two-dimensional spatial signal (like a triangle or square) into a wave (or oscillation) which has different dimensions - frequency and amplitude. You get a cepstrum basically by taking the Fourier transform of the signal, squaring it, taking the log, then taking the Fourier Transform of THAT, squaring it, and taking the log again. Interestingly, the features of the cepstrum that were found to be useful in the paper are also useful for speech recognition. The features are used with a clustering technique - they clump together similar sounding taps into a certain number of groups.
The model is tuned by taking advantage of the structured elements of typing - we type language, and in language, symbols follow other symbols with regular probabilities. We also type on standard keyboards, which means things like the delay between certain keystrokes and the probability of mistyping a letter is regular. The tool used to model these regularities is a Hidden Markov Model.
|Andrei Markov thinking about |
how cool it will be to have a
model named after him.
In any system that goes from one state to another in a probabilistic (rather than a known or predetermined) way, a Markov model is a way of representing the set of possible states and the probability of moving to any state given a starting point in some other state. Applied to keystroke recognition, this is actually quite intuitive: if I have typed "Andrei is a badass nam", you can predict that the next letter I will type will be "e". Even knowing that the previous letter was 'm' narrows the choice down significantly. Here is wikipedia's excellent concrete example of a Markov Model, with Python code.
|Markov hated shopping|
A couple of other techniques are used, including a spell checker and an n-gram language model. I've left out a lot of detail here obviously, but I'd encourage you to have a glance over the paper. It's quite accessible in parts, and it's a lovely example of a complex task that is intuitively very difficult, and probably impossible for humans (prove me wrong!). Maybe the most similar task humans have to perform is when we are about 18 months old and our mothers start babbling to us, pointing, smiling, imitating, and tuning the little parameters in our own little model.