Get a computer’s attention and start a conversation
With the rise of voice-enabled virtual assistants over the years, the sight of people talking to various electrical devices in public and private has become rather common. While these voice interfaces are definitely useful in a variety of situations, they also come with complications. One of them are the trigger phrases or wake words that voice assistants listen to when they are asleep. Just like in Star Trek, where saying “Computer” would get the computer’s attention, we also have our “Siri”, “Cortana” and a range of custom trigger phrases that activate the voice interface.
Unlike Star Trek, however, our virtual assistants don’t know when we really want to interact. Unable to make out the context, they will happily respond to someone on TV by mentioning their trigger phrase. This may be followed by a silly purchase order or other mischief. The realization here is the complexity of voice-based interfaces, while still lacking any sense of self-awareness or intelligence.
Another problem is that the speech recognition process itself is very resource-intensive, which limits the amount of processing that can be done on the local device. This typically leads voice assistants like Siri, Alexa, Cortana and others to process recorded voices in a data center, with obvious privacy implications.
just say my name
The idea of a trigger word that activates a system is old, with one of the earliest known practical examples dating back about a hundred years. This came in the form of a toy called Radio Rex, which featured a robot dog that sat in its small doghouse until its name was called. At this moment, he jumped outside to greet the person who called him.
The way this was implemented was simple and rather limited thanks to the technologies available in the 1910s and 1920s. It basically used the acoustic energy of a formant roughly corresponding to the vowel [eh] in “Rex”. As some have noted, one problem with Radio Rex is that it is tuned to 500 Hz, which would be the [eh] vowel when pronounced by an adult (medium) male voice.
This tragically meant that for children and women, Rex generally refused to step out of his niche unless they used a different vowel corresponding to the 500 Hz frequency range for their vocal range. Even then, they were likely to run into the other major issue with this toy, namely that of the sound pressure needed. Essentially, this meant that shouting might be needed to get Rex moving.
What’s interesting about this toy is that in many ways old Rex isn’t too different from how Siri and her friends work today. The trigger word that wakes them from sleep mode is interpreted in a less crude way, using a microphone and signal processing hardware and software rather than a mechanical contraption, but the effect is the same. In low-power trigger search mode, the assistant software continuously compares the formants of incoming sound samples for a match with the sound signature of the predefined trigger word(s).
Once a match has been detected and the mechanism kicks in, the assistant will exit its digital home into its full voice processing mode. At this point, an autonomous assistant – as can be found in older cars, for example – can use a simple Hidden Markov Model (HMM) to try to piece together the user’s intent. Such a model is usually trained on a fairly simple vocabulary model. Such a model will be specific to a particular language and often to a regional accent and/or dialect to increase accuracy.
Too big for the kennel
While it would be nice to run the whole natural language processing routine on the same system, the fact is that speech recognition is still very resource intensive. Not just in terms of processing power, as even an HMM-based approach has to sift through thousands of probabilistic paths per utterance, but also in terms of memory. Depending on the wizard’s vocabulary, the in-memory model can range from tens of megabytes to several gigabytes or even terabytes. This would obviously be rather impractical on the latest whizbang gadget, smartphone or smart TV, which is why this processing is usually moved to a data center.
When accuracy is considered an even higher priority – such as with the Google Assistant when asked a complex query – the HMM approach is usually abandoned for the new Long Short Term Memory (LSTM) approach. Although LSTM-based RNNs handle longer sentences much better, they also come with much higher processing and memory usage requirements.
With the current state of the art in speech recognition evolving towards increasingly complex neural network models, it seems unlikely that these system requirements will be overtaken by technological progress.
As a benchmark of what a basic low-end system at the level of a single board computer like a Raspberry Pi might be able to do with speech recognition, we can look at a project like CMU Sphinx, developed at the University Carnegie Mellon. The version for embedded systems is called PocketSphinx and, like its larger versions, uses an HMM-based approach. In the Spinx FAQ it is explicitly mentioned that large vocabularies will not work on SBCs like the Raspberry Pi due to limited RAM and CPU power on these platforms.
When you limit the vocabulary to around a thousand words, however, the model can fit in RAM and the processing will be fast enough to appear instantaneous to the user. This is fine if you want the voice interface to have only decent accuracy, within the limits of the training data, while offering only limited interaction. In the case where the objective is, for example, to allow the user to turn on or off a handful of lights, this may be sufficient. On the other hand, if this interface is called ‘Siri’ or ‘Alexa’, the expectations for such an interface are much higher.
Essentially, these virtual assistants are meant to act as if they understand natural language, the context in which it is used, and respond in a manner consistent with how average civilized human interaction should occur. Unsurprisingly, this is a tough challenge. Offloading the speech recognition part to a remote data center and using recorded speech samples to further train the model are natural consequences of this request.
No intelligence, just good guesses
Something that we humans are naturally quite good at, and increasingly harassed with during our school time, is called “part-of-speech marking,” also known as grammatical marking. This is where we quantify the parts of a sentence into its grammatical constituents, including nouns, verbs, articles, adjectives, etc. This is essential for understanding a sentence, as the meaning of words can change wildly depending on their grammatical classification, especially in languages like English with its common use of nouns as verbs and vice versa.
By using grammatical markup, we can then understand the meaning of the sentence. Yet that is not what these virtual assistants do. Using a Viterbi algorithm (for HMMs) or equivalent RNN approach, the probability is instead determined that the given input matches a specific subset of the language model. As most of us no doubt know, this is an approach that feels almost magical when it works and makes you realize that Siri is as dumb as a bag of bricks when it fails. to get a proper match.
As the demand for “smart” voice interfaces increases, engineers will no doubt work tirelessly to find more ingenious methods to improve the accuracy of the current system. The reality for the foreseeable future seems to remain that of sending voice data to data centers where powerful server systems can perform the required probability curve fitting, to figure out that you ask “Hey Google” where the nearest glacier. It doesn’t matter that you are actually asking for the nearest bike shop, but this is the tech for you.
Perhaps a little ironic about the whole experience of natural language and computer interaction is that text-to-speech is more or less a solved problem. As early as the 1980s, the Texas Instruments TMS (of Speak & Spell fame) and General Instrument SP0256 Linear Predictive Coding (LPC) voice chips used a fairly crude approximation of the human vocal tract in order to synthesize a human-sounding voice.
In the years that followed. LPC became increasingly refined for use in speech synthesis, while also finding use in speech encoding and transmission. Using the voice of a real human as the basis for an LPC voice device, virtual assistants can also switch between voices, allowing Siri, Cortana, etc. to sound like the gender and ethnicity that most appeals to an end user.
Hopefully in the next few decades we can make speech recognition as well as speech synthesis work, and maybe even give these virtual assistants a modicum of real intelligence.