Finding meaning behind the words
By Ryan Olson

If you need to call Sprint to pay your bill or ask a question about service plans, a computer on the other end of the phone greets you with the following message:
“You have reached PCS Customer Solutions. Briefly tell me how I can help you.”
“Bill payment,” you reply.
Silence.
“Got it. I’ll connect you to our payment system.”
Seems easy enough. If you say a word like “pizza,” however, the system responds with the following:
“Here are some popular choices. When you hear the one you want, just say it back to me.”

This simple dialogue paints a fairly accurate picture of modern speech recognition technology. The key problem lies in the fact that software cannot find outside meaning in spoken words. Right now it’s only useful in customer service applications where calls can be routed based on a computer’s ability to find key words in a person’s speech and match them to desired outcomes. But the ultimate goal of many researchers is a system that can actually converse with human beings.

As speakers, we know that there is far more to language than a strict literal translation. Try explaining the intricacies of meaning to a computer that can fundamentally only tell the difference between ones and zeros, black and white or yes and no. Throughout the course of our lives we continually learn and refine our methods of communication, but today’s computers can’t adapt and learn languages in the same way. Many scientists are therefore becoming linguists, asking fundamental questions about the meanings of words and how we can find and interpret those meanings – so they can tell a computer how to do it. In order for a computer to understand language, it must be able to understand meaning in context, depending on variations in the speed or tone of a speaker’s voice. Speech technology has come a long way, but many researchers believe the most dramatic breakthroughs are on the immediate horizon.

Imagine missing a meeting at work and you wanted to get the important details. Instead of listening to the entire meeting on a tape while fast forwarding or rewinding until the spot where an important conversation took place, you could use a computer with speech recognition software to find what you need. By extracting meaning from human speech, the computer could record and process the conversations from the meeting, and could then take you straight to the part of the discussion you wanted to hear. The same thing applies to verbal “documents” like radio or television broadcasts – advanced speech recognition software that finds meaning in spoken words could summarize them in electronic form.

But a great deal of work lies ahead. The main thing today’s speech recognition looks for is consistency, which can be a problem with human speech, says B.H. “Fred” Juang, a researcher at Georgia Tech. Consistency in tone and speed are particularly important because from those variables the computer forms a framework for understanding how someone speaks. It can then identify words as they are being said. Juang worked at AT&T Bell labs for 40 years and played a role in the development of the current generation of speech recognition software, typically found in computer systems that take operator-assisted telephone calls. “You obviously speak to [the computer] like you speak to another person,” he says. But when people try to clarify what they are saying to another person, they frequently change the way they are speaking – by talking louder or more slowly, for example – and this confuses software.

“Computers hate sloppy speech, but humans love it,” says Li Deng, a senior researcher at Microsoft’s speech technology group. Try transcribing a chunk of recorded speech and you will get the idea. In addition to fluctuations in speed and tone, human speech mixes with ambient noise, is filled with “ums and uhs,” half-completed sentences and sometimes even completely fictitious words. This is one reason that dictation software – the other common use for speech recognition technology – is far from perfect.

Current generations of dictation software still struggle against the stigmas of previous generations: requiring users to train themselves to speak in ways that allow their words to be more easily recognized by a computer. Computerized understanding of speech is presently built upon a database of “average” speakers – an amalgamation of voice samples from thousands of people that tries to include samples of the types of variations in speech that a computer may encounter. Modern versions continue to improve on the traditional technique of forcing users to speak in a slow, consistent tone, but software that continues to force users to adapt their speech to the computer’s requirements, rather than the other way around, does not result in the speedy adoption of the technology. Scientists want people to be able to speak to computers in the same way that they would talk to another human being.

Spoken words contain important clues about their meaning, such as their location in a sentence. They’re going to their house over there or homonyms like karat, caret and carrot are good examples of ways in which identical sounds can carry different meanings, depending on context.

“There is a great deal of importance in how you say something versus what you say,” says Mari Ostendorf, a professor of electrical engineering at the University of Washington. One of her research projects aims to develop systems that can locate the invisible punctuation in speech and use it to help determine meaning. Paragraphs, commas and question marks are all basic elements of the written word, but their location in a spoken sentence can be difficult to discern. “We also need to be able to detect emotion and stress,” she says.

Researchers believe that the best way to find and incorporate meaning is by using increasingly powerful computers to simultaneously detect and record more variables in speech. “The number of neurons in the brain devoted to speech is far more than the number of parameters in software,” Juang says.

Perhaps the most significant way to improve current software is with an increase in computing power, says Bishnu Atal, also at the University of Washington. Current technology analyzes approximately 20 or 30 features in speech every few milliseconds but researchers hypothesize that software may need to analyze several thousand features every few milliseconds to gather enough information about speech to determine meaning. The best way to accomplish this is by using more powerful computers.

Another key to the extraction of meaning from speech is successfully finding “phones,” the most elementary characteristics of speech, according to the University of Washington’s Signal, Speech and Language Interpretation Lab. Phones of the world’s languages can be described and uniquely identified by a compact set of approximately 30 features – the sound that we made for the letter “d”, for example, is a universal one that involves pressing the tongue against the palate. But merely add a single vowel and you have additional variability. Pronounce “di” and “du” and you can see the difference. By creating software that can identify speech by breaking it down into its smallest component parts, it will be much easier to write recognition software for specific languages, built from the ground up.

“Meaning is something that we as humans take for granted but is not easy to define, much less capture,” says Alex Acero, a senior researcher at Microsoft and manager of the company’s speech technology group. Before we will have robots that understand whom we’re talking about when we say “her” or be able to distinguish sarcasm, software will need to identify and understand many variables. Filtering out ambient noise, not having to define a word’s meaning every time it is used, or simply having a computer say “I don’t understand” when meaning is uncertain are just a few of the hurdles to overcome.