Finding meaning behind the words
Page 2


...what you need. By extracting meaning from human speech, the computer could record and process the conversations from the meeting, and could then take you straight to the part of the discussion you wanted to hear. The same thing applies to verbal “documents” like radio or television broadcasts – advanced speech recognition software that finds meaning in spoken words could summarize them in electronic form.

But a great deal of work lies ahead. The main thing today’s speech recognition looks for is consistency, which can be a problem with human speech, says B.H. “Fred” Juang, a researcher at Georgia Tech. Consistency in tone and speed are particularly important because from those variables the computer forms a framework for understanding how someone speaks. It can then identify words as they are being said. Juang worked at AT&T Bell labs for 40 years and played a role in the development of the current generation of speech recognition software, typically found in computer systems that take operator-assisted telephone calls. “You obviously speak to [the computer] like you speak to another person,” he says. But when people try to clarify what they are saying to another person, they frequently change the way they are speaking – by talking louder or more slowly, for example – and this confuses software.

“Computers hate sloppy speech, but humans love it,” says Li Deng, a senior researcher at Microsoft’s speech technology group. Try transcribing a chunk of recorded speech and you will get the idea. In addition to fluctuations in speed and tone, human speech mixes with ambient noise, is filled with “ums and uhs,” half-completed sentences and sometimes even completely fictitious words. This is one reason that dictation software – the other common use for speech recognition technology – is far from perfect.

Current generations of dictation software still struggle against the stigmas of previous generations: requiring users to train themselves to speak in ways that allow their words to be more easily recognized by a computer. Computerized understanding of speech is presently built upon a database of “average” speakers – an amalgamation of voice samples from thousands of people that tries to include samples of the types of variations in speech that a computer may encounter.