Finding
meaning behind the words
Page 2
...what
you need. By extracting meaning from human speech, the computer
could record and process the conversations from the meeting, and
could then take you straight to the part of the discussion you
wanted to hear. The same thing applies to verbal “documents”
like radio or television broadcasts – advanced speech recognition
software that finds meaning in spoken words could summarize them
in electronic form.
But a great deal of work lies ahead. The main thing today’s
speech recognition looks for is consistency, which can be a problem
with human speech, says B.H. “Fred” Juang, a researcher
at Georgia Tech. Consistency in tone and speed are particularly
important because from those variables the computer forms a framework
for understanding how someone speaks. It can then identify words
as they are being said. Juang worked at AT&T Bell labs for
40 years and played a role in the development of the current generation
of speech recognition software, typically found in computer systems
that take operator-assisted telephone calls. “You obviously
speak to [the computer] like you speak to another person,”
he says. But when people try to clarify what they are saying to
another person, they frequently change the way they are speaking
– by talking louder or more slowly, for example –
and this confuses software.
“Computers hate sloppy speech, but humans love it,”
says Li Deng, a senior researcher at Microsoft’s speech
technology group. Try transcribing a chunk of recorded speech
and you will get the idea. In addition to fluctuations in speed
and tone, human speech mixes with ambient noise, is filled with
“ums and uhs,” half-completed sentences and sometimes
even completely fictitious words. This is one reason that dictation
software – the other common use for speech recognition technology
– is far from perfect.
Current
generations of dictation software still struggle against the stigmas
of previous generations: requiring users to train themselves to
speak in ways that allow their words to be more easily recognized
by a computer. Computerized understanding of speech is presently
built upon a database of “average” speakers –
an amalgamation of voice samples from thousands of people that
tries to include samples of the types of variations in speech
that a computer may encounter.
|