If you need to call Sprint to pay your bill or ask a question
about service plans, a computer on the other end of the phone greets you with
the following message:
“You have reached PCS Customer Solutions. Briefly tell me how I can help
you.”
“Bill payment,” you reply.
Silence.
“Got it. I’ll connect you to our payment system.”
Seems easy enough. If you say a word like “pizza,” however, the
system responds with the following:
“Here are some popular choices. When you hear the one you want, just say
it back to me.”
This simple dialogue paints a fairly accurate picture of modern speech recognition
technology. The key problem lies in the fact that software cannot find outside
meaning in spoken words. Right now it’s only useful in customer service
applications where calls can be routed based on a computer’s ability to
find key words in a person’s speech and match them to desired outcomes.
But the ultimate goal of many researchers is a system that can actually converse
with human beings.
As speakers, we know that there is far more to language than a strict literal
translation. Try explaining the intricacies of meaning to a computer that can
fundamentally only tell the difference between ones and zeros, black and white
or yes and no. Throughout the course of our lives we continually learn and refine
our methods of communication, but today’s computers can’t adapt
and learn languages in the same way. Many scientists are therefore becoming
linguists, asking fundamental questions about the meanings of words and how
we can find and interpret those meanings – so they can tell a computer
how to do it. In order for a computer to understand language, it must be able
to understand meaning in context, depending on variations in the speed or tone
of a speaker’s voice. Speech technology has come a long way, but many
researchers believe the most dramatic breakthroughs are on the immediate horizon.
Imagine missing a meeting at work and you wanted to get the important details.
Instead of listening to the entire meeting on a tape while fast forwarding or
rewinding until the spot where an important conversation took place, you could
use a computer with speech recognition software to find what you need. By extracting
meaning from human speech, the computer could record and process the conversations
from the meeting, and could then take you straight to the part of the discussion
you wanted to hear. The same thing applies to verbal “documents”
like radio or television broadcasts – advanced speech recognition software
that finds meaning in spoken words could summarize them in electronic form.
But a great deal of work lies ahead. The main thing today’s speech recognition
looks for is consistency, which can be a problem with human speech, says B.H.
“Fred” Juang, a researcher at Georgia Tech. Consistency in tone
and speed are particularly important because from those variables the computer
forms a framework for understanding how someone speaks. It can then identify
words as they are being said. Juang worked at AT&T Bell labs for 40 years
and played a role in the development of the current generation of speech recognition
software, typically found in computer systems that take operator-assisted telephone
calls. “You obviously speak to [the computer] like you speak to another
person,” he says. But when people try to clarify what they are saying
to another person, they frequently change the way they are speaking –
by talking louder or more slowly, for example – and this confuses software.
“Computers hate sloppy speech, but humans love it,” says Li Deng,
a senior researcher at Microsoft’s speech technology group. Try transcribing
a chunk of recorded speech and you will get the idea. In addition to fluctuations
in speed and tone, human speech mixes with ambient noise, is filled with “ums
and uhs,” half-completed sentences and sometimes even completely fictitious
words. This is one reason that dictation software – the other common use
for speech recognition technology – is far from perfect.
Current generations of dictation software still struggle against the stigmas
of previous generations: requiring users to train themselves to speak in ways
that allow their words to be more easily recognized by a computer. Computerized
understanding of speech is presently built upon a database of “average”
speakers – an amalgamation of voice samples from thousands of people that
tries to include samples of the types of variations in speech that a computer
may encounter. Modern versions continue to improve on the traditional technique
of forcing users to speak in a slow, consistent tone, but software that continues
to force users to adapt their speech to the computer’s requirements, rather
than the other way around, does not result in the speedy adoption of the technology.
Scientists want people to be able to speak to computers in the same way that
they would talk to another human being.
Spoken words contain important clues about their meaning, such as their location
in a sentence. They’re going to their house over there or homonyms like
karat, caret and carrot are good examples of ways in which identical sounds
can carry different meanings, depending on context.
“There is a great deal of importance in how you say something versus what
you say,” says Mari Ostendorf, a professor of electrical engineering at
the University of Washington. One of her research projects aims to develop systems
that can locate the invisible punctuation in speech and use it to help determine
meaning. Paragraphs, commas and question marks are all basic elements of the
written word, but their location in a spoken sentence can be difficult to discern.
“We also need to be able to detect emotion and stress,” she says.
Researchers believe that the best way to find and incorporate meaning is by
using increasingly powerful computers to simultaneously detect and record more
variables in speech. “The number of neurons in the brain devoted to speech
is far more than the number of parameters in software,” Juang says.
Perhaps the most significant way to improve current software is with an increase
in computing power, says Bishnu Atal, also at the University of Washington.
Current technology analyzes approximately 20 or 30 features in speech every
few milliseconds but researchers hypothesize that software may need to analyze
several thousand features every few milliseconds to gather enough information
about speech to determine meaning. The best way to accomplish this is by using
more powerful computers.
Another key to the extraction of meaning from speech is successfully finding
“phones,” the most elementary characteristics of speech, according
to the University of Washington’s Signal, Speech and Language Interpretation
Lab. Phones of the world’s languages can be described and uniquely identified
by a compact set of approximately 30 features – the sound that we made
for the letter “d”, for example, is a universal one that involves
pressing the tongue against the palate. But merely add a single vowel and you
have additional variability. Pronounce “di” and “du”
and you can see the difference. By creating software that can identify speech
by breaking it down into its smallest component parts, it will be much easier
to write recognition software for specific languages, built from the ground
up.
“Meaning is something that we as humans take for granted but is not easy
to define, much less capture,” says Alex Acero, a senior researcher at
Microsoft and manager of the company’s speech technology group. Before
we will have robots that understand whom we’re talking about when we say
“her” or be able to distinguish sarcasm, software will need to identify
and understand many variables. Filtering out ambient noise, not having to define
a word’s meaning every time it is used, or simply having a computer say
“I don’t understand” when meaning is uncertain are just a
few of the hurdles to overcome.