Model Demos

The DIVA model is a neural network that learns to control the movements of a computer-simulated vocal tract. The model’s articulator movements and acoustic signal can be captured in videos. The samples below illustrate several interesting aspects of speech in developing infants and adults.


In the first 6-8 months of life, infants make a number of vocalizations, but very few of these are well-formed syllables. According to the DIVA model, these early, largely random vocalizations allow the infant’s brain to tune up neural mappings representing the relationships between movements and their acoustic consequences, independent of speech for now. These mappings will be needed to guide the production of speech sounds learned at later stages of development.

Demo: A sample of random babbling movements used to tune the model’s representation of the relationship between articulator movements and their acoustic consequences. Random_Babbling

At around 8-10 months, most infants enter a new stage of babbling, referred to as reduplicated or canonical babbling. This stage is characterized by the fairly sudden onset of reasonably well-formed syllables, often surprising parents with (perhaps accidental) productions of first words like “mama”. In the DIVA model, reduplicated babbling can be generated by simply moving the jaw up and down in an oscillatory fashion. Note how the model’s productions in the following video sound much more like syllables and words than in the previous video, which involved purely random articulator movements.

Demo: Reduplicated babbling, leading to frequent production of acceptable syllables.Reduplicated_Babbling

Shortly after they enter the reduplicated babbling stage, infants move on to variegated babbling, characterized by changes in the phonemes from syllable to syllable within a babble. In the model this can be induced by adding random articulator movements onto the oscillatory jaw movements used during reduplicated babbling.

Demo: Variegated babbling, in which random movements of the tongue and lips coupled with oscillatory jaw movements produce variations in the phonemic content of each babbled syllable. Variegated_Babbling

According to the model, words and syllables of the native language are represented in the brain as auditory target regions that the speech production mechanism must attain in order to correctly produce the word/syllable. The transition from babbling to first words in the model can be carried out simply by reducing the sizes of the auditory target regions from large and diffuse regions for babbling to smaller regions that more accurately characterize how a particular word is supposed to sound. The following videos represent this progression from babbling to correct production of the word “baby”, carried out by progressively shrinking the auditory target region with each attempt.

Baby_1 Baby_2 Baby_3 Demo: Three attempts to produce the word “baby”, with the auditory target region for the word shrinking with each attempt from a very diffuse target to a very precise target.

Learning New Words

According to the DIVA model, there are two ways our brains can control speech movements: (1) using auditory feedback of our own speech to correct any differences between our auditory target for a word and our actual speech, and (2) using “feedforward” commands, which are articulatory commands that were learned from past productions. Early productions of a new word rely heavily on auditory feedback control, since accurate feedforward commands for the word are not yet available. With each attempt, however, the feedforward commands improve. The videos below show the model’s first few attempts to produce a new phrase, “good doggie”.

Good_doggie_1 Good_doggie_2 Good_doggie_3 Good_doggie_4 Demo: DIVA’s first few attempts to produce the phrase “good doggie”. With each iteration the model improves the feedforward commands for the phrase, resulting in better and better productions.

Motor Equivalence in Speech

Humans are amazingly adept at producing intelligible speech in novel ways, for example while holding their jaws clamped on a pipe or a bite block. Clenching the jaw requires a complex reorganization of the commands to the tongue and lips in order to achieve the auditory targets for words, yet we do this almost effortlessly and without needing to practice. This ability, called motor equivalence, occurs in the DIVA model via the feedback control system, which maps sensory error signals into corrective movement commands to cancel out the effects of constraints on the articulators.

Demo: DIVA producing the vowels in “bet”, “beet”, “bat”, “but”, and “boot”. diva normal
Demo: DIVA producing the same vowels while the jaw is held fixed. The model was never trained using a blocked jaw, which requires very different movements from those used in the unblocked case.diva fixed jaw
Demo: DIVA producing the same vowels, with 2/3 of tongue mobility removed.diva fixed tongue
diva demo4 Demo: DIVA producing the same vowels without using it’s jaw, lips, or larynx. The model produces sounds reasonably close to the targets, much like an amateur ventriloquist. diva demo4

Communication Disorders

The model allows us to investigate the neural bases of various communication disorders since we can damage model components and see what effect this has on the model’s speech. The following videos demonstrate stuttering in the model, which can occur if the model relies too heavily on feedback control.

Good_doggie_stuttering Demo: The model stuttering when trying to produce the phrase “good doggie”.
Rala_stuttering Demo: The model stuttering when trying to produce the utterance “ra-la”.