2018 Friday Poster 6696

Friday, November 2, 2018 | Poster Session I, Metcalf Small | 3pm

Phonetic learning without phonetic categories
T. Schatz, N. Feldman, S. Goldwater, E. Dupoux

Infant speech perception becomes specialized to the native language(s) during the first year of life. For example, between 6-8 months and 10-12 months, infants hearing Japanese become worse at distinguishing American English (AE) [ɹ] and [l], while infants hearing AE become better [1]. This widely documented phenomenon has been dubbed phonetic category acquisition and has largely been taken as evidence that children form phonetic categories, like [ɹ] and [l] . We present evidence that calls this interpretation into question.

We show that a distributional learning model trained on raw, unsegmented speech can successfully predict infants’ changes in discrimination for [ɹ] and [l], but that it does so by learning units which are not ‘phonetic categories’ in any meaningful sense.

Our distributional learner is a Gaussian mixture model that takes MFCCs as input, i.e. moderate-dimensional descriptors (d=13) of the shape of the short-term spectrum extracted automatically from the waveform every 10 ms, as well as their first and second time derivatives [2]. We train an ‘English native’ and a ‘Japanese native’ model on spontaneous conversational speech from [3,4]. Our learner determines the number of Gaussians that best fits the distribution of sounds automatically (DPGMM; [5]). At test, it computes how likely [ɹ] and [l] test stimuli from a different AE corpus [6] are to belong to each Gaussian in the mixture. These posterior probability distributions over Gaussians, obtained for each test stimuli, are then used in a machine ABX discrimination task [7] to measure the model’s ability to distinguish [ɹ] and [l]. We obtained test stimuli from a different AE corpus to ensure that observed results are due to training language rather than channel effects.

Mirroring empirical observations in infants, [ɹ]-[l] discrimination is worse for the Japanese model than for the AE model (Figure 1). This shows that a distributional learning mechanism can predict perceptual patterns accurately. The Gaussians learned by our models, however, do not resemble phonetic categories. We illustrate this point by taking the [ɹa] and [la] stimuli from [1] and plotting the activation profile (i.e. the posterior probability) of the Gaussian units learned by the ‘English model’ as a function of time for these stimuli (Figure 2). [ɹ] and [l] activate distributed, partially overlapping, sets of units, each unit being activated on a duration much too short to constitute a proper phonetic segment.

Our results provide a way to reconcile previous work that found it difficult to learn phonetic categories using distributional learning mechanisms (e.g. [8]) with the hypothesis that early perceptual changes are driven by distributional learning. This is also the first time a phonetic learning model has been used to directly predict infants’ changes in discrimination. More broadly, our results challenge the view commonly held for several decades that perceptual changes in infancy constitute evidence for the early formation of phonetic categories and invite us to reconsider the very nature of early linguistic knowledge. Our model provides one possible alternative to phonetic categories, but many other can and should be investigated [9].

References

  1. Kuhl, P. K., et al. (2006). Infants show a facilitation effect for native language phonetic perception between 6 and 12 months. Developmental science, 9(2).
  2. Mermelstein, P. (1976). Distance measures for speech recognition, psychological and instrumental. Pattern recognition and artificial intelligence, 116, 374-388.
  3. Pitt, M. A., et al. (2005). The Buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Communication, 45(1), 89-95.
  4. Maekawa, K. (2003). Corpus of Spontaneous Japanese: Its design and evaluation. In ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition.
  5. Chang, J., & Fisher III, J. W. (2013). Parallel sampling of DP mixture models using sub-cluster splits. In Proceedings of NIPS.
  6. Paul, D. B., & Baker, J. M. (1992, February). The design for the Wall Street Journal-based CSR corpus. In Proceedings of the workshop on Speech and Natural Language. Association for Computational Linguistics.
  7. Schatz, T., et al. (2013). Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline. In Proceedings of INTERSPEECH.
  8. Adriaans, F., & Swingley, D. (2012). Distributional learning of vowel categories is supported by prosody in infant-directed speech. In Proceedings of the Annual Meeting of the Cognitive Science Society.
  9. Versteegh, M., et al. (2015). The zero resource speech challenge 2015. In Proceedings of INTERSPEECH.