Supplement to Auditory Perception

Speech Perception: Empirical and Theoretical Considerations

What are the objects of speech perception? Speaking involves the production of meaningful streams of sounds. At the physical level, a spectrogram reveals the patterns of frequency and amplitude that ground audible features. The stream sounds like a complex acoustic structure involving patterns of audible qualities over time. The stream, however, auditorily appears to be segmented (speech in an unfamiliar language often seems like an unsegmented stream). The most salient segments are words, the meaningful units. Also discernible in the stream are segments that correspond to something like syllables. These units or segments are not ascribed meaning, but instead combine to form words in a way loosely analogous to the way words combine to form sentences. Even syllables, however, comprise perceptually distinguishable sound types. For instance, though ‘dough’ has one syllable, it includes the sounds of /d/ and /O/ (or /oʊ/). The sound of the one-syllable spoken word ‘bad’ includes /b/, /æ/, and /d/. Those of ‘bat’ and ‘bash’ differ because the former contains /t/ and the latter contains /ʃ/. Such perceptible units, or phonemes, whose patterns form the basis for recognizing and distinguishing words, have been one primary focus of research into speech perception. Phonemes form a sort of “sound alphabet” from which audible words are built (Appelbaum 1999 critiques the “alphabetic” conception).

What is a phoneme? First, consider the universal class of phones, which contains all of the possibly distinguishable types of speech sounds that may mark a semantic difference in some world language. In contrast, phonemes are specific to a particular language. Phonemes also may be understood in terms of equivalence classes of sounds. Phonemes are semantically significant sound types that constitute the spoken words in a given language. The boundaries between phonemes in a language mark sound differences that may be semantically significant for that language.

Phonemes thus may differ across languages. For instance, though certain phonemes are shared, the class of English phonemes differs from that of Japanese. English, for example, distinguishes the [l] and [r] sounds (phones) as distinct phonemes, while Japanese does not. Instead, Japanese treats them as allophones, or variants of a common phoneme. Standard Chinese distinguishes distinct phonemes that correspond to allophones of the single English phoneme /p/ (the aspirated /pʰ/ and unaspirated /p/). It is noteworthy that infants prior to language learning distinguish phones that are later subsumed to a single phonemic equivalence class (see, e.g., Werker 1995, Kuhl 2000 for review and commentary). In addition, certain languages make use of novel sounds, such as clicks, that others do not. So, when compared with each other, distinct languages may differ in which sounds they include or omit among their respective phonemes, and they may differ in which sound pairs they treat as distinct phonemes or as allophonic.

The central puzzle of speech perception is that there is no obvious direct, consistent correspondence between the surface properties of a physical acoustic signal and the phonemes perceived when listening to speech.

This is manifested in a number of ways. Pioneers into speech perception research aimed initially to develop an automated reading machine for the blind that worked by replacing individual letters with specific sounds. The project failed miserably—listeners were unable at the rates of normal speech to resolve the sequence of individual sounds required to detect words (see Liberman 1996).

Most importantly, there is no clear invariant property of a sound signal that corresponds to a given phoneme. What sounds like a single phoneme might have very different acoustic correlates depending not just upon the speaker or the speaker’s mood, but also upon the phonemic context. For instance, /di/ and /du/ audibly share the /d/ phoneme. However, the acoustic signal corresponding to /d/ differs greatly in these cases (see Liberman et al. 1967, 435, fig. 1). While /di/ includes a formant that begins at a higher frequency and rises, /du/ includes a formant that begins at a lower frequency and drops. Acoustically, nothing straightforward in the signal corresponds to the /d/ sound one auditorily experiences in both cases. Two different audible phonemes also might share acoustic correlates, again depending on context. The acoustic signal that corresponds to /p/ is nearly identical to that of /k/ in the contexts /pi/ and /ka/ (Cooper et al. 1952). Prima facie, phonemes thus are not identical with distinctive invariant acoustic structures.

Lack of invariance stems in large part from coarticulation. In contrast to how things seem auditorily, how a speaker articulates a given phoneme depends upon what precedes or follows that phoneme. Being followed by /i/ rather than /u/ impacts how one pronounces /d/, and being preceded by /d/ impacts the vowel. When pronouncing ‘dab’, the effects of pronouncing both /d/ and /b/ are evident in the acoustic signature of /a/. The articulatory consequences of phonemic context change the acoustic features of the signal and confound attempts to map phonemes to signals (which presents the difficulty for artificial speech production and recognition). Furthermore, due to coarticulation, the signal lacks the clear segmentation of categorically perceived phonemes, which have been likened to beads on a string (Bloomfield 1933). In effect, speakers pronounce two or more phonemes at a time, and transitions are fluid rather than discrete (see, e.g., Liberman 1970, 309, fig. 5, Diehl et al. 2004).

One response to this, compatible with realism about perceptible phonological features, is to search for more complex acoustic structures or to higher-order acoustical properties that correspond to apparent phonemes (see, e.g., Blumstein and Stevens 1981, Diehl et al. 2004, Holt and Lotto 2008 for the general auditory approach). On the other hand, some philosophers instead conclude that phonological features are mere intentional objects, or ‘intentional inexistents’ (see Rey 2012). Pautz (2017, 27–28), for instance, maintains that differences in acoustical features cannot account for apparent categorical differences between phonemes.

Another type of realist approach appeals to aspects of the gestures used to pronounce phonemes—ways of moving one’s throat and mouth and tongue—which are reasonably invariant across contexts. For instance, pronouncing /d/ involves placing the tip of the tongue on the alveolar ridge directly behind the teeth. The alveolar consonants /d/ and /t/ differ from each other in being voiced, or accompanied by vocal fold movement. Whether you say /di/ or /du/, your tongue touches the alveolar ridge and you voice the consonant. But, while you articulate the gestures associated with /d/, you anticipate and begin to articulate those associated with /i/ or /u/. This alters the overall acoustic signature of the gestures associated with /d/. Gestures, rather than the complex acoustic signals they produce, on this view make intelligible the perceptual individuation of phonemes. Some therefore hold that perceiving phonemes involves recovering information about articulatory gestures from the acoustic signal. The motor theory (Liberman et al. 1967, Liberman and Mattingly 1985) and direct realism (Fowler 1986) are very different versions of this approach. Articulatory gestures thus make plausible candidates for objects of phoneme perception. They are, however, imperfect candidates, since they do not entirely escape worries about the context dependence and lack of discrete segmentation stemming from fluid coarticulation (Appelbaum 1996, Remez and Trout 2009).

Nonetheless, the claim is supported by the surprising finding that visual processes impact the auditory experience of speech. For instance, the McGurk effect includes one instance in which seeing video of a speaker pronouncing /ga/ dubbed with audio of /ba/ leads to hearing as of the /da/ phoneme (McGurk and Macdonald 1976). If perceiving speech involves perceiving gestures, it is not surprising that the visual evidence for articulatory gestures should be weighed against auditory evidence.

Some researchers who hold that intended or actual gestures are the best candidates for the objects of phoneme perception argue that speech perception therefore is special. That is, speech perception’s objects differ in kind from the sounds and acoustic structures we hear in general audition (Liberman et al. 1967, Liberman and Mattingly 1985). Liberman and Mattingly (1985), furthermore, use the claim that audition has distinctive objects to motivate the claim that speech perception therefore involves distinctive perceptual processes. They even argue that although speech perception shares an end organ with auditory perception, it constitutes a functionally distinct modular perceptual system (Liberman and Mattingly 1985, 7–10, 27–30, see also 1989). Part of the motivation for their motor theory of speech perception, against auditory theories, is to integrate explanations of speech perception and speech production (1985, 23–5, 30–1, see also Matthen 2005, ch 9, which uses the Motor Theory to support a Codependency Thesis linking the capacities to perceive and produce phonemes, 221). On this account, a single modular system is responsible for both the production and perception of speech. This purported link between capacities for production and perception suggests that humans are unique in possessing a speech perception system. Humans, but not other creatures, are capable of discerning speech for many of the same reasons they are capable of producing the articulatory gestures that correspond to perceived phonemes. Other animals presumably hear just sounds (Liberman et al. 1967, Liberman and Mattingly 1985).

One might accept that perceived phonemes should be identified with articulatory gestures but reject that this makes speech special (see, e.g., Fowler 1986, Mole 2009). If auditory perception generally implicates environmental happenings or sound sources, then the gestures and activities associated with speech production are not entirely distinctive among objects of audition. If hearing even sounds is not merely a matter of hearing features of acoustic signals or structures, and if it is part of the function of auditory perception to furnish information about distal events on the basis of their audible characteristics, then speech is not entirely unique among things we hear (see also Rosenbaum 2004, O’Callaghan 2015).

The processes associated with speech perception therefore need not be understood as entirely distinct in function or in kind from those devoted to general audition, as Liberman and Mattingly contend. Given this, it is not surprising to learn that good evidence suggests humans are not special in possessing the capacity to perceptually individuate the sounds of speech (see, e.g., Lotto et al. 1997 for details).

The processes associated with speech need not be entirely continuous with those of general audition. The overall claim is compatible with higher acuity or sensitivity for speech sounds, and it allows for special selectivity for speech sounds. Even if hearing speech marshals perceptual resources continuous with those devoted to hearing other sounds and events in one’s environment, it would be very surprising to discover that there were not processes and resources devoted to the perception of speech. Research in fact supports a special status for speech among the things we auditorily perceive. First, evidence suggests that human neonates prefer sounds of speech to non-speech (Vouloumanos and Werker 2007). Second, adults are able to distinguish speech from non-speech based on visual cues alone (Soto-Faraco et al. 2007). Third, infants can detect and distinguish different languages auditorily (Mehler et al. 1988, Bosch et al. 1997). Finally, infants aged approximately 4–6 months can detect, based on visual cues alone, when a speaker changes from one language to another, though all but those in bilingual households lose that ability by roughly 8 months (Weikum et al. 2007).

To review, no obvious acoustic correlates exist for phonetic segments heard in speech. Complex acoustic cues therefore must trigger perceptual experiences of phonemes. Articulatory gestures, however, are good (though imperfect) candidates for objects of speech perception. This does not imply that speech perception involves entirely different kinds of objects or processes from ordinary non-linguistic audition, nor does it imply that speech perception is a uniquely human capacity. Nevertheless, speech clearly is special for humans, in that we have special sensitivity for speech sounds. Speech perception promises to reward additional philosophical attention (see O’Callaghan 2015 for further development).

Copyright © 2020 by
Casey O'Callaghan <>

Open access to the SEP is made possible by a world-wide funding initiative.
The Encyclopedia Now Needs Your Support
Please Read How You Can Help Keep the Encyclopedia Free