Supplement to Auditory Perception

Speech Perception: Empirical and Theoretical Considerations

What are the objects of speech perception? Speaking involves the production of meaningful streams of sounds. At the physical level, a spectrogram reveals the patterns of frequency and amplitude that ground audible features. The stream sounds like a complex acoustic structure involving patterns of audible qualities over time. The stream, however, auditorily appears to be segmented (speech in an unfamiliar language often seems like an unsegmented stream). The most salient segments are words, the meaningful units. Also discernible in the stream are segments that correspond to something like syllables. These units or segments are not ascribed meaning, but instead combine to form words in a way loosely analogous to the way words combine to form sentences. Even syllables, however, comprise perceptually distinguishable sound types. For instance, though 'dough' has one syllable, it includes the sounds of /d/ and /O/ (or /oʊ/). The sound of the one-syllable spoken word 'bad' includes /b/, /æ/, and /d/. Those of 'bat' and 'bash' differ because the former contains /t/ and the latter contains /ʃ/. Such perceptible units, or phonemes, whose patterns form the basis for recognizing and distinguishing words, have been one primary focus of research into speech perception. One answer to the question, “What are the objects of speech perception?” is, “Phonemes.”

What is a phoneme? First, phonemes are language-specific. Phonemes are perceptual equivalence classes drawn from a universal class of phones, which contains all the possibly distinguishable speech sounds. The class of English phonemes, for instance, differs from that of Japanese, though certain phonemes are shared. English, for example, distinguishes the [l] and [r] sounds (phones) as distinct phonemes, while Japanese does not and treats them as allophones, or variants of a common phoneme. Chinese distinguishes several phonemes that correspond to allophones of the English phoneme /p/. Infants prior to language learning distinguish phones that are later subsumed to a single phonemic equivalence class (see, e.g., Werker 1995, Kuhl 2000 for review and commentary).

Intuitively, phonemes comprise the smallest set of perceptually equivalent, semantically significant sound types that constitute the spoken words in a given language. Phonemes seem to form a sound alphabet from which audible words are built. Writing, after all, is naturally understood as an innovation that involved translating into written form the audible sounds of speech, and we teach children to “sound out” written words (Appelbaum 1999 critiques the “alphabetic” conception). Pioneers into speech perception research aimed initially to develop an automated reading machine for the blind that worked by replacing individual letters with specific sounds. The project failed miserably--listeners were unable at the rates of normal speech to resolve the sequence of individual sounds required to detect words (see Liberman 1996).

The central puzzle of speech perception is that there is no obvious direct, consistent correspondence between the surface properties of a physical acoustic signal and the phonemes perceived when listening to speech. This is manifested in a number of ways. Most importantly, there is no clear invariant property of a sound signal that corresponds to a given phoneme. What sounds like a single phoneme might have very different acoustic correlates depending not just upon the speaker or the speaker's mood, but also upon the phonemic context. For instance, /di/ and /du/ audibly share the /d/ phoneme. However, the acoustic signal corresponding to /d/ differs greatly in these cases (see Liberman et al. 1967, 435, fig. 1). While /di/ includes a formant that begins at a higher frequency and rises, /du/ includes a formant that begins at a lower frequency and drops. Acoustically, nothing straightforward in the signal corresponds to the /d/ sound you auditorily experience in both cases. Two different audible phonemes also might share acoustic correlates, again depending on context. The acoustic signal that corresponds to /p/ is nearly identical to that of /k/ in the contexts /pi/ and /ka/ (Cooper et al. 1952). Prima facie, phonemes thus are not identical with distinctive invariant acoustic structures.

Lack of invariance stems primarily from coarticulation. In contrast to how things seem auditorily, how a speaker articulates a given phoneme depends upon what precedes or follows that phoneme. Being followed by /i/ rather than /u/ impacts how one pronounces /d/, and being preceded by /d/ impacts the vowel. When pronouncing 'dab', the effects of pronouncing both /d/ and /b/ are evident in the acoustic signature of /a/. The articulatory consequences of phonemic context change the acoustic features of the signal and confound attempts to map phonemes to signals (which presents the difficulty for artificial speech production and recognition). Furthermore, due to coarticulation, the signal lacks the clear segmentation of categorically perceived phonemes, which have been likened to beads on a string (Bloomfield 1933). In effect, speakers pronounce two or more phonemes at a time, and transitions are fluid rather than discrete (see, e.g., Liberman 1970, 309, fig. 5, Diehl et al. 2004).

One response to this is to search for more complex acoustic structures that correspond to perceived phonemes (see, e.g., Blumstein and Stevens 1981, Diehl et al. 2004, Holt and Lotto 2008 for the general auditory approach).

Another approach appeals to aspects of the gestures used to pronounce phonemes--ways of moving one's throat and mouth and tongue--which are reasonably invariant across contexts. For instance, pronouncing /d/ involves placing the tip of the tongue on the alveolar ridge directly behind the teeth. The alveolar consonants /d/ and /t/ differ from each other in being voiced, or accompanied by vocal fold movement. Whether you say /di/ or /du/, your tongue touches the alveolar ridge and you voice the consonant. But, while you articulate the gestures associated with /d/, you anticipate and begin to articulate those associated with /i/ or /u/. This alters the overall acoustic signature of the gestures associated with /d/. Gestures, rather than the complex acoustic signals they produce, on this view make intelligible the perceptual individuation of phonemes. Some therefore hold that perceiving phonemes involves recovering information about articulatory gestures from the acoustic signal. The motor theory (Liberman et al. 1967, Liberman and Mattingly 1985) and direct realism (Fowler 1986) are very different versions of this approach. Articulatory gestures thus make plausible candidates for objects of phoneme perception. They are, however, imperfect candidates, since they do not entirely escape worries about the context dependence and lack of discrete segmentation stemming from fluid coarticulation (Appelbaum 1996, Remez and Trout 2009).

Nonetheless, the claim is supported by the surprising finding that visual processes impact the auditory experience of speech. For instance, the McGurk effect includes one instance in which seeing video of a speaker pronouncing /ga/ dubbed with audio of /ba/ leads to hearing as of the /da/ phoneme (McGurk and Macdonald 1976). If perceiving speech involves perceiving gestures, it is not surprising that the visual evidence for articulatory gestures should be weighed against auditory evidence.

Some researchers who hold that intended or actual gestures are the best candidates for the objects of phoneme perception argue that speech perception therefore is special. That is, speech perception's objects differ in kind from the sounds and acoustic structures we hear in general audition (Liberman et al. 1967, Liberman and Mattingly 1985). Liberman and Mattingly (1985), furthermore, use the claim that audition has distinctive objects to motivate the claim that speech perception therefore involves distinctive perceptual processes. They even argue that although speech perception shares an end organ with auditory perception, it constitutes a functionally distinct modular perceptual system (Liberman and Mattingly 1985, 7-10, 27-30, see also 1989). Part of the motivation for their motor theory of speech perception, against auditory theories, is to integrate explanations of speech perception and speech production (1985, 23-5, 30-1, see also Matthen 2005, ch 9, which uses the Motor Theory to support a Codependency Thesis linking the capacities to perceive and produce phonemes, 221). On this account, a single modular system is responsible for both the production and perception of speech. This purported link between capacities for production and perception suggests that humans are unique in possessing a speech perception system. Humans, but not other creatures, are capable of discerning speech for many of the same reasons they are capable of producing the articulatory gestures that correspond to perceived phonemes. Other animals presumably hear just sounds (Liberman et al. 1967, Liberman and Mattingly 1985).

One might accept that perceived phonemes should be identified with articulatory gestures but reject that this makes speech special (see, e.g., Fowler 1986, Mole 2009). If auditory perception generally implicates environmental happenings or sound sources, then the gestures and activities associated with speech production are not entirely distinctive among objects of audition. If hearing even sounds is not merely a matter of hearing features of acoustic signals or structures, and if it is part of the function of auditory perception to furnish information about distal events on the basis of their audible characteristics, then speech is not entirely unique among things we hear (see also Rosenblum 2004).

The processes associated with speech perception therefore need not be understood as entirely distinct in function or in kind from those devoted to general audition, as Liberman and Mattingly contend. Given this, it is not surprising to learn that good evidence suggests humans are not special in possessing the capacity to perceptually individuate the sounds of speech (see, e.g., Lotto et al. 1997 for details).

The processes associated with speech need not be entirely continuous with those of general audition. The overall claim is compatible with higher acuity or sensitivity for speech sounds, and it allows for special selectivity for speech sounds. Even if hearing speech marshals perceptual resources continuous with those devoted to hearing other sounds and events in one's environment, it would be very surprising to discover that there were not processes and resources devoted to the perception of speech. Research in fact supports a special status for speech among the things we auditorily perceive. First, evidence suggests that human neonates prefer sounds of speech to non-speech (Vouloumanos and Werker 2007). Second, adults are able to distinguish speech from non-speech based on visual cues alone (Soto-Faraco et al. 2007). Third, infants can detect and distinguish different languages auditorily (Mehler et al. 1988, Bosch et al. 1997). Finally, infants aged approximately 4-6 months can detect, based on visual cues alone, when a speaker changes from one language to another, though all but those in bilingual households lose that ability by roughly 8 months (Weikum et al. 2007).

To review, no obvious acoustic correlates exist for phonetic segments heard in speech. Complex acoustic cues therefore must trigger perceptual experiences of phonemes. Articulatory gestures, however, are good (though imperfect) candidates for objects of speech perception. This does not imply that speech perception involves entirely different kinds of objects or processes from ordinary non-linguistic audition, nor does it imply that speech perception is a uniquely human capacity. Nevertheless, speech clearly is special for humans, in that we have special sensitivity for speech sounds. Speech perception promises to reward future philosophical attention.

Copyright © 2009 by
Casey O'Callaghan <casey.ocallaghan@rice.edu>

Open access to the SEP is made possible by a world-wide funding initiative.
Please Read How You Can Help Keep the Encyclopedia Free