Quantal theory of speech

Updated on Jan 31, 2026

Edit

Comment

The quantal theory of speech is a phonetic answer to one of the fundamental questions of phonology, specifically: if each language community is free to arbitrarily select a system of phonemes or segments, then why are the phoneme inventories of different languages so similar? For example, almost all languages have the stop consonants /p/, /t/, /k/, and almost all have the vowels /a/, /i/, and /u/. Other phonemes differ considerably among languages, but not nearly as much as they would if each language were free to choose arbitrarily.

Proposed by Ken Stevens at MIT, quantal theory formalizes the intuition that some speech sounds are easier to produce than others. Sounds that are easier to reliably produce, in the formal way described below, are more common among the languages of the world; those that are harder to reliably produce are less common.

The Quantal Nature of Speech

Let Y=f(X), where X is any particular articulatory parameter (tongue tip position, for example), and Y is any particular perceptual parameter (perceived frequency of the peak in the acoustic spectrum, for example). Like any nonlinear relation, f(X) has regions of low slope (|df/dX| small) and regions of high slope (|df/dX| large). Values of Y drawn from a high-slope region are unstable, in the sense that a small change in X causes a large change in Y; values of Y drawn from a low-slope region are conversely stable, in that they are little perturbed by large changes in X. Stevens proposed in 1968 that the stability of low-slope regions makes them more likely to be chosen as discrete linguistic units (phonemes) by the languages of the world, and that the distinction between any pair of phonemes tends similarly to occur across an unstable high-slope boundary region. Examples include

Consonant Place of Articulation

Alveolar versus palatal. The hard palate is horizontal for up to 1 cm behind the teeth, before suddenly opening upward in a feature known as the alveolar ridge. By moving the tongue a few millimeters before or behind the alveolar ridge, therefore, it is possible to dramatically change the acoustic spectrum, resulting in the distinction between "sip" and "ship".

Palatal versus retroflex. The tongue tip is flexible about 1.5 cm below its tip, permitting the tongue tip to fold back on itself. If the tongue tip is close to the palate when this action is performed, the air cavity under the tongue is suddenly lengthened thereby from 2.5 cm to 4 cm, resulting in the change from "chip" to "trip," or from "you" to "rue."

Consonant Manner

Plosive versus fricative versus glide. Producing turbulence in the vocal tract requires a very careful adjustment: the minimum constriction cross-section must be typically less than 1.5mm, but greater than 0mm. If the tongue (for example) closes all the way against the palate, then releases again, the result is a plosive (as in "tip"). If the tongue closes most of the way, but doesn't pass the magic 1.5mm boundary, the result is a glide (as in "yip"). If the tongue reaches a minimum constriction width between 0 and 1.5mm, the resulting sound is a fricative (as in "ship"). Despite the high degree of control required, most languages maintain a three-way contrast between glides, fricatives, and plosives, because of the large acoustic difference so achieved.

Plosive versus nasal. If the passage between your mouth and nose is opened by even 1mm during the /b/ closure of "bug," the word becomes "mug." Further opening of the soft palate (2mm, 5mm, even 20mm) has almost no effect on the acoustics; most languages distinguish /b/ from /m/, but few (if any?) distinguish different degrees of soft palate opening.

Strident versus nonstrident. When a fricative is produced, the turbulent jet of air can either be pointed against an obstacle (e.g., in the word "sin," the jet is directed against the lower teeth), or pointed directly out of the mouth (as in the word "thin"). A jet directed against an obstacle makes a lot more noise (sound power is typically ten times greater), therefore many languages use this distinction to enhance an otherwise tiny place of articulation difference.

Vowels

Lehiste demonstrated that when the peak frequencies in a vowel spectrum (the so-called "formants") are closer together than about half an octave, listeners respond as if the two peaks were merged into a single peak. Many vowel distinctions straddle this half-octave threshold, e.g., the first two formants of "bought" are closer than half an octave, while those of "but" are not; the second and third formants of "bit" are closer than half an octave, while those of "bet" are not.

Enhancement Features

Quantal theory is supported by a theory of language change, developed in collaboration with Jay Keyser, which postulates the existence of redundant or enhancement features.

It is quite common, in language, to find a pair of phonemes that differ in two features simultaneously. In English, for example, "thin" and "sin" differ in both the place of articulation of the fricative (teeth versus alveolar ridge), and in its loudness (nonstrident versus strident). Similarly, "tell" and "dell" differ in both the voicing of the initial consonant, and in its aspiration (the /t/ in "tell" is immediately followed by a puff of air, like a short /h/ between the plosive and the vowel). In many cases, native speakers have strong and mistaken intuition about the relative importance of the two distinctions, e.g., speakers of English believe that "thin" versus "sin" is a place of articulation difference, even though the loudness difference is more perceptible. Stevens, Keyser and Kawasaki proposed that such redundant features evolve as an enhancement of an otherwise weak acoustic distinction, in order to improve the robustness of the language's phonological system.

References

Quantal theory of speech Wikipedia

(Text) CC BY-SA