In formal language theory, the Chomsky–Schützenberger enumeration theorem is a theorem derived by Noam Chomsky and Marcel-Paul Schützenberger about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of formal languages and abstract algebra.
In order to state the theorem, a few notions from algebra and formal language theory are needed.
A power series over N is an infinite series of the form
f = f ( x ) = ∑ k = 0 ∞ a k x k = a 0 + a 1 x 1 + a 2 x 2 + a 3 x 3 + ⋯ with coefficients a k in N . The multiplication of two formal power series f and g is defined in the expected way as the convolution of the sequences a n and b n :
f ( x ) ⋅ g ( x ) = ∑ k = 0 ∞ ( ∑ i = 0 k a i b k − i ) x k . In particular, we write f 2 = f ( x ) ⋅ f ( x ) , f 3 = f ( x ) ⋅ f ( x ) ⋅ f ( x ) , and so on. In analogy to algebraic numbers, a power series f ( x ) is called algebraic over Q ( x ) , if there exists a finite set of polynomials p 0 ( x ) , p 1 ( x ) , p 2 ( x ) , … , p n ( x ) each with rational coefficients such that
p 0 ( x ) + p 1 ( x ) ⋅ f + p 2 ( x ) ⋅ f 2 + ⋯ + p n ( x ) ⋅ f n = 0. A context-free grammar is said to be unambiguous if every string generated by the grammar admits a unique parse tree or, equivalently, only one leftmost derivation. Having established the necessary notions, the theorem is stated as follows.
Chomsky–Schützenberger theorem. If
L is a
context-free language admitting an unambiguous context-free grammar, and
a k := | L ∩ Σ k | is the number of words of length
k in
L , then
G ( x ) = ∑ k = 0 ∞ a k x k is a power series over
N that is algebraic over
Q ( x ) .
Proofs of this theorem are given by Kuich & Salomaa (1985), and by Panholzer (2005).
The theorem can be used in analytic combinatorics to estimate the number of words of length n generated by a given unambiguous context-free grammar, as n grows large. The following example is given by Gruber, Lee & Shallit (2012): the unambiguous context-free grammar G over the alphabet {0,1} has start symbol S and the following rules
S →
M |
UM → 0
M1
M |
εU → 0
S | 0
M1
U.
To obtain an algebraic representation of the power series G ( x ) associated with a given context-free grammar G, one transforms the grammar into a system of equations. This is achieved by replacing each occurrence of a terminal symbol by x, each occurrence of ε by the integer '1', each occurrence of '→' by '=', and each occurrence of '|' by '+', respectively. The operation of concatenation at the right-hand-side of each rule corresponds to the multiplication operation in the equations thus obtained. This yields the following system of equations:
S =
M +
UM =
M²
x² + 1
U =
Sx +
MUx²
In this system of equations, S, M, and U are functions of x, so one could also write S ( x ) , M ( x ) , and U ( x ) . The equation system can be resolved after S, resulting in a single algebraic equation:
x ( 2 x − 1 ) S 2 + ( 2 x − 1 ) S + 1 = 0 .
This quadratic equation has two solutions for S, one of which is the algebraic power series G ( x ) . By applying methods from complex analysis to this equation, the number a n of words of length n generated by G can be estimated, as n grows large. In this case, one obtains a n ∈ O ( 2 + ϵ ) n but a n ∉ O ( 2 − ϵ ) n for each ϵ > 0 . See (Gruber, Lee & Shallit 2012) for a detailed exposition.
In classical formal language theory, the theorem can be used to prove that certain context-free languages are inherently ambiguous. For example, the Goldstine language L G over the alphabet { a , b } consists of the words a n 1 b a n 2 b ⋯ a n p b with p ≥ 1 , n i > 0 for i ∈ { 1 , 2 , … , p } , and n j ≠ j for some j ∈ { 1 , 2 , … , p } .
It is comparably easy to show that the language L G is context-free (Berstel & Boasson 1990). The harder part is to show that there does not exist an unambiguous grammar that generates L G . This can be proved as follows: If g k denotes the number of words of length k in L G , then for the associated power series holds G ( x ) = ∑ k = 0 ∞ g k x k = 1 − x 1 − 2 x − 1 x ∑ k ≥ 1 x k ( k + 1 ) / 2 − 1 . Using methods from complex analysis, one can prove that this function is not algebraic over Q ( x ) . By the Chomsky-Schützenberger theorem, one can conclude that L G does not admit an unambiguous context-free grammar. See (Berstel & Boasson 1990) for detailed account.