Lecture 24: Pumping lemma

Reading: Pass and Tseng, Limits of Automata, MCS 15.8 The pigeonhole principle
last semester's notes
Proof that \(\{0^n1^n \mid n \in \mathbb{N}\}\) is unrecognizable
Pumping lemma
Review exercises:
- use the pumping lemma to prove that the set of strings of balanced parentheses is not recognizable
- prove the pumping lemma

A non-recognizable set

Let \(L = \{0^k1^k \mid k \in \mathbb{N}\} = \{ε, 01, 0011, 000111, \dots\}\).

Claim: \(L\) is not recognizable.

Proof: by contradiction. Suppose \(L\) were recognizable. Then there is some \(M\) with \(L = L(M)\). Let \(n\) be the number of states of \(M\), and let \(x = 0^n1^n\). Clearly \(x \in L\), so \(M\) must accept \(x\).

Let's consider what happens while \(M\) is processing \(x\). While processing the first \(n\) characters, \(M\) must pass through \(n+1\) states \(q_0\), \(q_1\), , \(q_n\). Since there are only \(n\) states to choose from, two of these states must be the same: there is a loop; \(q_i = q_j\) for some \(i \lt j \leq n\).

Let \(u\) be the part of \(x\) that transitions from \(q_0\) to \(q_i\); \(v\) be the part that transitions from \(q_i\) to \(q_j\), and let \(w\) be the part that transititons from \(q_j\) to \(q_n\) (which remember, is a final state). Note that since the loop happens within the first \(n\) characters, \(u\) and \(v\) can consist only of 0's.

Now consider what happens if we plug the string \(uvvw\) into \(M\). \(M\) will transition to \(q_i\), and then go around the loop twice, ending up back at \(q_j\). It will then process \(w\), taking it from \(q_j\) to \(q_n\), where it will be accepted. Therefore \(uvvw \in L(M)\).

However, since \(v\) consisted of one or more 0s, \(uvvw\) has more 0's than 1's, so \(uvvw \notin L\). This contradicts the assumption that \(L(M) = L\), completing the proof.

The pumping lemma

This same argument can be applied to many languages, and can be generalized into the so-called "pumping lemma":

Claim (pumping lemma): If \(L\) is a DFA-recognizable language, then there exists some \(n\) (often called the pumping length), such that for all \(x \in L\) with \(len(x) \geq n\), there exists strings \(u\), \(v\), and \(w\) such that

\(x = uvw\),
\(len(uv) \leq n\),
\(len(v) > 0\), and
for all \(k \geq 0\), \(uv^kw \in L\).

The proof is just like the proof above; we give it below.

This lemma is used to prove that languages are not DFA-recognizable. For example, we can use it to rewrite the proof above:

Claim: \(L = \{0^n1^n \mid n \in \mathbb{N}\}\) is not DFA-recognizable.

Proof: by contradiction, assume that \(L\) is DFA-recognizable. Then there exists some \(n\) as in the pumping lemma. Let \(x = 0^n1^n\). Clearly \(x \in L\) and \(len(x) \geq n\), so we can write \(x\) as \(uvw\) as in the pumping lemma. Since \(len(uv) \leq n\), \(v\) can only consist of 0's (the first \(n\) characters of \(x\) are 0's). It must have at least one 0, since \(len(v) > 0\). The pumping lemma tells us that \(uv^2w \in L\), but this is a contradiction, because \(uv^2w\) has more 0's than 1's. Therefore \(L\) is not regular.

Here is another example:

Claim: Let \(L\) be the set of strings of digits and the symbols \(+\) and \(=\) that represent equations that are true. For example, "\(1+1=2\)" is in \(L\), while "\(3+5=9\)" is not. \(L\) is not recognizable.

Proof: by contradiction, assume that \(L\) is DFA-recognizable. Then there exists some \(n\) as in the pumping lemma. Let \(x = "1^n+0=1^n"\). Clearly \(x \in L\) and \(len(x) \geq n\), so we can write \(x\) as \(uvw\) as in the pumping lemma. Since \(len(uv) \leq n\), \(v\) can only consist of 1's (the first \(n\) characters of \(x\) are 1's). It must have at least one 1, since \(len(v) > 0\). The pumping lemma tells us that \(uv^0w = uw \in L\), but this is a contradiction, because \(uw\) has a smaller number on the left hand side of the equation than on the right side, and therefore is not in \(L\). Thus, \(L\) is not DFA-recognizable.

Proof of the pumping lemma: This proof is almost the same as the special case given above. Assume \(L\) is DFA-recognizable. Then there is some machine \(M\) that recognizes \(L\). Let \(n\) be the number of states of \(M\). Now, if \(x\) is an arbitrary string in \(L\) with length greater than or equal to \(M\), then while processing the first \(n\) characters, \(M\) must traverse the some state \(q\) at least twice.

Let \(u\) be the portion of \(x\) that transitions \(M\) from the start state to \(q\). Let \(v\) be the portion of \(x\) that transitions from \(q\) back to \(q\), and let \(w\) be the remainder of \(x\); \(w\) transitions \(M\) from \(q\) to some final state (since \(x \in L\), \(\hat{δ}(q_0,uvw)\) must be a final state).

Clearly \(x = uvw\). \(len(uv) \leq n\) since the loop must occur within the first \(n\) characters of \(x\). \(len(v) \gt 0\) because otherwise the loop is not a loop. Finally, while processing \(uv^kw\), \(M\) transitions to \(q\) on \(u\), then back to \(q\) on each iteration of \(v\), and finally from \(q\) to an accepting state on \(w\), and thus \(M\) accepts \(uv^kw\). Therefore \(uv^kw \in L(M) = L\), completing the proof.