Lecture 24: Pumping lemma

Reading: Pass and Tseng, Limits of Automata, MCS 15.8 The pigeonhole principle
Previous semester's notes
Finish closure under union
Pumping lemma
Review exercises:
- prove that the intersection of DFA-recognizable sets is recognizable. It's a good exercise to do it using only the hints in the "looking back on the proof" section below.
- use the pumping lemma to prove that the set of strings of balanced parentheses is not recognizable
- prove the pumping lemma

Closure under union

Claim: If $L_1$ and $L_2$ are DFA-recognizable, then so is $L_1 \cup L_2$.

Proof: Since $L_1$ and $L_2$ are recognizable, there are machines $M_1 = (Q_1, Σ, δ_1, q_{01}, F_1)$ and $M_2 = (Q_2, Σ, δ_2, q_{02}, F_2)$ that recognize them. We want to construct a machine $M$ that recognizes $L_1 \cup L_2$.

What would such a machine need to know while processing $x$? If it knew what states $M_1$ and $M_2$ were in, it would know whether to accept or not. So this suggests that a state of $M$ should correspond to a pair of states, one from $M_1$ and one from $M_2$. This is the construction we use.

Let $M = (Q_1 \times Q_2, Σ, δ, q_0, F)$, where $δ$, $q_0$ and $F$ are defined as follows.

To define $δ$, we first note the domain and codomain: $δ : (Q_1 \times Q_2) \times Σ → (Q_1 \times Q_2)$. We want $M$ to be in state $(q_1, q_2)$ if $M_1$ is in state $q_1$ and $M_2$ is in state $q_2$. If we then see another character $a$, we would want to step $M_1$ to $δ_1(q_1,a)$ and $M_2$ to $δ_2(q_2,a)$. This suggests the following definition: \[δ((q_1,q_2), a) ::= (δ_1(q_1,a), δ_2(q_2, a))\]

Where should $M$ start? If we process the empty string, $M_1$ would be in state $q_{01}$, and $M_2$ would be in state $q_{02}$, so let's choose $q_0 = (q_{01},q_{02})$.

What about the final states? We want $M$ to accept if either $M_1$ would or $M_2$ would. That suggests \[F = \{(q_1,q_2) \mid q_1 \in F_1\text{ or } q_2 \in F_2\}\]

We now want to show that $L(M) = L(M_1) \cup L(M_2)$. We start by showing that $M$ works properly. Let's write down a specification and prove it.

Let $P(x)$ be the statement $\hat{δ}(q_0,x) = (\hat{δ}_1(q_{01},x), \hat{δ}_2(q_{02}, x))$ (informally, $M$ correctly simulates $M_1$ and $M_2$). We will prove $∀x, P(x)$ by induction on $x$.

To see $P(ε)$, note that $\hat{δ}(q_0, ε) = q_0 = (q_{01},q_{02})$ by definition of $\hat{δ}$ and $q_0$. On the other side, we have $(\hat{δ}_1(q_{01},ε), \hat{δ}_2(q_{02}, ε)) = (q_{01}, q_{02})$ by definition of $\hat{δ}_1$ and $\hat{δ}_2$. Since these are the same, we are done.

To see $P(xa)$, first inductively assume $P(x)$. We compute \[ \begin{aligned} \hat{δ}(q_0, xa) &= δ(\hat{δ}(q_0, x), a) && \text{by definition of $\hat{δ}$} \\ &= δ(\hat{δ}((q_{01}, q_{02}), x), a) && \text{by definition of $q_0$} \\ &= δ\left(\left(\hat{δ}_1(q_{01}, x), \hat{δ}_2(q_{02}, x)\right), a\right) && \text{by $P(x)$} \\ &= (δ_1(\hat{δ}_1(q_{01}, x), a), δ_2(\hat{δ}_2(q_{02}, x), a)) && \text{by definition of $δ$} \\ &= (\hat{δ}_1(q_{01},xa), \hat{δ}_2(q_{02},xa)) && \text{by definition of $\hat{δ}_1$ and $\hat{δ}_2$} \end{aligned} \]

Now that we know that $M$ simulates $M_1$ and $M_2$, it is easy to prove that $L(M) = L(M_1) \cup L(M_2)$. As with the rest of the proof, we just keep plugging in definitions:

\[ \begin{aligned} L(M) &= \{x \mid \hat{δ}(q_0, x) \in F\} && \text{by definition of $L$} \\ &= \{x \mid (\hat{δ}_1(q_{01},x), \hat{δ}_2(q_{02}, x)) \in F\} && \text{this is $P(x)$, which we just proved} \\ &= \{x \mid \hat{δ}_1(q_{01}, x) \in F_1 \text{ or } \hat{δ}_2(q_{02}, x) \in F_2\} && \text{by definition of $F$} \\ &= \{x \mid \hat{δ}_1(q_{01}, x) \in F_1\} \cup \{x \mid \hat{δ}_2(q_{02}, x) \in F_2\} && \text{by definition of $\cup$} \\ &= L(M_1) \cup L(M_2) && \text{by definition of $L$} \\ \end{aligned} \]

Looking back at the proof

This proof looks intimidating. It isn't. The summary is: build a machine that simulates $M_1$ and $M_2$, and use induction. Everything else is just plugging in definitions or inductive hypotheses.

A non-recognizable set

Let $L = \{0^k1^k \mid k \in \mathbb{N}\} = \{ε, 01, 0011, 000111, \dots\}$.

Claim: $L$ is not recognizable.

Proof: by contradiction. Suppose $L$ were recognizable. Then there is some $M$ with $L = L(M)$. Let $n$ be the number of states of $M$, and let $x = 0^n1^n$. Clearly $x \in L$, so $M$ must accept $x$.

Let's consider what happens while $M$ is processing $x$. While processing the first $n$ characters, $M$ must pass through $n+1$ states $q_0$, $q_1$, , $q_n$. Since there are only $n$ states to choose from, two of these states must be the same: there is a loop; $q_i = q_j$ for some $i \lt j \leq n$.

Let $u$ be the part of $x$ that transitions from $q_0$ to $q_i$; $v$ be the part that transitions from $q_i$ to $q_j$, and let $w$ be the part that transititons from $q_j$ to $q_n$ (which remember, is a final state). Note that since the loop happens within the first $n$ characters, $u$ and $v$ can consist only of 0's.

Now consider what happens if we plug the string $uvvw$ into $M$. $M$ will transition to $q_i$, and then go around the loop twice, ending up back at $q_j$. It will then process $w$, taking it from $q_j$ to $q_n$, where it will be accepted. Therefore $uvvw \in L(M)$.

However, since $v$ consisted of one or more 0s, $uvvw$ has more 0's than 1's, so $uvvw \notin L$. This contradicts the assumption that $L(M) = L$, completing the proof.

The pumping lemma

This same argument can be applied to many languages, and can be generalized into the so-called "pumping lemma":

Claim (pumping lemma): If $L$ is a DFA-recognizable language, then there exists some $n$ (often called the pumping length), such that for all $x \in L$ with $len(x) \geq n$, there exists strings $u$, $v$, and $w$ such that

$x = uvw$,
$len(uv) \leq n$,
$len(v) > 0$, and
for all $k \geq 0$, $uv^kw \in L$.

The proof is just like the proof above; we give it below.

This lemma is used to prove that languages are not DFA-recognizable. For example, we can use it to rewrite the proof above:

Claim: $L = \{0^n1^n \mid n \in \mathbb{N}\}$ is not DFA-recognizable.

Proof: by contradiction, assume that $L$ is DFA-recognizable. Then there exists some $n$ as in the pumping lemma. Let $x = 0^n1^n$. Clearly $x \in L$ and $len(x) \geq n$, so we can write $x$ as $uvw$ as in the pumping lemma. Since $len(uv) \leq n$, $v$ can only consist of 0's (the first $n$ characters of $x$ are 0's). It must have at least one 0, since $len(v) > 0$. The pumping lemma tells us that $uv^2w \in L$, but this is a contradiction, because $uv^2w$ has more 0's than 1's. Therefore $L$ is not regular.

Here is another example:

Claim: Let $L$ be the set of strings of digits and the symbols $+$ and $=$ that represent equations that are true. For example, "$1+1=2$" is in $L$, while "$3+5=9$" is not. $L$ is not recognizable.

Proof: by contradiction, assume that $L$ is DFA-recognizable. Then there exists some $n$ as in the pumping lemma. Let $x = "1^n+0=1^n"$. Clearly $x \in L$ and $len(x) \geq n$, so we can write $x$ as $uvw$ as in the pumping lemma. Since $len(uv) \leq n$, $v$ can only consist of 1's (the first $n$ characters of $x$ are 1's). It must have at least one 1, since $len(v) > 0$. The pumping lemma tells us that $uv^0w = uw \in L$, but this is a contradiction, because $uw$ has a smaller number on the left hand side of the equation than on the right side, and therefore is not in $L$. Thus, $L$ is not DFA-recognizable.

Proof of the pumping lemma: This proof is almost the same as the special case given above. Assume $L$ is DFA-recognizable. Then there is some machine $M$ that recognizes $L$. Let $n$ be the number of states of $M$. Now, if $x$ is an arbitrary string in $L$ with length greater than or equal to $M$, then while processing the first $n$ characters, $M$ must traverse the some state $q$ at least twice.

Let $u$ be the portion of $x$ that transitions $M$ from the start state to $q$. Let $v$ be the portion of $x$ that transitions from $q$ back to $q$, and let $w$ be the remainder of $x$; $w$ transitions $M$ from $q$ to some final state (since $x \in L$, $\hat{δ}(q_0,uvw)$ must be a final state).

Clearly $x = uvw$. $len(uv) \leq n$ since the loop must occur within the first $n$ characters of $x$. $len(v) \gt 0$ because otherwise the loop is not a loop. Finally, while processing $uv^kw$, $M$ transitions to $q$ on $u$, then back to $q$ on each iteration of $v$, and finally from $q$ to an accepting state on $w$, and thus $M$ accepts $uv^kw$. Therefore $uv^kw \in L(M) = L$, completing the proof.