Lecture 26: Regular expressions

Regular expressions

Regular expressions are patterns that match certain strings. They give a way to define a language: the language of a regular expression is the set of all strings that match the pattern.

There are six ways to construct regular expressions. Formally, the set of regular expressions is formed by the following grammar:

\[ r ∈ RE ::= ∅ ~~|~~ ε ~~|~~ a ~~|~~ r_1r_2 ~~|~~ (r_1+r_2) ~~|~~ r^* \]

\(∅\) matches no strings. \(L(∅) = ∅\).

\(ε\) matches only the empty string. \(L(ε) = \{ε\}\).

\(a\) matches the string "a". \(L(a) = \{a\}\).

\(r_1r_2\) (the concatenation of \(r_1\) and \(r_2\)) matches any string that can be broken into two parts \(x\) and \(y\), with \(x\) matching \(r_1\) and \(y\) matching \(r_2\). \(L(r_1r_2) = \{xy ~~|~~ x \in L(r_1), y \in L(r_2)\}\).

\(r_1+r_2\) (the alternation of \(r_1\) and \(r_2\), sometimes written \(r_1 + r_2\) or \(r_1 \cup r_2\)) matches any string that matches either \(r_1\) or \(r_2\). Formally, \(L(r_1+r_2) = L(r_1) \cup L(r_2)\). Note that some sources use \(r_1|r_2\) or \(r_1 \cup r_2\) for the alternation.

\(r^*\) (the Kleene star or Kleene closure of \(r\)) matches the concatenation of any number (including 0) of strings, each of which match \(r\). Formally, \(L(r) = \{x_1x_2x_3\dots ~~|~~ x_i \in L(r)\}\).

Important note: Many programming languages add other kinds of regular expressions, such as \(r^+\) to denote one or more \(r\)'s, or \(r?\) to denote 0 or 1 r's. While convenient for programming, additional forms make the theory more complicated without adding anything. For this class, these are the only forms of regular expressions. You can achieve the same effects of most of these extensions by translating them to our basic regular expressions. For example, you can use \(rr^*\) to denote one or more repetitions of \(r\), and \((r + ε)\) to denote 0 or 1 repetitions.


Numbers: Let \(Σ\) be a the ASCII character set (including upper- and lower-case letters, digits, and punctuation). Let \(D ::= 0 + 1 + 2 + \cdots + 9\). Then \(D\) matches any single digit (e.g. \(D\) matches 0, \(D\) matches 1, etc.) \(D^*\) matches numbers of any length, including the empty string. \(DD^*\) matches any natural number with length \(\geq\) 1. \(('-'+ε)DD^*(ε+'.'DD^*)\) matches numbers that contain at least one digit and optionally start with a '-' character, and are optionally followed by a decimal point and one or more digits.

Dates: With \(D\) defined as above, \(DD/DD\) represents dates of the form \(mm/dd\). We can be more specific: Let \(N ::= 1 + 2 + \cdots + 9\) be a regular expression matching non-zero digits. We can represent numbers between one and 12 with the expression \(Mo ::= (1(0+1+2) + N)\). Then we can build regular expressions for dates that start with valid month numbers using the regular expression \(Mo/DD\). We could similarly restrict days to be numbers between 1 and 31. We could even match the number of days and the month, for example by writing

\[\begin{aligned} Date &::= 2/(N + 1D + 2D) && \text{29 day months} \\ &+ (4+6+9+11)/(N + 1D + 2D + 30) && \text{30 day months}\\ &+ (1+3+5+7+8+10+12)/(N+1D+2D+30+31) && \text{31 day months} \\ \end{aligned}\]

We could optionally add years:

\[DateWithOptionalYear ::= Date(ε + /DD + /19DD + 20DD)\]

Even number of 0's: We could write a RE for binary strings with an even number of 0's and any number of 1's by noticing that every 0 must be paired with another 0. We could start with a pattern matching an even number of 0's with no 1's: \((00)^*\). We could then allow ourselves to add any number of 1's between the 0's: \((01^*0)^*\). We can also add 1's at the beginning of the string, or between repetitions of the 0's: \(1^*(01^*01^*)^*\). Some experimentation shows that this is sufficient, although you could explicitly allow 1's everywhere: \(1^*(1^*01^*01^*)^*1^*\).

Checking your work: When writing regular expressions, it's always good to test your answer by coming up with a variety of strings that should match and a variety of strings that shouldn't, and testing them out.