CS412/413 Homework 1 Solutions

CS412/413
Introduction to Compilers and Translators
Spring 1999

Homework 1: Lexical Analysis

Due: February 5, 1999 at the beginning of class

Stats -- Min: 10, Max: 19, Mean: 14.2, Std.Dev.: 2.5

[Histogram]

Notes: each problem was worth either 1 or 2 points, depending on whether or not it was reasonable to award partial credit for the problem. As might be expected, two-pointers left room for partial credit; one-pointers were deemed simple enough to warrant "all-or-nothing" grading. For two-pointers, I allowed three misses (i.e., anything that on its own would warrant a point deduction) before marking a 0.

A large population of the class does not quite understand the concept of regular expressions, characters, and metacharacters. This manifested itself in problems #1 and #2. Basically, any character is interpreted as a literal unless it is a regular expression metacharacter (those characters shown in Lecture 2). In particular, there is nothing special about double-quotation marks; they are a character just like any other that might be scanned. In general, I was lenient. But these must be learned for future homeworks and programming assignments.

Each solution is prefixed by a bracketed expression denoting the value of the problem, the range of scores earned on the problem, and the average and standard deviation of scores earned on the problem.

[2 points, 0-2, 0.9±0.4] One possible description of a floating point constant is as follows: it contains a numeric part followed by an optional exponent. The numeric part may be a positive or negative integer, or a number containing a decimal point and at least one digit on the left- or right-hand side. The numeric part may not start with leading zeros unless the zero is the only digit to the left of the decimal point. The exponent starts with either the character "e" or "E", followed by either a positive or negative integer. Write a concise regular expression that captures this definition.

We begin by constructing pieces of our regular expression:
```
        NonZeroInteger     [1-9][0-9]*
        Dot                .
        SignedInteger      -? NonZeroInteger
        Integer            SignedInteger | 0
        Digits             [0-9]+
        Numeric            SignedInteger | Integer Dot | 
                           Dot Digits | Integer Dot Digits
        Exponent           [Ee] SignedInteger
        Float              Numeric Exponent?
```
Expanding, we have:
```
        (-? [1-9][0-9]* | (-? [1-9][0-9]* | 0) . | 
         . [0-9]+ | (-? [1-9][0-9]* | 0) . [0-9]+)
        ([Ee] -? [1-9][0-9]*)
```
Common mistakes for which points were not deducted:
- Using extraneous " marks in the regular expression. Technically these would be interpreted by some lexer as "read a literal double-quote character". Everything in a regular expression is either a literal or a metacharacter. The double-quote character is not a metacharacter. Refer to Lecture 2 for the set of standard metacharacters.
Common mistakes for which points were deducted:
- Incorrect use of regular expression metacharacters. In particular, "[()]" were often misused as nesting characters. In regular expressions, "[]" and "()" mean very different things. (Square brackets denote classes of characters; round parentheses denote groupings of sub-regular expressions. (If you want to nest parentheses, then use multiple layers of parentheses.))
- Not matching strings like "0.3", "3.", ".3" or "-0.14159".
- Matching "0" and other leading-zero strings. (No leading zeros unless ... decimal point.) This case was specifically mentioned in the FAQ, and enough people put in the work to get it right to distinguish themselves from the people who got it wrong.
- Matching "3E-" or "[Ee]0" for the exponent (the exponent was to be positive or negative).
- Requiring an exponent in the string. The exponent is an optional part of a floating point constant.
[2 points, 0-2, 0.8±0.6] A string in C begins and ends with the double quotation character ". In the middle of the string, any printable character may occur. The sequence \" denotes the occurrence of a double quotation mark within the string, the sequence \n denotes a newline character, and the sequence \\ indicates the occurrence of a single backlash character. Other escape sequences are allowed, but we will ignore them here. Write a regular expression that recognizes strings of exactly this sort, using NEWLINE to represent "a new line" (since \n will be interpreted as the literal sequence \n). In practice, strings and comments usually are tokenized by special-case code that is invoked once the initial " is seen. Can you see why?

Using the metacharacters ( ) [ ^ ] * given in Lecture 2 yields (the first regular expression accepts embedded newlines, the second one does not):
```
        " ( [^"\] | NEWLINE | \["n\] )* "

        " ( [^"\] | \["n\] )* "
```
Breaking up the parenthesized expression within the double quotes:
- [^"\]. Any character that is not a double quote or backslash.
- NEWLINE. An embedded newline character. For example:
```
"Hello,
world!"
```
  Really, NEWLINE is included by [^"\] since it is "any character that is not a double quote or backslash".
- \["n\]. A backslash followed by a double-quote, `n', or backslash.
Strings and comments are special-cased because of their arbitrary length. A long comment or string would overrun a typical buffer allocated for a token in a typical lexer (such as lex). The AT&T lex token buffer is 200 characters; other implementations allocate as little as 100 characters. Modern lexers are better prepared for long strings and comments; flex uses 8k buffers by default.

There were some inconsistencies regarding the treatment of NEWLINE in the FAQ and in some email; answers that either matched or rejected embedded newline characters were accepted.

Basically, misuse (or lack of use) of regular expression syntax earned a one-point deduction. Missing another thing resulted in a 0. Common mistakes:
- Using [0-9a-zA-Z] instead of [^"\]. The first expression excludes stuff like !@#$%...
- Misuse of the " character. (These tended to be the same people that misused " for #1.)
- Forgetting to exclude the " character from acceptable "string" characters.
- Not expressing the solution in regular expression syntax.
Appel, problem 2.2 (a,b).

(a) [1 point, 0-1, 0.7±0.5] A regular expression is equivalent to a DFA. Recognizing the strings containing more a's than b's would require a DFA of unbounded size, since there would have to be one DFA state for each difference in the number of a's and b's, which itself is unbounded.

(b) [1 point, 0-1, 0.6±0.5] The language of palindromes would also require a DFA of unbounded size, since it would need a state to represent the "beginning" and "end" instances of a character in the palindrome. Since palindromes are of arbitrary length, the DFA would be of arbitrary size.
Appel, problem 2.3.

(a) [1 point, 1-1, 1±0.0] This automata recognizes all four-character strings consisting of "0"'s and "1"'s except for "0110".

(b) [1 point, 0-1, 0.9±0.3] This automata recognizes strings consisting of 5n + 1 consecutive `a' characters, where n is a non-negative integer.

(c) [1 point, 0-1, 0.6±0.5] This automata recognizes strings accepted by the regular expressions
```
(0* (1 (0 1* 0)* 1)* )*
```
or
```
( (1 (0 1* 0)* 1)* 0*)*
```
or
```
(0 | 1 (0 1* 0)* 1)*
```
Note that in general, (a | b)* will generate and recognize the same strings as (a* b*)*. The point is that you should be able to algorithmically convert between DFAs and regular expressions.
The cuter solution was "all binary representations of multiples of 3".

Appel, problem 2.5.

(a) [2 points, 0-2, 1.8±0.5]

               y
         1234 --> [67]
          |^
         x||z
          v|
         [567]

(b) [2 points, 0-2, 1.0±0.9] The leaves have transitions between themselves. They are all final states, because of state 6. Yes, there were 32 states. Those who stuck it out to the end would have seen the binary tree right away. Those who used only a state table missed out on this insight. Note the exponential increase in the number of states (32 = 2⁵) from the original 6 states, and the balanced binary tree reflecting this 2ⁿ increase:

                                   1----
                                   |\__/ b
                                  a|
                                   12
  +--------------------------------+--------------------------------+
 a|                                                                 |b
 123                                                                13
  +-----------------------------+     +-----------------------------+
 a|                             |b   a|                             |b
 1234                          134   124                            14
  +-----------+     +-----------+     +-----------+     +-----------+
 a|           |b   a|           |b   a|           |b   a|           |b
 12345      1345   1245        145   1235        135   125          15
  +---+   +---+     +---+   +---+     +---+   +---+     +---+   +---+
 a|   |b a|   |b   a|   |b a|   |b   a|   |b a|   |b   a|   |b a|   |b
 123 134 124  14   123  13  12 156   123  13 12  146   12  136 126  16
 456  56 56   56   56   56  56       46   46 46        36

                   c               a
      {1,5,10,14} --> {2,6,11,15} --> {3,7,12,16}
                                           +
                         +-----------------+-------+
                        r|                         |t
                     [{13,17}]                  [{4,8}]
                         +                         +
                        s|                         |s
                        [18]                      [9]

[2 points, 0-2, 1.8±0.5] Appel, problem 2.6.

One may readily represent this state diagram as a table:

          transition
            0   1
state   1   2   6
        2   7   3
        3   1   3
        4   3   7
        5   8   6
        6   3   7
        7   7   5
        8   7   3

Note that the transitions leaving states {2,8} are the same. What this means is that once state 2 or 8 has been entered, the visitor of the DFA cannot distinguish between states 2 or 8. From the conditions for equivalent states given at the end of Appel section 2.4, it follows states 2 and 8 are equivalent. The same conditions also hold for states {4,6}.

Once states {2,8} and {4,6} have been minimized, you will see that states {1,5} can also be minimized. This yields the following table and DFA:

          transition
            0   1                +------------+   + 
state   15  28  46               v  0     1   |0 /|1
        28  7   3              ->15--->28--->[3]<-+
         3  15  3                | ^   |      ^
        46  3   7               1|  \1 |0 +   |
         7  7   15               v   \ v /|0  |
                                 46--->7<-+   |
                                0|            |
                                 +------------+

[3 points, 0-3, 2.1±1.1] Appel, problem 2.9.

The lexical specification given by Appel represents the following DFA:

>{1,7,10}---b-->[{11}]
    |a
    v
[{2,8,11}]--b-->{3,8}--a-->[{4}]--a-->{5}
    |a            |b          ^       /
    v             v            \     / b
  [{9}]<----a----{8}-+        a \{6}v
                  ^ /
                  |/ b
                  +

From the DFA, we can generate the edges and final tables:

+-----------------+--------------------+---------+
|                 |        Edges       |  Final  |
+-----------------+--------------------+---------+
|                 |     a         b    |  Action |
|       {1,7,10}  | [{2,8,11}]  [{11}] |    0    |
|      [{2,8,11}] |   [{9}]     {3,8}  |    3    |
|        {3,8}    |   [{4}]      {8}   |    0    |
|        [{4}]    |    {5}             |    1    |
|State    {5}     |              {6}   |    0    |
|         {6}     |   [{4}]            |    0    |
|         {8}     |   [{9}]     [{8}]  |    0    |
|        [{9}]    |                    |    2    |
|        [{11}]   |                    |    3    |
+-----------------+--------------------+---------+

CS412/413 Introduction to Compilers and Translators Spring 1999

Homework 1: Lexical Analysis

CS412/413
Introduction to Compilers and Translators
Spring 1999