Problem Set 1: Lexical Analysis

due: Monday, September 9

Update 09/02: Clarified chess field format. It should be a letter and a digit

Update 09/06: Clarified intermediate steps for DFA simplification. Those should be the intermediate state partitions as shown in the lecture

You may find the Dot and Graphviz packages helpful for generating drawings of DFAs and other graphs. You can get these packages for multiple OSes from the Graphviz downloads page. An example of a DFA drawn using Graphviz may be useful.

1. Design of finite automata

Living cells on earth encode their genetic information in a chemical code made from repetitions of four basic compounds abbreviated as A, C, G, and T.

These constituents are arranged in long linear sequences of triplets (such as GCT, or TGA) referred to as codons. The meaning of a short sub-strip of such a sequence is often not immediately clear, since there are different frames in which to interpret the information.

For instance, the partial sequence CTTCTCCGCAATTTAC might specify the codons

• CTT CTC CGC AAT TTA C??, or
• ?CT TCT CCG CAA TTT AC?, or
• ??C TTC TCC GCA ATT TAC

where the question marks indicate missing data. Certain codons have a special meaning, which fortunately resolves the reading frame ambiguity: for example, the initiation codon ATG denotes the beginning of a gene-encoding region, and one of the three termination codons TAA, TAG, and TGA mark the end of such a region.

A commonly asked question is whether a strip of a sequence acquired from a biological sample contains one or more complete gene-coding regions. Design a nondeterministic finite automaton that can be used to identify such a subsequence based on an arbitrary reading frame. Ensure that the initiation and termination codons are aligned with respect to each other.

2. Create a deterministic version of the following finite automaton. Make sure you show the initial and terminal states. Label each DFA state with the set of NFA states to which it corresponds.

3. Regular Expressions

A popular way of encoding chess games is the Portable Game Notation (PGN), which is essentially a numbered sequence of moves in algebraic format with optional comments between moves. Write a regular expression that describes this format as closely as possible using the regular expression syntax discussed in class. You may ignore spaces and newlines. Since the resulting regular expression may become large, you should feel free to define pieces of it and later reference them. (be careful to ensure that your language is regular in this case.)

A (simplified) format description of PGN is given by the following set of rules:

1. Game: A game consists of a sequence of turns and ends with one of three result identifiers: "1-0", "0-1", or "1/2-1/2"
2. Turns Every turn is preceded by a decimal number (which starts with a number other than zero) followed by a period. It consists of one or two moves.
3. Moves Moves are given by the piece that moved, an optional x when there was a capture, and the letter+digit identifier of the target square of the move. Pieces are denoted as K (king), Q (queen), R (rook), B (bishop), and N (knight). Pawns do not use a piece prefix.
4. Ambiguity resolution: In rare occasions, this type of specification is ambiguous, since there may be several pieces that could do the indicated move. In this case, either the rank (i.e. the row) or file (i.e. the column) identifier or both of them are inserted following the piece identifier (or at the beginning of the move, if the piece is a pawn).
5. Captures: An optional x after the piece identifier and any ambiguity resolution indicates a capture.
6. Castling: Kingside castling is denoted by O-O, and queenside castling is indicated by O-O-O.
7. Check and checkmate: A plus sign + is appended to checking moves. The number sign # is appended to checkmating moves.
8. Comments: Comments can optionally be inserted after a move. They contain arbitrary descriptions between curly braces. Comments never nest.

For clarification purposes, here is a PGN encoding of a game between Robert Fischer and Boris Spassky.

```1. e4 e5 2. Nf3 Nc6 3. Bb5 {This opening is called the Ruy Lopez.} a6
4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 d6 8. c3 O-O 9. h3 Nb8  10. d4 Nbd7
11. c4 c6 12. cxb5 axb5 13. Nc3 Bb7 14. Bg5 b4 15. Nb1 h6 16. Bh4 c5 17. dxe5
Nxe4 18. Bxe7 Qxe7 19. exd6 Qf6 20. Nbd2 Nxd6 21. Nc4 Nxc4 22. Bxc4 Nb6
23. Ne5 Rae8 24. Bxf7+ Rxf7 25. Nxf7 Rxe1+ 26. Qxe1 Kxf7 27. Qe3 Qg5 28. Qxg5
hxg5 29. b3 Ke6 30. a3 Kd6 31. axb4 cxb4 32. Ra5 Nd5 33. f3 Bc8 34. Kf2 Bf5
35. Ra7 g6 36. Ra6+ Kc5 37. Ke1 Nf4 38. g3 Nxh3 39. Kd2 Kb5 40. Rd6 Kc5 41. Ra6
Nf2 42. g4 Bd3 43. Re6 1/2-1/2
```
4. DFA Simplification (CS5120 only)

Simplify the following DFA using the method presented in class. Show the minimized version, as well as any intermediate partitions of states as shown in class.