Problem Set 3: Ordered Sets

Due March 2, 11:59:59 pm.
This assignment is to be done in groups of two students.

READ THESE INSTRUCTIONS: A major purpose of this problem set is to demonstrate the importance of designing your data structures wisely. If you read through the entire problem set and think about all of the functionality you will need for your data structure, you should be able to write one versatile data structure that you will be able to use through all of the problems to simplify your life immensely and achieve the efficiency you need in each of your solutions. The data structure should be an ordered set implementation, though you may choose to add functionality beyond that of an ordinary ordered set. Design your data structure with care. When you hand in your problem set, you will hand in one file with your data structure and big-O analyses of any operations that you will rely on the efficiency of in a later part of the problem set. That is: if your efficiency analysis of your query engine lookup for part 2 relies on the efficiency of your data structure's "find" functionality, you must include an analysis of that "find" function's efficiency in your data structure file. You will also hand in six files with your work for the problem set -- one each for parts one through four, two for part five. Use the stub files available on CMS through the "source" link.

Regarding data structure design: you may not use, within your data structure, any data structures within the SML library except for SML basic types (int, char, string, real, bool, tuple, record) and the Vector structure. Put effort into designing your datatype with as much functionality as you need to make the other 4 parts of the problem set as easy on yourself as possible. Furthermore, the efficiency of your solutions counts in this problem set, which means the efficiency of the operations your data structure supports counts. You will not receive full credit for solutions with sub-optimal running times. Note that the insertion time into an unbalanced tree is O(n), not O(log n).

This problem set will be completed in groups. Each of your six files should have both of your names, netIDs, and sections in a comment at the top. As always, all programs that you submit must compile. Programs that do not compile will receive an automatic zero. If you are having trouble getting your assignment to compile, please visit consulting hours. If you run out of time, it is better to comment out the parts that do not compile and hand in a file that compiles, rather than handing in a more complete file that does not compile. Also, please continue to pay attention to style. Refer to the CS 312 style guide and lecture notes. Take the extra time to think out the problems and find the most elegant solutions before coding them up. As before, it is important that you do not change the names of the functions or structures.

When submitting your implementations for the following problems, you should be sure to include all of the code that you used to test your implementation. The completeness of these test cases will be a part of your grade. Even if your solution to a problem is well-written and correct, lack of exhaustive test cases could cost you points.

Part 1: Warm-ups (8 points)
Stub files available on CMS through the "source" link.
This section is here to get you started thinking about your data structure and to give you hints as to what direction you should go with your design. Most of the problems in this section should be very few lines, and should be optimally efficient, if you have implemented a good ordered set data structure.
- 2 Points: Write three set operations -- intersect, union, and difference -- over in-order lists of integers. Each list can be assumed to contain no duplicates. Your output must be ordered. For some examples of output:
```
intersect([1,2,3,4,5],[3,4,7,8]) = [3,4]
union([1,2,3,4,5],[3,4,7,8]) = [1,2,3,4,5,7,8]
difference([1,2,3,4,5],[3,4,7,8]) = [1,2,5]
```
  Optimal efficiency: O(m+n) in all cases, where m and n are the respective lengths of the lists.
- 3 Points: Given an in-order list of integers, return a function that takes in an integer and returns the boolean of whether or not that integer was in the original list. For example:
```
fun fieldQueries(x:int list):(int -> bool) =
   ...

- val x = fieldQueries([1,4,27,119]);
- x(27);
val it = true
- x(31);
val it = false
```
  Optimal efficiency: O(n) for initial call, O(log n) for all subsequent calls to output. (Hint: Look at Vectors in the Basis library.)
- 3 Points: Given a list of (integer,string) pairs, sort them by integer. If the list given in input has two pairs with the same string, discard the pair with the higher integer.
```
fun tupleSort(l:(int * string) list) = 
   ...

- tupleSort([(12,"Faceman"),(47,"Barrackus"),(7,"Hannibal"),(82,"Faceman")]);
val it = [(7,"Hannibal"),(12,"Faceman"),(47,"Barrackus")]
```
  Optimal efficiency: O(n log n)
Part 2: Query Engine (10 points)
Stub files available on CMS through the "source" link.
A small company wants you to write a function that allows people to search its local intra-company network. People in different departments have access to different sets of files, so they want the following functionality: they would like to be able to pass in a list of files. You may assume that this has no duplicates. Your program will read through each of these files, ignoring whitespace, punctuation, and capitalization, and keep track of how often words appear in each of the files. Upon reading all of the files, you will produce the following:
- A function that takes in a query and returns relevant results. Queries consist of lists of QueryTerms. There are two types of QueryTerms:
  - AND_NOT of (string list * string list)
  - OR_NOT of (string list * string list)
  With an AND_NOT query, you will return every page that contains each of the words in the first list and none of the words in the second list. You will also return, for each page, the total number of times that a word in the first list appears on the page. The pages you will return will be ranked by this score, in decreasing order.
  
  With an OR_NOT query, you will return every page that contains at least one of the words in the first list and none of the words in the second. Again, you will return, with each page, the total number of times that a word from the first list appears on the page, and you will rank your results by this score in decreasing order.
- You will also return a piece of data, type of your choosing, to be passed to...
- The third thing you return, which will be a merge function. Merging works as follows: When you create a localized search engine, you pass back both one function and one piece of data. When the company applies the function from one engine to the piece of data from another, the result is a merged search engine; that is, a search engine that contains all of the information that both of the search engines did, but does NOT count any duplicated information twice; that is, if both search engines count 2 occurrences of the word "ATeam" in file "awesome.txt," the result should count only 2 occurrences of "ATeam" in "awesome.txt" as well. Of course, the form that the merged search engine will take is another (search function, merge data, merge function) triple.
There is one difficulty here: the spec, so far, is untypable. Why? Because the type of makeEngine's third return value is the same as the type of makeEngine, so we have a recursive type. Of course, SML can handle recursive types, we just have to make a custom datatype. We have done so for you:

If your code works properly, you should have something like this:
```
- datatype merger = MERGER of RelevanceMap -> (engine * RelevanceMap * merger)

- val (eng1,mergeData1,MERGER mergeFn1) = makeEngine(["awesome.txt",
                                                      "20x6.txt"]);
  (* eng1 now fields queries based on the words in awesome.txt and
   * 20x6.txt *)

- val (eng2,mergeData2,MERGER mergeFn2) = makeEngine(["20x6.txt",
                                                      "ps3sols.txt"]);
  (* eng2 now fields queries based on the words in 20x6.txt and
   * ps3sols.txt *)

- val (eng3,mergeData3,MERGER mergeFn3) = mergeFn1 mergeData2;
  (* eng3 now fields queries based on the words in all three of the
   * above textfiles *)

- val (eng3,mergeData3,MERGER mergeFn3) = mergeFn2 mergeData1;
  (* Effectively equivalent to the line above *)

- val (eng3,mergeData3,MERGER mergeFn3) = makeEngine(["awesome.txt",
                                                      "20x6.txt",
                                                      "ps3sols.txt"]);
  (* Effectively equivalent to the line above *)

- eng3 (AND_NOT(["birdhouse","argonauts"],["flying"]));
val it = [("soul.txt",12),("20x6.txt",2)]
  (* In this example, both "birdhouse" and "argonauts" appear in 
   * 20x6.txt once each, and they both appear in soul.txt
   * at least once each, but for a total of 12 times. "flying"
   * appears in neither file. ps3sols.txt either doesn't contain
   * one or both of "pity" and "fool," or it contains the word
   * "flying". *)
```
Optimal efficiency: Reading in the files will clearly be linear in the number of characters read. Beyond that, creating an engine should be O(n log n) in the number of words parsed given a constant number of files. Merging engines should be O((m + n)(o + p)) where m and n are the numbers of words in the two engines and o and p are the numbers of files in the two engines.
Part 3: Automata (12 points)
Stub files available on CMS through the "source" link.
- Deterministic Finite Automata : 4 points
  
  Automata model very simple computation, and can be used to describe sets. A very simple type of automata are deterministic finite automata (henceforth, DFAs). A DFA consists of:
  - A set of states (for this discussion, designated Q)
  - A "start state" within the above set (for this discussion, s)
  - A subset of the set of states, designated the "accept states" (for this discussion, A)
  - A set input symbols (for this discussion, I)
  - A transition function T:(Q*I)->Q that, given an input symbol and the current state, designates the next state of the computation
  The basic idea is this: DFAs operate on strings of input symbols and return true or false. They are therefore used to describe sets of input strings; a set is describable by a DFA if we can define a DFA that returns true only on strings within that set. At the beginning of operation, we consider the DFA to be "in" its start state s. At each step of operation, the DFA takes in the next symbol in the input string, and its current state, and plugs those into its transition function to determine its new state. If, when the input string terminates, the DFA is in a state that is designated an "accept state" (within A), the DFA returns true; otherwise, it returns false.
  
  If this idea seems strange, it may help to see an example of a DFA that describes a simple set we're familiar with. Here is a simple DFA that describes all inputs with length divisible by 4:
  - Q = {"0","1","2","3"}
  - s = "0"
  - A = {"0"}
  - I = {"x"}
  - T("0","x") = "1", T("1","x") = "2", T("2","x") = "3", T("3","x") = "0"
  As we can see, the state is updated each phase to represent the number of input symbols so far, modulus 4. For something a little more interesting, convince yourself that the following DFA describes the set of binary numbers divisible by 3. Be careful: there is some overlap here between the naming of states and the naming of input symbols. Remember that the transition function input takes the state first, then the input.
  - Q = {"0","1","2"}
  - s = "0"
  - A = {"0"}
  - I = {"0","1"}
  - T("0","0") = "0", T("0","1") = "1", T("1","0") = "2", T("1","1") = "0", T("2","0") = "1", T("2","1") = "2"
  What you are going to write is a function that takes in a description of a DFA -- very similar to that above -- and gives out the DFA itself; that is, you will return two functions: transition, and evaluate. The function transition will take in a state and an input symbol and return the state that results from transitioning from that state with that input symbol. The function evaluate will take in an input string and will return the boolean value of whether or not the DFA accepts the string, and also the path along the DFA from the start state to the state that it ended up in at the end of the string. So, for the DFA immediately above, we would get:
```
fun makeDFA(info:DFAInput):DFA = 
   ...

- val {evaluate:string list -> bool * string list,
       transition:string * string -> string} = makeDFA(info:DFAInput);
- transition("2","1");
val it = "2"
- evaluate(["1","0","0","1","1"]);
val it = (false,["0","1","2","1","0","1"])
```
  Note that the transition of DFAs is a complete mapping; you can assume that for every (state,input symbol) pair, you will be given a resulting state in your transition function description. In general, you do not have to worry about illegal or inconsistent DFA description inputs.
  
  Types DFA and DFAInput are both defined in the stubfile. There are a few efficiency concerns to worry about here. Let "x" denote |Q| * |I| -- that is, the product of the sizes of the set of states and the set of input symbols. Your transition function should, optimally, be O(log x). Your evaluate function should, then, be O(n *(log x)), where n is the length of the input string. Your createDFA function should be O(x log x).
- Nondeterministic Finite Automata: 8 points
  
  Now we come to something more difficult. Nondeterministic finite automata (henceforth NDFAs) work a little bit different from DFAs. The idea is this: instead of describing absolute behaviors like DFAs, NDFAs describe possible behaviors -- and to determine whether or not an NDFA accepts a string, instead of looking at whether we will end up in an accept state, we look at whether we could end up in a start state. More formally, an NDFA is just like a DFA, except it can have multiple start states, and it can have multiple -- or zero -- transitions for every (state,symbol) pair. Multiple transitions work just like you'd think: when the transition function has both (0,"A")->0 and (0,"A")->1, it means that, if we're in state 0, and we see input symbol "A", we either move to state 0 or 1. We don't know which, so we just have to keep track of the fact that we could be in either one. Zero transitions are also pretty intuitive: if no transitions exist for (1,"B"), then we can't get anywhere from state 1 at all if the input symbol is "B". (No, we can't just "stay in B"; in this case, if "B" is the only state we might be in, we end computation and conclude that our string is not accepted.)
  
  To simulate an NDFA, at every timestep, you keep track of the entire set of states that you might be in. (Note that, because there might not be any transitions from our current states with our current input symbols, this set can become empty.) Before the first timestep, this is the set of start states. To advance one timestep, you would take the first input symbol -- call it i -- and for every start state q, you would find the set of states that (q,i) can lead to according to the NDFA's transition function. You would union all these sets together, and the result would be all the sets you could possibly be in at the next timestep.
  
  A quick example: let's make a DFA that describes all strings that end in "A","B","B", plus the empty string.
  - Q = {0,1,2,3}
  - Starting States = {0,3}
  - A = {3}
  - I = {"A","B"}
  - T(0,"A") = 0, T(0,"B") = 0, T(0,"A") = 1, T(1,"B") = 2, T(2,"B") = 3
  Let's quickly look at how this would operate on the string "ABBB". At timestep 0, we could be in state 0 or state 3. In our first transition, we can't go anywhere from state 3, but we can go to states 0 or 1 from state 0; so now, our set of possibilities is {0,1}. At the next timestep, 0 can only go to 0, and 1 can go to 2; now our set is {0,2}. After one more timestep, it's {0,3}. And finally, we can't go anywhere from 3, and with a "B" can only go to 0 from 0, so we end with the set of possibilities {0}. Since no accept state is within our end of possible outcome states, we would return false for this input.
  
  Now let's look at your task. Like before, you will be given a description of an NDFA. Again, you will need to return a transition function, only this time, transition will take in a state q and an input symbol i and will return the set (in the form of a list) of all states reachable by (q,i). You will need to return a function evaluate that takes a string and returns whether or not that string can possibly finish its computation in an accept state. Your evaluate function will also return a path option; that is, if it is possible for the computation to end in an accept state, you will return one possible path it could have followed to do so. If it is not possible, you will return NONE. Finally, you will return a function that takes in a string of inputs and outputs the list of all possible states that the NDFA could end computation in at the end of the string. NOTE: the lists returned by "possible" and "transition" must be ordered according to the order that the states appear in the original list states (here, Q). The specifics of the format of your output are made clear in the stub file.
  
  Efficiency is, of course, a concern again. Let x denote the number of transitions, and m denote the number of states. Creating your NDFA should take O(x log x + m log m). (We can actually show a slightly tighter bound in the case that a certain inequality holds; find the inequality and the tighter bound for 2 extra points.)
Optional Part 4: Cryptography (10 points, extra credit)
Stub files available on CMS through the "source" link.
- Shift ciphers: 4 points (extra credit)
  
  A shift cipher is the simplest cipher there is: the key is simply a constant number that we shift each letter by. That is, if the key is 1, then the string "ABCD" becomes, encoded, "BCDE." A shift cipher is also simple to detect computationally. The method is this: given an encoded message, count the frequency of occurrence of each letter in the alphabet. In your stub file you will find the expected frequency of occurrence of the letters in normal English. Find the amount that you need to shift the observed frequencies of letter appearances by in order to make it most closely resemble the expected frequency vector; if you shift each of the letters in the encrypted text by this amount as well, you will have decrypted the message.
  
  Let's give a simpler example, more explicitly. Say we have a language with only four letters, A-D, and expected frequencies (0.6,0.1,0.2,0.1). If we find an encoded message with observed frequencies (0.05,0.55,0.15,0.25), then it is a fair bet that we can decode the message by shifting every letter back by 1. The only piece left to make explicit is the "distance" function. It seems obvious to us that the best shift above is (0.55,0.15,0.25,0.05), because the numbers match up well, but we need a formula to quantify that. For this exercise, we will calculate vector distance by summing the squared distances of the individual frequencies.
  
  You will write a function that will take in an encoded string, analyze it for the most likely shift key, and decode it with that key. You will ignore punctuation and whitespace in your input, but will not delete it from your output. (The same is true for capitalization.) As a test, the most common letter in English is "e", so any string that consists of only one letter should come out in all "e"s:
```
- decodeShift("A. A! aaaaa   ....");
val it = "E. E! eeeee   ...."
```
  When you write your own test cases, note that your decode function cannot be expected to work reliably on very short strings.
- Vigenere ciphers: 6 points (extra credit)
  
  A Vigenere cipher is a slight advancement on a shift cipher. The idea is that you have a key that consists of multiple shifts that you apply in order. If your encoded string, for whatever reason, was "AAA . bbb", and your key was [1,2,3], then your output would be "BCD. cde." (Note that, again, we preserved punctuation, whitespace, and capitalization.) We crack these in two steps: first, we find the key length, and then, we solve for the keys.
  
  Finding the key length involves a neat trick involving the index of coincidence. In gibberish over a language of 26 letters, the chance of any two randomly picked letters being identical would be 1/26 (or roughly 0.038.). In an English paragraph, the chance is notably higher (roughly 0.065) because of the imbalanced frequencies of letters. (If you think about this, it makes sense; in a totally imbalanced language, where one letter is used much more frequently than others, the chance would be very high.) When English is corrupted with a Vigenere cipher, the chance goes much lower because the imbalanced frequencies are spread out. For instance, E is the most likely letter, but with the Vigenere cipher key [1,2,3], the frequent occurrences of E would be spread out over the letters F, G, and H. This is the trick to finding key length: to test if the key might have length x, we split it up into x segments of every xth letter, and calculate the index of coincidence over each one. When we hit the right key length, the indices will be close to 0.065; when we don't, they will be closer to 0.038.
  
  A few notes, in case this is unclear: if the string is "ABCDEF", and we think the key length is two, we will split it up into "ACE" and "BEF" and find the index of coincidence of both. To calculate the index of coincidence, we find the number of each letter (call this n), and the number of letters total (call this t), and we sum up the term (n/t) * ((n-1)/(t-1)) over all letters. This should make sense; (n/t) is the chance that the first letter of two that we pick is this letter, and (n-1)/(t-1) is the chance that the second will be also.
  
  When you have a guess at a key length, and a corresponding vector of indices of coincidence, how do you compute that into a "score" of how good of a guess it is? Our "score" will be "distance" from 0.065, and our distance will be a squared, biased difference: that is, if the index is less than 0.065, we will square the difference; if it is greater, we will square half of the difference. (That is: (diff / 2) ^ 2). We will average these distances across the vector of indices for our "score" for this key length guess.
  
  There's one more hitch: If the key length is 2, for instance, and we guess 4, we're going to get a good score anyway. So, when we are comparing scores and finding the proper key length, if we are comparing two lengths where one is a factor of the other, we will bias in favor of the smaller length by a factor of two. That is, unless the score of the smaller is twice as large (large = bad) as the score of the larger key size, we will favor it.
  
  So, to complete our algorithm: given a string, we will try key lengths 1 through 10 and evaluate which one we deem best. Once we have a key length, we will break the string up into the appropriate parts and analyze the shifts of the different parts as above with the shift cipher. We will return the decoded string, as well as the suspected key length.
  
  The running time of both of these should just be O(n). The only efficiency concerns are with sorting constant-size lists of possible keys, which ideally should be done in O(n log n), since you've presumably already written a sort algorithm for part 1, problem 3 that works in that time.
Part 5: Wrap-ups (12 points)
- 4 points: Consider the following SML code fragment:
```
let fun f x y =
   let fun f n = (y n) + n
   in
       f ( x y )
   end
in
   f (fn x => x 42) (fn z => 4 * z)
end
```
  - What is the type of the outermost f? In a few sentences, justify your answer. (2 points)
  - Show the steps taken by the SML interpreter to evaluate this program, using the evaluation rules described in lecture, ending with the returned value. (2 points)
- 4 points: Use CVS for this problem set non-trivially between you and your partner. This means real commit messages and incremental commits. Submit your logs for review by the course staff. This means, if you are using Unix, displaying all of your CVS logs (look into the "cvs log" command) and piping them into a file; if you are using some Windows brand of CVS, it may involve copying and pasting. Submit these in the "cvslog" file.
- 4 points: Meet with the consultants the week after the problem sets are graded to explain your work.

Problem Set 3: Ordered Sets

Due March 2, 11:59:59 pm.This assignment is to be done in groups of two students.

Due March 2, 11:59:59 pm.
This assignment is to be done in groups of two students.