Problem Set 3: Ordered Sets
Due March 2, 11:59:59 pm.
This assignment is to be done in
groups of two students.
READ THESE INSTRUCTIONS: A major purpose of this
problem set is to demonstrate the importance of designing your data structures
wisely. If you read through the entire problem set and think about all of the
functionality you will need for your data structure, you should be able to write
one versatile data structure that you will be able to
use through all of the problems to simplify your life immensely and achieve the
efficiency you need in each of your solutions. The data structure should be an
ordered set implementation, though you may choose to add functionality beyond
that of an ordinary ordered set. Design your data structure with care. When you
hand in your problem set, you will hand in one file with your data structure and
big-O analyses of any operations that you will rely on the efficiency of in a
later part of the problem set. That is: if your efficiency analysis of your
query engine lookup for part 2 relies on the efficiency of your data structure's
"find" functionality, you must include an analysis of that "find" function's
efficiency in your data structure file. You will also hand in six files with
your work for the problem set -- one each for parts one through four, two for
part five. Use the stub files available on CMS through
the "source" link.
Regarding data structure design: you may not use, within your data structure,
any data structures within the SML library except for SML basic types (int,
char, string, real, bool, tuple, record) and the Vector structure. Put effort
into designing your datatype with as much functionality as you need to make the
other 4 parts of the problem set as easy on yourself as possible. Furthermore,
the efficiency of your solutions counts in this problem
set, which means the efficiency of the operations your data structure
supports counts. You will not receive full credit for solutions with sub-optimal
running times. Note that the insertion time into an unbalanced tree is O(n), not
O(log n).
This problem set will be completed in groups. Each
of your six files should have both of your names, netIDs, and sections in a
comment at the top. As always, all programs that you submit must compile. Programs that do not compile will receive an automatic
zero. If you are having trouble getting your assignment to compile,
please visit consulting hours. If you run out of time, it is better to comment
out the parts that do not compile and hand in a file that compiles, rather than
handing in a more complete file that does not compile. Also, please continue to
pay attention to style. Refer to the CS 312 style guide and lecture notes. Take
the extra time to think out the problems and find the most elegant solutions
before coding them up. As before, it is important that you do not change the
names of the functions or structures.
When submitting your implementations for the following problems, you should
be sure to include all of the code that you used to test your implementation.
The completeness of these test cases will be a part of your grade. Even if your
solution to a problem is well-written and correct, lack of exhaustive test cases
could cost you points.
- Part 1: Warm-ups (8 points)
Stub files available on CMS
through the "source" link.
This section is here to get you
started thinking about your data structure and to give you hints as to what
direction you should go with your design. Most of the problems in this section
should be very few lines, and should be optimally efficient, if you have
implemented a good ordered set data structure.
- 2 Points: Write three set operations -- intersect, union, and
difference -- over in-order lists of integers. Each list can be assumed to
contain no duplicates. Your output must be ordered. For some examples of
output:
intersect([1,2,3,4,5],[3,4,7,8]) = [3,4]
union([1,2,3,4,5],[3,4,7,8]) = [1,2,3,4,5,7,8]
difference([1,2,3,4,5],[3,4,7,8]) = [1,2,5]
Optimal efficiency: O(m+n) in all cases, where m and n are the
respective lengths of the lists.
- 3 Points: Given an in-order list of integers, return a function
that takes in an integer and returns the boolean of whether or not that
integer was in the original list. For example:
fun fieldQueries(x:int list):(int -> bool) =
...
- val x = fieldQueries([1,4,27,119]);
- x(27);
val it = true
- x(31);
val it = false
Optimal efficiency: O(n) for initial call, O(log n) for all subsequent
calls to output. (Hint: Look at Vectors in the Basis library.)
- 3 Points: Given a list of (integer,string) pairs, sort them by
integer. If the list given in input has two pairs with the same string,
discard the pair with the higher integer.
fun tupleSort(l:(int * string) list) =
...
- tupleSort([(12,"Faceman"),(47,"Barrackus"),(7,"Hannibal"),(82,"Faceman")]);
val it = [(7,"Hannibal"),(12,"Faceman"),(47,"Barrackus")]
Optimal efficiency: O(n log n)
- Part 2: Query Engine (10 points)
Stub files available on CMS
through the "source" link.
A small company wants you to write a
function that allows people to search its local intra-company network. People
in different departments have access to different sets of files, so they want
the following functionality: they would like to be able to pass in a list of
files. You may assume that this has no duplicates. Your program will read
through each of these files, ignoring whitespace, punctuation, and
capitalization, and keep track of how often words appear in each of the files.
Upon reading all of the files, you will produce the following:
- A function that takes in a query and returns relevant results. Queries
consist of lists of QueryTerms. There are two types of QueryTerms:
- AND_NOT of (string list * string list)
- OR_NOT of (string list * string list)
With an
AND_NOT query, you will return every page that contains each of the words in
the first list and none of the words in the second list. You will also
return, for each page, the total number of times that a word in the first
list appears on the page. The pages you will return will be ranked by this
score, in decreasing order.
With an OR_NOT query, you will return
every page that contains at least one of the words in the first list and
none of the words in the second. Again, you will return, with each page, the
total number of times that a word from the first list appears on the page,
and you will rank your results by this score in decreasing order.
- You will also return a piece of data, type of your choosing, to be
passed to...
- The third thing you return, which will be a merge function. Merging
works as follows: When you create a localized search engine, you pass back
both one function and one piece of data. When the company applies the
function from one engine to the piece of data from another, the result is a
merged search engine; that is, a search engine that contains all of the
information that both of the search engines did, but does NOT count any
duplicated information twice; that is, if both search engines count 2
occurrences of the word "ATeam" in file "awesome.txt," the result should
count only 2 occurrences of "ATeam" in "awesome.txt" as well. Of course, the
form that the merged search engine will take is another (search function,
merge data, merge function) triple.
There is one difficulty here:
the spec, so far, is untypable. Why? Because the type of makeEngine's third
return value is the same as the type of makeEngine, so we have a recursive
type. Of course, SML can handle recursive types, we just have to make a custom
datatype. We have done so for you:
If your code works properly, you
should have something like this:
- datatype merger = MERGER of RelevanceMap -> (engine * RelevanceMap * merger)
- val (eng1,mergeData1,MERGER mergeFn1) = makeEngine(["awesome.txt",
"20x6.txt"]);
(* eng1 now fields queries based on the words in awesome.txt and
* 20x6.txt *)
- val (eng2,mergeData2,MERGER mergeFn2) = makeEngine(["20x6.txt",
"ps3sols.txt"]);
(* eng2 now fields queries based on the words in 20x6.txt and
* ps3sols.txt *)
- val (eng3,mergeData3,MERGER mergeFn3) = mergeFn1 mergeData2;
(* eng3 now fields queries based on the words in all three of the
* above textfiles *)
- val (eng3,mergeData3,MERGER mergeFn3) = mergeFn2 mergeData1;
(* Effectively equivalent to the line above *)
- val (eng3,mergeData3,MERGER mergeFn3) = makeEngine(["awesome.txt",
"20x6.txt",
"ps3sols.txt"]);
(* Effectively equivalent to the line above *)
- eng3 (AND_NOT(["birdhouse","argonauts"],["flying"]));
val it = [("soul.txt",12),("20x6.txt",2)]
(* In this example, both "birdhouse" and "argonauts" appear in
* 20x6.txt once each, and they both appear in soul.txt
* at least once each, but for a total of 12 times. "flying"
* appears in neither file. ps3sols.txt either doesn't contain
* one or both of "pity" and "fool," or it contains the word
* "flying". *)
Optimal efficiency: Reading in the files will clearly be linear in the
number of characters read. Beyond that,
creating an engine should be O(n log n) in the
number of words parsed given a constant number of files.
Merging engines should be O((m + n)(o + p)) where m
and n are the numbers of words in the two engines and o and p are the numbers
of files in the two engines.
- Part 3: Automata (12 points)
Stub files available on CMS
through the "source" link.
- Deterministic Finite Automata : 4 points
Automata model
very simple computation, and can be used to describe sets. A very simple
type of automata are deterministic finite automata (henceforth, DFAs). A DFA
consists of:
- A set of states (for this discussion, designated Q)
- A "start state" within the above set (for this discussion, s)
- A subset of the set of states, designated the "accept states" (for
this discussion, A)
- A set input symbols (for this discussion, I)
- A transition function T:(Q*I)->Q that, given an input symbol and
the current state, designates the next state of the computation
The basic idea is this: DFAs operate on strings of input
symbols and return true or false. They are therefore used to describe sets
of input strings; a set is describable by a DFA if we can define a DFA that
returns true only on strings within that set. At the beginning of operation,
we consider the DFA to be "in" its start state s. At each step of operation,
the DFA takes in the next symbol in the input string, and its current state,
and plugs those into its transition function to determine its new state. If,
when the input string terminates, the DFA is in a state that is designated
an "accept state" (within A), the DFA returns true; otherwise, it returns
false.
If this idea seems strange, it may help to see an example of
a DFA that describes a simple set we're familiar with. Here is a simple DFA
that describes all inputs with length divisible by 4:
- Q = {"0","1","2","3"}
- s = "0"
- A = {"0"}
- I = {"x"}
- T("0","x") = "1", T("1","x") = "2", T("2","x") = "3", T("3","x") = "0"
As we can see, the state is updated each phase to represent
the number of input symbols so far, modulus 4. For something a little more
interesting, convince yourself that the following DFA describes the set of
binary numbers divisible by 3. Be careful: there is some overlap here
between the naming of states and the naming of input symbols. Remember that
the transition function input takes the state first, then the input.
- Q = {"0","1","2"}
- s = "0"
- A = {"0"}
- I = {"0","1"}
- T("0","0") = "0", T("0","1") = "1", T("1","0") = "2", T("1","1") =
"0", T("2","0") = "1", T("2","1") = "2"
What you are going
to write is a function that takes in a description of a DFA -- very similar
to that above -- and gives out the DFA itself; that is, you will return two
functions: transition, and evaluate. The function
transition will take in a state and an input symbol and return the
state that results from transitioning from that state with that input
symbol. The function evaluate will take in an input string
and will return the boolean value of whether or not the DFA accepts the
string, and also the path along the DFA from the start state to the state
that it ended up in at the end of the string. So, for the DFA immediately
above, we would get: fun makeDFA(info:DFAInput):DFA =
...
- val {evaluate:string list -> bool * string list,
transition:string * string -> string} = makeDFA(info:DFAInput);
- transition("2","1");
val it = "2"
- evaluate(["1","0","0","1","1"]);
val it = (false,["0","1","2","1","0","1"])
Note that the transition of DFAs is a complete mapping; you can assume
that for every (state,input symbol) pair, you will be given a resulting
state in your transition function description. In general, you do not have
to worry about illegal or inconsistent DFA description inputs.
Types
DFA and DFAInput are both defined in the stubfile. There are a few
efficiency concerns to worry about here. Let "x" denote |Q| * |I| -- that
is, the product of the sizes of the set of states and the set of input
symbols. Your transition function should, optimally, be O(log x). Your
evaluate function should, then, be O(n *(log x)), where n is the length of
the input string. Your createDFA function should be O(x log x).
- Nondeterministic Finite Automata: 8 points
Now we come to
something more difficult. Nondeterministic finite automata (henceforth
NDFAs) work a little bit different from DFAs. The idea is this: instead of
describing absolute behaviors like DFAs, NDFAs describe possible behaviors
-- and to determine whether or not an NDFA accepts a string, instead of
looking at whether we will end up in an accept state, we look at whether we
could end up in a start state. More formally, an NDFA is just like a
DFA, except it can have multiple start states, and it can have multiple --
or zero -- transitions for every (state,symbol) pair. Multiple transitions
work just like you'd think: when the transition function has both
(0,"A")->0 and (0,"A")->1, it means that, if we're
in state 0, and we see input symbol "A", we either move to state 0 or 1. We
don't know which, so we just have to keep track of the fact that we could be
in either one. Zero transitions are also pretty intuitive: if no transitions
exist for (1,"B"), then we can't get anywhere from state 1 at all if the
input symbol is "B". (No, we can't just "stay in B"; in this case, if "B" is
the only state we might be in, we end computation and conclude that our
string is not accepted.)
To simulate an NDFA, at every timestep, you
keep track of the entire set of states that you might be in. (Note that,
because there might not be any transitions from our current states with our
current input symbols, this set can become empty.) Before the first
timestep, this is the set of start states. To advance one timestep, you
would take the first input symbol -- call it i -- and for every start state
q, you would find the set of states that (q,i) can lead to according to the
NDFA's transition function. You would union all these sets together, and the
result would be all the sets you could possibly be in at the next
timestep.
A quick example: let's make a DFA that describes all
strings that end in "A","B","B", plus the empty string.
- Q = {0,1,2,3}
- Starting States = {0,3}
- A = {3}
- I = {"A","B"}
- T(0,"A") = 0, T(0,"B") = 0, T(0,"A") = 1, T(1,"B") = 2, T(2,"B") = 3
Let's quickly look at how this would operate on the string
"ABBB". At timestep 0, we could be in state 0 or state 3. In our first
transition, we can't go anywhere from state 3, but we can go to states 0 or
1 from state 0; so now, our set of possibilities is {0,1}. At the next
timestep, 0 can only go to 0, and 1 can go to 2; now our set is {0,2}. After
one more timestep, it's {0,3}. And finally, we can't go anywhere from 3, and
with a "B" can only go to 0 from 0, so we end with the set of possibilities
{0}. Since no accept state is within our end of possible outcome states, we
would return false for this input.
Now let's look at your task. Like
before, you will be given a description of an NDFA. Again, you will need to
return a transition function, only this time, transition
will take in a state q and an input symbol i and will return the set
(in the form of a list) of all states reachable by (q,i). You will need to
return a function evaluate that takes a string and returns whether
or not that string can possibly finish its computation in an accept state.
Your evaluate function will also return a path option; that is, if it is
possible for the computation to end in an accept state, you will return one
possible path it could have followed to do so. If it is not possible, you
will return NONE. Finally, you will return a function that takes in a string
of inputs and outputs the list of all possible states that the NDFA could
end computation in at the end of the string. NOTE: the lists returned by
"possible" and "transition" must be ordered according to
the order that the states appear in the original list states (here,
Q). The specifics of the format of your output are made clear in the
stub file.
Efficiency is, of course, a concern again. Let x denote
the number of transitions, and m denote the number of states. Creating your
NDFA should take O(x log x + m log m). (We can actually show a slightly
tighter bound in the case that a certain inequality holds; find the
inequality and the tighter bound for 2 extra points.)
- Optional Part 4:
Cryptography (10 points, extra credit)
Stub files available on CMS
through the "source" link.
- Part 5: Wrap-ups (12 points)