# A3: Search
**Deadline:** Wednesday, 3/28/18, 11:59 pm
*This assignment may be done as an individual or with one partner. The
use of a partner is optional. If you partner with someone outside your
recitation, please alert both TAs in advance, so that they can
coordinate on who is grading your submission. Sharing of code is
permitted only between you and your partner; sharing among larger groups
is prohibited. Your recitation TA will devote some recitation time to
helping connect you with potential partners.*
**Warning about academic integrity.** This assignment involves
implementing some data structures. In the vast expanses of the
Internet, there might be an implementation of every known data
structure in every known programming language. So it's conceivable
you could just Google for the solutions to certain parts of this
assignment. Well, we've already done that ourselves and put the
results in MOSS, the plagiarism detection software we use. If you do
steal code from somewhere and fail to cite it, that will be regarded
as an especially heinous violation of academic integrity. If you do
cite, the grade penalty will be severe. (Though of course not as
severe as violating the code of academic integrity.) In previous
years, there were over a dozen AI violations on this assignment,
mostly because students Googled for balanced tree implementations in
OCaml and re-used code they found in a functional programming course
at another Ivy-League university. *Don't search for solutions. Just
don't.* Once you see someone else's solution, even just some types or
signatures of helper functions, it's impossible to unsee it; you will
almost certainly write code that is too similar to theirs even if you
do not directly copy it.
**Table of Contents:**
* [Introduction](#intro)
* [Assignment information](#info)
* [Part -1: Git](#git)
* [Part 0: Makefile](#makefile)
* [Part 1: Read the release code](#release)
* [Part 2: List engine](#list)
* [Part 3: Test harness](#test)
* [Part 4: Tree engine](#tree)
* [What to turn in](#turnin)
* [Assessment](#assessment)
* [Karma](#karma)
## Introduction
In this assignment, you will develop a search engine for text documents.
The prototypical kind of document you should have in mind is a book
from [Project Gutenberg][pg]. Though that project offers books in
many formats, the ones you should look at are the Plain Text versions,
such as the one labeled "Plain Text UTF-8" of
[*Alice's Adventures in Wonderland*][alice] by Lewis Carroll.
[pg]: https://www.gutenberg.org/
[alice]: http://www.gutenberg.org/files/11/11.txt
Your engine will *crawl* through a directory on a local disk looking
for documents. When it finds them, it will *index* the words that
appear in those documents. It will then answer *queries* posed by
users. A query will involve words that might appear in the documents,
and the response to a query will be a list of documents that involve
those words.
The index maintained by your search engine will make use of lists,
trees, dictionaries, and sets. To get your engine working at first,
you will implement a simple dictionary based on association
lists. That implementation will be too inefficient to index large
documents, so you will later implement a dictionary using *AVL trees*.
You will likewise implement sets at first with an inefficient list
representation, then later with an efficient AVL tree representation.
The queries that users pose will have one of two forms. Abstractly
those two forms are as follows, in which the NOT clause is always
optional:
* "and-not" queries: AND (*w1*, *w2*, ..., *wn*),
NOT (*u1*, *u2*, ... *um*)
* "or-not" queries: OR (*w1*, *w2*, ..., *wn*),
NOT (*u1*, *u2*, ... *um*)
For example, "AND (far, away)" would return all documents that contain
both the words "far" and "away", whereas "AND (far, away), NOT (day,
night)" would return all documents that do contain both "far" and
"away" but do not contain "day" nor "night". Likewise, "OR (all,
nothing)" would return any document that contains "all" or "nothing"
(or both), and "OR (all, nothing), NOT (between)" would return any
document that contains "all" or "nothing" (or both) but does not
contain "between".
You will also build a test suite for your search engine, and as part
of that test suite, you will build a *test harness* that could be used
to test any dictionary, not just your own. The course staff will run
your test harness against some faulty dictionary implementations to
see whether your harness detects the faults in them.
All of the above tasks will require extensive use of the OCaml module
system. The release code mostly just contains a few interface files;
your job is to write implementation files for those interfaces.
## Assignment information
**Objectives.**
* Use the OCaml module system, including structures, signatures, and functors.
* Writing a test suite to detect faults in implementations other than your own.
* Implement an efficient data structure using balanced search trees.
* Practice writing programs in the functional style using immutable data.
* Use Git, a distributed version control system.
**Recommended reading.**
* [Lectures and recitations 7–13][web]
* [Handout on AVL trees][avltree]
[web]: http://www.cs.cornell.edu/Courses/cs3110/2018sp/
[avltree]: http://cs.wellesley.edu/~cs231/fall01/avl.pdf
**Requirements.**
1. You must implement the document crawler, indexer, and querier.
2. You must implement two dictionary data structures, one with
association lists and the other with AVL trees.
3. You must implement producing a set data structure out of an
arbitrary dictionary data structure.
4. You must build a test harness for dictionaries.
5. Your code must be written with good style and be well documented.
All functions, except for nested helper functions, must have
specification comments. Representation types for data structures
must have a documented abstraction function, and any
representation invariants must also be documented.
6. Your code must avoid duplication of implementations. There is some
duplication of type declarations that is unavoidable; see Part 1
below.
7. You must submit an [overview document][overview].
8. You must use Git as part of your development process.
[overview]: http://www.cs.cornell.edu/Courses/cs3110/2018sp/handouts/overview.html
**What we provide.**
In the release code on the course website you will find these files:
* Interface files `data.mli` and `engine.mli` that declare the modules,
names, and types you must implement. There are template implementations,
`data.ml` and `engine.ml`, provided that you will need to complete.
* A template for testing spread across two compilation units,
`Test_data` and `Test_engine`, and a driver file `test_main.ml`.
You will need to complete the `.ml` files.
* A couple test directories, `test1/` and `test2/`, containing some
sample documents to index.
* A file `.ocamlinit` that will automatically load some code into `utop`
each time you launch it. This should help with interactive testing.
Feel free to modify this file.
* Some additional scripts for compiling your code.
** Software note.**
The `Makefile` relies on the QuickCheck package. You can install it
by issuing the following command:
```
opam install qcheck
```
from the terminal.
**Grading issues.**
* **Late submissions:** Carefully review the course policies on
[submission and late assignments][late]. Verify before the deadline on CMS
that you have submitted the correct version.
* **Environment, names, and types:** You are required to adhere to the names and
types of the functions and modules specified in the release code. Your
solution must pass the `make check` described below in Part 0.
Otherwise, your solution will receive minimal credit.
* **Code style:** Refer to the [CS 3110 style guide][style].
Ugly code that is functionally correct will nonetheless be penalized.
Take extra time to think and find elegant solutions.
[late]: http://www.cs.cornell.edu/Courses/cs3110/2018sp/syllabus.php#late
[style]: http://www.cs.cornell.edu/Courses/cs3110/2018sp/handouts/style.html
**Prohibited OCaml features.**
You may not use imperative data structures, including refs, arrays,
mutable fields, and the `Bytes` and `Hashtbl` modules. Strings are
allowed, but the deprecated (mutable) functions on them in the `String`
module are not. Your solutions may not require linking any additional
libraries/packages beyond those in the provided `_tags` file. You may
and in fact must use I/O functions provided by the `Pervasives` module
and the `Unix` module, even though they cause side effects, to implement
your engine. All functions in the `Str` library (which is
different than the `String` module) may be used, even though some of
them do cause side effects. Your implementations of dictionaries and
sets may not use the standard library `Map` or `Set` implementations.
Your test suite is free to use them, if you wish, for the purpose of
comparing the correctness of your implementations to a "known good"
implementation.
## Part -1: Git
You are required to use [Git][git], a distributed version
control system, whether you are working as an individual or
with a partner. We recommend the following:
1. Do this [Git tutorial][git-tutorial]. Although that
tutorial covers branches, they are an advanced feature
that you do not need to use. For the most part,
the only commands you need to know are `pull`, `add`,
`commit`, and `push`. Atom has features built-in to support
these commands.
2. Use what you have learned to create a **private** git repo on the
[Cornell CIS GitHub][cisgithub]. Throughout your development of
A3, commit your changes to it. Use those checkins
to provide checkpoints, in case you need to restore
your development to a previous point.
3. Synch with a remote repo to communicate code between you
and your partner, or simply to backup your development
if you are working as an individual.
**You are strongly encouraged to use the [Cornell CIS
GitHub][cisgithub]** rather than the public GitHub (whose URL we
deliberately omit here). The main reason for this is that the CIS
GitHub allows unlimited private repositories.
**Private repos are of the utmost importance.** A public repo would
share your code with the entire world, including your classmates, thus
violating the course policy on academic integrity. Therefore we require
that you keep all your CS 3110 related code in private repos. To create
a private repository, make sure you select the "Private" radio button
when creating a new repository.
[git]: https://git-scm.com/
[git-tutorial]: https://try.github.io/
[cisgithub]: https://github.coecis.cornell.edu/
## Part 0: Makefile
As usual, we provide a makefile.
* `make check` will check your OCaml environment and the names and types of your required
functions. Run that early and often and always before submitting your solution.
* `make test` or simply `make` will build your OUnit test suite in `test_main.ml` and
run it.
* `make compile` will compile your `Data` and `Engine` compilation units. This is
meant to be run before starting `utop`. You could also use it when editing
with Atom to re-compile all your source code, so that Atom will pick up on changes
you've made to modules.
* `make zip` will build the zip file containing your source code for submission to CMS.
* `make zipcheck` will check your zip file for a common mistake that people sometimes
make if they create the zip with their OS's file browser instead of with `make zip`.
* `make clean` will delete generated files produced by the make and build systems.
Although testing interactively in `utop` is often quite helpful, you
should be aware that when you `#use` a `.ml` file, that is only
textual inclusion of the file as if you had typed it into the toplevel
yourself. The file is not treated as part of a compilation unit in
that case, so it is not checked against any `.mli` file. You must
compile the file with `make` (or at least with `ocamlbuild`) in order
for the implementation to be checked against the interface. That's
why the provided `.ocamlinit` has `#load` directives rather than
`#use` directives. You will therefore need to recompile your code with
`make compile` each time before launching `utop`.
## Part 1: Read the release code
Your next task is to familiarize yourself with the structure of the
release code we have shipped to you. Read each of the files in
detail, noting carefully the specifications that are provided. A good
order to do that would be `engine.mli`, `engine.ml`, `data.mli`,
`data.ml`, `test_main.ml`, `test_data.mli`, `test_data.ml`,
`test_engine.mli`, `test_engine.ml`. There are close to 500 lines of
code and comments in those files. Make sure to start reading all of
them right away! Don't put this off until the last minute.
**You may not change the names and types in the provided `.mli` files.**
They are the interface you must implement, and they are the interface
against which the course staff will test your submission. You may
add new declarations and definitions. If you're in doubt about whether
a change is permitted or not, just run `make check`: it will tell you
whether you have changed the interface in a prohibited way.
One thing you will notice is that there is a certain amount of copying
of a particular kind of code between interfaces and implementations:
any type declaration must be repeated in both files. This is because
if a `.mli` file contains a type declaration, the `.ml` must contain
that declaration as well to implement the interface—just like if
the `.mli` file declares a value, the `.ml` must define it. Though
this might seem weird, it enables separate compilation of interfaces
from implementations, which is actually a good thing.
## Part 2: List engine
As discussed above, the job of your search engine is to index a
directory of files and answer queries about the words appearing in
those files.
For purposes of this assignment, we define a *word* as follows:
* A *whitespace character* is any space, tab, or newline character
(i.e., carriage return or line feed).
* A *preword* is any maximal length sequence of characters in a file
that does not contain any whitespace.
* A *boundary character* is any lowercase letter, uppercase letter,
or digit.
* A *word* is the maximal length subsequence of a preword that begins
and ends with a boundary character. In between those there may be
any number of any other characters. A preword that contains no boundary
characters does not correspond to any word.
Yes, there will no doubt be some weird corner cases resulting from
this definition of words. But we need a standard definition; this one
is relatively simple for us all to use, and it gets many common cases
right.
For example, given a file containing the following text:
```
"I would found an institution where
any person can find instruction in
any study." ---E. Cornell (b. 1807)
```
The words in that file would be: "1807", "an", "any", "b", "can",
"Cornell", "E", "find", "found", "I", "in", "institution",
"instruction", "person", "study", "where", "would".
**Searching.**
Searches must be case insensitive. For example, searching for either
"Cornell" or "cornell" should return the file in the example immediately
above.
**Directories and files.**
* Anywhere you need to use a path separator in your code, use
`Filename.dir_sep` rather than hard-coding a forward or backward slash;
this will ensure that your code works on Windows and Unix platforms.
* Never change the case of any file or directory names that the user
passes to your engine or that the operating system returns to your
engine. Some underlying file systems are case sensitive whereas
others are case insensitive, so your code should preserve whatever
case it receives.
* From the engine's perspective, the directory name is part of the
file name. For example, if the name of the directory passed to
`index_of_dir` is `../myTests`, and if that directory contains a file
named `Atest.txt`, and that file contains the word `abracadabra`, then a
search for `abracadabra` should return `../myTests/Atest.txt`. It would
be incorrect to return just `Atest.txt`.
**What to do.**
Implement `Data.MakeListDictionary`, `Data.MakeSetOfDictionary`, and
`Engine.ListEngine`. Note that you might not need every one of the
functions in `Dictionary` or `Set` to implement your engine, but they
are required functions nonetheless. Also implement `Test_data.tests`
to provide OUnit tests for your dictionary and set data structures,
and `Test_engine.tests` to provide OUnit tests for your search
engine. When you are done you will have a complete, working search
engine. But its performance will be rather slow, because the data
structures are not very efficient. That's okay for now.
Here are some implementation hints:
* For processing directories, you will find the [`Unix` module][unix]
helpful, in particular the `opendir`, `readdir`, and `closedir`
functions.
* For processing files, you will find the [`Pervasives`module][pervasives]
helpful, in particular the `open_in`, `input_line`, and `close_in`
functions. The `input_line` function will handle newlines for you,
so that you don't have to worry about stripping out those particular
whitespace characters.
* You may assume that the files you process are encoded in ASCII.
As a source of test files, start off by creating your own small files.
Later, we recommend [Project Gutenberg][pg], from which you can
download large text files. Those are often encoded in UTF-8. In most
Unix systems, including the 3110 VM, you can convert UTF-8 to ASCII
with the following command:
```iconv -f UTF-8 -t ASCII -c in.txt >out.txt```
where `in.txt` is the name of the input UTF-8 file and
`out.txt` is the name of the output ASCII file. We have already
performed that conversion on the files we released to you in the `test2/`
directory.
* Given the kinds of files that motivate this assignment (plain text books
from Project Gutenberg), it would be reasonable to assume that lines
of files are not overly long, but that there might be many lines in a file.
* To parse a line of a file into its component words, you can either
directly implement it yourself using the `String` module's (non-deprecated)
functions, or you can implement it with regular expressions using the
`Str` module. The latter is recommended but certainly not required.
Regular expressions are a kind of small programming language; you would
find it immensely useful to learn them if you haven't picked it up
somewhere already. Here's a good tutorial: [http://regexone.com/]()
* Several of the signatures specify a `format` function suitable for
use with the toplevel's `#install_printer` directive; see the
[notes on modules][modules-notes] for more information. The release code
provides trivial implementations of these. You are not required to
improve those trivial implementations. In particular, the course
staff will not test your implementations of these `format`
functions. But you would find it exceedingly helpful to implement
them anyway. Finishing those `format` functions will enable the
toplevel to print useful representations of values of abstract
types, making it far easier to do interactive testing and debugging.
* The release code signatures require you to provide `rep_ok` and
`to_list` functions for each data abstraction. Use these to your
advantage when developing and debugging your code! You can think of
`to_list` as implementing a kind of abstraction function, as does `format`.
[unix]: http://caml.inria.fr/pub/docs/manual-ocaml/libref/Unix.html
[pervasives]: http://caml.inria.fr/pub/docs/manual-ocaml/libref/Pervasives.html
[modules-notes]: http://www.cs.cornell.edu/Courses/cs3110/2018sp/l/07-modules/notes.html
[regex-tutorial]: http://regexone.com/
## Part 3: Test harness
As part of developing your data structures, you naturally will be
constructing test cases that demonstrate the (in)correctness of your
dictionaries. Let's take that one step further. The course staff will
develop several faulty implementations of `Dictionary`. Your task is
to construct a suite of test cases that finds all our faults (i.e.,
bugs). The faults we introduce will be in one or more of `insert`,
`find`, and `remove`. Of course, your tests are free to exercise
other functions (which will be implemented correctly) in an effort to
find the faults. The faults we introduce will be related to
correctness, not performance. They will cause incorrect values to be
returned; they will not raise exceptions.
The test cases you write should be black-box tests against the
`Dictionary` interface. You don't know what representation types we
might use for our faulty implementations, so focus instead on the
behavior specified in the comments of that interface.
The functor `Test_data.DictTester` is where you should implement your
test harness. The course staff will instantiate that functor on our
own incorrect `Dictionary` implementations, as well as some correct
implementations. The functor will produce your list of OUnit test
cases. The staff's correct implementations must pass all those cases,
and the staff's incorrect implementations must each fail at least one
test case.
Your test harness should not rely on the presence of any directories
of documents, including `test1/` and `test2/`: those are for testing
of your engine, not of your dictionaries per se.
To ease your mind, you need not worry about maliciously incorrect
`Dictionary` implementations. By that we mean that our incorrect
implementations will be of the sort that a programmer might create on
accident or by misunderstanding, rather than being coded to fail only
on negligibly-likely inputs—for example, only when the number
`42` is inserted.
## Part 4: Tree engine
Tackle this part of the assignment only after you have completed the
previous parts.
The search engine you have implemented now works, but it is
inefficient: you will notice that trying to index and search the
provided `test2/` directory, which is not overly large, takes a
significant amount of time. The problem is that a list-based
implementation of dictionaries and sets, while being relatively easy
to implement (compared to what you're about to do next), offers poor
performance: many operations are linear time.
*Balanced search trees* offer much better performance. Their
operations are generally logarithmic time. An *AVL tree* is a kind of
balanced search tree data structure in which the difference between
the shortest and longest paths differ by at most one. Let's use them
to implement an efficient dictionary.
To learn about AVL trees, read [this handout][avltree] by Prof. Lyn
Turbak at Wellesley College. **Read the handout. Seriously.**
Following that article, implement `Data.MakeTreeDictionary` with an
AVL tree data structure.
**Representation type.**
In `data.mli`, we provide a type `('k,'v) avl` to represent AVL trees.
This must be your representation type for `MakeTreeDictionary`,
as already coded in `data.ml`. We provide this type not out of a desire
to restrict your creativity, but because in the past students have
been led astray by inventing far more complicated representation
types. Really, this type is all you need to represent an AVL tree.
The function `Data.Dictionary.expose_tree` exposes the representation
of dictionary as a AVL tree to clients of that data abstraction. This
is not the best possible design choice for a library, but we made it
so that the course staff could directly examine the trees your
algorithms create. There are various places in the AVL tree
algorithms where nondeterminism is possible: for example, you might
choose to insert a new binding at multiple places in a tree. It is up
to the course staff to build a test suite that allows any of those
possible choices; you are free to choose any one of them.
**Implementation.**
The `insert` operation can be implemented by directly following the
algorithm described in the handout, using rotations to rebalance the
tree as needed after an insertion.
The `remove` operation is the hardest part of implementing AVL trees,
so you could make that the very last thing you finish. Even though
this operation accounts for about 20% of the code you might need to
write, we expect `remove` on AVL trees to end up being worth only
about 5% of your score on the entire assignment. So even though it is
the hardest part of the assignment, you don't have to do it at all to
get a high score. You must implement the `remove` algorithm that is
described in the AVL tree handout, which has logarithmic time
performance. Linear-time algorithms (e.g., convert the tree to a
list, remove the element from the list, then create a new tree out of
that list) will receive zero points.
**Efficiency.**
When your AVL tree implementation is complete, try running your search
engine on many large documents. How efficient is it? As a point of
comparison, one (not particularly optimized) staff solution is able to
index `test2/` in about 2 minutes with the list engine, and in about
2.5 seconds with the tree engine.
## What to turn in
Submit files with these names on [CMS][cms]:
* `a3src.zip`, containing your solution code.
* `overview.pdf`, containing your overview document. We also accept `.txt` and
`.md` files.
* `gitlog.txt`, containing your Git log.
The sizes of the above files are limited in CMS to 1 MB, which is driven
in part by the fact that there are 300+ students in this course each
submitting many files. Please stay within the size alloted. In
particular, the purpose of `a3src.zip` is **not** for you to submit test
cases with large files. Feel free to describe such test cases in your
overview document, but do not submit them. Rest assured that the course
staff will test your submission on our own large files!
[cms]: https://cms.csuglab.cornell.edu/
**To produce gitlog.txt for submission**: Run the following
command in the directory containing your source code (e.g., `test_main.ml`):
```
$ git log --stat > gitlog.txt
```
## Assessment
The course staff will assess the following aspects of your solution.
The percentages shown are the approximate weight we expect each to receive.
* **Correctness.** [50%] Does your solution compile against and correctly pass
the course staff's test cases?
As a rough guide, here is a further breakdown of the approximate
weight we expect to assign to the parts of our test suite: search
engine: 40%; list dictionary: 15%; tree dictionary: 35%; sets 10%.
* **Testing.** [10-15%] Does your test harness (i.e., `Test_data.DictTester`)
work properly? That is, do all of its test cases pass when given a
correct dictionary implementation? And does at least one test case
fail when given an incorrect implementation? Beyond your dictionary
test harness, we will not assess the rest of your unit testing in this
assignment. But please do not take that as an invitation to stop testing:
automated unit tests are the best way to ensure that your implementation
is correct, thereby maximizing the partial credit you get for Correctness.
* **Code quality.** [10-15%] How readable and elegant is your source code? How
readable and informative are your specification comments? All functions,
including helper functions, must have specification comments.
* **Scalability.** [5-10%] Does your engine scale well enough to index and search
a larger collection of books that those provided in the release
code—for example, a few megabytes? Both your tree and list
engines should use stack space that is at most \\(O(\log w)\\), where
\\(w\\) is the number of words being indexed. Your list engine will
naturally perform worse than your tree engine, because many operations
on lists will require \\(O(w)\\) time, whereas those same operations on
trees will instead require \\(O(\log w)\\) time.
* **Modular programming.** [5-10%] Did you document an AF and RI for
each data abstraction? Did you implement `rep_ok`, and does it check
the entire RI? It's okay if you don't actually apply `rep_ok`
defensively inside your operations for the sake of efficiency, but we
do want to see that you implemented it hence could use it for
debugging if necessary.
* **Git.** [<5%] Did you submit a git log as required?
* **Overview.** [10%] Did you complete the overview document? Does it provide
all the required information?
## Karma
**Scale up:**
How well does your search engine scale to large inputs? Make your
implementation as efficient as possible while still being correct. We'll
test on increasingly larger directories to see how quickly your engine
indexes them. If there's a clear winner, we'll award up to 10% bonus
points (beyond mere karma). Last year, the winner didn't attempt
anything unusual: they just wrote really clean code that didn't do
anything unnecessary.
—
**Interface:**
The search engine you built in this assignment doesn't have its own
interface; instead, it relies on the toplevel. Build your own interface
instead. It might be a simple textual interface or a graphical user
interface, the latter being possible with [LablGtk][]. It is fine for
the purpose of implementing this functionality to provide additional
instructions to the grader about packages to install, additional build
commands to run, etc. But the code you submit must still pass `make
check` and obey the original assignment specifications, so that your
non-karma functionality can be graded with the rest of the submissions.
(Sorry; in a 300+ student class, automation of grading is a requirement.)
[LablGtk]: http://lablgtk.forge.ocamlcore.org/
* * *
**Acknowledgement:** Adapted from Prof. Greg Morrisett, Dean of Cornell CIS.