A3: Index

# A3: Search **Deadline:** Wednesday, 10/05/16, 11:59 pm *This assignment may be done as an individual or with one partner. Sharing of code is permitted only between you and your partner; sharing among larger groups is prohibited. You can use the "Search for Teammates" feature on Piazza to find a partner.* **Late submissions.** Fall Break begins on 10/08/15. You are in no way required to work on A3 over Fall Break. But you may continue to work through the Saturday and Sunday of Fall Break with the usual late penalties if you so choose. **Warning about academic integrity.** This assignment involves implementing some data structures. In the vast expanses of the Internet, there might be an implementation of every known data structure in every known programming language. So it's conceivable you could just Google for the solutions to certain parts of this assignment. Well, we've already done that ourselves and put any results in MOSS, the plagiarism detection software we use. If you do steal code from somewhere and fail to cite it, that will be regarded as an especially heinous violation of academic integrity. If you do cite, the grade penalty will be severe. (Though of course not as severe as violating the code of academic integrity.) **Begin by studying this writeup.** [Read every word][read]. At this point in your programming education, it's important that you take ownership of interpreting specifications. If the writeup does not specify some functionality that you have a question about, and if the specifications in the release code also do not, then that functionality is *unspecified*: you should choose a reasonable answer to your question, record that choice in your overview document, and implement something appropriate. If the writeup or the release code does specify some functionality, then you should know that; although the course staff is happy to clarify things you don't understand, it is **your** responsibility to read and be aware of all the information in the writeup and the release code. [read]: readAllTheWords.jpg ## Overview In this assignment, you will develop a search engine for text documents. The prototypical kind of document you should have in mind is a book from [Project Gutenberg][pg]. Though that project offers books in many formats, the ones you should look at are the Plain Text versions, such as the one labeled "Plain Text UTF-8" of [*Alice's Adventures in Wonderland*][alice] by Lewis Carroll. [pg]: https://www.gutenberg.org/ [alice]: http://www.gutenberg.org/files/11/11.txt Your engine will *crawl* through a directory on a local disk looking for documents. When it finds them, it will *index* the words the appear in those documents. Then it will answer *queries* posed by users. A query will involve words that might appear in the documents, and the response to a query will be a list of documents that involve those words. The index maintained by your search engine will make use of lists, trees, dictionaries, and sets. To get your engine working at first, you will implement a simple dictionary based on association lists. That implementation will be too inefficient to index large documents, so you will later implement a dictionary using a data structure called a *2-3 tree*. A 2-3 tree is a *balanced search tree*, meaning that the tree automatically reorganizes itself to keep its height small, thus improving the performance of operations on the tree. The 2-3 tree data structure was invented by **our own Prof. John Hopcroft** in 1970. You will likewise implement sets at first with an inefficient list representation, then later with an efficient 2-3 tree representation. The queries that users pose will have one of two forms. Abstractly those two forms are as follows, in which the NOT clause is always optional: * "and-not" queries: AND (*w1*, *w2*, ..., *wn*), NOT (*u1*, *u2*, ... *um*) * "or-not" queries: OR (*w1*, *w2*, ..., *wn*), NOT (*u1*, *u2*, ... *um*) For example, "AND (far, away)" would return all documents that contain both the words "far" and "away", whereas "AND (far, away), NOT (day, night)" would return all documents that do contain both "far" and "away" but do not contain "day" nor "night". Likewise, "OR (all, nothing)" would return any document that contains "all" or "nothing" (or both), and "OR (all, nothing), NOT (between)" would return any document that contains "all" or "nothing" (or both) but does not contain "between". You will also build a test suite for your search engine, and as part of that test suite, you will build a *test harness* that could be used to test any dictionary, not just your own. The course staff will run your test harness against some buggy dictionary implementations to see whether your harness detects the bugs in them. All of the above tasks will require extensive use of the OCaml module system. The release code mostly just contains a few interface files; your job is to write implementation files for those interfaces. **You may not change the provided interface files**, as they constitute an extended specification of the names and types your source code must define. ## Objectives * Use the OCaml module system, including structures, signatures, and functors. * Focus on testing by writing a test suite for implementations other than your own. * Implement an efficient data structure using balanced search trees. * Practice writing programs in the functional style using immutable data. * Use Git, a distributed version control system. ## Recommended reading * [Lectures and recitations 7–12][web] (as they become available) * [Handout on 2-3 trees][23tree] (also available from [CMS][cms]) * [Git tutorial][git-tutorial] [web]: http://www.cs.cornell.edu/Courses/cs3110/2016fa/ ## Requirements The primary requirements are the following: 1. You must implement the document crawler, indexer, and querier. 2. You must implement two dictionary data structures, one with association lists and the other with 2-3 trees. 3. You must implement producing a set data structure out of an arbitrary dictionary data structure. 4. You must build a test harness for dictionaries. 5. Your code must be written with good style and be well documented. All functions, except for nested helper functions, must have specification comments. Representation types for data structures must have a documented abstraction function, and any necessary representation invariants must also be documented. 6. Your code must avoid duplication of implementations. There is some duplication of type declarations that is unavoidable; see Part 0 below. 7. You must submit an [overview document][overview]. 8. You must use Git as part of your development process. [overview]: http://www.cs.cornell.edu/Courses/cs3110/2016fa/handouts/overview.html ## What we provide In the release code on the course website you will find these files: * Interface files `data.mli` and `engine.mli` that declare the modules, names, and types you must implement. There are template implementations, `data.ml` and `engine.ml`, provided as well. * A template for testing spread across two compilation units, `Test_data` and `Test_engine`, and a driver file `test_main.ml`. You will need to complete the `.ml` files. * A couple test directories, `test1/` and `test2/`, containing some sample documents to index. * A file `.ocamlinit` that will automatically load some code into `utop` each time you launch it. This should help with interactive testing. Feel free to modify this file. * Some additional scripts for compiling your code. ## What to turn in Submit files with these names on [CMS][cms]: * `data.ml`, `engine.ml`, `test_data.ml`, `test_engine.ml`, and `test_main.ml`, containing your solution code. * `overview.pdf` or `overview.txt`, containing your overview document. * `gitlog.txt`, containing your Git log. * `test.zip`, an optional file, which may contain subdirectories of documents as test cases. The sizes of the above files are limited in CMS to 1 MB, which is driven in part by the fact that there are 350 students in this course each submitting many files. Please stay within the size alloted. In particular, the purpose of `test.zip` is **not** for you to submit test cases with large files. Feel free to describe such test cases in your overview document, but do not submit them. Rest assured that the course staff will test your submission on our own large files! [cms]: https://cms.csuglab.cornell.edu/ **To produce gitlog.txt for submission**: Run the following command in the directory containing your source code (e.g., `test_main.ml`): ``` $ git log --stat > gitlog.txt ``` ## Git You are required to use [Git][git], a distributed version control system, whether you are working as an individual or with a partner. To get started... 1. Do this [Git tutorial][git-tutorial]. Although that tutorial covers branches, they are an advanced feature that you do not need to use. 2. Use what you have learned to create a git repo on the [Cornell CIS GitHub][cisgithub]. Throughout your development of A3, commit your changes to it. Use those checkins to provide checkpoints, in case you need to restore your development to a previous point. 3. Synch with a remote repo to communicate code between you and your partner, or simply to backup your development if you are working as an individual. **You are strongly encouraged to use the [Cornell CIS GitHub][cisgithub]** rather than the public GitHub (whose URL we deliberately omit here). The main reason for this is that the CIS GitHub allows unlimited private repositories. **Private repos are of the utmost importance.** A public repo would share your code with the entire world, including your classmates, thus violating the course policy on academic integrity. Therefore we require that you keep all your CS 3110 related code in private repos. To create a private repository, make sure you select the "Private" radio button when creating a new repository. [git]: https://git-scm.com/ [git-tutorial]: https://try.github.io/ [cisgithub]: https://github.coecis.cornell.edu/ ## Grading issues * **Late submissions:** Carefully review the course policies on [submission and late assignments][late]. Verify before the deadline on CMS that you have submitted the correct version. * **Environment, names, and types:** Your solution must compile and run under OCaml 4.03.0. You are required to adhere to the names and types provided in the released `.mli` files. Your solution must pass the `make check` described below in Part 0. Otherwise, your solution will receive minimal credit. * **Code style:** Refer to the [CS 3110 style guide][style]. Ugly code that is functionally correct will nonetheless be penalized. Take extra time to think and find elegant solutions. [late]: http://www.cs.cornell.edu/Courses/cs3110/2016fa/syllabus.php#late [style]: http://www.cs.cornell.edu/Courses/cs3110/2016fa/handouts/style.html ## Prohibited OCaml features You may not use imperative data structures, including refs, arrays, mutable fields, and the `Bytes` and `Hashtbl` modules. Strings are allowed, but the deprecated (mutable) functions on them in the `String` module are not. Your solutions may not require linking any additional libraries/packages beyond OUnit, Str, and Unix. You may and in fact must use I/O functions provided by the `Pervasives` module and the `Unix` module, even though they cause side effects, to implement your engine. Your implementations of dictionaries and sets may not use the standard library `Map` or `Set` implementations. Your test suite is free to use them, however, for the purpose of comparing the correctness of your implementations to a "known good" implementation. ## Part 0: Sanity check Download the release code. Run `make check` from the folder containing the release code. Read the entire output carefully. You should not get any warnings about your environment or your `.mli` files at this point. As you complete the assignment, you should periodically re-run `make check` to ensure that you haven't changed any of the required names or types. Although that script does its best to determine whether your names and types are correct, in the end it's your responsibility to make sure that you've adhered to the required names and types as declared in the released `.mli` files. Running `make test` or simply `make` will build your OUnit test suite in `test_main.ml` and run it. Running `make clean` will delete generated files produced by the make and build systems. Although testing interactively in utop is often quite helpful, you should be aware that when you `#use` a `.ml` file, that is only textual inclusion of the file as if you had typed it into the toplevel yourself. The file is not treated as part of a compilation unit in that case, so it is not checked against any `.mli` file. You must compile the file with `make` (or at least with `ocamlbuild`) in order for the implementation to be checked against the interface. That's why the provided `.ocamlinit` has `#load` directives rather than `#use` directives. You will therefore need to recompile your code each time before launching utop. ## Part 1: Understand the codebase Your preliminary task is to familiarize yourself with the structure of the release code we have shipped to you. Read each of the files in detail, noting carefully the specifications that are provided. A good order to do that would be `engine.mli`, `engine.ml`, `data.mli`, `data.ml`, `test_main.ml`, `test_data.mli`, `test_data.ml`, `test_engine.mli`, `test_engine.ml`. There are close to 500 lines of code and comments in those files. Make sure to start reading all of them right away! Don't put this off until the last minute. **You may not change the provided `.mli` files.** They are the interface you must implement, and they are the interface against which the course staff will test your submission. Do not change even a single character of those files unless you are comfortable with the risk of getting a zero on your correctness score for the assignment. One thing you will notice is that there is a certain amount of copying of a particular kind of code between interfaces and implementations: any type declaration must be repeated in both files. This is because if a `.mli` file contains a type declaration, the `.ml` must supply that declaration as well to implement the interface—just like if the `.mli` file declares a value, the `.ml` must define it. Though this might seem weird, it enables separate compilation of interfaces from implementations, which is actually a good thing. ## Part 2: Implement the search engine with inefficient data structures As discussed above, the job of your search engine is to index a directory of files and answer queries about the words appearing in those files. For purposes of this assignment, we define a *word* as follows; make sure to **read this definition carefully**: * A *whitespace character* is any space, tab, or newline character (i.e., carriage return or line feed). * A *preword* is any maximal length sequence of characters in a file that does not contain any whitespace. * A *boundary character* is any lowercase letter, uppercase letter, or digit. * A *word* is the maximal length subsequence of a preword that begins and ends with a boundary character. In between those there may be any number of any other characters. <span style="color:green;"> [CLARIFICATION 09/30/16] A preword that contains no boundary characters does not correspond to any word.</span> Yes, there will no doubt be some weird corner cases resulting from this definition of words. But we need a standard definition; this one is relatively simple for us all to use, and it gets many common cases right. For example, given the following file: ``` "I would found an institution where any person can find instruction in any study." ---E. Cornell (b. 1807) ``` The words in that file would be: "1807", "an", "any", "b", "can", "Cornell", "E", "find", "found", "I", "in", "institution", "instruction", "person", "study", "where", "would". **Searching.** Searches must be case insensitive. For example, searching for either "Cornell" or "cornell" should return the file in the example immediately above. **Directories and files.** Note that directory names and filenames are case sensitive, and your implementation must get that case right both in how it interacts with the filesystem and in the results your search engine returns. For example, if the name of the directory passed to `index_of_dir` is `../myTests`, and if that directory contains a file named `Atest.txt`, and that file contains the word `abracadabra`, then a search for that word should return the file named `../myTests/Atest.txt`. The directory name is a required component of the file name. It would be incorrect to return `Atest.txt` or `mytests/atest.txt`. Anywhere you need to use a path separator in your code, use `Filename.dir_sep` rather than hard-coding a forward or backward slash; this will ensure that your code works on Windows and Unix platforms. **What to do.** Implement `Data.MakeListDictionary`, `Data.MakeSetOfDictionary`, and `Engine.ListEngine`. Note that you might not need every one of the functions in `Dictionary` or `Set` to implement your engine, but they are required functions nonetheless. Also implement `Test_data.tests` to provide OUnit tests for your dictionary and set data structures, and `Test_engine.tests` to provide OUnit tests for your search engine. When you are done you will have a complete, working search engine. But its performance will be rather slow, because the data structures are not very efficient. That's okay for now. Just make sure that your implementations of any functions on lists are tail recursive, so that your solution is capable of processing large inputs albeit slowly. Here are some implementation hints: * For processing directories, you will find the [`Unix` module][unix] helpful, in particular the `opendir`, `readdir`, and `closedir` functions. * For processing files, you will find the [`Pervasives` module][pervasives] helpful, in particular the `open_in`, `input_line`, and `close_in` functions. The `input_line` function will handle newlines for you, so that you don't have to worry about stripping out those particular whitespace characters. * You may assume that the files you process are encoded in ASCII. As a source of test files, start off by creating your own small files. Later, we recommend [Project Gutenberg][pg], from which you can download large text files. Those are often encoded in UTF-8. In most Unix systems, including the 3110 VM, you can convert UTF-8 to ASCII with the following command: ```iconv -f UTF-8 -t ASCII -c in.txt >out.txt``` where `in.txt` is the name of the input UTF-8 file and `out.txt` is the name of the output ASCII file. We have already performed that conversion on the files we released to you in the `test2/` directory. * Given the kinds of files that motivate this assignment (plain text books from Project Gutenberg), it would be reasonable to assume that lines of files are not overly long, but that there might be many lines in a file. * To parse a line of a file into its component words, you can either directly implement it yourself using the `String` module's (non-deprecated) functions, or you can implement it with regular expressions using the `Str` module. The latter is recommended but certainly not required. Regular expressions are a kind of small programming language; you would find it immensely useful to learn them if you haven't picked it up somewhere already. Here's a good tutorial: [http://regexone.com/]() * Several of the signatures specify a `format` function suitable for use with the toplevel's `#install_printer` directive; see the [notes on modules][modules-notes] for more information. The release code provides trivial implementations of these. You are not required to improve those trivial implementations. In particular, the course staff will not test your implementations of these `format` functions. But you would find it exceedingly helpful to implement them anyway. Finishing those `format` functions will enable the toplevel to print useful representations of values of abstract types, making it far easier to do interactive testing and debugging. * The release code signatures require you to provide `rep_ok` and `to_list` functions for each data abstraction. Use these to your advantage when developing and debugging your code! You can think of `to_list` as implementing a kind of abstraction function, as does `format`. [unix]: http://caml.inria.fr/pub/docs/manual-ocaml/libref/Unix.html [pervasives]: http://caml.inria.fr/pub/docs/manual-ocaml/libref/Pervasives.html [modules-notes]: http://www.cs.cornell.edu/Courses/cs3110/2016fa/l/07-modules/notes.html [regex-tutorial]: http://regexone.com/ ## Part 3: Test harness As part of developing your data structures, you naturally will be constructing test cases that demonstrate the (in)correctness of your dictionaries. Let's take that one step further. The course staff will develop several buggy implementations of `Dictionary`. Your task is to construct a suite of test cases that finds all our bugs. The bugs we introduce will be in one or more of `insert`, `find`, and `remove`. Of course, your tests are free to exercise other functions in an effort to find the bugs. The test cases you write should be black box tests against the `Dictionary` interface. You don't know what representation types we might use for our buggy implementations, so focus instead on the behavior specified in the comments of that interface. The functor `Test_data.DictTester` is where you should implement your test harness. The course staff will instantiate that functor on our own incorrect `Dictionary` implementations, as well as some correct implementations. The functor will produce a list of OUnit test cases. Correct implementations must pass all those cases, and incorrect implementations must fail at least one test case. Note that your test harness should not rely on the presence of any of the additional documents you submit in `test.zip`: those documents are for testing of your engine, not of your dictionaries per se. To ease your mind, you need not worry about maliciously incorrect `Dictionary` implementations. By that we mean that our incorrect implementations will be of the sort that a programmer might create on accident or by misunderstanding, rather than being coded to fail only on negligibly-likely inputs—for example, only when the number `42` is inserted. ## Part 4: Implement efficient data structures Tackle this part of the assignment only after you have completed the previous parts. The search engine you have implemented now works, but it is inefficient: you will notice that trying to index and search the provided `test2/` directory, which is not overly large, takes a significant amount of time. The problem is that a list-based implementation of dictionaries and sets, while being relatively easy to implement (compared to what you're about to do next), offers poor performance: many operations are linear time. *Balanced search trees* offer much better performance. Their operations are generally logarithmic time. A *2-3 tree* is a kind of balanced search tree data structure in which nodes may have 2 or 3 children. Let's use them to implement an efficient dictionary. To learn about 2-3 trees, read [this handout][23tree] from Prof. Lyn Turbak at Wellesley College. You'll find a cached copy on [CMS][cms]. **Read every word of that handout. Seriously.** Following that handout, implement `Data.MakeTreeDictionary` with a 2-3 tree data structure. [23tree]: http://cs.wellesley.edu/~cs230/fall04/2-3-trees.pdf When your 2-3 tree implementation is complete, try running your search engine on many large documents. How big can you go within about a minute? Report some results in your overview document. Note that if you find the assignment sufficiently challenging that you have trouble completing this part of it, here are some hints to improve your partial credit: * A working engine probably doesn't need all the functions provided in the `Dictionary` interface. * The `remove` operation is the hardest part of implementing 2-3 trees, so you could make that the very last thing you finish. ## Karma <img src="camel_orange.png" height="40" width="40"> <img src="camel_black.png" height="40" width="40"> <img src="camel_black.png" height="40" width="40"> **Scale up:** How well does your search engine scale to large inputs? Make your implementation as efficient as possible while still being correct. We'll test against progressively larger input directories and files with a time limit of about 60 seconds to do the indexing. If there's a clear winner, we'll award up to 10% bonus points (beyond mere karma). <img src="camel_orange.png" height="40" width="40"> <img src="camel_black.png" height="40" width="40"> <img src="camel_black.png" height="40" width="40"> — <img src="camel_orange.png" height="40" width="40"> <img src="camel_orange.png" height="40" width="40"> <img src="camel_orange.png" height="40" width="40"> **Interface:** The search engine you built in this assignment doesn't have its own interface; instead, it relies on the toplevel. Build your own interface instead. It might be a simple textual interface or a graphical user interface, the latter being possible with [LablGtk][]. It is fine for the purpose of implementing this functionality to provide instructions to the grader of the following form: "install certain OPAM packages, delete engine.mli, uncomment certain lines of engine.ml, and run a modified build command." But the code you submit must still pass `make check` before any of those modifications and obey the original assignment specifications, so that your non-karma functionality can be graded with the rest of the submissions. (Sorry; in a 350-student class, automation of grading is a requirement.) [LablGtk]: http://lablgtk.forge.ocamlcore.org/ * * * **Acknowledgement:** Adapted from Prof. Greg Morrisett, Dean of Cornell CIS.