A4: Search
In this assignment and the next, you will develop a search engine for text documents. Your engine will crawl through a directory on a local disk looking for documents. When it finds them, it will index the words that appear in those documents. Then it will answer queries posed by users.
In A4, you will build an working but inefficient version of the search engine. In A5, you will upgrade the data structures involved to make the engine scale up to large collections of documents.
This assignment is more difficult than A3. On a similar assignment last year, the mean hours worked was 13.2, and the standard deviation was 6.6. Please get started right away and make steady progress each day. Please track the time that you spend. We will ask you to report it.
What you’ll do: Implement some functors, and use a testing tool called Bisect.
Objectives:
- Learn more about the OCaml module system.
- Implement two data structures, dictionaries and sets.
- Practice documenting and implementing abstraction functions and representation invariants.
- Gain experience with glass-box testing.
- Optionally, learn about regular expressions.
Table of contents:
- Step 1: Form a CMS Partnership
- Step 2: Get Started
- Step 3: Dictionaries
- Step 4: Sets
- Step 5: Index
- Step 6: Query
- Step 7: Bisect
- Rubric
- Submission
Step 1: Form a CMS Partnership
We recommend that you have a partner, but it is not required. Having a partner is not needed to complete the assignment: it is definitely do-able by one person. Nonetheless, working with another person is useful because it gives you a chance to bounce ideas off another person, and to get their help with fixing faults in your code. See the discussion of “pair programming”, below, for more details.
→ If you do have a partner, you must read the Partner Policy. ←
The deadline to form a CMS partnership is Friday, October 4, at 11:59 pm. There will be a short grace period. Then the assignment will close for a few hours. On Saturday, October 5, we will re-open the assignment, but you will no longer be able to form new partnerships. Instead, you will have to email one of your section TAs, cc’ing your intended partner. The TA can then manually put the two of you into a CMS partnership. However, there will be a penalty for late partner formation: -5 points on Saturday and Sunday, and -20 points on Monday. That penalty applies to both partners (no exceptions). Starting on Tuesday at 12:01 am, no new partnerships may be formed. When we re-open the assignment on Saturday, it will be changed to have the usual Wednesday 5 pm deadline with the usual 48-hour automatic extension policy.
Why have this early partner formation deadline, and the penalties? It’s to make sure you are working with your partner on the entire assignment, as required by the Partner Policy, rather than joining forces part way through.
If you want to split up with your partner before the final assignment deadline,
you must acknowledge their influence on your work in the Authors
interface.
You may form a new partership, subject to the deadlines and penalties stated
above.
Pair Programming. If you have a partner, you should be pair programming with them. Pair programming is a specific way for two people to write code together, and for both of them to own the result. Please watch this video, which explains the driver and navigator model of pair programming:
If you’d optionally like to read more about the benefits of pair programming, Strengthening the Case for Pair Programming by Williams et al. (2000) is a good place to start. It includes this quote:
For years, programmers in industry have claimed that by working collaboratively, they have produced higher-quality software products in shorter amounts of time. But their evidence was anecdotal and subjective: “It works” or “It feels right.” To validate these claims, we have gathered quantitative evidence showing that pair programming—two programmers working side by side at one computer on the same design, algorithm, code, or test—does indeed improve software quality and reduce time to market. Additionally, student and professional programmers consistently find pair programming more enjoyable than working alone. Yet most who have not tried and tested pair programming reject the idea as a redundant, wasteful use of programming resources: “Why would I put two people on a job that just one can do? I can’t afford to do that!” But we have found, as Larry Constantine wrote, that “Two programmers in tandem is not redundancy; it’s a direct route to greater efficiency and better quality.”
Step 2: Get Started
Download the release code. There are many files, and you might need to read most of them before the assignment is over, but you don’t have to do so yet.
Create a new git repo for this assignment. Make sure the repo is private. Add the release code to your repo. Refer back to the instructions in A1 if you need help with that. Make sure you unzip and copy the files correctly, including hidden dotfiles, as described in the A1 handout.
If you are using a partner, grant them access to the repo:
- Go to the repo’s webpage on the COECIS Github.
- Click “Settings”; on the settings page click “Collaborators”, and type your
partner’s NetID into the search box. Click “Add collaborator”. - Your partner should now click on the GitHub icon on the top left in their own browser. They should see the repo you just created listed on the left as a repo to which they have access.
In the release code, there is a makefile provided with the usual targets: build, test, check, finalcheck, zip, docs, and clean.
We are back to an autograded assignment, hence your submission must pass
make check
. Now that you’re used to working with .mli
files, we have
omitted the comments about what you are allowed to change. Any names
and types that we provide in an interface may not be changed, but you
may of course add new declarations and definitions. If you’re ever in
doubt about whether a change is permitted or not, just run make check
:
it will tell you whether you have changed the interface in a prohibited
way.
Step 3: Dictionaries
The most important data structure that your search engine will need is a dictionary—that is, a mapping from keys to values. In this assignment, we’ll keep it simple and implement a dictionary with an association list (which you can read about in the textbook). In A5, we’ll upgrade to a more efficient implementation.
First, read the interfaces in dictionary.mli
. There are a few
advanced OCaml features used in them that you might not recognize,
so we briefly describe them here:
-
The
Formattable.format
function is a custom printer intended for use by the toplevel. Custom printers are discussed in the textbook in the section on abstract types under the heading “Custom Printers”. -
Some signatures have code like
include Sig with type t := t
. This is a destructive substitution. You don’t need to know how to use this yourself. Essentially, it means that the typet
insideSig
should be replaced with the typet
from the surrounding signature. For more information, see the OCaml manual. -
DictionaryMaker
uses module sharing constaints; we previously saw type sharing constraints in the textbook and in recitation. There’s nothing fundamentally different between those two kinds of sharing constraints: both refine a signature by specifying equations that must hold.
Second, read the interface in listDictionary.mli
. All it does is
include the Dictionary
interface and declare a functor.
Next, complete the code in listDictionary.ml
, following the specifications
and comments provided in the starter code. Using test-driven development, also
implement unit tests for ListDictionary
in test.ml
. We will not be assessing
how many test cases you have until the excellent scope, below, so see there for
details.
Be extra careful to use the comparison operator provided by the KeySig
, not
Stdlib.compare
(nor the built-in comparisons <
, <=
, =
, etc.), when
comparing keys. Otherwise your dictionary will not process keys in the right
order, and you will lose points.
The file exampleDictionary.ml
contains an example of how to create a
dictionary. That file is intended to be loaded in utop with #use
, not to be
compiled with ocamlbuild
.
Note that you need to document and implement an abstraction function and any
representation invariants. The documentation goes above the representation
type. The implementations go in format
and rep_ok
, respectively:
-
There is a helper function
format_assoc_list
provided for you in the starter code; it should be helpful in implementingformat
. You may improve its output in any way you wish. -
How to implement
rep_ok
is discussed in the textbook in the section on implementing representation invariants. You are not required to insert calls torep_ok
throughout yourListDictionary
implementation, but you might find it useful for debugging purposes.
You do not need to have OUnit tests for format
or rep_ok
. Indeed,
it would be hard or perhaps even impossible to write such tests.
Step 4: Sets
Your search engine will also need a data structure for sets. In this
assignment, we’ll use a dictionary to represent a set. The core idea of
this representation is that the elements of the set are the keys in the
dictionary. The values in the dictionary are thus irrelevant, so they
might as well simply be ()
, the unit value. Although there is a bit of
extra space overhead using this idea, because of storing the unit value,
it nonetheless provides a way to turn any dictionary data structure
into a set data structure. We will profit from that in A5, when we will
automatically get an improvement in the performance of our set data
structures by upgrading our dictionary data structures.
Use that idea to implement dictionarySet.ml
, specifically the Make
functor,
following the specifications and comments provided in the starter code. Also
implement unit tests for it in test.ml
. You will want to carefully
read the specifications in dictionarySet.mli
.
This is the stopping point for a satisfactory solution.
Step 5: Index
Now that we have the data structures we need, it’s time to implement the search engine. We’ll start with indexing the words found in files.
Begin by carefully reading the specifications in engine.mli
. Then open
engine.ml
and implement the functions index_of_dir
, words
, to_list
, and
format
in Engine.Make
. It is a functor that produces a search engine out of
dictionaries and sets. The ListEngine
module, which you don’t need to modify,
uses that functor to create a search engine based on the association list
dictionaries you created earlier. Implement tests for ListEngine
in
test.ml
. We provide more instructions and hints, below. Please read all the
instructions below before beginning.
Avoid removal. Don’t use the remove
function from the Dictionary
interface in your implementation of the search engine, neither in this
step nor the next step of the handout. You won’t get points off if you
use it, but, we’re warning you that using it in A4 could make your
A5 implementation much harder.
Files. The prototypical kind of file you should have in mind is a book stored as an ASCII-encoded plain text file, such as this edition of Alice’s Adventures in Wonderland. It would be reasonable to assume that the individual lines of files are not overly long, but that there might be many lines in a file.
We have provided two test directories for you in the release code. One is
alice
, which contains just that story. The other is preamble
, which
contains a couple very short files.
As a source for other test files, we recommend Project Gutenberg. Project Gutenberg files are often encoded in UTF-8 instead of ASCII. On most Unix systems, including the 3110 VM, you can convert UTF-8 to ASCII with the following command:
iconv -f UTF-8 -t ASCII -c in.txt >out.txt
where in.txt
is the name of the input UTF-8 file and
out.txt
is the name of the output ASCII file.
Words: For purposes of this assignment, we define a word as follows:
-
A whitespace character is any space, tab, or newline character (i.e., carriage return or line feed).
-
A preword is any maximal length sequence of characters in a file that does not contain any whitespace.
-
A boundary character is any lowercase letter, uppercase letter, or digit.
-
A word is the maximal length subsequence of a preword that begins and ends with a boundary character. In between those there may be any number of any other characters. A preword that contains no boundary characters does not correspond to any word.
There will no doubt be some weird corner cases resulting from this definition of words. But we need a standard definition; this one is relatively simple for us all to use, and it gets many common cases right.
For example, given a file containing the following text:
"I would found an institution where
any person can find instruction in
any study." ---E. Cornell (b. 1807)
The words in that file would be: 1807, an, any, b, can, Cornell, E, find, found, I, in, institution, instruction, person, study, where, would.
Hint: there are 3755 words in Alice’s Adventures in Wonderland, and after converting them all to lowercase, only 3278 distinct words remain.
Paths and filenames:
-
Anywhere you need to use a path separator in your code, use
Filename.dir_sep
rather than hard-coding a forward or backward slash; this will ensure that your code works on Windows and Unix platforms. -
Never change the case of any file or directory names that the user passes to your engine or that the operating system returns to your engine. Some underlying file systems are case sensitive whereas others are case insensitive, so your code should preserve whatever case it receives.
-
When implementing
index_of_dir
, pay close attention to its specification. Also, do not modify the argumentd
in any way: don’t add any kind of path or punctuation to the beginning or end of it, don’t change its case, don’t add a file extension. Just leave it alone. Otherwise, the autograder will be unable to grade your submission, and you will lose many points. Callingindex_of_dir "alice"
ought to index the Alice In Wonderland story that is provided in that directory of the release code. Likewise, callingindex_of_dir "preamble"
ought to index both the files in that directory of the release code.
Library hints:
-
For processing directories, you will find the
Unix
module helpful, in particular theopendir
,readdir
, andclosedir
functions. -
For processing files, you will find the
Stdlib
module helpful, in particular theopen_in
,input_line
, andclose_in
functions. Theinput_line
function will handle newlines for you, so that you don’t have to worry about stripping out those particular whitespace characters. -
To parse a line of a file into its component words, you can either directly implement it yourself using the
String
module’s functions, or you can implement it with regular expressions using theStr
module. The latter is recommended but certainly not required. Regular expressions are a kind of small programming language; you would find it immensely useful to learn them if you haven’t picked it up somewhere already. Here’s a good tutorial: http://regexone.com/.
The modules mentioned above are specifically permitted by this assignment even if they have side effects or are imperative. When dealing with I/O, side effects are unavoidable.
Step 6: Query
The queries that users pose will have one of two forms. Abstractly those two forms are as follows, in which the NOT clause is always optional:
-
“and-not” queries: AND (w1, w2, …, wn), NOT (u1, u2, … um)
-
“or-not” queries: OR (w1, w2, …, wn), NOT (u1, u2, … um)
For example, “AND (far, away)” would return all files that contain both the words “far” and “away”, whereas “AND (far, away), NOT (day, night)” would return all files that do contain both “far” and “away” but do not contain “day” nor “night”. Likewise, “OR (all, nothing)” would return any file that contains “all” or “nothing” (or both), and “OR (all, nothing), NOT (between)” would return any file that contains “all” or “nothing” (or both) but does not contain “between”.
Queries must be case insensitive, which is the behavior you expect from Google. That means queries for the words “far”, “Far”, and “FAR” should all return the same results. Likewise, a file containing “far”, “Far”, or “FAR” would be returned by any of those queries.
Implement and_not
and or_not
in engine.ml
, and implement unit
tests in test.ml
.
This is the stopping point for a good solution. Your A4 grade will be based on what you complete by the time A4 is due. A5 will then ask you to add new functionality. Any functionality mentioned above that you leave incomplete in A4 will become part of the Satisfactory scope of A5. But, the excellent scope of A4 will not be part of A5. So you do not ever have to do the excellent scope of A4. It is perfectly fine to stop right here.
Step 7: Bisect
The textbook contains a tutorial on a tool called Bisect, which is a code coverage testing tool. Do that tutorial.
Now run make bisect
on your solution. Open report/index.html
and
examine the code coverage you have achieved in listDictionary.ml
,
dictionarySet.ml
, and engine.ml
. It’s unlikely you are at
100%, and probably it’s impossible to achieve that. For example, unit
tests are unlikely to ever exercise (all or any of) your format
and
rep_ok
functions. But outside of those, look for any lines colored
red by the Bisect report. Do your best to invent new unit tests that
will cause those lines to be executed. Add those tests to test.ml
.
How high do you need to get your coverage? In an ideal world you would
cover every line that you know it’s possible to cover, or at least that
is feasible to write unit tests to cover. With modest effort, the staff
solution to this assignment was able to achieve 90-100% code coverage in
those three files, excluding format
and rep_ok
implementations.
To exclude those, follow the instructions at the end of the textbook tutorial in
the section titled “Ignoring uncoverable code” regarding BISECT-IGNORE
comments. Depending on when you first read the tutorial, you might not have
seen that section; it was added just before this assignment released.
You may not use BISECT-IGNORE
to unfairly increase your code coverage
percentage. If you do, there will be significant penalties and perhaps even
an Academic Integrity case, because you are falsifying scientific data.
In limited cases beyond format
and rep_ok
it might be fair to use
BISECT-IGNORE
. For example, if there is some defensive code that checks a
precondition and raises an exception if it is violated, and it turns out to be
impossible or infeasible to write a unit test to trigger that exception, then
you should add additional source code comments to explain why it is reasonable
to ignore that code in the Bisect report.
Rubric
- 25 points: submitted and compiles
- 25 points: satisfactory scope
- 25 points: good scope
- 5 points: efficiency of search engine
- 10 points: code quality
- 10 points: excellent scope
Note that there is no testing component to the rubric. Rather, it is absorbed
into the excellent scope. The graders will not attempt to run make test
on
your submission.
Efficiency Rubric. Most of the test cases we run on your search engine will involve relatively small directories. But 5 points will be reserved for running your search engine on larger directories of up to 1 MB. That entails a high enough number of words that you will want to be careful about tail recursion. Of course, since the data structures are lists, the running time will likely be rather slow. For reference, the staff solutions take about 2–3 minutes to index that size directory.
Code Quality Rubric.
This will be assessed the same as in A2, including documentation.
Graders will read your extracted public and private HTML documentation,
not the original source code comments. The make docs
command must
succeed, or you will lose all points on the documentation. We’re going
to grade the top-level components of a module, not the nested parts.
For example, when we grade listDictionary.ml
’s documentation we won’t
click on Make
. The same holds for dictionarySet.ml
and Make
. So
you don’t have to worry about the fact that ocamldoc
doesn’t include
the specification comments for Dictionary
as part of
ListDictionary.Make
’s HTML: we won’t penalize you for those being
missing.
Excellent Scope Rubric. We’ll make listDictionary.ml
and engine.ml
worth
4 points each, and dictionarySet.ml
worth 2 points. We’ll look at your
code-coverage percentages per-file and round any that are at least 90% to up a
full 100%. Then you’ll get a number of points that is your percent code coverage
on a file times the number of points the file is worth, rounded to the nearest
integer.
Submission
Make sure your NetIDs are in authors.mli
, and set the
hours_worked
variable at the end of authors.ml
. Note that
variable is now must be a list. Order does not matter in it.
Run make zip
to construct the ZIP file you need to submit on CMS. Our
autograder needs to be able to find the files you submit inside that ZIP without
any human assistance, so:
→ DO NOT use your operating system’s graphical file browser to construct the ZIP file. ←
Use only the make zip
command we provide. Mal-constructed ZIP files
will receive a penalty of 15 points because of the extra human
processing they will require. If CMS says your ZIP file is too large,
it’s probably because you did not use make zip
to construct it; the
file size limit in CMS is plenty large for properly constructed ZIP
files.
If you look closely, you will notice that make zip
excludes all test
directories. That is intended behavior. We don’t want to clutter CMS
with your large tests. So, please do not go back into your ZIP and
manually add your test directories to it. The course staff is not going
to try and run make test
or make bisect
on your submission, so it is
fine for your tests to be omitted. As for the Bisect report, make zip
automatically produces it and includes it in your submission. So again,
it will be fine that your tests are omitted.
Ensure that your solution passes make finalcheck
. Submit your
zipfile on CMS. Double-check before the deadline that you
have submitted the intended version of your file.
Congratulations! Your search is off to a good start.
Acknowledgement: Adapted from Prof. Greg Morrisett, Dean and Vice Provost of Cornell Tech.