# Problem Set 5: MapReduce

### Last Updated: Wed Apr 10 09:22 EDT 2013

• Fixed the example DNA sequences
• Clarified design meeting: discussing MapReduce, not Part 1 (the written problem).

## Part 1: Written Problem (5 points)

Problem Set 7 is due in 10 minutes and your group is struggling to finish on time.

Your partner Gary has given you this interface for a module he's writing. You do not have access to Gary's module, but you know that he is representing the tree as an array where the root is stored at index 0 and the left and right children of any node at index i are stored at indices `2i + 1` and `2i + 2`.

In order to finish the assignment, you need a function `map f t` that traverses a BinaryTree `t` in order and applies the function `f` to each node. Unfortunately, Gary is taking his 2110 prelim and cannot be reached. Write the map function inside your own module using only the BinaryTree interface or write a README explaining to the course staff why this task is impossible.

```module type BinaryTree = sig
(* Represented as an array in the implementation *)
type 'a tree

(* [add x t] adds an element [x] to the tree [t] *)
val add : 'a tree -gt; 'a -> 'a tree
(* [remove x t] removes element [x] from the tree [t] *)
val remove : 'a tree -> 'a -> 'a tree
(* [member x t] returns [true] if [x] is an element of
* tree [t] and [false] otherwise *)
val member : 'a tree -> 'a -> bool
(* [size t] returns the number of elements in the tree [t] *)
val size : 'a tree -> int
end
```

Write your group's solution to this problem in `part1.txt` and submit it to the appropriate slot on CMS. We will grade these solutions by hand, so don't worry about making your solution compile!

## Part 2: MapReduce (60 points)

In this problem set, you will be implementing a simplified version of Google's MapReduce. Prior to starting this problem set, read the short paper on MapReduce to familiarize yourself with the basic architecture. This writeup will assume that you have read sections 1–3.1 of that paper.

### Overview

#### Map and Reduce Functions

The map and reduce functions have the following OCaml types:

```val map : 'a -> 'b -> ('c * 'd) list
val reduce : 'c ->> 'd list -> 'e list
```

These functions will be computed in a distributed fashion by several agents running concurrently. The agents communicate with each other using a messaging protocol defined in shared/protocol.ml. However, that protocol only allows for the transmission of strings, so you must use OCaml's built-in marshalling and unmarshalling facilities to transmit values of different types. We will explain this more thoroughly below.

For a given MapReduce application `foo`, the application-specific code is stored in the directory apps/foo/. For standard MapReduce applications having one map function and one reduce function, store the code for the mappers in apps/foo/mapper.ml and the code for the reducers in apps/foo/reducer.ml. There is also a controller in the file apps/foo/foo.ml, which handles initialization of and communication between mappers and reducers. This controller must be named in this manner because our application assumes this convention to dynamically load apps.

The mappers and reducers receive their input by calling `Program.get_input` and communicate their results by calling `Program.set_output`. The specific mechanisms for these operations are described below.

#### Basic Execution

Execution is divided into five phases:

1. Pre-Map:

`controller.exe` is invoked with the name of a MapReduce application on the command line along with other application-specific information (e.g., input file name, any other parameters as required by the application). This information is passed to `main` method of the controller in the application directory. This method preprocesses the program input into key-value pairs for map, and then manages the rest of the MapReduce procedure.

2. Map

The controller sends each (key, value) pair to an available mapper. The mapper applies the mapping function on this (key, value) pair, resulting in a list of (key, value) pairs (the types of the input and output do not have to be the same). The resulting list of (key, value) pairs is sent back to the controller. The controller continues to send inputs to the next available mapper until all pairs have been mapped.

3. Combine

For each key produced by the mappers, all of its corresponding values are collected into a single list, resulting in a (key, value list) pair for that key.

4. Reduce

The controller connects to worker servers and initializes reducers on these servers. The (key, value list) pairs produced in the Combine phase are then sent to available reducers. The reducers perform the reduce operation and send the results back to the controller.

5. Post-Reduce

The controller collects the results of the Reduce phase and outputs them in a suitable manner.

#### Client Execution Example

Consider the canonical MapReduce example of counting the number of occurrences of each word in a set of documents. The input is a set of (document id, body) pairs. In the Map phase, each word that occurs in the body of a file is mapped to the pair (word, "1"), indicating that the word has been seen once. In the Combine phase, all pairs with the same first component are collected to form the pair (word, ["1"; "1"; ...; "1"]), one such pair for each word. Then in the Reduce phase, for each word, the list of ones is summed to determine the number of occurrences of that word.

For simplicity, our framework accepts only a single data file containing all the input documents. Each document is represented in the input file as a (id, document name, body) triple. See data/reuters.txt, a collection of Reuters articles from the 1980's, or the shorter files data/test1.txt and data/test2.txt for an example of the formatting. `Map_reduce.map_reduce` prepares a document file by first calling `Util.load_documents` and then formatting the documents into (key, value) pairs for the mapper.

We have included a full implementation of this application in apps/word_count. Here is a detailed explanation of the sequence of events.

1. The controller application is started using the command `controller.exe word_count filename`. It immediately calls the `main` method of `apps/word_count/word_count.ml`, passing it the argument list. The `main` method calls `controller/Map_reduce.map_reduce`, which is common controller code for performing a simple one-phase MapReduce on documents. Other more involved applications have their own controller code.
2. The documents in `filename` are read in and parsed using `Util.load_documents` which splits the collection of documents into {id; title; body} triples. These triples are converted into (id, body) pairs.
3. The controller calls `Map_reduce.map kv_pairs "apps/word_count/mapper.ml"`.
4. `Map_reduce.map` initializes a mapper worker manager using `Worker_manager.initialize_mappers "apps/word_count/mapper.ml"`. The `Worker_manager` loads in the list of available workers from the file named addresses. Each line of this file contains a worker address of the form ip_address:port_number. Each of these workers is sent the mapper code. The worker creates a mapper with that code and and sends back the id of the resulting mapper. This id is combined with the address of the worker to uniquely identify that mapper. If certain workers are unavailable, this function will report this fact, but will continue to run successfully.
5. `Map_reduce.map` then sends individual unmapped (id, body) pairs to available mappers until it has received results for all pairs. Free mappers are obtained using `Worker_manager.pop_worker()`. Mappers should be released once their results have been received using `Worker_manager.push_worker`. Read controller/worker_manager.mli for more complete documentation. Once all (id, body) pairs have been mapped, the new list of (word, "1") pairs is returned.
6. `Map_reduce.map_reduce` receives the results of the mapping. If desired, `Util.print_map_results` can be called here to display the results of the Map phase for debugging purposes.
7. The list of (word, "1") pairs is then combined into (word, ["1"; ...; "1"]) pairs for each word by calling `Map_reduce.combine`. The results can be displayed using `Util.print_combine_results`.
8. `Map_reduce.reduce` is then called with the results of `Map_reduce.combine` and the name of the reducer file `apps/word_count/reducer.ml`.
9. `Map_reduce.reduce` initializes the reducer worker manager by calling the appropriate `Worker_manager` function, which retrieves worker addresses in the same manner as for mappers.
10. `Map_reduce.reduce` then sends the unreduced (word, count list) pairs to available reducers until it has received results for all input. This is performed in essentially the same manner as the map phase. When all pairs have been reduced, the new list of (word, count) tuples is returned. In this application, the key doesn't change, so `Worker_manager.reduce` and the reduce workers only calculate and return the new value (in this case, count), instead of returning the key (in this case, word) and count.
11. The results are returned to the `main` method of `word_count.ml`, which displays them using `Util.print_reduce_results`.

#### Worker Execution Example

1. Multiple workers can be run from the same directory as long as they listen on different ports. A worker server is started using worker_server.exe port_number, where port_number is the port the worker listens on.
2. `Worker_server` receives a connection and spawns a thread, which calls `Worker.handle_request` to handle it.
3. `Worker.handle_request` determines the request type. If it is an initialization request, then the new mapper or reducer is built using `Program.build`, which returns either the new worker id or the compilation error. See worker_server/program.ml for more complete documentation. If it is a map or reduce request, the worker id is verified as referencing a valid mapper or reducer, respectively. If the id is valid, then `Program.run` is called with that id and the provided input. This runs the relevant worker, which receives its input by calling `Program.get_input()`. Once the worker terminates, having set its output using `Program.set_output`, these results are returned by `Program.run`. If the request is invalid, then the appropriate error message, as defined in shared/protocol.ml, is prepared.
4. Once the results of either the build or map/reduce are completed, then `Worker.send_response` is called, which is responsible for sending the result back to the client. If the response is sent successfully, then `Worker.handle_request` simply recurses, otherwise it returns unit.

### Code Structure

In this section we describe the organization of the MapReduce code.

#### `controller/`

• `main.ml`

This is the main module that is called to start up an application. It does not do anything except parse the command-line arguments to determine which application to start, then calls the main method of that application, passing it the command-line arguments.

• `map_reduce.ml`

This module provides functionality for performing the actual map, combine, and reduce operations. These methods can be called by the individual applications. See the `map_reduce.mli` for additional documentation.

The method `map_reduce` in this module contains controller code common to a number of applications that manipulate documents. It can be used with different mappers and reducers to achieve different results. It loads and parses the documents from the input file, then calls `Map_reduce.map` to run the mappers. Next, the list of (key, value) pairs are combined using `Map_reduce.combine`. The combined values are then reduced using `Map_reduce.reduce`. These results are passed back to the individual application controllers for output.

• `worker_manager.ml`

This module is the interface for communicating with workers. It includes functions to initialize or kill mappers and reducers, select inactive workers, assign a map/reduce task to a worker, and return workers that have completed their task to the pool of inactive workers. Detailed descriptions of each function in the module are in `worker_manager.mli`.

If a request fails due to some sort of network or file descriptor error, then the `Worker_manager` will automatically retry a fixed number of times. It is therefore acceptable for workers to intermittently output errors as long as they are still able to service requests.

#### `worker_server/`

• `program.ml`

This module provides the ability to build and run mappers/reducers. It incudes the function `get_input` and `set_output`, which mappers and reducers must use.

• `worker_server.ml`

This module is responsible for initializing the server and spawning threads to handle client connection requests. These threads simply invoke `Worker.handle_request`.

• `worker.ml`

This module is responsible for managing communication between the clients and the mappers/reducers that they build and then run. A more thorough description is contained in the your tasks section.

### Communication Protocol

The protocol that the controller application and the workers use to communicate is stored in shared/protocol.ml.

#### Requests

• `InitMapper code`

Initialize a mapper. Each element in the `code` list is a line in mapper.ml, which will be executed sequentially. Note that if you wish to `open` a module, you must include double-semicolons before accessing its contents. For example, `open Util;;` must appear in the `code` list before a call to `load_documents`. The semicolons indicate the end of a statement and force evaluation of the expression.

• `InitReducer (code)`

Initialize a reducer, same as above.

• `MapRequest (worker_id, key, value)`

Execute mapper with id `worker_id` with `(key, value)` as input.

• `ReduceRequest (worker_id, key, value list)`

Execute reducer with id `worker_id` with `(key, value list)` as input.

#### Responses

• `Mapper (worker_id option, error)`

`Mapper (Some id, _)` indicates the requested mapper was successfully built, and has the returned id. `Mapper (None, error)` indicates compilation of the mapper failed with the returned error.

• `Reducer (worker_id option, error)`

Same as above, except for a reducer.

• `InvalidWorker (worker_id)`

Either a map request was made, and the `worker_id` does not correspond to a mapper, or a reduce request was made, and the `worker_id` does not correspond to a reducer.

• `RuntimeError (worker_id, error)`

Worker with id `worker_id` output the runtime error(s) stored in `error`. Also used to indicate that the worker didn't return any output, meaning `Program.run` returned `None`.

• `MapResults (worker_id, (key, value) list)`

Map request sent to `worker_id` resulted in output of ```(key, value) list```.

• `ReduceResults (worker_id, value list)`

Reduce request sent to `worker_id` resulted in output of `value list`.

#### Marshalling

Only string data can be communicated between agents. Values can be converted to and from strings explicitly using `Util.marshal` and `Util.unmarshal`, which call the OCaml built-in `Marshal.to_string` and `Marshal.from_string` functions, respectively. You can send strings via communication channels without marshalling, but other values need to be marshalled. Your mapper or reducer must also convert the input it receives from strings back to the appropriate type it can operate on, and once it has finished, it needs to convert its output into strings to communicate it back.

Note that marshalling is not type safe. OCaml cannot detect misuse of marshalled data during compilation. If you unmarshal a string and treat it as a value of any type other than the type it was marshalled as, your program will compile, run, and crash. You should therefore take particular care that the types match when marshalling/unmarshalling. Make sure that the type of marshalled input sent to your mapper matches the type that the mapper expects, and that the type of the marshalled results that the mapper sends back matches the type expected. The same is true for reducers using the reducer messages. Recall the syntax for type annotations:

`let x : int = Util.unmarshal my_data`
Annotated variable declarations will help pinpoint unmarshalling errors.

All code you submit must adhere to the specifications defined in the respective `.mli` files. As always, do not change the .mli files.

You must implement functions in the following files:

• `shared/hashtable.ml`

You are required to implement a hashtable according to the specifications in shared/hashtable.mli. This data structure will be useful in your MapReduce implementation. Note: You may not use the OCaml Hashtbl library functions in your hashtable implementation. The only exception is that you may use `Hashtbl.hash` as your hashing function.

However, you are encouraged to use the OCaml `Hashtbl` module throughout the rest of your solutions. You may use your own `Hashtable`, but we discourage this for the following reasons:

• All functions in the `Hashtbl` module are guaranteed to work
• You cannot marshal a `Hashtable`, because it contains a function.
• Using the OCaml module helps reduce the scope of possible bugs. The course staff will be grateful!

• `controller/map_reduce.ml`

The code in this module is responsible for performing the actual map, combine, and reduce operations by initializing the mappers/reducers, sending them input that hasn't been mapped/reduced yet, and then returning the final list of results.

The `map` and `reduce` functions must be implemented according to the specifications in controller/map_reduce.mli. The functionality should mirror the above example execution description. Both functions must make use of available workers simultaneously and be able to handle worker failure. However, you don't need to handle the case when all workers fail when there is no coding error in your worker code (i.e. complete network failure). Additionally, there should only be one active request per worker at a time, and if a worker fails, it should not be returned to the `Worker_manager` using the `push_worker` function.

Both `map` and `reduce` share the same basic structure. First, the workers are initialized through a call to the appropriate `Worker_manager` function. Once the workers have been initialized, the input list should be iterated over, sending each available element to a free worker, which can be accessed using the `pop_worker` function of `Worker_manager`. A mapper is invoked using `Worker_manager.map` providing the mapper's id and the (key, value) pair. Each mapper (and therefore invocation of `Worker_manager.map`) receives an individual (key, value) pair, and outputs a new (key, value) list, which is simply added onto the list of all previous map results. A reducer is invoked similarly using `Worker_manager.reduce`. These functions block until a response is received, so it is important that they are invoked using a thread from the included `Thread_pool` module. Additionally, this requires that these spawned threads store their results in a shared data structure, which must be accessed in a thread-safe manner.

Once all of the input has been iterated over, it is important that any input that remains unfinished, either due to a slow or failed worker, is re-submitted to available workers until all input has been processed. Finally, close connections to the workers with a call to `Worker_manager.clean_up_workers` and return the results.

Note: It is acceptable to use `Thread.delay` with a short sleep period (0.1 seconds or less) when looping over unfinished data to send to available workers in order to prevent a flooding of workers with unnecessary work.

The `combine` function must be implemented according to specifications, but does not need to make use of any workers. It should combine the provided (key, value) pair list into a list of (key, value list) pairs in linear time such that each key in the provided list occurs exactly once in the returned list, and each value list for a given key in the returned list contains all the values that key was mapped to in the provided list.

• `worker_server/worker.ml`

The code in this module is responsible for handling communication between the clients that request work and the mapper/reducers that perform the work.

The `handle_request` function must be implemented according to the mli specifications and the above description, and must be thread-safe. This function receives a client connection as input and must retrieve the `worker_request` from that connection.

If the request is for a mapper or reducer initialization, then the code must be built using `Program.build`, which returns `(Some id, "")` where id is the worker id if the compilation was successful, or `(None, error)` if the compilation was not successful. If the build succeeds, the worker id should be stored in the appropriate mapper or reducer set (depending on the initialization request), and the id should be sent back to the client using `send_response`. If the build fails, then the error message should be sent back to the client using `send_response`.

If the request is to perform a map or reduce, then the worker id must be verified to be of the correct type by looking it up in either the mapper or reducer set before the mapper or reducer is invoked using `Program.run`. The return value is `Some v`, where `v` is the output of the program, or `None` if the program failed to provide any output or generated an error.

Note that the mapper and reducer sets are shared between all request handler threads spawned by the `Worker_server`, and therefore access to them must be thread-safe.

Note: Code in shared/util.ml is accessible to all workers by default.

#### Building and running your code

We have provided a `Makefile` and a build script `build.bat` that you can use to build your project.

• worker_server.exe port_number: starts up a worker server listening on `port_number` for requests.
• controller.exe word_count filename: perform a MapReduce using application `word_count` with datafile filename.

## Part 3: Using MapReduce (27 points)

You will use the simplified MapReduce that you created in part 1 to implement three applications: `inverted_index`, `game_of_life`, and `dna_sequencing`.

Each application will require a mapper `apps/xxx/mapper.ml` and reducer `apps/xxx/reducer.ml`, where `xxx` is the name of the application, as well as a controller `apps/xxx/xxx.ml`. We have supplied a full implementation of the `word_count` application as an example to get you started.

### Inverted Index (5 points)

An inverted index is a mapping from words to the documents in which they appear. For example, if these were your documents:

Document 1:

`OCaml map reduce`

Document 2:
`fold filter ocaml`
The inverted index would look like this:
worddocument
ocaml1 2
map 1
reduce 1
fold 2
filter 2

#### Implementation

To implement this application, you should take a dataset of documents (such as `data/reuters.txt`) as input and use MapReduce to produce an inverted index. This can be done in a way very similar to `word_count`. In particular, you can use `Map_reduce.map_reduce`. Print your final results using `Util.print_reduce_results`.

Write the mapper, reducer, and controller for `inverted_index`. Your mapper and reducer code should go in apps/inverted_index/mapper.ml and apps/inverted_index/reducer.ml, respectively, and your controller code should go in apps/inverted_index/inverted_index.ml.

### Conway's Game of Life (8 points)

Mathematician John Horton Conway, in addition to his significant contributions to theoretical Mathematics, single-handedly created the field of recreational mathematics. He invented thousands of mathematical games and published a number of books and papers on the subject. The most famous of his mathematical games is the simply titled Game of Life. A number of variations have gained popularity, but original Life is a zero-player solitaire which models a cellular automaton. Players define an initial state for the world and then the cells live and die according to a few simple rules, creating fascinating patterns and sequences. "The world" is a two-dimensional torroid (read: a matrix that wraps around at the edges) divided into grid squares. Each square is either alive or dead, and the current state of the board at any instant defines the next state of each cell/grid square. The game progresses in a series of turns, where each square of the entire grid is updated simultaneously according to the following rules:

1. Any live cell with more than three live neighbors dies of overcrowding
2. Any live cell with two or three live neighbors lives on to the next generation
3. Any live cell with fewer than two live neighbors dies of loneliness
4. Any dead cell with exactly three live neighbors becomes a live cell. It's the miracle of birth!
To reiterate: in one turn, every cell gets updated. One game is a series of turns. Each turn is like a snapshot of the world. Playing the snapshots in sequence animates life on our little planet.

### App Overview

You will implement Life using the map reduce framework. Represent the board as an array of 1s and 0s, marking a live cell with a 1 and a dead cell with a 0. For I/O purposes, we will store initial states and generated games as strings, but we have supplied functions to handle the conversions from string to array for you. At a high level, your app must read in a board from a file and run for given number of turns, saving a transcript of the board state before every update. This transcript can be opened by the included GUI `conway.jar` so you can view your simulation. Simply double-click the jarfile or run the command `java -jar conway.jar` in the `gui` directory.

Implement `game_of_life.ml` and the corresponding `mapper` and `reducer`. You are free to write the map and reduce functions however you like, but the controller must be implemented such that `./controller.exe game_of_life my_board.txt 10 my_board.out`:

1. Reads in the string representation of a board stored in `my_board.txt` as the initial state
2. Records the board dimensions at the top of the transcript file
3. Runs the simulation for 10 turns according to Conway's rules
4. Records the 11 board states (intial + 10 turns) in the transcript
Note that we have handled steps 1. and 2. for you already in `game_of_life.ml`. You may reorganize this code if you like, but it is complete. Additionally, you may find the supplied function `write_epoch : int array array -> out_channel -> unit` useful. It takes a board and an output channel and writes a string representation of the board to the output.

### Genetic Sequence Alignment (14 points)

In this exercise you will implement the computationally intensive part of one variant of genetic sequence alignment. Sequence alignment is a collection of problems related to finding approximate matches between two (or more) sequences of genetic information. This exercise was inspired by this paper which you are welcome to read, though you won't need to in order complete this application.

For our purposes, DNA sequences are simply strings of the letters {G,C,A,T}. The context for our problem is that we have some piece of DNA (which we will call the 'sample') that we would like to analyze to get its exact sequence of letters. Unfortunately, it is quite hard to directly read the sequence of letters from a long piece of DNA. Don't ask me why; I'm not a biochemist.

What the biochemists can do is make a ton of copies of the sample DNA and then randomly chop up all the copies into relatively short pieces (on the order of 100 letters long). They call these pieces 'short reads' or just 'reads'. Now we have a puzzle on our hands. How can we put the short reads back together to recreate the whole sample? Without any more information, this is a very challenging problem.1 However, if we have a reference sequence that we expect to be highly similar to the sample sequence, we can use it to combine the short reads. This solution is commonly used in practice; for example, a reference sequence for one set of short reads might come from a different individual of the same species.

Your goal in this section is to implement (most of) the computationally intensive part of putting the short reads back together as 2 map-reduce passes.2 At the end of the 2 passes, your application should identify all the pieces of short reads that appear exactly somewhere in the reference sequence. For example with this reference sequence:

```            1         2         3         4
01234567890123456789012345678901234567890123456
GATCTCTATGCAAAATACGTATTTGTACGTCCACCCTCGGAGTGGTG```
And these short reads:
```            1         2
01234567890123456789012
ATGCAAAATACGTATT
GTCCACCCTCGGA
CGTATTTGTACATCCACCCTCGG
GGTTGCTCTAATATATATGGCG```
CGTATTTGTACATCCACCCTCGG The output should be:
```  ATGCAAAATACGTATT        length 16 match at index  7 of reference 1 and index 0 of read
GTCCACCCTCGGA           length 11 match at index 30 of reference 1 and index 2 of read
CGTATTTGTACATCCACCCTCGG length 11 match at index 17 of reference 1 and index 0 of read
length 11 match at index 29 of reference 1 and index 12 of read```

#### Implementation

1. The input files will be a sequence of lines, where each line looks like:
`  ID@{READ|REF}@{A|C|G|T}*`
ID is a unique number. This format is set up so that the utility function load_documents is reasonably useful.
2. First Map/Reduce Step. Chop up all the short reads and the reference sequence into even shorter "k-mers". A "mer" is simply a short sequence of DNA information; for example "AGTTCA" is a 6-mer. The input to each map instance should be either one short read or the reference sequence.
• A reasonable value for "k" is 10. That means that for a short read of length 15, like "AGCTAGCTCAGTACC", you should end up with 6 10-mers:
```  AGCTAGCTCA
GCTAGCTCAG
CTAGCTCAGT
TAGCTCAGTA
AGCTCAGTAC
GCTCAGTACC```
For the output from the first map instance, the key should be a k-mer itself and the value should be a tuple that includes:
• A flag to indicate reference versus short read
• A unique id
• The index of this k-mer within its short read or reference
• The combine step will group together all of the identical mers, which is exactly what we are after in this step. In particular, we are interested in the mers that appear in both the reference sequence and a short read.
• In the reduce step, find the mers that have a match in both a short read and the reference; we'll call them "shared mers". The key is still the k-mer itself and the values should be a tuple that includes:
• The short read id
• The reference id (Not entirely necessary because we just have one reference)
• Index from the short read
• Index from the reference
3. Second Map/Reduce Step. Coalesce runs of adjacent shared k-mers into longer "seeds". In this step we do not tolerate any mismatched letters. If we have two shared mers from the previous step whose index is off by one, we merge them into a single larger shared mer.
• The map function for this step is mostly a no-op. It just needs to rearrange the data a little. The key in the input is the k-mer text itself. For the output of the map step, the key should be the short read id and the value will be a tuple with the following information:
• The reference id
• Index from the short read
• Index from the reference
• The combine function will group together all the "shared mers" for the same short read. This is why we made it the key in the previous step.
• The reduce function should sort all the shared mer tuples by short read index. Then we look for runs of consecutive shared mers. We coalesce these into seeds. The key is still the short read id. The values are similar to the previous step, but we add the length of the seed.
• The reference id (Not entirely necessary because we just have one reference)
• Index from the short read
• Index from the reference
• Seed length

The result of these three map-reduce steps is a set of aligned short reads (with at most a small number of mismatched letters) which can be stitched together into a full sequence. Don't worry about producing the final combination. If you get to this step with a set of aligned short reads, throw up the victory flags and congratulate yourself!

Implement the two-stage dna sequencing application according to the above specification. We have supplied you with a complete controller script `dna_sequencing.ml`, so all you need to do is implement the mappers `step_1_mapper.ml` `step_2_mapper.ml` and reducers `step_1_reducer.ml` `step_2_reducer.ml`. Enjoy!

1. This short read alignment problem without any more information is exactly what researchers looking at genetic information from uncharted places like the deep sea have to deal with.
2. It is actually common to use several reference sequences and reads from several samples, but we will stick with one to keep things simple.
###### Remarks

Note that for the above applications, your MapReduce code will probably run a lot slower than a non-MapReduce implementation of the same algorithm. This slowdown is due to the fact that there is a lot of overhead and the simplified MapReduce framework is not very well optimized. Performance gains are only realized when doing this on a massive scale.

### To submit

Simply zip up and submit your entire ps5 folder. Be sure to double-check your app names and delete all compiled files before submitting. In particular, remove the working directories generated by the worker server. If you have GNU make installed, simply run `make clean` and you'll be ready to zip and submit. Otherwise, use the script `clean.bat` and zip what's left. If you're paranoid about build scripts and want to prepare everything by hand, then delete those *.cmo & *.cma files and the worker_XXXXX directories generated by the `worker_server` manually. Be careful not to delete the worker_server directory, though!

## Part 4: Source Control (3 points)

You are required to use a source control system like Git or SVN. Submit the log file that describes your activity. We will provide you with an SVN repository, hosted by CSUG, or you may use a private repository from a provider like xp-dev or bitbucket. github. Do not post your code in a public repository. That would be an academic integrity violation.

For information on how to get started with your provided SVN account, read Using Subversion in the CSUGLab.

Note: Your repository name is `cs3110_<netid(s)>`. For example, `cs3110_njl36`, or if you have a partner: `cs3110_dck10_njl36`. Notice that the netids of groups are in alphabetical order. This repository name is what you put in place of `project1` in the directions in the provided link.

If you use Windows, and are unfamiliar with the command line or Cygwin, there is a GUI based SVN client called Tortoise SVN that provides convenient context menu access to your repo.

For Debian/Ubuntu, a GUI-based solution to explore would be Rapid SVN. To install:

````apt-get install rapidsvn`
```

Mac users looking into a front-end can try SC Plugin, which claims to provide similar functionality to Tortoise. Rapid SVN might work as well.

There is also a plug-in for Eclipse called Subclipse that integrates everything for you nicely.

Note that all of these options are simply graphical mimics of the extremely powerful terminal commands, and are by no means necessary to successfully utilize version control.

## Part 5: Design Review Meeting (5 points)

We will be holding 15-minute design review meetings. Your group is expected to meet with one of the course staff during their office hours to discuss this assignment. A sign-up schedule is available on CMS.

Be prepared to give a short presentation on how you and your partner plan to implement MapReduce, including how you will divide the work, and bring any questions you may have concerning the assignment. Staff members will ask questions to gauge your understanding of varying aspects of the assignment, ranging from the high-level outline to specific design pitfalls. This is not a quiz; you are not expected to be familiar with the intricacies of the project during your design review. Rather, you are expected to have outlined your design and prepared thoughtful questions. Your design review should focus on your approach toward implementing the MapReduce framework, specifically controller/map_reduce.ml, but you may spend a few minutes at the end discussing the MapReduce applications.

Design reviews are for your benefit. Meet with your partner prior to the review and spend time outlining and implementing MapReduce. This is a large and difficult assignment. The review is an opportunity to ensure your group understands the tasks involved and has a feasible design early on.