CS 2110 Homework Assignment 2
Due: Tue 9/22 11:59pm
Description
This assignment builds on your solution to Assignment 1, but unlike Assignment 1, you have permission to work in teams of two if you so desire. We don't actually recommend doing so; the project is designed to be done by individuals and there is no real need for teams. But we also know that some people strongly prefer to work with a partner and since many people in the computing industry do work in teams throughout their careers, we're allowing that. Both team members must identify themselves as part of a team and should hand in the identical solution.
We are making available Ken's solution to Homework 1 but won't release until the morning after Assignment 1 was due. You may use that as a starting point if your own version of Assignment 1 was hopelessly broken. However, if you do so, you must indicate that you based your code on Ken's solution.
Assignment 2 is about calculating gene distance. The closer two genes are, the more likely it is that one evolved from or into the other, or that they share a recent common ancestor. In our algorithm, matching sections of the two genes that contribute nothing ot the distance calculation, and a heuristic is used to determine a score for differing sections. The higher the total score, the more distant the two genes are considered to be.
A Simplified Algorithm
For ease of exposition, let us begin with an oversimplified algorithm—not the one you will finally implement—that provides some intuition for the more sophisticated approach. Consider two genes, Ga and Gb, where Ga is of the form GprefixG0Gsuffix and Gb is of the form GprefixG1Gsuffix. Gprefix and Gsuffix are the same for both genes, and may even be zero-length. Thus G0 and G1 differ only in their middle sections G0 and G1. Assume the prefix and suffix are maximal, such that G0 and G1 do not have any characters in common at either end.
Scoring
To determine the score for these two gene fragments G0 and G1, we first count the instances of each character. For example, if G0 = AOOACIAIOCOO and G1 = OCIIOOOIIOOOOOI, then we count for G0 3 A's, 2 C's, 2 I's and 5 O's. Likewise, for G1 we have 0 A's, 1 C's, 5 I's and 9 O's.
Now we compare these numbers. We use three constants—SCOST, DCOST, and DDCOST— in our score calculation.
- If a character is in one fragment but not the other, add DDCOST for each instance of that character. Example: There are three As is in G0 but none in G1, so we add DDCOST*3 to the score.
- If a character appears in both fragments, add SCOST for each common occurrence, and DCOST for each difference. Character I appears twice in G0 and five times in G1. Thus, I contributes SCOST*2 + DCOST*3 to the total score. Likewise, C contributes SCOST + DCOST and O contributes SCOST*5 + DCOST*4.
The nominal constant values are SCOST=15, DCOST=8, and DDCOST=25, giving a total score of SCOST * 8 + DCOST * 8 + DDCOST * 3 = 259 for the example.
This scoring function is a heuristic: a rule of thumb that will suffice for our purposes in CS2110. It may seem a bit ad-hoc, but in fact the Smith-Wasserman and BLAST algorithms use similar heuristics. In assignment 3, you will get even closer to the Smith-Wasserman algorithm. The reason for using heuristics of this sort is that computing the exact way that some gene evolved into another is such a hard problem—there are too many possibilities to explore, and in general, more than one possibility might lead to the right place. Worse, one can't really know that evolution followed the shortest path. So this is a case in which there isn't any way to know the "true" answer. A good rule of thumb may be the best that can be done!
The Gene Distance Algorithm
The above algorithm is too simple, because there might yet be large parts of G0 and G1 that are identical, and the algorithm should detect those parts and only score the parts that are actually different. The result will become recursive, breaking the overall genes Ga and Gb down into matching and non-matching regions, ultimately calling the scoring function above for each of the non-matching regions. The final distance will be the sum of all of those scores.
Trimming Common Affixes
As in the simplified algorithm, we will ignore common prefixes and suffixes in Ga and Gb. That is, the first thing to do with your two input genes is find the longest strings Gprefix and Gsuffix such that Ga is of the form GprefixG0Gsuffix, and Gb is of the form GprefixG1Gsuffix. Throw away Gprefix and Gsuffix; the result of the trimming operation is just G0 and G1. Note that they necessarily differ in both first and last characters, or the prefix or suffix could have been longer; actually either or both can be empty as well—if both are empty, it is because Ga and Gb are identical, with distance zero.
In some cases, the order in which you trim the prefix and suffix matters. For this assignment, trim the prefix first.
Sorting Genes
In order to ensure that the algorithm described below is deterministic, and everybody gets the same results, it is important to make sure the computation always happens the same way. We will define a sort order for genes as follows:
- If G0 is shorter (has fewer characters) than G1, it comes first.
- If G0 and G1 are the same length, and G0 comes first alphabetically, it comes first.
- Otherwise, G1 comes first.
In other words, sort first by length (increasing), and sub-sort alphabetically.
After trimming the common prefix and suffix from our inputs, we compare the genes using this sort order. If G0 is greater than G1, we will swap them so that G0 is the smaller of the two.
Finding a Match
Given two trimmed, sorted genes G0 and G1, our distance algorithm will recursively identify and compare segments that differ between the genes.
The first step is finding a match. We will use the constant MIN_MATCH = 4.
We'll take a "window" of MIN_MATCH characters (4, as defined below) within G0 and "slide" it along G1 looking for a match. For example, if G0 was ACIOCAO and G1 was CIAACIOCCC, then we can slide ACIO (the first 4 characters of G0) along G1: first we would check it against CIAA, then IAAC, then AACI and then (hooray!) ACIO. Obviously, we got lucky here and found a match at the very start of G0. In general we would have needed to slide our window to the right: after checking ACIO we would have checked CIOC, then IOCA, then OCAO. We didn't need to do those checks in this particular example, but if we didn't get a match on the first try this is what we would have done next.
To illustrate:
G0 ICICCOAAIOAIICOCIOACCAIAIAIOCAAOAOACCOI
G1 OOCICICCOAAIIOIACOOIACOCOCAOACOAAIOOOOIAOOAAAI
Slide ...
G0 ICICCOAAIOAIICOCIOACCAIAIAIOCAAOAOACCOI
G1 OOCICICCOAAIIOIACOOIACOCOCAOACOAAIOOOOIAOOAAAI
Slide ...
G0 ICICCOAAIOAIICOCIOACCAIAIAIOCAAOAOACCOI
G1 OOCICICCOAAIIOIACOOIACOCOCAOACOAAIOOOOIAOOAAAI
Slide ...
G0 ICICCOAAIOAIICOCIOACCAIAIAIOCAAOAOACCOI
G1 OOCICICCOAAIIOIACOOIACOCOCAOACOAAIOOOOIAOOAAAI
Match!
If you reach the end of G1 and still haven't found a match, slide the window in G0 and start again from the beginning in G1, comparing G01–4 to G00–3, and so forth. This is just a nested loop, comparing all possible 4-character substrings of G0 and G1 until a match is found; the reason you must make the sliding G1 window the inner loop and G0 the outer is, once again, to ensure that everyone implements the same algorithm and thus gets the same answers.
If a match is found, you know that G0 and G1 have 4 characters in common at that point—but they might have more. "Grow" the match to include any other characters that match in a contiguous segment.
G0 ICICCOAAIOAIICOCIOACCAIAIAIOCAAOAOACCOI
G1 OOCICICCOAAIIOIACOOIACOCOCAOACOAAIOOOOIAOOAAAI
Growing the match
Using the Match to Compute Distance
If there is no match, then the two genes (really, fragments of the top-level genes Ga and Gb) have no significant portions in common, and the distance between them is determined directly by the scoring algorithm described in the simplified algorithm.
Otherwise, we will split each gene into three segments— everything to the left of the match, the match itself, and everything to its right. The left and right segments are possibly empty. We now run the score calculation on the left side segments of G0 and G1 and add this to our total distance. The match segment is discarded. And we run the distance algorithm recursively on the right segment.
AOOCICOAAIOAIICOCIOACCAIAIAIOCAAOAOACCOI
OOCACAIOIOAIIOIACOOIACOCOCAOACOAAIOOOOIAOOAAAI
Green segments are scored and added to the total. Yellow match segments are ignored. The distance algorithm is run recursively on the red segments.
The final distance, of course, is simply the sum of all of the scores computed for the non-matching regions. Here, at last, is pseudo-code for the distance algorithm:
Note: The trim and sort steps are not included in the recursion.
You will find that when genes are reasonably similar, the distance tends to be 750 or less (often much less). When genes are very different the score rises sharply.
Notice the pains that have been taken to ensure that the entire gene comparison algorithm is completely specified, so there is actually a single correct solution to these problems. If two different versions read in the same Animal data files they should (a) find the same genes, (b) give any given gene the same number that any other version would pick, (c) compute the same gene-to-gene distances, and hence (d) arrive at the identical table of distances.
Our testing procedure expects this and tests for it.
Implementation
In this assignment, you will write a program that prints out a comparison matrix of genes. Each cell (i,j) in the comparison matrix contains the computed distance between Gi and Gj, where genes are numbered according to their sort order. The output will look something like this:
You will accomplish this by filling in the stub methods for the GeneComparison class,
These are the steps your main method should follow:
- Parse each file in args using your SpeciesReader from assignment 1. Find DNA by searching for the DNA command, and extract the genes from it using your DNAParser.
- Collect the genes from all species in a single collection. Be sure to not include duplicates in the collection if the same gene is read twice.
- After you've finished reading files, sort the genes according to the sort order described earlier. We will name the genes G0, G1, G2, ... , GN according to this order.
- Efficiently compute the table of all-to-all gene comparisons. Our gene comparison rule is commutative, in that the genetic distance from G3 to G9 is identical to the distance from G9 to G3. You should exploit this symmetry to do less computation.
- Optional, for extra credit: Include JUnit tests for the major classes in your solution. Each test should include comments explaining it tests.
Tip: At this point, you may want to wrap gene strings in a custom Gene class. This is optional, but a Gene class is an intuitive location for gene comparison methods.
You should use one or more Java Collection classes to collect and sort genes. Each type of collection has different strengths and weaknesses; read the descriptions of each to learn what they do. Notably, some are good for sorting, and others are good for culling duplicates:
ArrayList,
HashMap,
HashSet,
TreeMap,
TreeSet
These documents refer to natural sort order, which you can define for your class by implementing the Comparable interface. You can also impose an external sort order on objects by passing a Comparator instance to the Collections.sort method.
Matrix print format
Your program should first print a header that lists all of the genes in sorted order, along with their names starting with G0. The name and value of the gene must be separated by an equals sign.
Example:
Next, print a blank line, followed by a line of labels for the columns of the matrix starting with G0.
Then print the matrix, one row per line. The top row should contain comparisons with G0, such that the upper-left cell is a comparison of G0 with itself. Row elements should be separated by whitespace (spaces and tabs). You don't need to "pretty-print" your matrix into even columns, although doing so may help you test your code. Our parser will ignore any extra whitespace before the first or after the last element in a row. The matrix should contain only integers.
Tip: Use System.out.printf to format your output. For example, System.out.printf("%5d ", 137) prints the integer 137 into a right-justified five-character area followed by another space.
Matrix Comments
You can leave comments in your output using //, just like Java. Our parser will ignore the // operator and everything following it on the same line. Our examples use comments to label matrix rows; you don't need to do this, but it is useful.
Our parser will also ignore blank lines.
What to turn in
You will submit a .jar file containing your source code and compiled classes. The name of the .jar file is unimportant, but it must contain your GeneComparison class and all of your source code.
All of your classes should belong to the package netId.assignment2, where netId is your Net ID.
Note: Because you are implementing a runnable program, and your main method is important, you will want to (and we will test, so you must) include a "Manifest file" in your jar. This is a bit of extra information that indicates which class contains the main method, so that anybody can run your jar file. In Eclipse, when exporting your project as a Jar, you must click "Next" (rather than "Finish" immediately) until you see options about the manifest; you want Eclipse to "Generate a manifest" and you must set your "Main class" or "Main method" to be your GeneComparison class. If you do not do this, your code will not be runnable as a jar, and the test suite will catch this.
Can I put other classes in the .jar? What other classes should I have?
Yes! You are free to create whatever classes you want; this is software engineering practice. Your .jar should contain all of the classes you use for your program. As usual, leave comments that describe the reasoning behind your classes and methods.
If you use your SpeciesReader and DNAParser implementations from assignment 1, please copy them into your netId.assignment2 package.
Tip: Classes can be copied and pasted between packages using the Package Explorer tab in Eclipse.
If your SpeciesReader or DNAParser do not work or you do not wish to use them, you may request the source code for the official assignment 1 solution from a TA. It is OK to use this source code for assignment 2, but you must have submitted assignment 1.
What should my program do with malformed input?
Your program's behavior under bad input is unspecified. We will not test against bad data.
If your program won't run against our test we may ask you to repair the formatting problem that caused the issue and resubmit. In such cases you will receive a penalty.
If you created JUnit tests for the main classes in your solution, you will receive extra credit. This won't take you over the maximum possible credit for Assignment 2, but could compensate for points lost elsewhere in the assignment. Extra credit doesn't carry over to other assignments.
How we plan to test your code
We will test your program against several simple gene comparisons and a large set of randomly generated genes. In each case, we will test both your printed matrix output and the result of your geneDistance function.
A test suite is available for download under the CMS page for this assignment, named A2ComplianceTool.jar. The test suite will verify that your .jar contains the GeneComparison class, source code for that and any other classes you provide, a correct manifest file (see above), and will test the output format of your code and some simple sanity checks on the matrix you produce.
The test suite for this assignment, unlike assignment 1, expects additional arguments. As before, specify the jar file containing your submission as the first argument; then provide the arguments to your main method (e.g. "A0.dat" or "A1.dat A2.dat") after. Thus, the run configuration for your test suite invocation might have the arguments "assignment2.jar A1.dat A2.dat". The test suite will test your program under the arguments you provide, so you can see it for whatever input files you are using or have devised.
Because your program prints to System.out, you must be very careful to format your printed matrix as described above. The test suite must be able to parse your output.
Results for a sample situation
To help you debug your code, we implemented a solution and ran our own version a few times. First we executed it using a data set where you can check the gene comparison algorithm by hand and make sure it is doing the right thing. For this purpose, we created a file FakeData.dat and included a DNA string with the following eight genes, obtaining the eight-by-eight distances matrix shown below. If you are getting different results something is either wrong with your program, or with ours!
Here are some larger examples, using files from AnimalData.zip.
Downloads
- FakeData.dat - a small test file
- AnimalData.zip - test data set