CS 2110 Homework Assignment 2

Due: Tue 9/22 11:59pm

Description

This assignment builds on your solution to Assignment 1, but unlike Assignment 1, you have permission to work in teams of two if you so desire. We don't actually recommend doing so; the project is designed to be done by individuals and there is no real need for teams. But we also know that some people strongly prefer to work with a partner and since many people in the computing industry do work in teams throughout their careers, we're allowing that. Both team members must identify themselves as part of a team and should hand in the identical solution.

We are making available Ken's solution to Homework 1 but won't release until the morning after Assignment 1 was due. You may use that as a starting point if your own version of Assignment 1 was hopelessly broken. However, if you do so, you must indicate that you based your code on Ken's solution.

Assignment 2 is about calculating gene distance. The closer two genes are, the more likely it is that one evolved from or into the other, or that they share a recent common ancestor. In our algorithm, matching sections of the two genes that contribute nothing ot the distance calculation, and a heuristic is used to determine a score for differing sections. The higher the total score, the more distant the two genes are considered to be.

A Simplified Algorithm

For ease of exposition, let us begin with an oversimplified algorithm—not the one you will finally implement—that provides some intuition for the more sophisticated approach. Consider two genes, G_a and G_b, where G_a is of the form G_prefixG₀G_suffix and G_b is of the form G_prefixG₁G_suffix. G_prefix and G_suffix are the same for both genes, and may even be zero-length. Thus G₀ and G₁ differ only in their middle sections G₀ and G₁. Assume the prefix and suffix are maximal, such that G₀ and G₁ do not have any characters in common at either end.

Scoring

To determine the score for these two gene fragments G₀ and G₁, we first count the instances of each character. For example, if G₀ = AOOACIAIOCOO and G₁ = OCIIOOOIIOOOOOI, then we count for G₀ 3 A's, 2 C's, 2 I's and 5 O's. Likewise, for G₁ we have 0 A's, 1 C's, 5 I's and 9 O's.

Now we compare these numbers. We use three constants—SCOST, DCOST, and DDCOST— in our score calculation.

If a character is in one fragment but not the other, add DDCOST for each instance of that character. Example: There are three As is in G₀ but none in G₁, so we add DDCOST*3 to the score.
If a character appears in both fragments, add SCOST for each common occurrence, and DCOST for each difference. Character I appears twice in G₀ and five times in G₁. Thus, I contributes SCOST*2 + DCOST*3 to the total score. Likewise, C contributes SCOST + DCOST and O contributes SCOST*5 + DCOST*4.

The nominal constant values are SCOST=15, DCOST=8, and DDCOST=25, giving a total score of SCOST * 8 + DCOST * 8 + DDCOST * 3 = 259 for the example.

Tip: Store the constants as static final variables in a Constants class. Put any other constants you use here as well.

public class Constants {
    public static final int SCOST = 15;   // Score for each common instance of a character
    public static final int DCOST = 8;    // Score for difference in number of instances
    public static final int DDCOST = 25;  // Score for characters appearing only in one gene fragment
    public static final int MIN_MATCH = 4;  // Sliding window minimum match length
}

This scoring function is a heuristic: a rule of thumb that will suffice for our purposes in CS2110. It may seem a bit ad-hoc, but in fact the Smith-Wasserman and BLAST algorithms use similar heuristics. In assignment 3, you will get even closer to the Smith-Wasserman algorithm. The reason for using heuristics of this sort is that computing the exact way that some gene evolved into another is such a hard problem—there are too many possibilities to explore, and in general, more than one possibility might lead to the right place. Worse, one can't really know that evolution followed the shortest path. So this is a case in which there isn't any way to know the "true" answer. A good rule of thumb may be the best that can be done!

The Gene Distance Algorithm

The above algorithm is too simple, because there might yet be large parts of G₀ and G₁ that are identical, and the algorithm should detect those parts and only score the parts that are actually different. The result will become recursive, breaking the overall genes G_a and G_b down into matching and non-matching regions, ultimately calling the scoring function above for each of the non-matching regions. The final distance will be the sum of all of those scores.

Trimming Common Affixes

As in the simplified algorithm, we will ignore common prefixes and suffixes in G_a and G_b. That is, the first thing to do with your two input genes is find the longest strings G_prefix and G_suffix such that G_a is of the form G_prefixG₀G_suffix, and G_b is of the form G_prefixG₁G_suffix. Throw away G_prefix and G_suffix; the result of the trimming operation is just G₀ and G₁. Note that they necessarily differ in both first and last characters, or the prefix or suffix could have been longer; actually either or both can be empty as well—if both are empty, it is because G_a and G_b are identical, with distance zero.

In some cases, the order in which you trim the prefix and suffix matters. For this assignment, trim the prefix first.

Sorting Genes

In order to ensure that the algorithm described below is deterministic, and everybody gets the same results, it is important to make sure the computation always happens the same way. We will define a sort order for genes as follows:

If G₀ is shorter (has fewer characters) than G₁, it comes first.
If G₀ and G₁ are the same length, and G₀ comes first alphabetically, it comes first.
Otherwise, G₁ comes first.

In other words, sort first by length (increasing), and sub-sort alphabetically.

After trimming the common prefix and suffix from our inputs, we compare the genes using this sort order. If G₀ is greater than G₁, we will swap them so that G₀ is the smaller of the two.

Finding a Match

Given two trimmed, sorted genes G₀ and G₁, our distance algorithm will recursively identify and compare segments that differ between the genes.

The first step is finding a match. We will use the constant MIN_MATCH = 4.

We'll take a "window" of MIN_MATCH characters (4, as defined below) within G₀ and "slide" it along G₁ looking for a match. For example, if G₀ was ACIOCAO and G₁ was CIAACIOCCC, then we can slide ACIO (the first 4 characters of G₀) along G₁: first we would check it against CIAA, then IAAC, then AACI and then (hooray!) ACIO. Obviously, we got lucky here and found a match at the very start of G₀. In general we would have needed to slide our window to the right: after checking ACIO we would have checked CIOC, then IOCA, then OCAO. We didn't need to do those checks in this particular example, but if we didn't get a match on the first try this is what we would have done next.

To illustrate:

G₀ ICICCOAAIOAIICOCIOACCAIAIAIOCAAOAOACCOI
G₁ OOCICICCOAAIIOIACOOIACOCOCAOACOAAIOOOOIAOOAAAI
Slide ...
G₀ ICICCOAAIOAIICOCIOACCAIAIAIOCAAOAOACCOI
G₁ OOCICICCOAAIIOIACOOIACOCOCAOACOAAIOOOOIAOOAAAI
Slide ...
G₀ ICICCOAAIOAIICOCIOACCAIAIAIOCAAOAOACCOI
G₁ OOCICICCOAAIIOIACOOIACOCOCAOACOAAIOOOOIAOOAAAI
Slide ...
G₀ ICICCOAAIOAIICOCIOACCAIAIAIOCAAOAOACCOI
G₁ OOCICICCOAAIIOIACOOIACOCOCAOACOAAIOOOOIAOOAAAI
Match!

If you reach the end of G₁ and still haven't found a match, slide the window in G₀ and start again from the beginning in G₁, comparing G_{0_1–4} to G_{0_0–3}, and so forth. This is just a nested loop, comparing all possible 4-character substrings of G₀ and G₁ until a match is found; the reason you must make the sliding G₁ window the inner loop and G₀ the outer is, once again, to ensure that everyone implements the same algorithm and thus gets the same answers.

If a match is found, you know that G₀ and G₁ have 4 characters in common at that point—but they might have more. "Grow" the match to include any other characters that match in a contiguous segment.

G₀ ICICCOAAIOAIICOCIOACCAIAIAIOCAAOAOACCOI
G₁ OOCICICCOAAIIOIACOOIACOCOCAOACOAAIOOOOIAOOAAAI
Growing the match

Using the Match to Compute Distance

If there is no match, then the two genes (really, fragments of the top-level genes G_a and G_b) have no significant portions in common, and the distance between them is determined directly by the scoring algorithm described in the simplified algorithm.

Otherwise, we will split each gene into three segments— everything to the left of the match, the match itself, and everything to its right. The left and right segments are possibly empty. We now run the score calculation on the left side segments of G₀ and G₁ and add this to our total distance. The match segment is discarded. And we run the distance algorithm recursively on the right segment.

AOOCICOAAIOAIICOCIOACCAIAIAIOCAAOAOACCOI
OOCACAIOIOAIIOIACOOIACOCOCAOACOAAIOOOOIAOOAAAI

Green segments are scored and added to the total. Yellow match segments are ignored. The distance algorithm is run recursively on the red segments.

The final distance, of course, is simply the sum of all of the scores computed for the non-matching regions. Here, at last, is pseudo-code for the distance algorithm:

distance(G_a, G_b): G₀, G₁ = sort (trim (G_a, G_b)) return distance'(G₀, G₁) distance'(G₀, G₁) G_{0_left}, G_{1_left}, G_{0_right}, G_{1_right} = findMatch (G₀, G₁) return simpleScore(G_{0_left}, G_{1_left}) + distance'(G_{0_right}, G_{1_right})

Note: The trim and sort steps are not included in the recursion.

You will find that when genes are reasonably similar, the distance tends to be 750 or less (often much less). When genes are very different the score rises sharply.

Notice the pains that have been taken to ensure that the entire gene comparison algorithm is completely specified, so there is actually a single correct solution to these problems. If two different versions read in the same Animal data files they should (a) find the same genes, (b) give any given gene the same number that any other version would pick, (c) compute the same gene-to-gene distances, and hence (d) arrive at the identical table of distances.

Our testing procedure expects this and tests for it.

Implementation

In this assignment, you will write a program that prints out a comparison matrix of genes. Each cell (i,j) in the comparison matrix contains the computed distance between G_i and G_j, where genes are numbered according to their sort order. The output will look something like this:

G0=A G1=OCA G2=AACA G3=OOCA G4=AAACA G5=OOOCA G6=AAAOOCA G7=AIOCAAAOOAIOCAIOCCA G0 G1 G2 G3 G4 G5 G6 G7 0 50 75 75 100 100 150 450 // G0 50 0 75 25 100 50 100 347 // G1 75 75 0 100 25 125 75 322 // G2 75 25 100 0 125 25 75 354 // G3 100 100 25 125 0 150 50 329 // G4 100 50 125 25 150 0 100 361 // G5 150 100 75 75 50 100 0 300 // G6 450 347 322 354 329 361 300 0 // G7

You will accomplish this by filling in the stub methods for the GeneComparison class,

public class GeneComparison { public static int geneDistance(String gene1, String gene2) { /* Returns the computed distance between two genes */ ... } public static void main(String[] args) { /* Prints all-to-all comparison matrix of genes extracted from files given in args */ ... } }

These are the steps your main method should follow:

Parse each file in args using your SpeciesReader from assignment 1. Find DNA by searching for the DNA command, and extract the genes from it using your DNAParser.

Tip: At this point, you may want to wrap gene strings in a custom Gene class. This is optional, but a Gene class is an intuitive location for gene comparison methods.

Collect the genes from all species in a single collection. Be sure to not include duplicates in the collection if the same gene is read twice.
After you've finished reading files, sort the genes according to the sort order described earlier. We will name the genes G0, G1, G2, ... , GN according to this order.

You should use one or more Java Collection classes to collect and sort genes. Each type of collection has different strengths and weaknesses; read the descriptions of each to learn what they do. Notably, some are good for sorting, and others are good for culling duplicates: ArrayList, HashMap, HashSet, TreeMap, TreeSet

These documents refer to natural sort order, which you can define for your class by implementing the Comparable interface. You can also impose an external sort order on objects by passing a Comparator instance to the Collections.sort method.

Efficiently compute the table of all-to-all gene comparisons. Our gene comparison rule is commutative, in that the genetic distance from G₃ to G₉ is identical to the distance from G₉ to G₃. You should exploit this symmetry to do less computation.
Optional, for extra credit: Include JUnit tests for the major classes in your solution. Each test should include comments explaining it tests.

Matrix print format

Your program should first print a header that lists all of the genes in sorted order, along with their names starting with G0. The name and value of the gene must be separated by an equals sign.

Example:

G0=A G1=OCA G2=AACA G3=OOCA G4=AAACA G5=OOOCA G6=AAAOOCA G7=AIOCAAAOOAIOCAIOCCA

Next, print a blank line, followed by a line of labels for the columns of the matrix starting with G0.

G0 G1 G2 G3 G4 G5 G6 G7

Then print the matrix, one row per line. The top row should contain comparisons with G0, such that the upper-left cell is a comparison of G0 with itself. Row elements should be separated by whitespace (spaces and tabs). You don't need to "pretty-print" your matrix into even columns, although doing so may help you test your code. Our parser will ignore any extra whitespace before the first or after the last element in a row. The matrix should contain only integers.

0 50 75 75 100 100 150 450 50 0 75 25 100 50 100 347 75 75 0 100 25 125 75 322 75 25 100 0 125 25 75 354 100 100 25 125 0 150 50 329 100 50 125 25 150 0 100 361 150 100 75 75 50 100 0 300 450 347 322 354 329 361 300 0

Tip: Use System.out.printf to format your output. For example, System.out.printf("%5d ", 137) prints the integer 137 into a right-justified five-character area followed by another space.

Matrix Comments

You can leave comments in your output using //, just like Java. Our parser will ignore the // operator and everything following it on the same line. Our examples use comments to label matrix rows; you don't need to do this, but it is useful.

Our parser will also ignore blank lines.

What to turn in

You will submit a .jar file containing your source code and compiled classes. The name of the .jar file is unimportant, but it must contain your GeneComparison class and all of your source code.

All of your classes should belong to the package netId.assignment2, where netId is your Net ID.

Note: Because you are implementing a runnable program, and your main method is important, you will want to (and we will test, so you must) include a "Manifest file" in your jar. This is a bit of extra information that indicates which class contains the main method, so that anybody can run your jar file. In Eclipse, when exporting your project as a Jar, you must click "Next" (rather than "Finish" immediately) until you see options about the manifest; you want Eclipse to "Generate a manifest" and you must set your "Main class" or "Main method" to be your GeneComparison class. If you do not do this, your code will not be runnable as a jar, and the test suite will catch this.

Can I put other classes in the .jar? What other classes should I have?

Yes! You are free to create whatever classes you want; this is software engineering practice. Your .jar should contain all of the classes you use for your program. As usual, leave comments that describe the reasoning behind your classes and methods.

If you use your SpeciesReader and DNAParser implementations from assignment 1, please copy them into your netId.assignment2 package.

Tip: Classes can be copied and pasted between packages using the Package Explorer tab in Eclipse.

If your SpeciesReader or DNAParser do not work or you do not wish to use them, you may request the source code for the official assignment 1 solution from a TA. It is OK to use this source code for assignment 2, but you must have submitted assignment 1.

What should my program do with malformed input?

Your program's behavior under bad input is unspecified. We will not test against bad data.

If your program won't run against our test we may ask you to repair the formatting problem that caused the issue and resubmit. In such cases you will receive a penalty.

If you created JUnit tests for the main classes in your solution, you will receive extra credit. This won't take you over the maximum possible credit for Assignment 2, but could compensate for points lost elsewhere in the assignment. Extra credit doesn't carry over to other assignments.

How we plan to test your code

We will test your program against several simple gene comparisons and a large set of randomly generated genes. In each case, we will test both your printed matrix output and the result of your geneDistance function.

A test suite is available for download under the CMS page for this assignment, named A2ComplianceTool.jar. The test suite will verify that your .jar contains the GeneComparison class, source code for that and any other classes you provide, a correct manifest file (see above), and will test the output format of your code and some simple sanity checks on the matrix you produce.

The test suite for this assignment, unlike assignment 1, expects additional arguments. As before, specify the jar file containing your submission as the first argument; then provide the arguments to your main method (e.g. "A0.dat" or "A1.dat A2.dat") after. Thus, the run configuration for your test suite invocation might have the arguments "assignment2.jar A1.dat A2.dat". The test suite will test your program under the arguments you provide, so you can see it for whatever input files you are using or have devised.

Because your program prints to System.out, you must be very careful to format your printed matrix as described above. The test suite must be able to parse your output.

Results for a sample situation

To help you debug your code, we implemented a solution and ran our own version a few times. First we executed it using a data set where you can check the gene comparison algorithm by hand and make sure it is doing the right thing. For this purpose, we created a file FakeData.dat and included a DNA string with the following eight genes, obtaining the eight-by-eight distances matrix shown below. If you are getting different results something is either wrong with your program, or with ours!

// File(s): FakeData.dat G0=A G1=OCA G2=AACA G3=OOCA G4=AAACA G5=OOOCA G6=AAAOOCA G7=AIOCAAAOOAIOCAIOCCA G0 G1 G2 G3 G4 G5 G6 G7 0 50 75 75 100 100 150 450 // G0 50 0 75 25 100 50 100 347 // G1 75 75 0 100 25 125 75 322 // G2 75 25 100 0 125 25 75 354 // G3 100 100 25 125 0 150 50 329 // G4 100 50 125 25 150 0 100 361 // G5 150 100 75 75 50 100 0 300 // G6 450 347 322 354 329 361 300 0 // G7

Here are some larger examples, using files from AnimalData.zip.

// File(s): A0.dat G0=ACAAOOCOCOOIAOOOCOACICCACIOACIIOAOAIACCOOAOIAICOCCCOOOAIAAICOACIIOCOAIICOC G1=AIAAAAAICCCCOIIOOOACIOAAIOOIAOACOAICIOCCOAAOIOCCACAIAOACAAAIIIOCIIOAICCCOC G2=AIOCCICCAOIICCOCCCOACOAIIOACCAAOAICOAAOOOICICOCCIAAOIOCCCICICCCCICCOCACAACICOCC G3=AOACAAOACOIAAAACCCAACICICIAOIIACIOAIAACCIAOIOOAIACIOCCCCIIAIACOCACACIIOCIAIICOIC G4=AICICICOAAAIICOCIOACCAIAICOOCCOCIIOAAIAOAACIAOOACICOCCIIOCAOICICIAACIOAAAOICCIOACCICOCAAOOCAOAOACCOIAC G5=AOOCACAOOIAAIIOAOIAIICIAOOCOCCIIOOAAACCAOAOAOOAIAOIOOOCCCOOIICCAOAOAICOCAICOACOAAICOOOCIAOOOCOCCOOAOIOOACC G6=ACCOOIOOIOAIIIIOAACIAAOOCACAIOIOAIIOIACOOIAOACOAAOOOIAOOAAAIOOCCIICOIOIACOIOCOCAACOICIAAOOCCCOIACIACCOCOCCAOOAC G0 G1 G2 G3 G4 G5 G6 0 1334 2081 1853 2978 2389 1851 // G0 1334 0 1935 1458 3116 2935 2279 // G1 2081 1935 0 2121 2038 2378 3482 // G2 1853 1458 2121 0 3316 2410 1999 // G3 2978 3116 2038 3316 0 2381 2416 // G4 2389 2935 2378 2410 2381 0 3400 // G5 1851 2279 3482 1999 2416 3400 0 // G6

// File(s): A1.dat, A2.dat G0=ACAAOOCAOOOOCOCOOIAOOOCOACICCACIOACIIOCOOAOIAIAICOCCCOCCCACIOOCOOOOACIIOCOAIICOC G1=ACAAOOCOCOOIAOOOCOACICCACACAAICIOACIIOAOAIACCOOAOIAICOCCCOOOAIAAICOACIIOCOAIICOC G2=AIAAAAAICCCCOACCOICIIOOOACIOAAIOOIOAAOIOCCACAIAOACAAAIIIACAAACOCIAAOAACIOAICCCOC G3=AIAAAAAICCCCOIIOOOACIOAAAOOIOCIOOIAOACOAICIOCCOAAOIOCCACAIAOACAAAIIIOCIIOAICCCOC G4=AIOCAIIOOCCICCOCCCOACOAIIOAOOIACACCOAAOOOAOACACICICOCCICCICICCCCICCOCAAIICCCCAACICOCC G5=AIOCCICCAOIICCOACIAICCCCOACOAIIOACCAAOAICOAAOOOICICOCCIAAOIOCCCICICCCCICCOCACAACICOCC G6=AOACAAOACOIAAAACCCAACICICIAOIIACIOAIAACCIAOIOOAIACIOCCAIIOICCCIIAIACOCACACIIOCIAIICOIC G7=AOAIOICCACAAAOAIACOACOIACAACICICIIOAIAACCIAAAAAICOIOOAIACAICCACIOCCCCIIAIACIIOCIAIICOIC G8=AICICICOAAAIICOCIOACCAIAICOOCCOCIIOAAIAOAACIAAOAIACOOACICOCCIIOCAOICICIAACIOAAAOICCIOACCICOCAAOOCAOAOACCOIAC G9=AICICICOAAIOACCAICOCIIOAAIAOAACIAOOACAOIIOCIAAAIOCCOCCIIOCAOICICIAACIOAAAOICCIOACCICOCAAAOACACOOCAOAOACCOIAC G10=AOOCACAOOIAAIIOAOIAIICIAOOCOCCIIOOAAACCAOAOAOOAIAOIOOOCCCICICCOOIICCAOAOAICOCAICOACOAAICOOOCIAOOOCOCCOOAOIOOACC G11=AAOOIAAIIOAOIAIICIAOOCOCCIIOOAAACCAOAOOOOCAACCCCCOAOOAIAOIOOOCCCOOIICCAOOCAICOACOAAICOOOCIAOOOCOCCOOAIOCACAOIOOACC G12=ACCOOIOAIIOCCOIOAIIIIOAACIAAOOCACAIOIOAIIOIACOOIAOACOAAOOOIAOOAAAIOOCCIICOIOIACOIOCOCAACOICIAAOOCCCOIACIACCOCOCCAOOAC G13=ACCOOIOOIOAIIIIOAACIAAOOCACAIOIOAIIOIACOOIACOCOCAOACOAAOOOIAOOAAAIOOCCIICOIOIACOIOCOCAACOICIAAOOCCCOIACIACCOCOCCAOOAC G0 G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 0 905 1624 1505 1869 2018 1981 2215 2866 2061 3226 2437 2465 2170 // G0 905 0 1860 1555 2089 2347 2519 1968 3227 1458 2664 2186 2043 1956 // G1 1624 1860 0 877 2221 1656 1770 1775 2102 2073 3037 3075 2844 2878 // G2 1505 1555 877 0 2131 1208 1768 1950 3382 3278 3176 3251 2426 2460 // G3 1869 2089 2221 2131 0 2778 1872 1325 2062 1256 3118 3173 3116 2810 // G4 2018 2347 1656 1208 2778 0 2247 2478 2232 2162 2445 2520 3612 3748 // G5 1981 2519 1770 1768 1872 2247 0 1510 3548 2788 2617 2058 2273 2239 // G6 2215 1968 1775 1950 1325 2478 1510 0 2797 2792 2708 4070 2773 2790 // G7 2866 3227 2102 3382 2062 2232 3548 2797 0 2486 2739 2366 2613 2630 // G8 2061 1458 2073 3278 1256 2162 2788 2792 2486 0 2933 2752 2431 2448 // G9 3226 2664 3037 3176 3118 2445 2617 2708 2739 2933 0 2872 3675 3624 // G10 2437 2186 3075 3251 3173 2520 2058 4070 2366 2752 2872 0 3862 3828 // G11 2465 2043 2844 2426 3116 3612 2273 2773 2613 2431 3675 3862 0 983 // G12 2170 1956 2878 2460 2810 3748 2239 2790 2630 2448 3624 3828 983 0 // G13

Downloads

FakeData.dat - a small test file
AnimalData.zip - test data set