CS 2110 Homework Assignment 2
Due: Thursday 9/23 11:59pm
Description
Assignment 2 is about calculating gene distance. The closer two genes are, the more likely it is that one evolved from or into the other, or that they share a recent common ancestor. In our algorithm, matching sections in the two genes contribute nothing to the distance calculation, and a heuristic is used to determine a score for differing sections. The higher the total score, the more distant the two genes are considered to be.
This assignment builds on your solution to Assignment 1, but unlike Assignment 1, you have permission to work in teams of two if you so desire. We don't actually recommend doing so; the project is designed to be done by individuals and there is no real need for teams. But we also know that some people strongly prefer to work with a partner and since many people in the computing industry do work in teams throughout their careers, we're allowing that. Both team members must identify themselves as part of a team and should hand in the identical solution.
We are making available Ken's solution to Assignment 1. You may use
that as a starting point if your own version of Assignment 1 was
hopelessly broken. However, if you do so, you must indicate that you
based your code on Ken's solution.
A Simplified Gene Distance Algorithm
For ease of exposition, let us begin with an oversimplified algorithm—not the one you will finally implement—that provides some intuition for the more sophisticated approach. Consider two genes, Ga and Gb, where Ga is of the form GprefixG0Gsuffix and Gb is of the form GprefixG1Gsuffix. Gprefix and Gsuffix are the same for both genes, and may even be zero-length. Thus G0 and G1 differ only in their middle sections G0 and G1. Assume the prefix and suffix are maximal, such that G0 and G1 do not have any characters in common at either end.
Scoring
To determine the score for these two gene fragments G0 and G1, we first count the instances of each character. For example, if G0 = IAAIYTITAYAA and G1 = AYTTAAATTAAAAAT then we count for G0 3 I's, 2 Y's, 2 T's and 5 A's. Likewise, for G1 we have 0 I's, 1 Y's, 5 T's and 9 A's.
Now we compare these numbers. We use three constants—COMMON_COST, DIFF_NUM_COST, and ONLY_IN_ONE_COST— in our score calculation.
- If a character is in one fragment but not the other, add ONLY_IN_ONE_COST for each instance of that character. Example: There are three I's is in G0 but none in G1, so we add ONLY_IN_ONE_COST*3 to the score.
- If a character appears in both fragments, add COMMON_COST for each common occurrence, and DIFF_NUM_COST for each difference. Character T appears twice in G0 and five times in G1. Thus, T contributes COMMON_COST*2 + DIFF_NUM_COST*3 to the total score. Likewise, Y contributes COMMON_COST + DIFF_NUM_COST and A contributes COMMON_COST*5 + DIFF_NUM_COST*4.
The nominal constant values are COMMON_COST=15, DIFF_NUM_COST=8, and ONLY_IN_ONE_COST=25, giving a total score of COMMON_COST * 8 + DIFF_NUM_COST * 8 + ONLY_IN_ONE_COST * 3 = 259 for the example.
This scoring function is a heuristic: a rule of thumb that will suffice for our purposes in CS2110. It may seem a bit ad-hoc, but in fact the Smith-Wasserman and BLAST algorithms use similar heuristics. In assignment 3, you will get even closer to the Smith-Wasserman algorithm. The reason for using heuristics of this sort is that computing the exact way that some gene evolved into another is such a hard problem—there are too many possibilities to explore, and in general, more than one possibility might lead to the right place. Worse, one can't really know that evolution followed the shortest path. So this is a case in which there isn't any way to know the "true" answer. A good rule of thumb may be the best that can be done!
The Complete Gene Distance Algorithm
The above algorithm is too simple, because there might yet be large
parts of G0 and G1 that are identical, and the
algorithm should detect those parts and only score the parts that are
actually different. The result will become recursive, breaking the
overall genes Ga and Gb down into matching and
non-matching regions, ultimately calling the scoring function above for
each of the non-matching regions. The final distance will be the sum of
all of those scores.
Step 1: Trim Common Affixes
As in the simplified algorithm, we will ignore common prefixes and suffixes in Ga and Gb. That is, the first thing to do with your two input genes is find the longest strings Gprefix and Gsuffix such that Ga is of the form GprefixG0Gsuffix, and Gb is of the form GprefixG1Gsuffix. Throw away Gprefix and Gsuffix; the result of the trimming operation is just G0 and G1. Note that they necessarily differ in both first and last characters, or the prefix or suffix could have been longer. Either or both of G0 and G1 can be empty as well—if both are empty, it is because the two genes are identical, with distance zero.
In some cases, the order in which you trim the prefix and suffix
matters. For this assignment, trim the prefix first.
Step 2: Sort the Trimmed Genes
In order to ensure that the algorithm described below is deterministic, and everybody gets the same results, it is important to make sure the computation always happens the same way. We will define a sort order for genes as follows:
- If G0 is shorter (has fewer characters) than G1, it comes first.
- If G0 and G1 are the same length, and G0 comes first alphabetically, it comes first.
- Otherwise, G1 comes first.
In other words, sort first by length (increasing), and sub-sort alphabetically.
After trimming the common prefix and suffix from our inputs, we
compare the genes using this sort order. If G0 is greater
than G1 , swap them.
This sort order will be used elsewhere, so it is worthwhile to code
it in a re-usable way. One way to do this would be to
implement Comparable<Gene>
in your Gene class.
Step 3: Sliding Window Matching
Given two trimmed, sorted genes G0 and G1, our
distance algorithm will recursively identify and compare segments that
differ between the genes. Although we have trimmed common affixes,
there might be internal fragments of DNA that are identical; this
"sliding window" algorithm is designed to identify identical segments
of DNA.
The first step is finding a match. We will use the constant MIN_MATCH = 4.
We'll take a "window" of MIN_MATCH characters within G0 and "slide" it from left to right along G1 looking for a match.
To illustrate:
G0 TYTYYAIITAITTYAYTAIYYITITITAYIIAIAIYYAT
G1 AAYTYTYYAIITTATIYAATIYAYAYIAIYAIITAAAATIAAIIIT
Slide ...
G0 TYTYYAIITAITTYAYTAIYYITITITAYIIAIAIYYAT
G1 AAYTYTYYAIITTATIYAATIYAYAYIAIYAIITAAAATIAAIIIT
Slide ...
G0 TYTYYAIITAITTYAYTAIYYITITITAYIIAIAIYYAT
G1 AAYTYTYYAIITTATIYAATIYAYAYIAIYAIITAAAATIAAIIIT
Slide ...
G0 TYTYYAIITAITTYAYTAIYYITITITAYIIAIAIYYAT
G1 AAYTYTYYAIITTATIYAATIYAYAYIAIYAIITAAAATIAAIIIT
Match!
If you reach the end of G1 and still haven't found a match, slide the window in G0 to the right by one character and repeat. This is just a nested loop, comparing all possible four-character substrings of G0 and G1 until a match is found; the reason you must make the sliding G1 window the inner loop and G0 the outer is to ensure that everyone implements the same algorithm and thus gets the same answers.
If a match is found, you know that G0 and G1
have four characters in common at that point—but they might have more.
"Grow" the match to include all other characters that match in a
contiguous segment. In the illustration below, the window is expanded
to include five additional matching characters before it reaches
non-matching characters.
G0 TYTYYAIITAITTYAYTAIYYITITITAYIIAIAIYYAT
G1 AAYTYTYYAIITTATIYAATIYAYAYIAIYAIITAAAATIAAIIIT
Growing the match
Step 4: Using the Match to Compute Distance
If there is no match, then the two genes (really, fragments of the top-level genes Ga and Gb) have no significant portions in common, and the distance between them is determined directly by the scoring algorithm described in the simplified algorithm.
Otherwise, we will split each gene into three segments—everything to the left of the match, the match itself, and everything to its right. The left and right segments are possibly empty. We first run the score calculation (as described in the simplified algorithm) on the left side segments of G0 and G1 and add this to our total distance. Second, the match segment is discarded. Third, we run the distance algorithm recursively on the right segment.
IAAYTYAIITAITTYAYTAIYYITITITAYIIAIAIYYAT
AAYIYITATAITTATIYAATIYAYAYIAIYAIITAAAATIAAIIIT
Green segments are scored. The distance algorithm is run recursively on the red segments. Yellow (matching) segments are discarded. The total distance is the sum of the green score and the recursive red score.
The final distance is the sum of all of the scores computed for the non-matching regions. Here, at last, is pseudo-code for the distance algorithm:
Note: The trim and sort steps are not included in the recursion.
Implementation
To begin, make a new package "cs2110.netid.assignment2" in your Eclipse project. Copy your classes (MyGene, MySpeciesReader, etc) from the first assignment into this package.
The first step of this assignment is to implement the gene distance
algorithm in your MyGene class:
Next, you will write a program that prints out a comparison matrix of genes. Each cell (i,j) in the comparison matrix contains the computed distance between Gi and Gj. Your program must sort the genes first according to the sorting rules, such that G0 is the least gene. The output will look something like this:
Do this in a class named Main. You
can
use
the
Main.java
we provided for the first assignment as a
starting point.
Main's main method should do the following:
- Read the .dat files specified in arguments array
- Collect a single set of all the genes belonging to all the input species. Be sure to eliminate duplicates.
- Sort the genes according to the sort order described above.
- Compute and print the gene comparison matrix. Format described
below.
Note: You should
use one or
more Java Collection classes to collect and sort genes. Each type of
collection has different strengths and weaknesses; read the
descriptions of each to learn what they do. Notably, some are good for
sorting, and others are good for culling duplicates. See: ArrayList,
HashMap,
HashSet,
TreeMap,
TreeSet
These documents refer to "natural sort order", which you can
define for your Gene class by implementing the Comparable
interface. You can also impose an external sort order on objects by
passing a Comparator
instance to the Collections.sort
method.
Tip: Decompose
your main method into smaller methods that perform specific tasks.
Matrix print format
Your program should first print a header that lists all of the genes in sorted order, along with their names starting with G0. The name and value of the gene must be separated by an equals sign.
Example:
Next, print a blank line, followed by a line of labels for the columns of the matrix starting with G0.
Then print the matrix, one row per line. The top row should contain comparisons with G0, such that the upper-left cell is a comparison of G0 with itself. Row elements should be separated by whitespace (spaces and tabs). You don't need to "pretty-print" your matrix into even columns, although doing so may help you test your code. Our parser will ignore any extra whitespace before the first or after the last element in a row. The matrix should contain only integers. Comments at the end (as shown below) are optional.
Tip: Use System.out.printf
to format your output. For example, System.out.printf("%5d ", 137)
prints the integer 137 into a right-justified five-character area
followed by another space. The String.format
method can be used to format strings without printing them.
Matrix Comments
You can leave comments in your output using //, just like Java. Our parser will ignore the // operator and everything following it on the same line. Our examples use comments to label matrix rows; you don't need to do this, but it is useful.
What to Turn in
You will submit a .jar file containing your source code and compiled
classes. The name of the .jar file is unimportant. Export all of the source code in your src directory, including your
assignment1 and the classes in cs2110.assignment1.
Be sure to check the box to
include source files in the export dialog.
As with the first assignment, you will also hand in a README file in
.txt or PDF format, containing a few sentences about your
implementation of the assignment.
Please follow the code style
guidelines.
Can I put other classes in the .jar? What other classes should I have?
You are free to create additional classes as you see fit. As usual, write comments that describe the reasoning behind your classes and methods.
Tip: Classes can be copied and pasted between packages using the Package Explorer tab in Eclipse.
What should my program do with malformed input?
Your program's behavior under bad input is unspecified. We will not test against bad data.
Sample Results
To help you debug your code, we implemented a solution and ran our own version a few times. First we executed it using a data set where you can check the gene comparison algorithm by hand and make sure it is doing the right thing. For this purpose, we created a file FakeData.dat and included a DNA string with the following eight genes, obtaining the eight-by-eight distances matrix shown below. If you are getting different results something is either wrong with your program, or with ours!
Here are some larger examples, using files from SpeciesData.zip.
Extra Credit
Extra credit will be awarded for meaningful JUnit tests. Be sure to mention them in your README file. As usual, extra credit points aren't added directly to your assignment score, but are tallied in a separate CMS column and considered at the end of semester when computing your final grade.Downloads
- FakeData.dat - a small test file
- SpeciesData.zip - same as for
assignment 1