CS 2110 Homework Assignment 2

Due: Thursday 9/23 11:59pm


Description

Assignment 2 is about calculating gene distance. The closer two genes are, the more likely it is that one evolved from or into the other, or that they share a recent common ancestor. In our algorithm, matching sections in the two genes contribute nothing to the distance calculation, and a heuristic is used to determine a score for differing sections. The higher the total score, the more distant the two genes are considered to be.

This assignment builds on your solution to Assignment 1, but unlike Assignment 1, you have permission to work in teams of two if you so desire. We don't actually recommend doing so; the project is designed to be done by individuals and there is no real need for teams. But we also know that some people strongly prefer to work with a partner and since many people in the computing industry do work in teams throughout their careers, we're allowing that. Both team members must identify themselves as part of a team and should hand in the identical solution.

We are making available Ken's solution to Assignment 1. You may use that as a starting point if your own version of Assignment 1 was hopelessly broken. However, if you do so, you must indicate that you based your code on Ken's solution.


A Simplified Gene Distance Algorithm

For ease of exposition, let us begin with an oversimplified algorithm—not the one you will finally implement—that provides some intuition for the more sophisticated approach. Consider two genes, Ga and Gb, where Ga is of the form GprefixG0Gsuffix and Gb is of the form GprefixG1Gsuffix. Gprefix and Gsuffix are the same for both genes, and may even be zero-length. Thus G0 and G1 differ only in their middle sections G0 and G1. Assume the prefix and suffix are maximal, such that G0 and G1 do not have any characters in common at either end.

Scoring

To determine the score for these two gene fragments G0 and G1, we first count the instances of each character. For example, if G0 = IAAIYTITAYAA and G1 = AYTTAAATTAAAAAT then we count for G0 3 I's, 2 Y's, 2 T's and 5 A's. Likewise, for G1 we have 0 I's, 1 Y's, 5 T's and 9 A's.

Now we compare these numbers. We use three constants—COMMON_COST, DIFF_NUM_COST, and ONLY_IN_ONE_COST— in our score calculation.

The nominal constant values are COMMON_COST=15, DIFF_NUM_COST=8, and ONLY_IN_ONE_COST=25, giving a total score of COMMON_COST * 8 + DIFF_NUM_COST * 8 + ONLY_IN_ONE_COST * 3 = 259 for the example.

Tip: It's a good idea to store the constants as static final variables in your Gene class instead of hard-coding "magic numbers" into your code.
// Score for each common instance of a character public static final int COMMON_COST = 15; // Score for difference in number of instances public static final int DIFF_NUM_COST = 8; // Score for characters appearing only in one gene fragment public static final int ONLY_IN_ONE_COST = 25; // Sliding window minimum match length public static final int MIN_MATCH = 4;

This scoring function is a heuristic: a rule of thumb that will suffice for our purposes in CS2110. It may seem a bit ad-hoc, but in fact the Smith-Wasserman and BLAST algorithms use similar heuristics. In assignment 3, you will get even closer to the Smith-Wasserman algorithm. The reason for using heuristics of this sort is that computing the exact way that some gene evolved into another is such a hard problem—there are too many possibilities to explore, and in general, more than one possibility might lead to the right place. Worse, one can't really know that evolution followed the shortest path. So this is a case in which there isn't any way to know the "true" answer. A good rule of thumb may be the best that can be done!

The Complete Gene Distance Algorithm

The above algorithm is too simple, because there might yet be large parts of G0 and G1 that are identical, and the algorithm should detect those parts and only score the parts that are actually different. The result will become recursive, breaking the overall genes Ga and Gb down into matching and non-matching regions, ultimately calling the scoring function above for each of the non-matching regions. The final distance will be the sum of all of those scores.

Step 1: Trim Common Affixes

As in the simplified algorithm, we will ignore common prefixes and suffixes in Ga and Gb. That is, the first thing to do with your two input genes is find the longest strings Gprefix and Gsuffix such that Ga is of the form GprefixG0Gsuffix, and Gb is of the form GprefixG1Gsuffix. Throw away Gprefix and Gsuffix; the result of the trimming operation is just G0 and G1. Note that they necessarily differ in both first and last characters, or the prefix or suffix could have been longer. Either or both of G0 and G1 can be empty as well—if both are empty, it is because the two genes are identical, with distance zero.

In some cases, the order in which you trim the prefix and suffix matters. For this assignment, trim the prefix first.

Step 2: Sort the Trimmed Genes

In order to ensure that the algorithm described below is deterministic, and everybody gets the same results, it is important to make sure the computation always happens the same way. We will define a sort order for genes as follows:

  1. If G0 is shorter (has fewer characters) than G1, it comes first.
  2. If G0 and G1 are the same length, and G0 comes first alphabetically, it comes first.
  3. Otherwise, G1 comes first.

In other words, sort first by length (increasing), and sub-sort alphabetically.

After trimming the common prefix and suffix from our inputs, we compare the genes using this sort order. If G0 is greater than G1 , swap them.

This sort order will be used elsewhere, so it is worthwhile to code it in a re-usable way.  One way to do this would be to implement  Comparable<Gene> in your Gene class.

Step 3: Sliding Window Matching

Given two trimmed, sorted genes G0 and G1, our distance algorithm will recursively identify and compare segments that differ between the genes. Although we have trimmed common affixes, there might be internal fragments of DNA that are identical; this "sliding window" algorithm is designed to identify identical segments of DNA.

The first step is finding a match. We will use the constant MIN_MATCH = 4.

We'll take a "window" of MIN_MATCH characters within G0 and "slide" it from left to right along G1 looking for a match. 

To illustrate:

G0 TYTYYAIITAITTYAYTAIYYITITITAYIIAIAIYYAT
G1 AAYTYTYYAIITTATIYAATIYAYAYIAIYAIITAAAATIAAIIIT
Slide ...
G0 TYTYYAIITAITTYAYTAIYYITITITAYIIAIAIYYAT
G1 AAYTYTYYAIITTATIYAATIYAYAYIAIYAIITAAAATIAAIIIT
Slide ...
G0 TYTYYAIITAITTYAYTAIYYITITITAYIIAIAIYYAT
G1 AAYTYTYYAIITTATIYAATIYAYAYIAIYAIITAAAATIAAIIIT
Slide ...
G0 TYTYYAIITAITTYAYTAIYYITITITAYIIAIAIYYAT
G1 AAYTYTYYAIITTATIYAATIYAYAYIAIYAIITAAAATIAAIIIT
Match!

If you reach the end of G1 and still haven't found a match, slide the window in G0 to the right by one character and repeat. This is just a nested loop, comparing all possible four-character substrings of G0 and G1 until a match is found; the reason you must make the sliding G1 window the inner loop and G0 the outer is to ensure that everyone implements the same algorithm and thus gets the same answers.

If a match is found, you know that G0 and G1 have four characters in common at that point—but they might have more. "Grow" the match to include all other characters that match in a contiguous segment. In the illustration below, the window is expanded to include five additional matching characters before it reaches non-matching characters.

G0 TYTYYAIITAITTYAYTAIYYITITITAYIIAIAIYYAT
G1 AAYTYTYYAIITTATIYAATIYAYAYIAIYAIITAAAATIAAIIIT
Growing the match

Step 4: Using the Match to Compute Distance

If there is no match, then the two genes (really, fragments of the top-level genes Ga and Gb) have no significant portions in common, and the distance between them is determined directly by the scoring algorithm described in the simplified algorithm.

Otherwise, we will split each gene into three segments—everything to the left of the match, the match itself, and everything to its right. The left and right segments are possibly empty. We first run the score calculation (as described in the simplified algorithm) on the left side segments of G0 and G1 and add this to our total distance. Second, the match segment is discarded. Third, we run the distance algorithm recursively on the right segment.

IAAYTYAIITAITTYAYTAIYYITITITAYIIAIAIYYAT
AAYIYITATAITTATIYAATIYAYAYIAIYAIITAAAATIAAIIIT

Green segments are scored. The distance algorithm is run recursively on the red segments. Yellow (matching) segments are discarded. The total distance is the sum of the green score and the recursive red score.

The final distance is the sum of all of the scores computed for the non-matching regions. Here, at last, is pseudo-code for the distance algorithm:

int distance(Ga, Gb):   G0, G1 = sort (trim (Ga, Gb))   return distanceRecursive(G0, G1) int distanceRecursive(G0, G1):   G0left, G1left, G0right,   G1right = findMatch (G0, G1)   return simpleScore(G0left, G1left) + distanceRecursive(G0right, G1right)

Note: The trim and sort steps are not included in the recursion.

Implementation

To begin, make a new package "cs2110.netid.assignment2" in your Eclipse project. Copy your classes (MyGene, MySpeciesReader, etc) from the first assignment into this package. 

The first step of this assignment is to implement the gene distance algorithm in your MyGene class:

public int getDistance(Gene other) {   // Compute distance between this and other   ... }

Next, you will write a program that prints out a comparison matrix of genes. Each cell (i,j) in the comparison matrix contains the computed distance between Gi and Gj. Your program must sort the genes first according to the sorting rules, such that G0 is the least gene. The output will look something like this:

G0=I G1=AYI G2=AAYI G3=IIYI G4=AAAYI G5=IIIYI G6=IIIAAYI G7=ITAYIIIAAITAYITAYYI G0 G1 G2 G3 G4 G5 G6 G7 0 50 75 75 100 100 150 450 // G0 50 0 25 75 50 100 100 347 // G1 75 25 0 100 25 125 75 354 // G2 75 75 100 0 125 25 75 322 // G3 100 50 25 125 0 150 100 361 // G4 100 100 125 25 150 0 50 329 // G5 150 100 75 75 100 50 0 300 // G6 450 347 354 322 361 329 300 0 // G7

Do this in a class named Main. You can use the Main.java we provided for the first assignment as a starting point.

Main's main method should do the following:

  1. Read the .dat files specified in arguments array
  2. Collect a single set of all the genes belonging to all the input species. Be sure to eliminate duplicates.
  3. Sort the genes according to the sort order described above.
  4. Compute and print the gene comparison matrix. Format described below.

Note: You should use one or more Java Collection classes to collect and sort genes. Each type of collection has different strengths and weaknesses; read the descriptions of each to learn what they do. Notably, some are good for sorting, and others are good for culling duplicates. See: ArrayList, HashMap, HashSet, TreeMap, TreeSet

These documents refer to "natural sort order", which you can define for your  Gene class by implementing the Comparable interface. You can also impose an external sort order on objects by passing a Comparator instance to the Collections.sort method.

Tip: Decompose your main method into smaller methods that perform specific tasks.

Matrix print format

Your program should first print a header that lists all of the genes in sorted order, along with their names starting with G0. The name and value of the gene must be separated by an equals sign.

Example:

G0=I G1=AYI G2=AAYI G3=IIYI G4=AAAYI G5=IIIYI G6=IIIAAYI G7=ITAYIIIAAITAYITAYYI

Next, print a blank line, followed by a line of labels for the columns of the matrix starting with G0.

G0 G1 G2 G3 G4 G5 G6 G7

Then print the matrix, one row per line. The top row should contain comparisons with G0, such that the upper-left cell is a comparison of G0 with itself. Row elements should be separated by whitespace (spaces and tabs). You don't need to "pretty-print" your matrix into even columns, although doing so may help you test your code. Our parser will ignore any extra whitespace before the first or after the last element in a row. The matrix should contain only integers. Comments at the end (as shown below) are optional.

0 50 75 75 100 100 150 450 // G0 50 0 25 75 50 100 100 347 // G1 75 25 0 100 25 125 75 354 // G2 75 75 100 0 125 25 75 322 // G3 100 50 25 125 0 150 100 361 // G4 100 100 125 25 150 0 50 329 // G5 150 100 75 75 100 50 0 300 // G6 450 347 354 322 361 329 300 0 // G7

Tip: Use System.out.printf to format your output. For example, System.out.printf("%5d ", 137) prints the integer 137 into a right-justified five-character area followed by another space. The String.format method can be used to format strings without printing them.

Matrix Comments

You can leave comments in your output using //, just like Java. Our parser will ignore the // operator and everything following it on the same line. Our examples use comments to label matrix rows; you don't need to do this, but it is useful.

What to Turn in

You will submit a .jar file containing your source code and compiled classes. The name of the .jar file is unimportant.  Export all of the source code in your src directory, including your assignment1 and the classes in cs2110.assignment1. Be sure to check the box to include source files in the export dialog.

As with the first assignment, you will also hand in a README file in .txt or PDF format, containing a few sentences about your implementation of the assignment.

Please follow the code style guidelines.

Can I put other classes in the .jar? What other classes should I have?

You are free to create additional classes as you see fit.  As usual, write comments that describe the reasoning behind your classes and methods.

Tip: Classes can be copied and pasted between packages using the Package Explorer tab in Eclipse.

What should my program do with malformed input?

Your program's behavior under bad input is unspecified. We will not test against bad data.

Sample Results

To help you debug your code, we implemented a solution and ran our own version a few times. First we executed it using a data set where you can check the gene comparison algorithm by hand and make sure it is doing the right thing. For this purpose, we created a file FakeData.dat and included a DNA string with the following eight genes, obtaining the eight-by-eight distances matrix shown below. If you are getting different results something is either wrong with your program, or with ours!

// File(s): FakeData.dat G0=I G1=AYI G2=AAYI G3=IIYI G4=AAAYI G5=IIIYI G6=IIIAAYI G7=ITAYIIIAAITAYITAYYI G0 G1 G2 G3 G4 G5 G6 G7 0 50 75 75 100 100 150 450 // G0 50 0 25 75 50 100 100 347 // G1 75 25 0 100 25 125 75 354 // G2 75 75 100 0 125 25 75 322 // G3 100 50 25 125 0 150 100 361 // G4 100 100 125 25 150 0 50 329 // G5 150 100 75 75 100 50 0 300 // G6 450 347 354 322 361 329 300 0 // G7

Here are some larger examples, using files from SpeciesData.zip.

// File(s): A0.dat G0=ITIIIIITYYYYATTAAAIYTAIITAATIAIYAITYTAYYAIIATAYYIYITIAIYIIITTTAYTTAITYYYAY G1=IYIIAAYAYAATIAAAYAIYTYYIYTAIYTTAIAITIYYAAIATITYAYYYAAAITIITYAIYTTAYAITTYAY G2=ITAYYTYYIATTYYAYYYAIYAITTAIYYIIAITYAIIAAATYTYAYYTIIATAYYYTYTYYYYTYYAYIYIIYTYAYY G3=IAIYIIAIYATIIIIYYYIIYTYTYTIATTIYTAITIIYYTIATAAITIYTAYYYYTTITIYAYIYIYTTAYTITTYATY G4=ITYTYTYAIIITTYAYTAIYYITITYAAYYAYTTAIITIAIIYTIAAIYTYAYYTTAYIATYTYTIIYTAIIIATYYTAIYYTYAYIIAAYIAIAIYYATIY G5=IAAYIYIAATIITTAIATITTYTIAAYAYYTTAAIIIYYIAIAIAAITIATAAAYYYAATTYYIAIAITYAYITYAIYAIITYAAAYTIAAAYAYYAAIATAAIYY G6=IYYAATAATAITTTTAIIYTIIAAYIYITATAITTATIYAATIAIYAIIAAATIAAIIITAAYYTTYATATIYATAYAYIIYATYTIIAAYYYATIYTIYYAYAYYIAAIY G0 G1 G2 G3 G4 G5 G6 0 2162 1935 1458 3116 2935 2279 // G0 2162 0 2081 1853 2978 2389 1851 // G1 1935 2081 0 2121 2038 2378 3482 // G2 1458 1853 2121 0 3316 2410 1999 // G3 3116 2978 2038 3316 0 2381 2416 // G4 2935 2389 2378 2410 2381 0 3400 // G5 2279 1851 3482 1999 2416 3400 0 // G6
// File(s): A1.dat, A2.dat G0=ITIIIIITYYYYAIYYATYTTAAAIYTAIITAATAIIATAYYIYITIAIYIIITTTIYIIIYAYTIIAIIYTAITYYYAY G1=ITIIIIITYYYYATTAAAIYTAIIIAATAYTAATIAIYAITYTAYYAIIATAYYIYITIAIYIIITTTAYTTAITYYYAY G2=IYIIAAYAYAATIAAAYAIYTYYIYIYIITYTAIYTTAIAITIYYAAIATITYAYYYAAAITIITYAIYTTAYAITTYAY G3=IYIIAAYIAAAAYAYAATIAAAYAIYTYYIYTAIYTTAYAAIATITITYAYYYAYYYIYTAAYAAAAIYTTAYAITTYAY G4=ITAYITTAAYYTYYAYYYAIYAITTAIAATIYIYYAIIAAAIAIYIYTYTYAYYTYYTYTYYYYTYYAYIITTYYYYIIYTYAYY G5=ITAYYTYYIATTYYAIYTITYYYYAIYAITTAIYYIIAITYAIIAAATYTYAYYTIIATAYYYTYTYYYYTYYAYIYIIYTYAYY G6=IAIYIIAIYATIIIIYYYIIYTYTYTIATTIYTAITIIYYTIATAAITIYTAYYITTATYYYTTITIYAYIYIYTTAYTITTYATY G7=IAITATYYIYIIIAITIYAIYATIYIIYTYTYTTAITIIYYTIIIIITYATAAITIYITYYIYTAYYYYTTITIYTTAYTITTYATY G8=ITYTYTYAIIITTYAYTAIYYITITYAAYYAYTTAIITIAIIYTIIAITIYAAIYTYAYYTTAYIATYTYTIIYTAIIIATYYTAIYYTYAYIIAAYIAIAIYYATIY G9=ITYTYTYAIITAIYYITYAYTTAIITIAIIYTIAAIYIATTAYTIIITAYYAYYTTAYIATYTYTIIYTAIIIATYYTAIYYTYAYIIIAIYIYAAYIAIAIYYATIY G10=IAAYIYIAATIITTAIATITTYTIAAYAYYTTAAIIIYYIAIAIAAITIATAAAYYYTYTYYAATTYYIAIAITYAYITYAIYAIITYAAAYTIAAAYAYYAAIATAAIYY G11=IIAATIITTAIATITTYTIAAYAYYTTAAIIIYYIAIAAAAYIIYYYYYAIAAITIATAAAYYYAATTYYIAAYITYAIYAIITYAAAYTIAAAYAYYAAITAYIYIATAAIYY G12=IYYAATAATAITTTTAIIYTIIAAYIYITATAITTATIYAATIYAYAYIAIYAIIAAATIAAIIITAAYYTTYATATIYATAYAYIIYATYTIIAAYYYATIYTIYYAYAYYIAAIY G13=IYYAATAITTAYYATAITTTTAIIYTIIAAYIYITATAITTATIYAATIAIYAIIAAATIAAIIITAAYYTTYATATIYATAYAYIIYATYTIIAAYYYATIYTIYYAYAYYIAAIY G0 G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 0 877 1708 1393 2221 1656 1770 1775 2102 2073 3037 3075 2878 2844 // G0 877 0 2275 2017 2131 1208 1768 1950 3382 3278 3176 3251 2460 2426 // G1 1708 2275 0 655 2089 2347 2519 1968 3227 1458 2664 2186 1956 2043 // G2 1393 2017 655 0 1869 2018 1981 2215 2866 2061 3226 2437 2170 2465 // G3 2221 2131 2089 1869 0 2778 1872 1325 2062 1256 3118 3173 2810 3116 // G4 1656 1208 2347 2018 2778 0 2247 2478 2232 2162 2445 2520 3748 3612 // G5 1770 1768 2519 1981 1872 2247 0 1510 3548 2788 2617 2058 2239 2273 // G6 1775 1950 1968 2215 1325 2478 1510 0 2797 2792 2708 4070 2790 2773 // G7 2102 3382 3227 2866 2062 2232 3548 2797 0 2486 2739 2366 2630 2613 // G8 2073 3278 1458 2061 1256 2162 2788 2792 2486 0 2933 2752 2448 2431 // G9 3037 3176 2664 3226 3118 2445 2617 2708 2739 2933 0 2872 3624 3675 // G10 3075 3251 2186 2437 3173 2520 2058 4070 2366 2752 2872 0 3828 3862 // G11 2878 2460 1956 2170 2810 3748 2239 2790 2630 2448 3624 3828 0 300 // G12 2844 2426 2043 2465 3116 3612 2273 2773 2613 2431 3675 3862 300 0 // G13

Extra Credit

Extra credit will be awarded for meaningful JUnit tests.  Be sure to mention them in your README file.  As usual, extra credit points aren't added directly to your assignment score, but are tallied in a separate CMS column and considered at the end of semester when computing your final grade.

Downloads