CS 2110 Homework Assignment 3

Due: Thu 10/7 11:59pm


Description

This assignment builds on your solution to assignment 2, and again you may work in teams of two if you so desire. Both team members must identify themselves as part of a group on CMS, and will jointly hand in one solution. Choose one partner's netID to use when conforming to the package conventions of the course.

We will make a solution to assignment 2 available on CMS to all students who have submitted a solution and received a grade. You may use that as a starting point if your own version of Assignment 2 was hopelessly broken. However, if you do so, you must indicate that you based your code on our solution.

There are three basic objectives for this assignment:

  1. Recursive gene distances: We're going to implement a better gene distance algorithm, but one that was perhaps a bit too fancy for you to do in A2 when you weren't yet familiar with recursion.
  2. Species-to-species distances: We'll use these better genes to compute species-to-species distances. Basically, these are computed as the sum of gene to gene distances, which each gene in animal A is matched against the closest gene in animal B, and then vice versa for B against A.
  3. Graphical user interface: With these animal-to-animal scores, we build a nice graphical interface that shows all the animals and color codes them based which ones are closest neighbors and which are further away, and also displays (for any given animal), it and its closest match off on the right hand side, with a text string labeling the pictures. See the example picture on page 3. If there is a tie for closest match, show the animal with the alphabetically smaller name.

Command-line arguments

Your program will take two arguments (passed to your main method in its args array). The first argument will be either "--console" or "--gui" (note the two dashes in front), which will determine your program's behavior. You can use whichever default you like if neither of these is specified; our testing will always supply this argument. The second argument is the path to a directory containing species .dat and image files.

For both GUI mode and console mode, your program will read all of the .dat files in the directory provided, calculating the all-to-all distance matrix for all genes and species encountered.

We will not test against malformed data. All species .dat files will contain an ImageFilename attribute that gives the species' image filename, along with a Name and DNA. All image files are in the same directory as the .dat files.

Hint: You should only let your SpeciesReader read the .dat files in the specified directory, and not any of the .png files that will be in the same directory! Use the File class to scan the directory, and the endsWith method of the String class to check a file's extension

The behavior of your program for the two different modes of operation is defined below.

Recursive gene distances

You'll start with the gene comparison algorithm from assignment 2. Recall that a key step in that assignment was to take the two genomic strings and to put them in a specific order: sorted by size and then with ties broken alphabetically. After that we had G0 and G1 and ran a matching algorithm that broke each into three parts: a non-matching gene sequence, a matched section 4 or more characters in length, and then the remainders of G0 and G1 respectively. In assignment 2 we employed a "simple distance" computation counting the letters in the non-matching sections on the left, ignored the matching sections, and repeated the comparison procedure on the remainders on the right. But we kept the genes in the same order!

The change we want you to make is to compute the minimum genetic distance where you run this computation first on G0 and G1 but then a second time on G1 and G0 (with the longer string "on top" and the shorter one "underneath"). That is, while we still use a sorting order to number the genes, we don't use a sorting step at all in the distance computation. Instead, we try both ways and just take the smaller of the computed distances.

As was the case for homework A2, the computation itself is repeated. Given our two genes, we trim off any identical prefix and suffix. Now treat the remaining strings as a sequence consisting of codons that match followed by codons that differ, perhaps repeated many times. For each of these sections, we break off and discard the matching portion, compute the distance score on the differing pieces (G0left and G1left), and then repeat the calculation on the remaining portions which come after the section that matches. Again, you need to try this two ways: G0right, G1right and then G1right, G0right taking the minimum computed distance.

Here's pseudo-code for how that might look:

distance(Ga, Gb): G0, G1 = trim (Ga, Gb) return min(distanceRecursive(G0, G1), distanceRecursive(G1, G0)) distanceRecursive(G0, G1) G0left, G1left, G0right, G1right = findMatch (G0, G1) scoreLeft = simpleScore(G0left, G1left) scoreRight = min(distanceRecursive(G0right, G1right), distanceRecursive(G1right, G0right)) return scoreLeft + scoreRight

The computation lends itself to a recursive procedure. With respect to the version you created in A2, you can keep the genetic character counting code and the window-matching code, which won't change, but you'll replace the distance and distanceRecursive methods you coded for A2 with code that looks somewhat similar to the code above. Obviously, the code above isn't real Java. We left a bit of work for you to do!

This recursive procedure will give you different (better) genetic distances than you obtained with the version we used in assignment 2. At the end of this assignment we recomputed the gene-to-gene distance tables for A1-A2 and even in this case you'll see a few changes compared to the ones obtained in assignment 2.

Species-to-species distances

Recall that in previous assignments we numbered genes as G0, G1, G2... based on our rules for sorting genes. You'll do the same with species: Species will be numbered S0, S1, S2... sorted alphabetically by their name ("Armored Snapper", "Asian Boxing Lobster", etc.) such that S0 is the first animal alphabetically, S1 is the next, and so on.

As in previous assignments, each animal has DNA and by now we have a list of the genes that each animal's DNA contained. Thus if we have animal A and animal B, we can make lists of the genes that each animal had in its DNA: perhaps G1, G7 and G8 for A, and G7, G8, G9 and G11 for B.

Our basic scheme will be to first compare the genes in A to the genes in B, and then do the reverse, comparing the genes in B with the genes in A, and then to average the two results.

To compare two species (A and B), we'll maintain a running sum initialized to zero. For each gene in animal A, find the closest matching gene in animal B, and add the distance to the running score (exact matches will have genetic distance zero, and hence won't contribute to the distance score). We'll do this twice: first iterating over the A genes and comparing them against the ones in B, and then iterating over the B genes and comparing these against the genes in A. The computed score will be the average of the two. For example, perhaps A contains genes G1, G2 and G3 and B contains G4 and G5. So first you'll look at G1 relative to G4 and G5 and figure out which it was closest to, perhaps G4. Next you'll look at G2, perhaps this is also closest to G4. Add the score to the score for G1 versus G4. Finally, you'll look at G3. Maybe this is closest to G5; again, add the score. That gives a total score for A versus B; now repeat for B versus A, average the two, and that's our answer.

Note:We want species-to-species distances to be integers, like the gene-to-gene distances. No need to use floating point for these averages (yes, the distances will round down if the total happened to be odd, but that's just fine).

Console Mode

Console mode will operate much like assignment 2. You will print a list of all the genes and a gene-to-gene comparison matrix exactly like in assignment 2, but using the new gene distance computation. Then, you will print an animal-to-animal comparison matrix. First, print a line that describes each species and its genes:

S0=Armored Snapper: Genes [0, 1, 2, 3, 12, 13, 17]

Genes are referred to by their positions in the sorted global pool of genes (17 is G17, etc.).

Then print an animal-to-animal comparison matrix using your species distance calculations.

As in assignment 2, // can be used to leave comments that will be ignored. Blank lines will be ignored, and matrix columns do not need to line up perfectly.

Example output is at the bottom of this specification.

The Graphical User Interface (GUI)

Similar to assignment 2, your program for assignment 3 will have the option of printing a matrix of comparisons. However, we will also add a graphical component. Your main method's first argument will be either "--console" or "--gui". If the "--gui" flag is given, your program will launch into a graphical display mode. Here's how the display will look (note: the data used for this example was deliberately incorrect: the closest relative of Darwin's Tortle is not actually Fuzzy Trible, nor are the other distances real--the coloring is nonsense).

GUI Screenshot

To help you build this fancy GUI, we have provided an empty "shell". You will populate it with images and teach it how to respond to user events.

  1. Download our Swing-based GUI shell, a3.jar and add it to your Eclipse build path. You can do this several ways:
    • Import it into your project just as you did with a1.jar: right-click (control-click on Macs) the src directory, and choose Import. Import an Archive file (under General), then find a3.jar and import it.
    • Alternatively, since this jar contains compiled .class files and you will simply be extending them, you may add it to your build path. Right click on your project, then choose Build Path-->Configure Build Path. Under the Libraries tab, choose to add a JAR (if you placed a3.jar in your 2110 project's directory) or an external JAR (if you didn't). Choose to add a3.jar to the build path. Now you may use and extend classes in the jar file.
    • Whichever method you choose, you should not need to modify these files, and therefore will not need to turn these files back in to us, as we will have them already.
  2. Refer to the API reference.
  3. Create a subclass of cs2110.assignment3.ComparisonGUI
  4. Create a main method that takes two arguments--the first is "--console" or "--gui", and the second is the path to a directory containing species .dat files and images. If the "--gui" option is passed:
    1. Read all of the species .dat files in the given directory. Store Species information in whatever data structure you think is best.
    2. Compute the animal-to-animal distance for all pairs of species, as described above. Store this distance in whatever structure you like (a two-dimensional array is the obvious choice). You should pre-compute all distances because we'll be using them a lot, so computing them on-the-fly would be prohibitive.
    3. Create an instance of your ComparisonGUI and populate it with images using the setCellImage method. Species should be sorted by name alphabetically, such that cell 0 contains the alphabetically first species.
    4. Call the run method to start the GUI.
  5. Override the onSelectCell method in your subclass. This is called by the GUI when the user clicks on a cell in the table. Implement the following behavior for when a cell is clicked:
    1. The cell that is clicked becomes the "selected cell"
    2. Change the color of all cells to reflect their species' distance to the selected cell's species. You may choose your own color scheme, but please explain your color scheme in the comments of your ComparisonGUI class. The example uses red to indicate the selected cell, yellow to indicate closely-related species, and green to indicate more distantly related species. But any color scheme you like will be just fine.
    3. Call setSelectedInfo and setClosestRelativeInfo to set the name and image for the larger displays to the right of the table. These displays show the name and picture of the selected species and its closest relative.

What to turn in

Export a source JAR containing source code and compiled classes for your cs2110.netid.assignment3 package along with your previous assignment packages if you re-used them. Your main method can be in any class you choose, but it is very important that you specify this class as your main class when you export the source JAR. This is done on the third page of the "Export JAR" dialog.

Testing

We plan to test your animal-to-animal distances computation in much the same way that we did for Assignment 2 with the gene-to-gene distances. You should produce a formatted output containing an animal-to-animal distance table like the ones shown below (from our in-house solution). We don't need you to print the genes this time, but you'll probably want to do so to debug your code. Our solution does so; from the example below you can confirm that the program created a single table of genes numbered 0... in "sorted order" with the shortest genes getting the lowest numbers, and with ties broken alphabetically.

Extra credit

Add either a "slider" or "scale" widget, with values ranging from 0 to 5000. If the user slides the bar, for example to d, find all "families" of animals such that for each family F, if Animals A, B are both in F, then distance(A, B) < d. Thus if d=1000, we would find some number of families, perhaps 15, such that each family is a set of animals that are all within distance 1000 of every other member of that same family. Color each family with a distinct color. As one slides the slider, dynamically recompute the families and refresh the display accordingly.

How is this done? In the "real world", you might use something like a k-means clustering algorithm to determine the families, but this is a bit too involved for this course, so we suggest this algorithm: iterate over the list of animals in increasing index order: A0, A1, etc. Figure out if Ai could go into some existing family according to the rule; if so, add it. If an animal could be in two families, put it in the family where its average distance from existing members is smaller. If no existing family is suitable, create a new family for Ai and then move on.

Remarks

What's in a Name?

There are several "names" for each animal in our application. One name is associated with the data file: A0.dat or A17.dat (the name being "A0" or "A17"). A second name is the common-language name in the "Name command", like "Name="Freddy Frog". And yet a third name is the fake scientific latin name, like "Frogus Fredius".

Our assignment A3 makes use of two of these names. The GUI uses the common-language names and asks you to sort animals into alphabetical order by name. The animal-to-animal distance matrix should use the name from the data file: A17 or A6. That matrix has its rows and columns labeled by the Axx name, in increasing order by number (the "xx" part), as seen in the example below.

Why use different names in different situations? This is common in computer science and relates to the fact that humans tend to be happiest with names that make sense in English, like Freddy Frog; scientists like the precision of scientific terminology, and yet computers often find it easier to work with very compact and regular names like A11 or A19. So: different naming styles for different uses!

In A5 this will all come together in our "dendroscope" pictures, which will be labeled using several kinds of information all at once. But of course we won't tackle that challenge for a few more weeks.

Hint for A4: Neighbors aren't ancestors

This paragraph is aimed towards people who read the master assignment page and are already thinking ahead about A4. In a phylogenetic tree, animals that are close together will often be from the same "generation". For example, two modern species of ducks would presumably be more similar to one another than either would be to a velociraptor, even if birds are descendents of that branch of dinosaurs (as some believe to be the case). Here in assignment 3, we're finding closest relatives and hence will tend to "group" animals by generation--the ducks will resemble other birds, the lobsters will resemble other shellfish, etc. As we move on to assignment 4, we'll need to find a different way of computing our animal-to-animal distances in order to highlight ancestor to descendent relationships and to deemphasize "sibling" relationships.

Example Output

// File(s): A0.dat, A1.dat, A2.dat G0=ITIIIIITYYYYATTAAAIYTAIITAATIAIYAITYTAYYAIIATAYYIYITIAIYIIITTTAYTTAITYYYAY G1=IYIIAAYAYAATIAAAYAIYTYYIYTAIYTTAIAITIYYAAIATITYAYYYAAAITIITYAIYTTAYAITTYAY G2=ITAYYTYYIATTYYAYYYAIYAITTAIYYIIAITYAIIAAATYTYAYYTIIATAYYYTYTYYYYTYYAYIYIIYTYAYY G3=IAIYIIAIYATIIIIYYYIIYTYTYTIATTIYTAITIIYYTIATAAITIYTAYYYYTTITIYAYIYIYTTAYTITTYATY G4=ITIIIIITYYYYAIYYATYTTAAAIYTAIITAATAIIATAYYIYITIAIYIIITTTIYIIIYAYTIIAIIYTAITYYYAY G5=ITIIIIITYYYYATTAAAIYTAIIIAATAYTAATIAIYAITYTAYYAIIATAYYIYITIAIYIIITTTAYTTAITYYYAY G6=IYIIAAYAYAATIAAAYAIYTYYIYIYIITYTAIYTTAIAITIYYAAIATITYAYYYAAAITIITYAIYTTAYAITTYAY G7=IYIIAAYIAAAAYAYAATIAAAYAIYTYYIYTAIYTTAYAAIATITITYAYYYAYYYIYTAAYAAAAIYTTAYAITTYAY G8=ITAYITTAAYYTYYAYYYAIYAITTAIAATIYIYYAIIAAAIAIYIYTYTYAYYTYYTYTYYYYTYYAYIITTYYYYIIYTYAYY G9=ITAYYTYYIATTYYAIYTITYYYYAIYAITTAIYYIIAITYAIIAAATYTYAYYTIIATAYYYTYTYYYYTYYAYIYIIYTYAYY G10=IAIYIIAIYATIIIIYYYIIYTYTYTIATTIYTAITIIYYTIATAAITIYTAYYITTATYYYTTITIYAYIYIYTTAYTITTYATY G11=IAITATYYIYIIIAITIYAIYATIYIIYTYTYTTAITIIYYTIIIIITYATAAITIYITYYIYTAYYYYTTITIYTTAYTITTYATY G12=ITYTYTYAIIITTYAYTAIYYITITYAAYYAYTTAIITIAIIYTIAAIYTYAYYTTAYIATYTYTIIYTAIIIATYYTAIYYTYAYIIAAYIAIAIYYATIY G13=IAAYIYIAATIITTAIATITTYTIAAYAYYTTAAIIIYYIAIAIAAITIATAAAYYYAATTYYIAIAITYAYITYAIYAIITYAAAYTIAAAYAYYAAIATAAIYY G14=ITYTYTYAIIITTYAYTAIYYITITYAAYYAYTTAIITIAIIYTIIAITIYAAIYTYAYYTTAYIATYTYTIIYTAIIIATYYTAIYYTYAYIIAAYIAIAIYYATIY G15=ITYTYTYAIITAIYYITYAYTTAIITIAIIYTIAAIYIATTAYTIIITAYYAYYTTAYIATYTYTIIYTAIIIATYYTAIYYTYAYIIIAIYIYAAYIAIAIYYATIY G16=IAAYIYIAATIITTAIATITTYTIAAYAYYTTAAIIIYYIAIAIAAITIATAAAYYYTYTYYAATTYYIAIAITYAYITYAIYAIITYAAAYTIAAAYAYYAAIATAAIYY G17=IYYAATAATAITTTTAIIYTIIAAYIYITATAITTATIYAATIAIYAIIAAATIAAIIITAAYYTTYATATIYATAYAYIIYATYTIIAAYYYATIYTIYYAYAYYIAAIY G18=IIAATIITTAIATITTYTIAAYAYYTTAAIIIYYIAIAAAAYIIYYYYYAIAAITIATAAAYYYAATTYYIAAYITYAIYAIITYAAAYTIAAAYAYYAAITAYIYIATAAIYY G19=IYYAATAATAITTTTAIIYTIIAAYIYITATAITTATIYAATIYAYAYIAIYAIIAAATIAAIIITAAYYTTYATATIYATAYAYIIYATYTIIAAYYYATIYTIYYAYAYYIAAIY G20=IYYAATAITTAYYATAITTTTAIIYTIIAAYIYITATAITTATIYAATIAIYAIIAAATIAAIIITAAYYTTYATATIYATAYAYIIYATYTIIAAYYYATIYTIYYAYAYYIAAIY G0 G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 G16 G17 G18 G19 G20 0 1019 1550 1074 727 150 1052 1032 1430 1738 1121 1544 1267 2083 1315 2030 1975 1959 2072 2092 2058 // G0 1019 0 1134 903 1445 1229 150 505 1113 1294 951 1068 2172 1733 2220 1307 1824 1851 1869 1914 2001 // G1 1550 1134 0 2121 1255 1020 1233 1366 856 150 2476 1499 2038 1844 2188 2118 1761 1580 1637 1628 1616 // G2 1074 903 2121 0 1450 1448 879 1066 1307 2169 150 1006 2159 1979 2224 1647 2053 1765 1683 1813 1847 // G3 727 1445 1255 1450 0 877 1537 1225 1638 1443 1770 1471 1340 2661 1388 1538 2087 2212 2549 2345 2311 // G4 150 1229 1020 1448 877 0 1262 1263 1070 1208 1614 1609 1615 2131 1663 2146 2074 1947 1772 2080 2046 // G5 1052 150 1233 879 1537 1262 0 655 1470 1751 927 1151 2271 2131 2319 1458 2256 1893 2081 1956 2043 // G6 1032 505 1366 1066 1225 1263 655 0 1506 1203 1181 1170 2716 1508 2866 1384 1563 2062 1695 1732 2212 // G7 1430 1113 856 1307 1638 1070 1470 1506 0 1314 1872 1325 1719 1740 1835 1256 1780 1529 2078 2315 1527 // G8 1738 1294 150 2169 1443 1208 1751 1203 1314 0 2247 1581 2082 1926 2232 2162 1843 1662 1719 1710 1698 // G9 1121 951 2476 150 1770 1614 927 1181 1872 2247 0 1156 2084 2027 2149 1788 2101 1829 1731 1877 1911 // G10 1544 1068 1499 1006 1471 1609 1151 1170 1325 1581 1156 0 1836 1871 1887 1983 1911 1801 2091 1819 1934 // G11 1267 2172 2038 2159 1340 1615 2271 2716 1719 2082 2084 1836 0 2381 150 965 2827 2179 1984 2223 1605 // G12 2083 1733 1844 1979 2661 2131 2131 1508 1740 1926 2027 1871 2381 0 2248 2443 125 2372 700 2450 2522 // G13 1315 2220 2188 2224 1388 1663 2319 2866 1835 2232 2149 1887 150 2248 0 1115 2738 2339 2049 2338 1653 // G14 2030 1307 2118 1647 1538 2146 1458 1384 1256 2162 1788 1983 965 2443 1115 0 2933 2010 1868 2289 1648 // G15 1975 1824 1761 2053 2087 2074 2256 1563 1780 1843 2101 1911 2827 125 2738 2933 0 2240 825 2339 2390 // G16 1959 1851 1580 1765 2212 1947 1893 2062 1529 1662 1829 1801 2179 2372 2339 2010 2240 0 3054 150 150 // G17 2072 1869 1637 1683 2549 1772 2081 1695 2078 1719 1731 2091 1984 700 2049 1868 825 3054 0 2732 2769 // G18 2092 1914 1628 1813 2345 2080 1956 1732 2315 1710 1877 1819 2223 2450 2338 2289 2339 150 2732 0 300 // G19 2058 2001 1616 1847 2311 2046 2043 2212 1527 1698 1911 1934 1605 2522 1653 1648 2390 150 2769 300 0 // G20 S0=Armored Snapper: Genes [0, 1, 2, 3, 12, 13, 17] S1=Asian Boxing Lobster: Genes [5, 6, 9, 10, 14, 16, 19] S2=Ballards Hooting Crane: Genes [4, 7, 8, 11, 15, 18, 20] S0 S1 S2 0 1025 4909 // S0 1025 0 6062 // S1 4909 6062 0 // S2

Downloads