CS 2110 Homework Assignment 3.  Due: Fri 10/9 11:59pm

Frequently Asked Questions

New! The test suite has been released.

This assignment builds on your solution to Assignment 2, and again, you have permission to work in teams of two if you so desire.  Both team members must identify themselves as part of a team and should hand in the identical solution.

We will make Ken’s solution to Homework 2 available the morning after Assignment 2 was due.  You may use that as a starting point if your own version of Assignment 1 was hopelessly broken.  However, if you do so, you must indicate that you based your code on Ken’s solution.

Here are the basic objectives for the assignment: 

1.       We’re going to implement a better gene-distance algorithm, but one that was perhaps a bit too fancy for you to do in A2 when you weren’t yet familiar with recursion.

2.       We’ll use these better genes to compute animal-to-animal distances.  Basically these are computed as the sum of gene to gene distances, which each gene in animal A is matched against the closest gene in animal B, and then vice versa for B against A.

3.       With these animal-to-animal scores, we build a nice graphical interface that shows all the animals and color codes them based which ones are closest neighbors and which are further away, and also displays (for any given animal), it and its closest match off on the right hand side, with a text string labeling the pictures.  See the example picture on page 3.   If there is a tie for closest match, show the animal with the alphabetically smaller name.

Here’s how we’ll build it.  We start with Assignment 2 but using slightly different animal picture and data files (in a PNG format that uses transparent backgrounds; the animal data files are actually the same as from A2, but they use the new animal picture names) and then:

1.       We’ll enhance the gene-to-gene distance algorithm using a slightly fancier form of recursion to get more accurate scoring.

2.       Using our more accurate gene distances, we’ll compute an array of animal-to-animal distances.

3.       Your program will take two arguments: The first string passed to your main method will be either “console” or “gui”, and this will determine your program’s behavior.  You can use whichever default value you like; our testing will always supply this argument.   The second string will be the path to a directory containing species .dat and image files.

a.       In console mode, your application should print a table of animal-to-animal distances in the format we describe below.

b.      In GUI mode you should display a table of animal images.  When the user clicks on one of the images, the program will highlight the image the user clicked by re-coloring the background of that picture, and then will show how far each of the other animals was using other colors on a spectrum – from green (closest) to yellow (furthest) in our example later in this assignment. 

c.       Still for the GUI case, your program should also display the picture of the animal and the picture of its closest relative to the right of the table of animal pictures.

Note: in A2 we also provided file names as arguments.  In A3 we want you to search the directory for .dat files and load all the ones you find.  See below for details.

Here are the rules for the improved genetic comparison.  You’ll start with the gene comparison algorithm from assignment 2, which we won’t reproduce here.  Recall that a key step in that assignment was to take the two genomic strings and to put them in a specific order: sorted by size and then with ties broken alphabetically.  After that we had g0 and g1 (notation from the assignment 2 handout) and ran a matching algorithm that broke each into three parts: a non-matching  gene sequence (call these a and a’), a matched section 4 or more characters in length (call it b) and then the remainders of g0 and g1 respectively (call these d and d’).  In assignment 2 we added in the difference computation for (a, a’), ignored (b, b’) and repeated the comparison procedure with g0=d and g1=d’.  But we didn’t resort the genes.  This is why we ended up with a method called Distance (in which we sorted the genes) and a second recursive one called Distance’ (in which we did the comparison).

The change we want you to make is to compute the minimum genetic distance where you run this computation first on g0,g1  but then a second time on g1,g0, namely with the longer string “on top” and the shorter one “underneath”.   That is, we still use a sorting order to number the genes, but we don’t use a sorting step, at all, in the distance computation.  Instead, we try both ways and just take the smaller of the computed distances.

As was the case for homework A2, the computation itself is repeated.  Given our two genes, we trim off any identical suffix.  Now treat the remaining strings as a sequence consisting of codons that match followed by codons that differ,  perhaps repeated many times.  For each of these sections, we break off and discard the matching portion, then break off the chunk that “differs”.  We compute the distance score on the differing pieces (which may be of differing lengths), and then repeat the calculation on the remaining portions, which start with a section that matches.  Again, you need to try this two ways: g0’,g1’ and then g1’,g0’, taking the minimum computed distance.  Finally, note that in the discussion above, any of these strings might be empty – we can run out of characters in g0, g1 or both in the process of doing this computation.

Here’s pseudo-code for how that might look:

Distance(String g0, String g1)
{
            g0, g1 = TrimIdenticalSuffix(g0, g1);                       // Remove matching suffix,  if any
            return min(Distance’(g0,g1), Distance’(g1,g0));
}

Distance’(String g0, String g1)
{
            g0, g1 = TrimIdenticalPrefix(g0, g1);                       // Remove matching prefix,  if any
            break g0, g1 into a differing portion, and two remainders that start with a match
            dist = GeneticDistance(g0DifferingPortion, g1DifferingPortion);
            return dist +
                       min(Distance’(g0Remainder, g1Remainder), Distance’(g1Remainder,g0Remainder));
}

The computation lends itself to a recursive procedure.  With respect to the version you created in A2, you can keep the genetic  character counting code, which won’t change, but you’ll replace the Distance and Distance’ methods you coded for A2 with code that looks somewhat similar to the code above.  Obviously, the code above isn’t real Java.  We left a bit of work for you to do!  

This recursive procedure will give you different (better) genetic distances than you obtained with the version we used in homework 2.   At the end of this assignment we recomputed the gene-to-gene distance tables for A1-A2 and even in this case you’ll see a few changes compared to the ones obtained in Assignment 2.  We won’t be testing your gene-to-gene distance tables this time, but the Animal-to-Animal distances depend on the gene-to-gene distances, so we’ll have an indirect way to see if you got them right!

This improved comparison scheme works best with slightly changed values of the gene distance constants.  Here are the values we want you to use:

public class Constants {
         public static final int SCOST = 5;    // Gene: Cost for same characters in different order

         public static final int DCOST = 10;   // Gene: Cost for different counts but of the same characters

         public static final int DDCOST = 20;  // Gene: Cost for completely different characters

public static final int MIN_MATCH = 4;// Sliding window minimum match length
}

So this leads to the Animal-to-Animal distance computation.  As explained in the homework overview, each animal has DNA and by now we have a list of the genes that each animal’s DNA contained.  Thus if we have animal A and animal B, we can make lists of the genes that each animal had in its DNA: perhaps G1, G7 and G8 for A, and G7, G8, G9 and G11 for B.  

Our basic scheme will be to first compare the genes in A to the genes in B, and then do the reverse, comparing the genes in B with the genes in A, and then to average the two results.  In our test data sets, most animals have approximately 7 genes, but you should not assume that this is the case.  Still, it may help in debugging or thinking about the solution to have those numbers in mind.

To compare two sets of genes (A to B), we’ll maintain a running sum, score, initialized to zero.  For each gene in animal A, find the closest matching gene in animal B, and add the distance to the running score (exact matches will have genetic distance zero, and hence won’t count).   We’ll do this twice: first iterating over the A genes and comparing them against the ones in B, and then iterating over the B genes and comparing these against the genes in A.  The computed score will be the average of the two.  For example, perhaps A contains genes G1, G2 and G3 and B contains G4 and G5.  So first you’ll look at G1 relative to G4 and G5 and figure out which it was closest to, perhaps G4.   Next you’ll look at G2, perhaps this is also closest to G4.  Add the score to the score for G1 versus G4.  Finally, you’ll look at G3.  Maybe this is closest to G5; again, add the score.  That gives a total score for A versus B; now repeat for B versus A, average the two, and that’s our answer.

Hint: We want animal-to-animal distances to be integers, like the gene-to-gene distances.  No need to use floating point for these averages (yes, the distances will “round down”, but that’s just fine).

The Graphical User Interface (GUI)

Similar to assignment 2, your program for assignment 3 will have the option of printing a matrix of comparisons.  However, we will also add a graphical component.  Your main method’s first argument will be either “console” or “gui”.  If the “gui” flag is given, your program will launch into a graphical display mode.  Here are the basic things you’ll do to create the clickable display and the box where you’ll show the animal-to-animal distance information.  Here’s how the display will look (note: the data used for this example was deliberately incorrect:  the closest relative of the striped salamander is actually not Darwin’s Tortle, nor are the other distances real.  Thus the coloring is nonsense).

As noted earlier, to get transparent backgrounds, you will need to use PNG files instead of the BMP files we supplied for A1.  The link is here in case you missed it above.

gui_beta_example.jpg

To help you build this fancy GUI, we have provided an empty “shell”.  You will populate it with images and teach it how to respond to user events. 

1.       Download our Swing-based GUI shell,  cs2110_a3_swing.jar and add it to your Eclipse build path.  The API reference is here: cs2110.assignment3.swing.  Note: An earlier version of this assignment used SWT rather than Swing.  You can still use the SWT version (API reference) if you prefer. To switch to the Swing version, simply change your imports from cs2110.assignment3 to cs2110.assignment3.swing.

2.       Create a subclass of cs2110.assignment3.swing.ComparisonGUI

3.       The subclass should have a main method that takes two arguments---the first is “console” or “gui”, and the second is the path to a directory containing species .dat files and images.  The main method should behave as follows when the “gui” flag is given:

a.       Read all of the species .dat files in the given directory.  Store species information in whatever data structure you think is best; you may want to create a Species class for this purpose.   

                                   i.      Note: We will not test against malformed data.  All species .dat files will contain an Image command that gives the species’ image filename, along with a Name and DNA.  All image files are in the same directory as the .dat files.

                             ii.       Hint: Use the File class to scan the directory, and the endsWith method of the String class to check a file’s extension

b.      Compute the animal-to-animal distance for all pairs of species, as described above.  Store this distance in whatever structure you like (a two-dimensional array is the obvious choice).  You should pre-compute all distances because we’ll be using them a lot, so computing them on-the-fly would be prohibitive.

c.       Create an instance of your ComparisonGUI and populate it with images using the setCellImage method.  Species should be sorted by name alphabetically, such that cell 0 contains the alphabetically first species.

d.      Call the run method to start the GUI!

4.       Override the onSelectCell method in your subclass.  This is called by the GUI when the user clicks on a cell in the table.  Implement the following behavior for when a cell is clicked:

a.       The cell that is clicked becomes the “selected cell”

b.      Change the color of all cells to reflect their species’ distance to the selected cell’s species.  You may choose your own color scheme, but please explain your color scheme in the comments of your ComparisonGUI class.  The example uses red to indicate the selected cell, yellow to indicate closely-related species, and green to indicate more distantly related species.  But any color scheme you like will be just fine.

c.       Call setSelectedInfo and setClosestRelativeInfo to set the name and image for the larger displays to the right of the table.  These displays show the name and picture of the selected species and its closest relative.

What’s in a Name?

There are several “names” for each animal in our application.  One name is associated with the data file: A0.dat or A17.dat (the name being “A0” or “A17”).  A second name is the common-language name in the “Name command”, like “Name=”Freddy Frog”.  And yet a third name is the fake scientific latin name, like “Frogus Fredius”.  


Our assignment A3 makes use of two of these names.  The GUI uses the common-language names and asks you to sort animals into alphabetical order by name.  The animal-to-animal distance matrix should use the name from the data file: A17 or A6.  That matrix has its rows and columns labeled by the Axx name, in increasing order by number (the “xx” part),  as seen in the example below.


Why use different names in different situations?  This is common in computer science and relates to the fact that humans tend to be happiest with names that make sense in English, like Freddy Frog; scientists like the precision of scientific terminology, and yet computers often find it easier to work with very compact and regular names like A11 or A19.  So:  different naming styles for different uses!

 

In A5 this will all come together in our “dendroscope” pictures, which will be labeled using several kinds of information all at once.  But of course we won’t tackle that challenge for a few more weeks.

 

Hint for A4: Neighbors aren’t ancestors

This paragraph is aimed towards people who read the master assignment page and are already thinking ahead about A4.  In  a phylogenetic tree, animals that are close together will often be from the same “generation”.  For example, two modern species of ducks would presumably be more similar to one-another than either would be to a velociraptor, even if birds are descendents of that branch of dinosaurs (as some believe to be the case).  Here in assignment 3, we’re finding closest relatives and hence will tend to “group” animals by generation – the ducks will resemble other birds, the lobsters will resemble other shellfish, etc.  As we move on to assignment 4, we’ll need to find a different way of computing our animal-to-animal distances in order to highlight ancestor to descendent relationships and to deemphasize “sibling” relationships. 

 

What to turn in

Before submitting your assignment, run the test suite. This ensures we will be able to parse your program’s output for grading.

Export a source JAR containing source code and compiled classes for your netid.assignment3 package along with your previous assignment packages if you re-used them.  Your main method can be in any class you choose, but it is very important that you specify this class as your main class when you export the source JAR.  This is done on the third page of the “Export JAR” dialog.   We will release a test suite before assignment submission is enabled on CMS… you should run the test suite prior to turn-in to make sure your output formatting is correct.

What  we’ll test
We plan to test your animal-to-animal distances computation in much the same way that we did for Assignment 2 with the gene-to-gene distances.  You should produce a formatted output containing an animal-to-animal distance table like the ones shown below (from our in-house solution).  We don’t need you to print the genes this time, but you’ll probably want to do so to debug your code.  Our solution does so; from the example below you can confirm that the program created a single table of genes numbered 0… in “sorted order” with the shortest genes getting the lowest numbers, and with ties broken alphabetically. 

Notes:  Our animal distances table, below, has the distances “left justified” but you can right-justify them if you prefer--- it isn’t strictly necessary that your columns line up, but it will help you check your output against ours.  Also:  please use our animal numbering:  If you renumber the animals, our test program will fail.  So: our  A17.dat defines animal A17, etc.

From this table you can see that the Armored Snapper and the Asian Boxing Lobster are fairly closely related.  Ballards Hooting Crane is clearly a more distant relative; it is slightly closer to the Armored Snapper than to the Asian Boxing Lobster, but the match isn’t very good in either case.

 

Extra Credit (Extra credit only is applied up to the maximum possible for A3.)

 

Add either a "slider" or “scale” widget, with values ranging from 0 to 5000.  If the user slides the bar, for example to d, find all "families" of animals such that " Family F, Animal A,B Î F, AnimalDist(A,B) < d.  Thus if d=1000, we would find some number of families, perhaps 15, such that each family is a set of animals that are all within distance d of every other member of that same family.  Color each family with a distinct color.  As one slides the slider, dynamically recompute the families and refresh the display accordingly.

 

Hint: To create families, iterate over the list of animals in increasing index order: A0, A1, etc.  Figure out if Ai could go into some existing family according to the rule; if so, add it.  If an animal could be in two families, put it in the family where its average distance from existing members is smaller.  If no existing family is suitable, create a new family for Ai and then move on. 

 

Console Mode Output Format

 

An example of the console mode output format is below.  The first part consists of a gene-to-gene comparison matrix exactly like in assignment 2, but using the new gene distance computation.  The second part is an animal-to-animal comparison matrix.  First, print a line that describes each animal and its genes, e.g.,

 

A0=Armored_Snapper: Genes [0 1 2 7 12 13 16]

 

Genes are referred to by their positions in the sorted global pool of genes (7 is G7, etc.).  Alternate formats  A0: Armored_Snapper {Genes 0 1 2 7 12 13 16} and  A0: Armored_Snapper [Genes 0 1 2 7 12 13 16] are also acceptable.

 

Then print an animal-to-animal comparison matrix using your species distance calculations. 

 

As in assignment 2, // and  | can be used to leave comments that will be ignored.  Blank lines will be ignored, and matrix columns do not need to line up perfectly.

 


//Input files: A0.dat A1.dat A2.dat

G0=ACAAOOCOCOOIAOOOCOACICCACIOACIIOAOAIACCOOAOIAICOCCCOOOAIAAICOACIIOCOAIICOC

G1=AIAAAAAICCCCOIIOOOACIOAAIOOIAOACOAICIOCCOAAOIOCCACAIAOACAAAIIIOCIIOAICCCOC

G2=AIOCCICCAOIICCOCCCOACOAIIOACCAAOAICOAAOOOICICOCCIAAOIOCCCICICCCCICCOCACAACICOCC

G3=ACAAOOCAOOOOCOCOOIAOOOCOACICCACIOACIIOCOOAOIAIAICOCCCOCCCACIOOCOOOOACIIOCOAIICOC

G4=ACAAOOCOCOOIAOOOCOACICCACACAAICIOACIIOAOAIACCOOAOIAICOCCCOOOAIAAICOACIIOCOAIICOC

G5=AIAAAAAICCCCOACCOICIIOOOACIOAAIOOIOAAOIOCCACAIAOACAAAIIIACAAACOCIAAOAACIOAICCCOC

G6=AIAAAAAICCCCOIIOOOACIOAAAOOIOCIOOIAOACOAICIOCCOAAOIOCCACAIAOACAAAIIIOCIIOAICCCOC

G7=AOACAAOACOIAAAACCCAACICICIAOIIACIOAIAACCIAOIOOAIACIOCCCCIIAIACOCACACIIOCIAIICOIC

G8=AIOCAIIOOCCICCOCCCOACOAIIOAOOIACACCOAAOOOAOACACICICOCCICCICICCCCICCOCAAIICCCCAACICOCC

G9=AIOCCICCAOIICCOACIAICCCCOACOAIIOACCAAOAICOAAOOOICICOCCIAAOIOCCCICICCCCICCOCACAACICOCC

G10=AOACAAOACOIAAAACCCAACICICIAOIIACIOAIAACCIAOIOOAIACIOCCAIIOICCCIIAIACOCACACIIOCIAIICOIC

G11=AOAIOICCACAAAOAIACOACOIACAACICICIIOAIAACCIAAAAAICOIOOAIACAICCACIOCCCCIIAIACIIOCIAIICOIC

G12=AICICICOAAAIICOCIOACCAIAICOOCCOCIIOAAIAOAACIAOOACICOCCIIOCAOICICIAACIOAAAOICCIOACCICOCAAOOCAOAOACCOIAC

G13=AOOCACAOOIAAIIOAOIAIICIAOOCOCCIIOOAAACCAOAOAOOAIAOIOOOCCCOOIICCAOAOAICOCAICOACOAAICOOOCIAOOOCOCCOOAOIOOACC

G14=AICICICOAAAIICOCIOACCAIAICOOCCOCIIOAAIAOAACIAAOAIACOOACICOCCIIOCAOICICIAACIOAAAOICCIOACCICOCAAOOCAOAOACCOIAC

G15=AICICICOAAIOACCAICOCIIOAAIAOAACIAOOACAOIIOCIAAAIOCCOCCIIOCAOICICIAACIOAAAOICCIOACCICOCAAAOACACOOCAOAOACCOIAC

G16=ACCOOIOOIOAIIIIOAACIAAOOCACAIOIOAIIOIACOOIAOACOAAOOOIAOOAAAIOOCCIICOIOIACOIOCOCAACOICIAAOOCCCOIACIACCOCOCCAOOAC

G17=AOOCACAOOIAAIIOAOIAIICIAOOCOCCIIOOAAACCAOAOAOOAIAOIOOOCCCICICCOOIICCAOAOAICOCAICOACOAAICOOOCIAOOOCOCCOOAOIOOACC

G18=AAOOIAAIIOAOIAIICIAOOCOCCIIOOAAACCAOAOOOOCAACCCCCOAOOAIAOIOOOCCCOOIICCAOOCAICOACOAAICOOOCIAOOOCOCCOOAIOCACAOIOOACC

G19=ACCOOIOAIIOCCOIOAIIIIOAACIAAOOCACAIOIOAIIOIACOOIAOACOAAOOOIAOOAAAIOOCCIICOIOIACOIOCOCAACOICIAAOOCCCOIACIACCOCOCCAOOAC

G20=ACCOOIOOIOAIIIIOAACIAAOOCACAIOIOAIIOIACOOIACOCOCAOACOAAOOOIAOOAAAIOOCCIICOIOIACOIOCOCAACOICIAAOOCCCOIACIACCOCOCCAOOAC

 

G0     G1     G2     G3     G4     G5     G6     G7     G8     G9    G10    G11    G12    G13    G14    G15    G16    G17    G18    G19    G20   

0      785    830    405    120    1005   970    730    810    930    790    830    1945   1380   2005   925    1450   1360   1485   1570   1515   

785    0      1365   810    840    575    120    725    1105   1430   770    1375   1030   1785   1090   1770   1580   1820   1720   1670   1690   

830    1365   0      1055   1045   1155   765    1775   650    120    2085   1150   1585   1445   1705   1635   1170   1410   1430   1270   1230   

405    810    1055   0      525    865    1065   825    1470   890    910    855    2340   1195   2460   1320   1525   1250   1530   1645   1210   

120    840    1045   525    0      990    930    715    1120   1505   775    940    2035   1670   2095   1330   1420   1770   1605   1540   1485   

1005   575    1155   865    990    0      695    1325   1265   1280   1545   1260   1270   2100   1330   1330   1795   1760   2090   1885   1905   

970    120    765    1065   930    695    0      1150   855    890    1205   1445   1340   1845   1400   1870   1475   1910   1580   1565   1585   

730    725    1775   825    715    1325   1150   0      1105   1835   120    805    1820   1845   1890   1545   1415   1915   1695   1495   1475   

810    1105   650    1470   1120   1265   855    1105   0      1060   1545   1195   1455   1605   1555   1130   1175   1655   1775   1160   1925   

930    1430   120    890    1505   1280   890    1835   1060   0      1905   1230   1585   1525   1705   1635   1250   1490   1510   1350   1310   

790    770    2085   910    775    1545   1205   120    1545   1905   0      925    1750   1905   1820   1675   1470   1975   1450   1550   1530   

830    1375   1150   855    940    1260   1445   805    1195   1230   925    0      1475   1480   1580   1595   1400   1530   1700   1510   1400   

1945   1030   1585   2340   2035   1270   1340   1820   1455   1585   1750   1475   0      2045   120    765    1765   2375   1850   1445   1855   

1380   1785   1445   1195   1670   2100   1845   1845   1605   1525   1905   1480   2045   0      1965   2095   1970   100    560    2090   1955   

2005   1090   1705   2460   2095   1330   1400   1890   1555   1705   1820   1580   120    1965   0      885    1920   2295   1830   1505   2010   

925    1770   1635   1320   1330   1330   1870   1545   1130   1635   1675   1595   765    2095   885    0      1510   2425   1580   1430   1910   

1450   1580   1170   1525   1420   1795   1475   1415   1175   1250   1470   1400   1765   1970   1920   1510   0      1915   2385   120    120    

1360   1820   1410   1250   1770   1760   1910   1915   1655   1490   1975   1530   2375   100    2295   2425   1915   0      660    2035   2005   

1485   1720   1430   1530   1605   2090   1580   1695   1775   1510   1450   1700   1850   560    1830   1580   2385   660    0      2495   2140   

1570   1670   1270   1645   1540   1885   1565   1495   1160   1350   1550   1510   1445   2090   1505   1430   120    2035   2495   0      240    

1515   1690   1230   1210   1485   1905   1585   1475   1925   1310   1530   1400   1855   1955   2010   1910   120    2005   2140   240    0 

 

A0=Armored_Snapper: Genes [0 1 2 7 12 13 16]

A1=Asian_Boxing_Lobster: Genes [4 6 9 10 14 17 20]

A2=Ballards_Hooting_Crane: Genes [3 5 8 11 15 18 19]

A0      A1      A2

0       820     3880  // A0

820     0       4795  // A1

3880    4795    0     // A2