CS 2110 Homework Assignment 4

Due: Thu 11/4 11:59pm

Description

This assignment builds on your solution to assignment 3, and again you may work in teams of two if you so desire. Both team members must identify themselves as part of a group on CMS, and will jointly hand in one solution. Choose one partner's netID to use when conforming to the package conventions of the course.

We will make a solution to assignment 3 available on CMS to all students who have submitted a solution and received a grade. You may use that as a starting point if your own version of assignment 3 was hopelessly broken. However, if you do so, you must indicate that you based your code on our solution.

Objectives for this assignment:

Sibling species: We will modify the animal-to-animal distance formula so that "siblings" will be pushed far part. With this change, ancestor-descendant relationships will now be the ones with the smallest distances in our table.
Phylogenetic trees: We will construct a phylogenetic tree of all the species in the CS 2110 bestiary. We'll first designate a root node that represents a common ancestor of all the other animals. Then we'll add nodes, representing descendent animals, one by one.

Species Distance Algorithm Improvements

Siblings vs. Descendants

Recall from A3 that the animal closest to some selected animal will often be its evolutionary brother or sister. For example, you might imagine some ancestral kind of horse, and two modern-day descendents: a Budweiser Clydesdale and a Tapir. The Tapir is closely related to a horse, and hence the animal distance between Clydesdale and the Tapir might be smaller than the distance between either animal and the ancestral species from which it evolved!

To build our evolutionary tree, we need to push that Clydesdale away from the Tapir. It turns out that these sibling relationships always involve at least one exactly shared gene. For example, the Tapir and the Clydesdale might both have identical copies of gene G17. In contrast, over evolutionary periods of time between generations of species, no gene remains totally unchanged. As a result, even if a modern animal is quite similar to its ancestor, there will always be some changes to every gene it inherited.

Modifying the Species Class

Modify your animal-to-animal distance computation so that if two animals share a gene, the distance between the two is infinite (or, as computer scientists define infinite, a number so large that any two animals which are not siblings will definitely have a smaller score, in our case 100,000... a number small enough to print without using too much space, but big enough to exceed any possible animal-to-animal distance).

Note: Recall from assignment 3 that a species had 0 distance against itself. Yet, a species is certainly its own sibling, so under these new rules it's unclear whether a species should have 0 or 100,000 distance from itself. You may resolve this conflict either way: A species may have 0 or 100,000 distance from itself. However, a species should definitely be its own sibling, as a sibling certainly shares a gene with itself.

To help in this process, we are adding a new method to the Species class, isSibling(Species other). This method will check to see if two animals are indeed siblings, returning true or false.

A Phylogenetic Tree

Our goal is to construct an evolutionary tree. First, your program will read in the animal data just as was done in A3. The user will specify which animal should be used as the root of the tree. With this information, you can build your evolutionary tree. Each node will contain an animal as well as children which are its evolutionary descendents. Note that unlike a binary tree, in this more general tree, there could be more than two children for a given node.

The constructor for our PhylogenyTree class will take two arguments, a set of all the animals (including the root animal), and then the root animal itself. After the constructor runs, you should have the root node of a fully generated phylogenetic tree. We'll be explaining how to build this tree in the following sections of this specification. Also, you should not modify the set of animals passed in to the constructor.

In addition, our class will have three more methods, which return the animal stored in this node, the parent node, and a list of child nodes. Finally, there will be one more method, find(), which takes an animal as input. It searches the node and all its descendants and returns the node containing that animal, or null if no such node is found.

Constructing the Phylogenetic Tree

Here is a high-level explanation of the basic algorithm we will use to build the tree.

We'll start with the root animal as the only animal in the tree, then add descendant animals one by one until we've found the complete evolutionary tree. There are many possible evolutionary trees we could generate this way, but we want to get the best one possible, so each step of the way, we'll have to add the animal that is closest to an animal already in the tree. In other words, if we went through each animal not yet in the tree and examined the distance from that animal to each of the animals already in the tree, we would select the pair of animals that minimizes the distance, over all the cases examined. But if we did this with a brute-force search every time, we'd be performing a lot of repeated distance computations, so we want to be slightly smarter about this process. To go into detail about how we do that, we'll first need to discuss two new concepts: Prim's Algorithm and Priority Queues.

Prim's Algorithm, and Relation to Graphs

You will soon be learning about graphs in this class. This problem can be viewed as a graph, where the nodes or vertices are the animals, and weighted edges exist between any two animals which are not siblings. The weight of the edge is the distance between the two animals.

Essentially, this problem is constructing a minimum spanning tree for this graph. This is the tree with the smallest total weight on its edges that connects every node in the graph. Prim's algorithm is a greedy algorithm that efficiently finds this tree. The algorithm starts with one node already in the tree (the root, in our case), and slowly adds other nodes and edges to the tree. In each step, it chooses the cheapest edge that connects to a node not yet connected to the tree. More technically, it chooses the cheapest edge e connecting vertices u and v such that u is in the tree and v is not. It adds that edge to the tree, and then repeats the process until all the nodes are in the tree.

You'll see that the algorithm described in the high-level description actually corresponds exactly with Prim's algorithm.

Priority Queues

We'll need to employ Java's PriorityQueue to make full use of Prim's algorithm. This is a special queue where elements can have a priority associated with them, and after adding various elements to the queue, the first element that comes out is the one with the highest priority. As you can guess, to find our evolutionary tree, we'll define the edges in the tree with a small distance to have a higher priority. Just like with TreeSets, we do this by passing a Comparator to the PriorityQueue's constructor, or making the elements in the priority queue implement the Comparable interface. The elements that are smaller according to the Comparator or the Comparable interface will come out first when taking elements from the PriorityQueue (via methods such as peek, remove, or poll).

Putting it all together

We'll have a PriorityQueue of Edges that could possibly be added to the tree. Initially, this will just be all edges from the root animal to every other animal. When we want to add a new animal, we won't have to search every single possible pair of animals; we'll simply get the first element from the PriorityQueue, and it is sure to be the edge with the lowest distance. This edge will connect one animal (call it newParent) already in the tree and another animal (call it newChild) not yet in the tree. We'll add newChild to the tree as a child of newParent. But since we just added another node to the tree, in the next step we'll have more possible edges to consider for addition! So we'll have to add all edges between newChild and all the remaining animals to the priority queue. We keep going until the tree has all the animals in it.

To ensure that everyone produces consistent answers, we will need to do two things: First, if two pairs of animal are tied for their distance, we will tiebreak these pairs based on a String comparison of the names (e.g. "Biscuit" or "Armored Snapper") of the animals in nodesToAdd, adding the animal with the lexicographically smallest name. Also, the children of each node will be stored as a list. When adding a new child, add it to the end of the list of children.

Here's pseudo-code for how the complete algorithm might look:

public class PhylogenyTree { public PhylogenyTree(Set<Species> species, Species root) { // The animal at this node will be the root animal. this.animal = root; PriorityQueue<Edge> edgesToAdd = ???; // edges from root to other animals while ( ??? /* there are still nodes to be added to the tree*/ ) { // Get first edge from edgesToAdd, which connects two Species. // newParent is the Species already in the tree // and newChild is the Species not yet in the tree. // If both are already in the tree, discard the edge and get the next one. // Append newChild to list of children of newParent. // Move newChild from nodesToAdd to nodesInTree. // Add edges connecting newChild and the rest of the animals to the priority queue. } } private static class Edge { private Species species1; private Species species2; private int distance; // You define the rest! } }

Program I/O

Keeping I/O Separate

Note that our discussion of the animals and the phylogeny tree did not mention anything about the command line or a GUI! While we need some way to provide input to our program so it can create the animals, and likewise our program needs to display some sort of output for the tree, we should still separate the animal and tree themselves from details about we read in the animals or display the tree.

For example, suppose we change the format of our .dat files, or we add in a GUI to select these files instead of specifying them on the command line. Since the constructor for the Species just needs the name, Latin name, image filename, and DNA, we do not need to change our Species class at all. All we changed is how our program finds these four pieces of information need to construct a species.

Likewise, since we have a separate class to print out the phylogeny tree, if we want to change how we print the tree, we do not need to change the code for the PhylogenyTree class. In fact, in A5, we will be printing out the tree in a different format called the Newick format, which can be read in by the program Dendroscope.

Input: Command Line Arguments

Your program will take the following arguments (passed to your main method in its args array). The first argument will be either "--console" or "--gui" (note the two dashes in front), just like in A3. For A4, we will not be working with a GUI, so the first argument can be ignored for now. We will be working with a GUI once again in A5.

The second argument is of the form "--root=Axx", which will specify the file containing the root animal. This will not contain the .dat extension. (e.g. "--root=A39" or "--root=A17".

Note: In assignment 3, we simply specified the directory of animals to be analyzed. In assignment 4, we want to once again have the flexibility to analyze only a few animals at a time, so we have gone back to the same style as assignment 2, where the files are listed one by one on the command-line.

Output: Printing the Tree

Note: this is just an example of printing a tree; the example tree is not the correct tree.

Finally, our program will need to print out the tree, so that it will have something to show. We have provided the stub of a class called TreePrinter, which has a function to do this. When we display the tree, we first start out with the root animal, which has no tabs.

Spotted Ghila

For each level that you descend in the tree, indent by one tab before printing the descendents of this node. In this example, Spotted Ghila has one child, Striped Salamander, which also has its own child, Gray Floop.

Spotted Ghila Striped Salamander Gray Floop

If a node has multiple children, you should print out each child node and all of that child's descendants before moving on to the next child. For example, suppose that Spotted Ghila has two more children, Parmesanian and then Jelly Belly. The Parmesanian also has two children of its own, Pink Ziffer and then Globe Floater. Additionally, Pink Ziffer has its own child, Ballards ProtoDuck. Here is how the tree would like:

Spotted Ghila Striped Salamander Gray Floop Parmesanian Pink Ziffer Ballards ProtoDuck Globe Floater Jelly Belly

Writing the code

Downloading and Submitting the Assignment

Download the file a4.jar, which contains the code provided for this assignment. Similarly to the previous assignments, import the jar into your project. Note that you will need to rename the packages from cs2110.netID.phylogeny to reflect your actual netid. Eclipse's right click (or control-click for Macs) -> Refactor -> Rename will be useful for this.

When you submit your assignment, you will create a source jar just like in the past. Preferably, all your code will be in the cs2110.netID.phylogeny package (or one of its subpackages). Note that you no longer need to extend/implement the abstract classes and interfaces we provided for assignment 1.

Completing the Assignment

The code contains three class stubs: Species, PhylogenyTree, and TreePrinter. A good software engineering practice is to write stubs, classes which contain a bunch of public functions with an empty body. This helps you design and think through your code before you write any significant code, and it also gives you a basic program that compiles.

Wherever you see an empty function body with a TODO comment, you should write the code for the function. (You can go to Window -> Show View -> Tasks in Eclipse to get a list of these tasks). JavaDoc is provided to help remind you what each function does. Most of these tasks have been previously described in the assignment, and any that have not are rather trivial functions to write.

Your previous code will have to conform to the stubs provided for the Species class, but since the class is only a stub, you have a lot of freedom in how you do that. However, you should not rename, delete, or change the signature of the public functions we give you. Feel free to modify the contents of the main() method if you want to also print some helpful debugging output.

If you need to create a method with a different signature than ours (for example, a printPhylogenyTree() function with an extra argument), then you should add it as a private method and have the public method call it.

In software engineering, one often restructures code not only to improve it, but to make it conform to a specification others can use (in this case, one our JUnit tests can use). When you deviate from that specification, you can break the code of others who consume your code. Thus, if you ignore our warnings, you would probably break our JUnit tests.

Recap

Put all your code in the cs2110.netID.phylogeny package or one of its subpackages.
You should put only tests in the cs2110.netID.phylogeny.test package; your code should still be able to run if we deleted all your tests.
Do not change the signatures of the public functions we provide for this assignment.
Submit a jar of your phylogeny package to CMS.

Testing

JUnit Tests

In this assignment, we will be using the JUnit testing framework. This allows us to dig a little deeper (but not too deep) into the code, testing individual classes and public functions inside them. This provides an additional level of granularity in diagnosing problems.

Additionally, you'll be able to step through the execution of test cases in Eclipse's debugger. If test cases are unexpectedly failing, you will want to do this to figure out what exactly is causing the test case to fail.

Even though the code is not complete, we have provided tests for it. In the real world, you can write the tests before you even write the code. Or, if you are working with a partner, your partner can write some tests while you write the code.

Testable Units of Work

If the animal distance tests are failing, that would explain why the phylogeny tree tests are failing. If the animal tests are passing (and the tests don't miss any bugs), then your failing tests for the phylogeny tree are likely caused by incorrect code in the phylogeny tree. We have broken our code into several testable units of work. In practice, this technique tends work better than giving someone the code for the entire program, saying it does not work, and then asking what is wrong in those thousands of lines of code.

Tests Provided with the Code

We have provided some basic JUnit test cases. Passing these tests will not guarantee you 100%; they are just smoke tests to catch some basic flaws. There many ways to create incorrect solutions that could still sneak past all these tests. When we grade your code, we will use a different test suite which will only pass if the results are correct.

While you should not modify the tests we have already given you, you are encouraged to write additional tests of your own (e.g.: find two animals who are siblings and two who are not, and write some sibling tests with those two pairs.)

Running the tests

You will first need to add JUnit to your build path. In Eclipse, if you see an error on an import org.junit.*; statement, just use the quick fix option and select "Fix project setup..." You will then be given the option to add JUnit to your build path. Also, if you create a new JUnit Test Case, Eclipse will ask you if you want to add JUnit to the build path.

To run the tests, you will need to create a new JUnit Run/Debug Configuration in Eclipse. The test class you want is phylogeny.test.PhylogenyTestSuite. The way the tests are written, the AnimalData folder, which contains all 40 .dat files, should be the working directory. Recall that the working directory can be changed in the Arguments tab.

If you successfully run the test suite, the test suite should be running tests from 3 classes. These 3 classes will collectively be running 8 basic tests.

What We Will Test

The new animal distance algorithm, implemented in getDistance()
Sibling test for animals, isSibling()
The phylogeny tree generated by your constructor to PhylogenyTree
The find() method in your phylogenetic tree
Printing the tree, printPhylogenyTree()

What We Won't Test

Parsing .dat files and the gene distance algorithm. These have not changed from A3, but if your gene algorithm is wrong, your animal distance algorithm will be wrong.
getName(), getLatinName(), getImageSource() in Species; getAnimal(), getParent(), getChildren() in PhylogenyTree. They are trivial, but once again bugs in them will cause other things to fail.
Invalid or weird .dat files.

Extra Credit

Java's PriorityQueue has a small flaw that prevents us from making the best use of it in our implementation of Prim's algorithm: you cannot increase the priority of an element already inside the queue (if you try to do it, you'll find that it hasn't changed position in the queue at all!). This crucial missing operation is known as decrease-key. Without this operation, we were forced to add all the edges in the graph to the priority queue, and this uses up extra memory.

If you implement your own priority queue that supports decrease-key, you will only need to add one element per Species into your priority queue. You'll set its initial priority to be the distance from the root animal, and when you add an animal to the tree you'll simply use decrease-key on the nodes whose distances need to be updated, instead of adding a whole new edge to your priority queue. Implement such a priority queue and use it in the tree-building algorithm for extra credit points.

Ideally, your queue should also decrease the running time of your program. In general, this means inserting an element and removing the smallest element should happen in O(log n) time or better. Decrease-key should happen in better than O(n) time.

Implementing such a priority queue could be as simple as making an array-backed binary heap. There are alternatives as well, such as a pairing heap. And for the truly daring, there is the fibonacci heap, which supports decrease-key in O(1) amortized time. There may be other useful priority queue implementations as well: you are free to explore, experiment, and implement whatever you wish.

Examples

We've run our solution on two sets of animals, so that you can compare your results against ours. In both cases, we've shown you the tree your program should print, in addition to an animal distance matrix just like the in A3.

Note that your final submission should not print the animal distance matrix; the only thing it should print is the tree. The animal distance matrix is provided for your convenience to allow you to ensure that your animal distances are being calculated correctly.

Example 1

Arguments: --console --root=A17 A4.dat A14.dat A15.dat A17.dat

Jelly Belly Green Herring Globe Floater Gray Floop

S0=Globe Floater: Genes [8, 9, 10, 12, 18, 20, 21] S1=Gray Floop: Genes [8, 9, 10, 11, 18, 20, 21] S2=Green Herring: Genes [4, 5, 6, 7, 15, 17, 19] S3=Jelly Belly: Genes [0, 1, 2, 3, 13, 14, 16] S0 S1 S2 S3 0 100000 1950 2075 // S0 100000 0 1950 2075 // S1 1950 1950 0 1025 // S2 2075 2075 1025 0 // S3

Example 2

Arguments: --console --root=A17 A4.dat A14.dat A15.dat A17.dat A23.dat A25.dat A30.dat A32.dat A35.dat A37.dat

Jelly Belly Green Herring Paradise Rockfish Toothy Ballonfish Pink Ziffer Globe Floater Gray Floop Striped Salamander Shy Frecklepuss Spotted Ghila

S0=Globe Floater: Genes [12, 13, 16, 18, 28, 31, 36] S1=Gray Floop: Genes [12, 13, 16, 17, 28, 31, 36] S2=Green Herring: Genes [4, 6, 7, 9, 21, 24, 30] S3=Jelly Belly: Genes [0, 1, 2, 3, 19, 20, 23] S4=Paradise Rockfish: Genes [11, 13, 14, 18, 27, 31, 34] S5=Pink Ziffer: Genes [4, 6, 8, 9, 22, 24, 29] S6=Shy Frecklepuss: Genes [10, 12, 16, 18, 27, 32, 33] S7=Spotted Ghila: Genes [10, 12, 15, 17, 26, 32, 36] S8=Striped Salamander: Genes [4, 5, 8, 9, 21, 25, 29] S9=Toothy Ballonfish: Genes [12, 13, 14, 18, 26, 31, 35] S0 S1 S2 S3 S4 S5 S6 S7 S8 S9 0 100000 1950 2075 100000 1050 100000 100000 1734 100000 // S0 100000 0 1950 2075 100000 1050 100000 100000 1734 100000 // S1 1950 1950 0 1025 1050 100000 2225 3087 100000 1050 // S2 2075 2075 1025 0 2054 1025 2100 2098 1050 2075 // S3 100000 100000 1050 2054 0 1772 100000 3010 1856 100000 // S4 1050 1050 100000 1025 1772 0 1925 1767 100000 1614 // S5 100000 100000 2225 2100 100000 1925 0 100000 1050 100000 // S6 100000 100000 3087 2098 3010 1767 100000 0 1050 100000 // S7 1734 1734 100000 1050 1856 100000 1050 1050 0 1856 // S8 100000 100000 1050 2075 100000 1614 100000 100000 1856 0 // S9