CS 2110 Assignment 4

Due: Sun 11/8 11:59pm

This assignment builds on Assignment 3, and again you can work with a partner.We will make the solution to Assignment 3 available once the due date has passed for all students (including students with an extension).You may use code from that solution.

Objectives

1.      We will modify the animal-to-animal distance formula so that �siblings� will be pushed far part.With this change, ancestor-descendent relationships will now be the ones with the smallest distances in our table.

2.      We will construct a phylogenomic tree.The construction will start with a root node and then will add nodes, representing descendent animals, one by one.

Animal Distance Algorithm Improvements

Siblings vs. Descendants

Recall from A3 that the animal closest to some selected animal will often be its evolutionary brother or sister.For example, you might imagine some ancestral kind of horse, and two modern-day descendents: a Budweiser Clydesdale and a Tapir.The Tapir is closely related to a horse, and hence the animal distance between Clydesdale and the Tapir might be smaller than the distance between either animal and the ancestral specie from which it evolved!

To build our evolutionary tree, we need to push that Clydesdale away from the Tapir.It turns out that these sibling relationships always involve at least one exactly shared gene.For example, the Tapir and the Clydesdale might both have identical copies of gene G17.In contrast, over evolutionary periods of time, no gene remains totally unchanged.As a result, even if a modern animal is quite similar to its ancestor, there will always be some changes to every gene it inherited.

Modifying the Animal Class

Modify your animal-to-animal distance computation so that if two animals share a gene, the distance between the two is infinite (or, as computer scientists define infinite, a number so large that any two animals which are not siblings will have a smaller score, in our case 100,000� a number small enough to print without using too much space, but big enough to exceed any possible animal-to-animal distance).

Also, we�ll adding a new method to the Animal class, isSibling().This method will check to see if two animals are indeed siblings, returning true or false.

A Phylogenomic Tree

An Evolutionary Tree

Our goal is to construct an evolutionary tree.First, your program will read in the animal data just as was done in A3.The user will specify which animal should be used as the root of the tree.With this information, you can build your evolutionary or �phylogenomic� tree.��� Each node will contain an animal as well as children which are its evolutionary descendents.Note that unlike a binary tree, in this more general tree, there could be more than two children for a given node.Each node will also contain a parent unless it is the root.

To build the tree, you must first create a root node from the animal the user designated.Then the other animals in will be added to the tree to reflect their evolutionary relations.This will be done step by step, one animal at a time.At each step, we will add the closest evolutionary descendant we can find to the tree, selected from the animals not already in the tree.The tree construction will end when every animal has been added.

Constructing the Phylogenomic Tree

Here is an explanation of the basic algorithm we will use to build the tree.

First, we will create a new tree node for each animal.Each tree node will have no parent and no children.Next, we will split these nodes into two sets.The first set, nodes_in_tree, will contain nodes which have been added to the tree.Initially, this will contain only the node with root animal.The other set, nodes_to_add, will contain nodes which have not been added to the tree.Initially, this will contain all other nodes.

In this next part, we will select one node from nodes_in_tree and one node from nodes_to_add.We then add the latter node as a child of the first, making sure we link up the tree nodes properly.We keep adding nodes to the tree in this fashion until nodes_to_add is empty.

Here is how we select the two nodes in each step.For each animal in nodes_in_tree, we examine the distance from that animal to each animal in nodes_to_add.��� We select the node from nodes_in_tree and the node from nodes_to_add which minimize the distance between them, over all the cases examined.By linking these two nodes, we add the closest evolutionary descendant to the tree.

To ensure that everyone produces consistent answers, we will need to do two things.First, if two pairs of animal are tied for their distance, we will tiebreak these pairs based on a String comparison of the names of the animals in nodes_to_add, adding the animal with the lexicographically smallest name.Here we mean the name from the Name=�whatever� command in the .dat file.Also, the children of each node will be stored as a list.When adding a new child, add it to the end of the list of children.

Hint:To compare two pairs of animals, where the second animal is in nodes_to_add, it may help to create a small inner class implementing the Comparable interface.

Here�s a sketch of the tree-building algorithm.

buildTree(set of animals, root animal) {

���� For each animal, create a tree node with no parents and no children

����� nodes_in_tree = {node with root animal}

����� nodes_to_add = all other nodes

����� while (nodes_to_add is not empty) {

����������� Find node1 in nodes_in_tree, node2 in nodes_to_add with smallest distance

����������� Append node2 to the list of children for node1

����������� Move node2 from nodes_to_add to nodes_in_tree

����� }

����� Return node with root animal

}����

Implementing the PhylogenyTree Class

The constructor for our PhylogenyTree class will take two arguments, a set of all the animals (including the root animal), and then the root animal itself.After the constructor runs, you should have the root node of a fully generated phylogenemic tree.Also, you should not modify the set of animals passed in to the constructor.

In addition, our class will have three more methods, which return the animal stored in this node, the parent node, and a list of child nodes.Finally, there will be one more method, find(), which takes an animal as input.It searches the node and all its descendants and returns the node containing that animal, or null if no such node is found.

Note for A5:Improving the Algorithm

You will later learn about graphs in this class.This problem can be viewed as a graph, where the nodes or vertices are the animals, and weighted edges exist between any two animals which are not siblings.The weight of the edge is the distance between the two animals.

Essentially, this problem is constructing a minimum spanning tree for this graph.In A4, we are using a na�ve algorithm to accomplish this.In A5, we will be using a better algorithm, Prim�s algorithm, to do this.This should both decrease the running time of our algorithm and probably its space usage as well.

If you cache animal distances or compute a distance matrix ahead of time in A4, you will probably not notice a difference between the two algorithms.But in real-life contexts, where one could be working with thousands of animals, the difference will become evident.

Program I/O

Keeping I/O Separate

Note that our discussion of the animals and the phylogeny tree did not mention anything about the command line or a GUI.While we need some way to provide input to our program so it can create the animals, and likewise our program needs to display some sort of output for the tree, we should still separate the animal and tree themselves from details about we read in the animals or display the tree.

For example, suppose we change the format of our .dat files, or we add in a GUI to select these files instead of specifying them on the command line.Since the constructor for the Animal just needs the name, Latin name, image source, and DNA, we do not need to change our Animal class at all.All we changed is how our program finds these four pieces of information need to construct an animal.

Likewise, since we have a separate class to print out the phylogeny tree, if we want to change how we print the tree, we do not need to change the code for the PhylogenyTree class.In fact, in A5, we will be printing out the tree in a different format called the Newick format, which can be read in by the program Dendroscope.

Input:Command Line Interface

Our program�s first argument will the -cmd or -gui, just like in A3.In fact, we will not be working with a GUI in A4 (but we will in A5).For A4, you should ignore this argument.

Our second argument will be the name of the file with the root animal.This will not contain the extension (e.g.: A38).The remaining arguments will be the .dat files for all 40 animals.Typing in all 40 filenames could be tedious, so we will use a little trick.

Suppose the AnimalData folder is inside the program�s working directory.To select all the .dat files, we can provide the argument AnimalData/*.dat.Before the program runs, this will be expanded to AnimalData/A0.dat AnimalData/A1.dat � AnimalData/A39.dat.Note that this expansion is done automatically by the operating system, which understands this �*� (or �wildcard�) notation.Some people have reported issues with this, perhaps because they are running on an operating system we haven�t run into before.If the newsgroup and/or office hours does not solve this problem, you can also list out each argument.

Output:Printing the Tree

Note: this is just an example of printing a tree; the example tree is not the correct tree.

Finally, our program will need to print out the tree, so that it will have something to show.We have provided the stub of a class called TreePrinter, which has a function to do this.When we display the tree, we first start out with the root animal, which has no tabs.

Spotted_Ghila

For each level that you descend in the tree, indent by one tab before printing the descendents of this node.In this example, Spotted Ghila has one child, Striped Salamander, which also has its own child, Gray Floop.

Spotted_Ghila
����������� Striped_Salamander
����������������������� Gray_Floop

If a node has multiple children, you should print out each child node and all of that child�s descendants before moving on to the next child.For example, suppose that Spotted Ghila has two more children, Parmesanian and then Jelly Belly.The Parmesanian also has two children of its own, Pink Ziffer and then Globe Floater.Additionally, Pink Ziffer has its own child, Ballards ProtoDuck.Here is how the tree would like:

Spotted_Ghila
����������� Striped_Salamander
����������������������� Gray_Floop

����������� Parmesanian

����������������������� Pink_Ziffer

���������������������������������� Ballards_ProtoDuck

����������������������� Globe_Floater

����������� Jelly_Belly

Writing the Code

Downloading and Submitting the Assignment

You can find a file named phylogeny.zip on CMS or here, which contains the code provided for this assignment.Place the phylogeny folder in the directory where your source code goes (likely named src in Eclipse).In Java, the package structure of the classes and the structure of the files must match (e.g.: a Java class in the phylogeny.util package must be in the phylogeny/util folder, and the file should have the same name as the class).

When you submit this in CMS, zip up the phylogeny folder.Basically, you are submitting the code in the same form you received it.Thus, all your files should go either in the phylogeny package or one of its subpackages.Also, do not put any code besides tests in the phylogeny.test package.Not only does this intuitively make sense, but we will wipe the contents of that folder and replacing it with a new test suite used for grading.

Completing the Assignment

The code contains three class stubs:Animal, PhylogenyTree, and TreePrinter.A good software engineering practice is to write stubs, classes which contain a bunch of public functions with an empty body.This helps you design and think through your code before you write any significant code, and it also gives you a basic program that compiles.

Wherever you see an empty function body with a TODO comment, you should write the code for the function. (You can go to Window → Show View → Tasks in Eclipse to get a list of these tasks).JavaDoc is provided to help remind you what each function does.�� Most of these tasks have been previously described in the assignment, and any that have not are rather trivial functions to write.

Your previous code will have to conform to the stubs provided for the Animal class, but since the class is only a stub, you have a lot of freedom in how you do that.However, you should not rename, delete, or change the signature of the public functions we give you.Also, do not modify the DatParser class, although you should have no need for that in the first place.Feel free to modify the contents of the main() method if you want to also print some helpful debugging output.

If you need to create a method with a different signature than ours (for example, a printPhylogenyTree() function with an extra argument), then you should add it as a private method and have the public method call it.

In software engineering, one often restructures code not only to improve it, but to make it conform to a specification others can use (in this case, one our jUnit tests can use).When you deviate from that specification, you can break the code of others who consume your code.Thus, if you ignore our warnings, you would probably break our jUnit tests.

Recap

         Put all your code in the phylogeny package or one of its subpackages.

         You should put only tests in the phylogeny.test package; your code should still be able to run if we deleted all your tests.

         Do not change the signatures of the public functions we provide for this assignment.

         Submit a .zip of the phylogeny folder to CMS.

Testing

jUnit Tests

In this assignment, we will be using the jUnit testing framework.This allows us to dig a little deeper (but not too deep) into the code, testing individual classes and public functions inside them.�� This provides an additional level of granularity in diagnosing problems.

Additionally, you�ll be able to step through the execution of test cases in Eclipse�s debugger.If test cases are unexpected failing, you will want to do this to figure out what exactly is causing the test case to fail.

Even though the code is not complete, we have provided tests for it.In the real world, you can write the tests before you even write the code.Or, if you are working with a partner, your partner can write some tests while you write the code.

Testable Units of Work

If the animal distance tests are failing, that would explain why the phylogeny tree tests are failing.If the animal tests are passing (and the tests don�t miss any bugs), then your failing tests for the phylogeny tree are likely caused by incorrect code in the phylogeny tree.We have broken our code into several testable units of work.In practice, this technique tends work better than giving someone the code for the entire program, saying it does not work, and then asking what is wrong in those thousands of lines of code.

Tests Provided with the Code

We have provided some basic jUnit test cases. You will not get a 100% if you pass these tests; they are just smoke tests to catch some basic flaws.There many ways to create incorrect solutions that could still squeek past all these tests.When we grade your code, we will use a different test suite which will only pass if the results are correct.

While you should not modify the tests we have already given you, you are encouraged to write additional tests of your own (e.g.: find two animals who are siblings and two who are not, and write some sibling tests with those two pairs.)

Running the Tests

You will first need to add jUnit to your build path.In Eclipse, if you see an error on an import org.junit.*; statement, just use the quick fix option and select �Fix project setup��You will then be given the option to add jUnit to your build path.Also, if you create a new jUnit Test Case, Eclipse will ask you if you want to add jUnit to the build path.

To run the tests, you will need to create a new jUnit Run/Debug Configuration in Eclipse.The test class you want is phylogeny.test.PhylogenyTestSuite. The way the tests are written, the AnimalData folder, which contains all 40 .dat files, should be the working directory.Recall that the working directory can be changed in the Arguments tab.(There is a way to avoid this type of limitation in Microsoft�s Visual Studio, but not in Eclipse.)

If you successfully run the test suite, the test suite should be running tests from 3 classes.These 3 classes will collectively be running 6 basic tests.

Note:In version 3.5 of Eclipse (Galileo), some people have experienced a bug where Eclipse will not add jUnit to the build path.To solve this, you can either upgrade to version 3.5.1, which fixes this bug, or you can downgrade to 3.4.2 (Ganymede).The latter is recommended, as the cutting edge version of any software product tends to be buggy.

What We Will Test

         The new animal distance algorithm, implemented in computeDistance()

         Sibling test for animals, isSibling()

         The phylogeny tree generated by your constructor to PhylogenyTree

         The find() method in your phylogenomic tree

         Printing the tree, printPhylogenyTree()

What We Won�t Test

         Parsing .dat files and the gene distance algorithm.These have not changed from A3, but if your gene algorithm is wrong, your animal distance algorithm will be wrong.

         getName(), getLatinName(), getImageSource() in Animal; getAnimal(), getParent(), getChildren() in PhylogenyTree.They are trivial, but once again bugs in them will cause other things to fail.

         We will not run tests with invalid or weird .dat files.


 


Examples

Example 1

We ran our solution with the args:

-cmd A17 A4.dat A14.dat A15.dat A17.dat

Here is the printed tree

Jelly_Belly

����������� Green_Herring

����������������������� Globe_Floater

����������������������� Gray_Floop

Here is the animal distance matrix.It�s not printed by the program; we�ve provided it for your convenience:

0������ 1000001560��� 1660���� Globe_Floater{Genes 10 11 15 18 28 31 33 }

������� 0������ 1560��� 1660���� Gray_Floop{Genes 10 11 15 17 28 31 33 }

0    �����820����� Green_Herring{Genes 4 5 8 13 21 24 29 }

����������������������� 0������� Jelly_Belly{Genes 0 1 2 6 19 20 23 }

Example 2

-cmd A17 A4.dat A14.dat A15.dat A17.dat A23.dat A25.dat A30.dat A32.dat A35.dat A37.dat

Jelly_Belly

����������� Green Herring

����������������������� Paradise_Rockfish

����������������������� Toothy_Ballonfish

����������� Pink_Ziffer

����������������������� Globe_Floater

����������������������� Gray_Floop

����������� Striped Salamander

����������������������� Shy_Frecklepuss

����������������������� Spotted_Ghila

 

 

0������ 1000001560��� 100000840���� 1000001000001370��� 1000001660���� Globe_Floater{Genes 10 11 15 18 28 31 33 }

������� 0������ 1560��� 100000840���� 1000001000001370��� 1000001660���� Gray_Floop{Genes 10 11 15 17 28 31 33 }

��������������� 0������ 840���� 1000001780��� 2432��� 100000840���� 820����� Green_Herring{Genes 4 5 8 13 21 24 29 }

����������������������� 0������ 1390��� 1000002370��� 1440��� 1000001640���� Paradise_Rockfish{Genes 10 12 16 18 27 31 35 }

������������������������������� 0������ 1540��� 1400��� 1000001250��� 820����� Pink_Ziffer{Genes 4 5 7 13 22 24 30 }

�������� �������������������������������0������ 100000840���� 1000001680���� Shy_Frecklepuss{Genes 9 11 15 18 27 32 36 }

����������������������������������������������� 0������ 840���� 1000001675���� Spotted_Ghila{Genes 9 11 14 17 26 32 33 }

���������������� ���������������������������������������0������ 1440��� 840����� Striped_Salamander{Genes 3 5 7 13 21 25 30 }

��������������������������������������������������������������� 0������ 1660���� Toothy_Ballonfish{Genes 10 11 16 18 26 31 34 }

������������������ �����������������������������������������������������0������� Jelly_Belly{Genes 0 1 2 6 19 20 23 }