CS 2110 Assignment 4
Due: Sun 11/8 11:59pm
This assignment builds on Assignment 3, and again you can
work with a partner.� We will make the
solution to Assignment 3 available once the due date has passed for all
students (including students with an extension).� You may use code from that solution.
1.
We
will modify the animal-to-animal distance formula so that �siblings� will be
pushed far part.� With this change, ancestor-descendent
relationships will now be the ones with the smallest distances in our table.
2.
We
will construct a phylogenomic tree.� The construction will start with a root node
and then will add nodes, representing descendent animals, one
by one.
Recall from A3 that
the animal closest to some selected animal will often be its evolutionary
brother or sister.� For example, you
might imagine some ancestral kind of horse, and two modern-day descendents: a
Budweiser Clydesdale and a Tapir.� The
Tapir is closely related to a horse, and hence the animal distance between
Clydesdale and the Tapir might be smaller than the distance between either animal and the ancestral specie from which it evolved!
To build our
evolutionary tree, we need to push that Clydesdale away from the Tapir.� It turns out that these sibling relationships
always involve at least one exactly
shared gene.� For example, the Tapir
and the Clydesdale might both have identical copies of gene G17.� In contrast, over evolutionary periods of
time, no gene remains totally unchanged.�
As a result, even if a modern animal is quite similar to its ancestor,
there will always be some changes to every gene it inherited.
Modify your
animal-to-animal distance computation so that if two animals share a gene, the
distance between the two is infinite (or, as computer scientists define
infinite, a number so large that any two animals which are not siblings will have
a smaller score, in our case 100,000� a number small enough to print without
using too much space, but big enough to exceed any possible animal-to-animal
distance).
Also, we�ll adding a
new method to the Animal class, isSibling().� This method will check
to see if two animals are indeed siblings, returning true or false.
Our goal is to
construct an evolutionary tree.� First,
your program will read in the animal data just as was done in A3.� The user will specify which animal should be
used as the root of the tree.� With this
information, you can build your
evolutionary or �phylogenomic� tree.��� Each node will contain an animal as well as
children which are its evolutionary descendents.� Note that unlike a binary tree, in this more
general tree, there could be more than two children for a given node.� Each node will also contain a parent unless
it is the root.
To build the tree,
you must first create a root node from the animal the user designated.� Then the other animals in will be added to
the tree to reflect their evolutionary relations.� This will be done step by step, one animal at
a time.� At each step, we will add the
closest evolutionary descendant we can find to the tree, selected from the
animals not already in the tree.� The
tree construction will end when every animal has been added.
Here is an explanation of the basic algorithm we will use to build the
tree.
First, we will create a new tree node for each animal.� Each tree node will have no parent and no
children.� Next, we will split these
nodes into two sets.� The first set, nodes_in_tree, will contain nodes which have been
added to the tree.� Initially, this will
contain only the node with root animal.�
The other set, nodes_to_add,
will contain nodes which have not been added to the tree.� Initially, this will contain all other nodes.
In this next part, we will select one node from nodes_in_tree and one node from nodes_to_add.� We then add the latter node as a child of the
first, making sure we link up the tree nodes properly.� We keep adding nodes to the tree in this
fashion until nodes_to_add
is empty.
Here is how we select the two nodes in each step.� For each animal in nodes_in_tree, we examine the distance from that
animal to each animal in nodes_to_add.��� We
select the node from nodes_in_tree
and the node from nodes_to_add
which minimize the distance between them, over all the cases examined.� By linking these two nodes, we add the
closest evolutionary descendant to the tree.
To ensure that everyone
produces consistent answers, we will need to do two things.� First, if two pairs of animal are tied for
their distance, we will tiebreak these pairs based on a String comparison of
the names of the animals in nodes_to_add, adding
the animal with the lexicographically smallest name.� Here we mean the name from the
Name=�whatever� command in the .dat file.� Also, the children of each node will be
stored as a list.� When adding a new
child, add it to the end of the list of children.
Hint:� To compare two pairs of animals, where the
second animal is in nodes_to_add,
it may help to create a small inner class implementing the Comparable
interface.
Here�s a sketch of the tree-building algorithm.
buildTree(set of animals, root animal) {
����� For
each animal, create a tree node with no parents and no children
����� nodes_in_tree =
{node with root animal}
����� nodes_to_add =
all other nodes
����� while (nodes_to_add is not empty) {
����������� Find node1 in nodes_in_tree,
node2 in nodes_to_add with smallest distance
����������� Append node2 to the list of children
for node1
����������� Move node2 from nodes_to_add
to nodes_in_tree
����� }
����� Return node with root animal
}����
The constructor for our PhylogenyTree
class will take two arguments, a set of all the animals (including the root
animal), and then the root animal itself.�
After the constructor runs, you should have the root node of a fully
generated phylogenemic tree.� Also, you should not modify the set of
animals passed in to the constructor.
In addition, our class will have three more methods,
which return the animal stored in this node, the parent node, and a list of
child nodes.� Finally, there will be one
more method, find(), which takes an animal as
input.� It searches the node and all its
descendants and returns the node containing that animal, or null if no such
node is found.
You will later learn about graphs in this class.� This problem can be viewed as a graph, where
the nodes or vertices are the animals, and weighted edges exist between any two
animals which are not siblings.� The
weight of the edge is the distance between the two animals.
Essentially, this problem is constructing a minimum
spanning tree for this graph.� In A4, we
are using a na�ve algorithm to accomplish this.�
In A5, we will be using a better algorithm, Prim�s algorithm, to do
this.� This should both decrease the running
time of our algorithm and probably its space usage as well.
If you cache animal distances or compute a distance
matrix ahead of time in A4, you will probably not notice a difference between
the two algorithms.� But in real-life
contexts, where one could be working with thousands of animals, the difference
will become evident.
Note that our discussion of the animals and the
phylogeny tree did not mention anything about the command line or a GUI.� While we need some way to provide input to
our program so it can create the animals, and likewise our program needs to
display some sort of output for the tree, we should still separate the animal
and tree themselves from details about we read in the animals or display the
tree.
For example, suppose we change the format of our .dat files, or we add in a GUI to select these files instead
of specifying them on the command line.�
Since the constructor for the Animal just needs the name, Latin name,
image source, and DNA, we do not need to change our Animal class at all.� All we changed is how our program finds these
four pieces of information need to construct an animal.
Likewise, since we have a separate class to print out
the phylogeny tree, if we want to change how we print the tree, we do not need
to change the code for the PhylogenyTree class.� In fact, in A5, we will be printing out the
tree in a different format called the Newick format,
which can be read in by the program Dendroscope.
Our program�s first
argument will the -cmd
or -gui,
just like in A3.� In fact, we will not be
working with a GUI in A4 (but we will in A5).�
For A4, you should ignore this argument.
Our second argument
will be the name of the file with the root animal.� This will not contain the extension (e.g.:
A38).� The remaining arguments will be
the .dat files for all 40 animals.� Typing in all 40 filenames could be tedious,
so we will use a little trick.
Suppose the AnimalData folder is inside the program�s working
directory.� To select all the .dat files, we can provide the argument AnimalData/*.dat.� Before the program
runs, this will be expanded to AnimalData/A0.dat AnimalData/A1.dat � AnimalData/A39.dat.� Note that this expansion is done automatically by the operating system,
which understands this �*� (or �wildcard�) notation.� Some people have reported issues with this, perhaps because they are running on an operating
system we haven�t run into before.� If
the newsgroup and/or office hours does not solve this
problem, you can also list out each argument.
Note: this is
just an example of printing a tree; the example tree is not the correct tree.
Finally, our program will need to print out the tree,
so that it will have something to show.�
We have provided the stub of a class called TreePrinter,
which has a function to do this.� When we
display the tree, we first start out with the root animal, which has no tabs.
�Spotted_Ghila
For each level that you descend in the tree, indent by
one tab before printing the descendents of this node.� In this example, Spotted Ghila
has one child, Striped Salamander, which also has its own child, Gray Floop.
Spotted_Ghila
����������� Striped_Salamander
����������������������� Gray_Floop
If a node has multiple children, you should print out
each child node and all of that child�s descendants before moving on to the
next child.� For example, suppose that Spotted
Ghila has two more children, Parmesanian
and then Jelly Belly.� The Parmesanian also has two children of its own, Pink Ziffer and then Globe Floater.� Additionally, Pink Ziffer
has its own child, Ballards ProtoDuck.� Here is how the tree would like:
Spotted_Ghila
����������� Striped_Salamander
����������������������� Gray_Floop
����������� Parmesanian
����������������������� Pink_Ziffer
���������������������������������� Ballards_ProtoDuck
����������������������� Globe_Floater
����������� Jelly_Belly
You can find a file named phylogeny.zip on CMS or here, which contains the code provided for
this assignment.� Place the phylogeny
folder in the directory where your source code goes (likely named src in Eclipse).� In
Java, the package structure of the classes and the structure of the files must
match (e.g.: a Java class in the phylogeny.util
package must be in the phylogeny/util folder, and the
file should have the same name as the class).
When you submit this in CMS, zip up the phylogeny
folder.� Basically, you are submitting
the code in the same form you received it.�
Thus, all your files should go either in the phylogeny package or one of
its subpackages.�
Also, do not put any code besides tests in the phylogeny.test
package.� Not only does this intuitively
make sense, but we will wipe the contents of that folder and replacing it with
a new test suite used for grading.
The code contains three class stubs:�
Animal, PhylogenyTree, and TreePrinter.� A good software engineering practice is to
write stubs, classes which contain a bunch of public functions with an empty
body.� This helps you design and think
through your code before you write any significant code, and it also gives you
a basic program that compiles.
Wherever you see an empty function body with a TODO comment, you should
write the code for the function. �(You
can go to Window → Show View → Tasks in Eclipse to get a list of
these tasks).� JavaDoc
is provided to help remind you what each function does.�� Most of these tasks have been previously
described in the assignment, and any that have not are rather trivial functions
to write.
Your previous code will have to conform to the stubs provided for the
Animal class, but since the class is only a stub, you have a lot of freedom in
how you do that.� However, you should not rename, delete, or change the signature of the public
functions we give you.� Also, do not
modify the DatParser class, although you should have
no need for that in the first place.�
Feel free to modify the contents of the main()
method if you want to also print some helpful debugging output.
If you need to create a method with a different signature than ours (for
example, a printPhylogenyTree() function with an extra argument), then you should add it
as a private method and have the public method call it.
In software engineering, one often restructures code not only to improve
it, but to make it conform to a specification others can use (in this case, one
our jUnit tests can use).� When you deviate from that specification, you
can break the code of others who consume your code.� Thus, if you ignore our warnings, you would
probably break our jUnit tests.
�
Put all your code in the phylogeny
package or one of its subpackages.
�
You should put only tests in the phylogeny.test package; your code should still be able to
run if we deleted all your tests.
�
Do not change the signatures of the
public functions we provide for this assignment.
�
Submit a .zip of the phylogeny folder
to CMS.
In this assignment, we will be using the jUnit testing framework.�
This allows us to dig a little deeper (but not too deep) into the code,
testing individual classes and public functions inside them.�� This provides an additional level of
granularity in diagnosing problems.
Additionally, you�ll be able to step through the
execution of test cases in Eclipse�s debugger.�
If test cases are unexpected failing, you will want to do this to figure
out what exactly is causing the test case to fail.
Even though the code is not complete, we have provided
tests for it.� In the real world, you can
write the tests before you even write the code.�
Or, if you are working with a partner, your partner can write some tests
while you write the code.
If the animal distance tests are failing, that would
explain why the phylogeny tree tests are failing.� If the animal tests are passing (and the
tests don�t miss any bugs), then your failing tests for the phylogeny tree are
likely caused by incorrect code in the phylogeny tree.� We have broken our code into several testable
units of work.� In practice,
this technique tends work better than giving someone the code for the entire
program, saying it does not work, and then asking what is wrong in those
thousands of lines of code.
We have provided some basic jUnit
test cases. �You will not get a 100% if
you pass these tests; they are just smoke tests to catch some basic flaws.� There many ways to create incorrect solutions
that could still squeek past all these tests.� When we grade your code, we will use a
different test suite which will only pass if the results are correct.
While you should not modify the tests we have already
given you, you are encouraged to write additional tests of your own (e.g.: find
two animals who are siblings and two who are not, and write some sibling tests
with those two pairs.)
You will first need to add jUnit
to your build path.� In Eclipse, if you
see an error on an import org.junit.*; statement, just use the quick fix option
and select �Fix project setup��� You will
then be given the option to add jUnit to your build
path.� Also, if you create a new jUnit Test Case, Eclipse will ask you if you want to add jUnit to the build path.
To run the tests, you will need to create a new jUnit Run/Debug Configuration in Eclipse.� The test class you want is phylogeny.test.PhylogenyTestSuite.
The way the tests are written, the AnimalData folder,
which contains all 40 .dat files, should be the
working directory.� Recall that the
working directory can be changed in the Arguments tab.� (There is a way to avoid this type of
limitation in Microsoft�s Visual Studio, but not in Eclipse.)
If you successfully run the test suite, the test suite
should be running tests from 3 classes.�
These 3 classes will collectively be running 6 basic tests.
Note:� In version 3.5 of Eclipse (Galileo), some
people have experienced a bug where Eclipse will not add jUnit
to the build path.� To solve this, you
can either upgrade to version 3.5.1, which fixes this bug, or you can downgrade
to 3.4.2 (Ganymede).� The latter is
recommended, as the cutting edge version of any software product tends to be
buggy.
�
The new animal distance algorithm,
implemented in computeDistance()
�
Sibling test for animals, isSibling()
�
The phylogeny tree generated by your
constructor to PhylogenyTree
�
The find() method in your phylogenomic tree
�
Printing the tree, printPhylogenyTree()
�
Parsing .dat
files and the gene distance algorithm.� These
have not changed from A3, but if your gene algorithm is wrong, your animal
distance algorithm will be wrong.
�
getName(), getLatinName(),
getImageSource() in Animal; getAnimal(),
getParent(), getChildren()
in PhylogenyTree.�
They are trivial, but once again bugs in them will cause other things to
fail.
�
We will not run tests with invalid or
weird .dat files.
We ran our solution with the args:
-cmd A17 A4.dat A14.dat
A15.dat A17.dat
Here is the printed tree
Jelly_Belly
����������� Green_Herring
����������������������� Globe_Floater
����������������������� Gray_Floop
Here is the animal distance matrix.� It�s not printed by the program; we�ve provided
it for your convenience:
0������ 100000� 1560���
1660���� Globe_Floater{Genes
10 11 15 18 28 31 33 }
������� 0������
1560��� 1660���� Gray_Floop{Genes 10 11 15 17 28 31 33 }
0
�����820�����
Green_Herring{Genes 4 5 8 13 21 24 29 }
����������������������� 0������� Jelly_Belly{Genes
0 1 2 6 19 20 23 }
�-cmd A17 A4.dat A14.dat A15.dat A17.dat A23.dat A25.dat
A30.dat A32.dat A35.dat A37.dat
Jelly_Belly
����������� Green Herring
����������������������� Paradise_Rockfish
����������������������� Toothy_Ballonfish
����������� Pink_Ziffer
����������������������� Globe_Floater
����������������������� Gray_Floop
����������� Striped Salamander
����������������������� Shy_Frecklepuss
����������������������� Spotted_Ghila
0������ 100000� 1560���
100000� 840���� 100000�
100000�
1370��� 100000� 1660����
Globe_Floater{Genes 10 11 15 18 28 31 33 }
������� 0������
1560��� 100000� 840����
100000� 100000� 1370���
100000� 1660���� Gray_Floop{Genes
10 11 15 17 28 31 33 }
��������������� 0������ 840����
100000� 1780��� 2432���
100000� 840���� 820�����
Green_Herring{Genes 4 5 8 13 21 24 29 }
����������������������� 0������ 1390���
100000� 2370��� 1440���
100000� 1640���� Paradise_Rockfish{Genes
10 12 16 18 27 31 35 }
������������������������������� 0������ 1540���
1400��� 100000� 1250���
820����� Pink_Ziffer{Genes
4 5 7 13 22 24 30 }
�������� �������������������������������0������ 100000� 840����
100000� 1680���� Shy_Frecklepuss{Genes
9 11 15 18 27 32 36 }
�����������������������������������������������
0������ 840���� 100000� 1675���� Spotted_Ghila{Genes
9 11 14 17 26 32 33 }
���������������� ���������������������������������������0������ 1440���
840����� Striped_Salamander{Genes 3 5 7 13 21
25 30 }
���������������������������������������������������������������
0������ 1660���� Toothy_Ballonfish{Genes 10 11 16 18
26 31 34 }
������������������
�����������������������������������������������������0������� Jelly_Belly{Genes 0 1 2 6 19 20 23
}