Assignment 5: Cluster Analysis
Due to CMS by Tuesday, November 19th at 11:59 pm.
Big Data is a major computer science topic
these days. As computers become both ubiquitous and more powerful, many applications —
from science to business to entertainment — are generating huge amounts of data. This
data is too large to process by conventional means. That means we need new tools (such as new
programming languages) and new algorithms to handle it all. The CS department has
several upper level courses (such as CS 4780: Machine Learning) dedicated to this topic.
In this assignment you will work with one of the most popular (though not necessarily
most efficient) tools for analyzing large amounts of data: k-means clustering. In
k-means clustering, you organize the data into a small number (k) of clusters of similar
values. For example, in the picture to the right, there are a lot of points in 3-dimensional
space. These points are color-coded into five clusters, with all the points in a cluster being
near to one another.
If you look at the wikipedia entry
for k-means clustering, you can see that it has an untold number of applications. It has been
used in market analysis, recommender systems (e.g. how Netflix recommends new movies to you),
and computer vision, just to name a few. It is popular because it is relatively easy to implement.
It is simple enough that you, a CS 1110 student, should be able do it.
A Mild Warning
This assignment is brand new. It has never been used in this class before. We have
beta-tested this assignment on other students, but that will not catch all of the
errors and misunderstandings that will inevitably pop up. Furthermore, while we
have some idea of how long the assignment should take, that is different than how
long it will actually take you. The survey for this
assignment is incredibly important.
We recommend that you start early, and work on a little bit each day. In
particular, we have indicated several suggested mini-deadlines
throughout the assignment. These are not enforced, but if you follow them, this
assignment should be very manageable. The most important thing is to complete
Part I before the first prelim, as this assignment
covers almost every major topic (lists, for-loops, classes) on that prelim.
You should check this site and Piazza regularly for updates and clarifications.
If we hear enough complaints that this assignment (or parts of this assignment)
are too hard, we will make modifications to adjust the difficulty.
IMPORTANT NOTE FOR WINDOWS USERS
There appears to be an issue with the old version of CornellExtensions on Windows
only. The matplotlib module is not working correctly. To test whether your
installation has problems, down load the test file scatter3d.py
and run it as a script. You should get a window that looks like the one below:
If your CornellExtensions is not working properly, you will get the following error
instead:
To solve this problem, you need to download and reinstall a new copy of
CornellExtensions
before you start this assignment. Please see a
staff member if you
are continuing to have problems.
For Macintosh users, you should be fine as long as you have not installed
Mavericks. For you Mavericks users, we are trying to get a test machine
to fix you, but it may still be a while.
Learning Objectives
This assignment has several important objectives.
-
It gives you practice with writing your own classes.
-
It exposes you to classes with complex invariants.
-
It gives you practice with using data encapsulation in classes.
-
It gives you practice with both both 1-dimensional and 2-dimensional lists.
-
It gives you experience with designing helper functions to structure your code properly.
-
It introduces you to a powerful machine learning tool.
Table of Contents
Authors: W. White, D. Murphy, T. Westura.
Academic Integrity and Collaboration
Academic Integrity
This is a brand new assignment, so we are less concerned about your copying
off of someone from a previous year. With that said, the usual warnings about
academic integrity stand. Do not share your code with others, in this semester
or with students in future semesters. We prohibit this behavior because engaging
in it so would help no one in the long run.
In this assignment, it is highly unlikely that your code for this
assignment will look exactly like someone else's. We will be using
Moss to check for instances of cheating and plagiarism. Anyone caught
using files that are obviously identical to other students will be
prosecuted, with the end result perhaps being to fail the course.
Collaboration Policy
You may do this assignment with one other person. If you are going to work
together, then form your group on CMS as soon as possible. If you do this
assignment with another person, you must work together. It is against the
rules for one person to do some programming on this assignment without the
other person sitting nearby and helping.
With the exception of your CMS-registered partner, you may not look at anyone
else's code or show your code to anyone else, in any form what-so-ever.
K-Means Clustering
In this section, we describe what cluster analysis is, and how we use k-mean
clustering to implement it. There is no Python code in this section: it is
all math. It is possible to do this assignment without completely understanding
this math (so long as you follow the function/method specifications). However,
you will do better if you have some intuition for what you are doing.
In this section, we also talk about the classes
that you are creating. We have found that the biggest trouble students
have with this assignment is figuring out how to connect the k-means algorithm
to the two classes: Database and Cluster. We have structured
the assignment that anyone can do Parts A and B without really understanding
anything at all. But unless you know what k-means is doing, the assignment
gets a lot harder after this point.
Cluster Analysis
Cluster analysis is the act of grouping together similar points into small
groups. When we say point, it could be a 2-dimensional point, a
3-dimensional point, a 4-dimensional point, or even higher dimensional.
In fact, for cluster analysis, a point is really just a list of numbers.
Those numbers can either be ints or floats; we do not care.
For example, suppose we work in the candy industry. We decide to rate
every candy product (using the scale 0 to 1) on how sweet it is, how
sour it is, how many nuts are included, and how smooth it is. A sample
candy bar might be labeled as follows:
Billy Bonker's Honey Nut Nonpariels (sweetness=0.73,sourness=0.37,nuttiness=0.74,texture=0.30)
We have just turned a piece of candy into a 4-dimensional point.
Cluster analysis would allow us to to group candies that are "close" to one
another. We might do this for market analysis, to make
a candy that is similar to some of the best selling brands, or to make a candy
that is very different from the rest of the market.
When we cluster points, "close" means according to the
Euclidean Distance.
You know how to compute the distance between two points in 2-dimensions, or
even three dimensions. In general, suppose you have points
\(\textbf{p} = (p_1, p_2, \ldots, p_n)\) and \(\textbf{q} = (q_1, q_2, \ldots, q_n)\).
Then we would define the distance between those two points to be
\[
d(\textbf{p},\textbf{q}) = \sqrt{(p_1-q_1)^2+(p_2-q_2)^2+\cdots+(p_n-q_n)^2}
\]
Distance calculations are always between points of the same dimension. For example,
suppose we have the points \(\textbf{p} = [0.1,0.2,0.3,0.4]\) and \(\textbf{q} = [0.0,0.2,0.3,0.2]\)
\[
d(\textbf{p},\textbf{q}) = \sqrt{(0.1-0.0)^2+(0.2-0.2)^2+(0.3-0.3)^2+(0.4-0.2)^2} = 0.7071\ldots
\]
When we create a cluster, we always have a centroid. The centroid
is the center of mass of the cluster (e.g. the average of all of the points
in the cluster). The centroid could be one of the points in the cluster,
but it does not have to be. For example, the picture to the right is an
example of a 4 point cluster with a centroid that is not one of the
points in the cluster.
To compute a centroid, we average. Remember that to average, you add up a
collection of things and divide by the total. The same is true for points;
we add up all the points to get a very large point, and then divide that
point by the total number of points.
To add two points, we just add the coordinates individually:
\[
\textbf{p}+\textbf{q} = (p_1+q_1, p_2+q_2, \ldots, p_n+q_n)
\]
We we divide a point by a number, we just divide each coordinate, as follows:
\[
\textbf{p}/a = (p_1/a, \>p_2/a,\> \ldots,\> p_n/a)
\]
As you can see, this is a natural thing to do with a for-loop.
K-Means Clustering
One of the hardest part of clustering is figuring out how many clusters you
should have. K-means simplifies this by starting out with a guess at
the number of clusters (this is the meaning of the number k).
All you have to do is group the data into that many clusters.
While that may sound like an unsatisfactory solution, it works well in
practice. A user usually has a good idea ahead of time about the number
of clusters. The user can run the program for several different values
of k, and pick the one that gives the best results.
To implement k-means clustering, we break it down until several steps.
1. Pick k Centroids
We start with the number k. The next step is to guess k centroids. Throughout
these instructions, we will refer to the centroids as \(\textbf{m}_1,\textbf{m}_2,\ldots \textbf{m}_k\).
The letter m indicates that each centroid will represent a type of
mean of our data points.
Remember that the centroids are themselves points. These points could have
any dimension, so long as they all have the same dimension. We will use the
number n to refer to the dimension. Hence, for any mean \( \textbf{m}_j \)
\[
\textbf{m}_j = (m_{j,1},m_{j,2},\ldots,m_{j,n})
\]
So each \(m_{j,i}\) represents the ith coordinate of the jth centroid.
To pick k centroids, all we do is chose k points from our original data set.
We chose these points at random; we have provided you will a helper function
in the assignment to make this step easier.
This is referred to as the
Forgy Method,
and it is one of the most popular ways to start k-means clustering.
2. Partition the Dataset
Now that we have k centroids, we need to assign every point in the database
to a centroid. This is know as partitioning; no two different centroids
will have the same set of points. The points for each centroid will form
a cluster.
To represent this mathematically, we create the sets
\(S_1, S_2, \ldots, S_n\). The set \(S_j\) (which we will refer to as a
cluster is the points associated with centroid \(\textbf{m}_j \).
The partition step involves puting points from our data set into the
sets \(S_j\). For each point p in the data set we
-
Find the closest centroid \(\textbf{m}_j\). In the unlikely event of ties,
we chose the one with j least.
-
Put p in the cluster \(S_j\).
3. Recompute the Mean and Repartition
Once we have put points in each cluster \(S_j\), we need to compute a new
mean \(\textbf{m}_j\). This new mean is just the average of all of the elements
in the set. Suppose that
\[
S_j = \lbrace \textbf{p}_{j,1},\textbf{p}_{j,2},\ldots,\textbf{p}_{j,c_j}\rbrace
\]
So \(S_j\) has \(c_j\) points inside of it. Then the new mean is
\[
\textbf{m}_j = \frac{1}{c_j}(\textbf{p}_{j,1}+\textbf{p}_{j,2}+\cdots+\textbf{p}_{j,c_j})
\]
The formula above uses the rules for adding points and dividing points by numbers
that we discussed above. If you do not understand this
formula, please talk to a staff member
immediately.
Now that you have new means, the points might be in the wrong cluster. We have
to repartition the data set. To do this, empty out every set \(S_j\) and refill
it with the points that are closest to the new centroid. In refilling the sets,
you continue exactly as you did in step 2.
4. Repeat Step 3 Until You Converge
The clustering algorithm continues this process: recompute the means and
repartition the data set.
We do not want to do this forever, so we have to stop some time. The right
time to stop is when the means stop changing. In other words, if all the
means are unchanged after step 3, it is okay to stop. When this happens,
we say that the algorithm has converged.
Sometimes it can take a very long time to converge — thousands of
thousands of steps. Therefore, when we implement k-means clustering
we often add a "time-out factor". This is a maximum number of times
to repeat step 3. If it does not converge before this maximum
number of times, we stop anyway, and use the clusters that we do have.
Implementing K-Means with Classes
Classes are ideal for representing complex mathematical objects. For
example, we saw in class how to use classes to represent fractions.
There are several interesting mathematical objects in the description
above (points, means, sets/clusters) and we could use classes to
represent all of these.
To cut down on the amount of code that you have to write, we will not
be using classes for points and means. In particular, you should not
use the Point class in tuple3d. That class only
supports 3-dimensional points. As we saw in the candy example, we need
to support 4-dimensional points, or even higher.
Throughout this assignment, a point will be represented as a list of
numbers; both ints and floats are allowed. This is true for means as well.
For example, here is an example of a possible mean if we were working with
4-dimensional points.
\[
\textbf{m}_j = [0.1,0.5,0.3,0.7]
\]
In this example \(m_{j,2}\) (using the notation from the previous
section is 0.5, while \(m_{j,4}\) is 0.7. Note that our mathematical
notation starts at 1, while list indices start at 0.
Class Cluster
The first interesting mathematical type is the clusters \(S_j\).
To support this new type, you will implement the class Cluster.
Objects of this class will actually hold two different values:
the mean \(\textbf{m}_j\) and the set \(S_j\). We do this because
the mean and cluster are associated with one another, so it makes
sense to store them in a single object.
Right away, we see that we are going to need two attributes:
-
_centroid: A list of numbers representing the point \(\textbf{m}_j\)
-
_contents: A list of points inside of the cluster \(S_j\)
In order for this class to work, we need a very strong invariant: all of the points
in \(S_j\) must have the same dimension has \(\textbf{m}_j\). To keep track of our
invariant, we add a third attribute:
-
_dimension: The dimension of all points in this cluster
Notice that all of the attributes have underscores before their names. This means
that the attributes are meant to be hidden. A large part of this assignment is writing
getters and setters for all these attributes. We will provide the specifications
for these, but you must implement them.
One of the trickier part of this class is working with the attribute _contents.
This is a list of points, and each point is a list of numbers (int or float).
That means that _contents is a 2-dimensional list, just like we
talked about in lecture.
For example, if a cluster contains the points (0.1,0.5,0.3,0.7) and (0.2,0.0,0.2,0.1),
then the attribute _contents is
[[0.1,0.5,0.3,0.7], [0.2,0.0,0.2,0.1]]
The class Cluster needs some methods other than getters and setters.
To see what methods we need, we look at the description of the k-means algorithm,
and use that as motivation. We are going to need
-
A method to find the distance between a point and a cluster (distance)
-
A method to update the mean, as in step 3 of the algorithm (update)
As a word of warning, you will notice a few other attributes and methods in
the class Cluster provided in the assignment source code.
Those are added to make the market analysis activity
easier. They have nothing to do with k-means clustering, and so we will ignore
them for now.
Class Database
While describing the k-means algorithm, there was always this mysterious
"data set" in the background. This was the set of all points that we were clustering.
When we partition in step 2, we are looping over this data set to find all of
our points. This means we need another class to represent this data set.
That is the purpose of the class Database.
The class Database will not only keep track of the points in the data set;
it will also manage the k-means algorithm itself. That means it needs the following
attributes:
-
_contents: A list of points inside this data set
-
_ksize: The k-value for k-means clustering
-
_clusters: The list current clusters \(S_1, \ldots, S_k\)
As with class Cluster, we are going to add one more attribute to help us
keep track of our invariants:
-
_dimension: The dimension of all points in this data set
Once again, all of these attributes are hidden, and they need getters and setters.
As in class Cluster, the attribute _contents is a 2-dimensional
list of numbers (so you will find yourself writing the same code in both
Cluster and Database).
However, attribute _clusters is list of Cluster objects, and not
a 2-dimensional list. Therefore, its getter (and setter) will be a little different.
Fortunately, we will give you directions for all of these.
The other methods of class Database all come from looking at the k-means
clustering algorithm. We are going to need
-
A method to perform step 1 and initialize algorithm (actually handled by the setter setKSize)
-
A method to perform step 2 and partition the points (partition)
-
A helper method for partition to find the nearest cluster to a point (nearest)
-
A method to perform step 3 in the algorithm (step)
Note that there is no method for step 4 to repeat the algorithm. We could have
added such a method, but we chose not to (because it complicated the class a bit more).
Step 4 is going to be handled by our testing programs and
additional functions.
That is it. We now know all we need to know about the classes Database and
Cluster. Of course, we are going to give you a lot more hints, but
an upper-level CS student could probably stop reading the directions at this point.
How to Work on the Assignment
As we said, this assignment is longer and more difficult than ones in the past.
We do believe that you can do it. The trick is to pace
yourself and make use of all of the unit test
and the visualizer that we provide.
Assignment Source Code
Before you do anything else, you should download the zip archive
A5.zip from this link. You should unzip it and
put the contents in a new directory. This time, you will
see that this directory contains a lot of files:
a5.py
-
This is the only file that you will need to modify. It is the Python file
that you will submit for a grade.
a5test.py
-
This file is a unit test to provide you with even more help with this assignment.
Is not a 100% complete unit test, but you are not required to finished it. We
describe this in more detail below.
cornelltest.py
-
This is a brand new version of cornelltest. In testing, we found that
all the lists made it difficult to use cornelltest. We have made several improvements
(used in a5test.py) to make errors easier to read.
visualizer.py
-
This is a GUI that will help you see what is going on. We describe how it works below.
data.a5d
-
This is a sample data set to get you started with the visualizer. It contains
only 3-dimensional points.
candy.txt
-
This is a set of candy data to use in the market analysis
part of the assignment. Because candy is 4-dimensional, it cannot be used in
the visualizer.
small_candy.txt
-
This is a smaller set of candy data. It is used by the unit test
a5test.py to verify that your market analysis
function is working correctly.
Pacing Yourself
You should start as this assignment soon as possible. The second prelim is due
the Thursday before this is due. If you wait until after the prelim to start this
assignment, you will have a hard time completing it. If you do one part of it
every few days, you will be more relaxed and get it done on time.
In addition, if you start this assignment early, you will do a lot better on the
exam. This assignment covers a lot of things that are on the second prelim,
such as lists (particularly 2-dimensional lists), for-loops, and classes.
At the end of each part, we have a suggested completion date. While this is not
enforced, we recommend that you try to hit these deadlines. If you cannot make
this deadlines, it might be a sign that you are having difficulty and need some
extra help.
Using the Unit Test
Ideally, you should always write your own tests. However, writing test cases for
this assignment is quite difficult. Test cases are often multiple lines, since
you have to initialize the object, and check the results. One of the primary
methods relies on a random number generator, and that means that you are not
guaranteed to have the same output every time. Finally, some of the tests require
that we violate coding conventions and access hidden attributes directly.
Because of this we have written a unit test for you, in the file
a5test.py . There is a separate
test procedure for each part of the assignment. In addition, there are print
statements inside of the test procedures, so you can determine exactly what
part is failing at any given time.
This unit test script is quite long, and can be a bit complicated at points.
However, if you learn to read what this script is doing, you will understand
exactly what is going on in this assignment. You are encouraged to sit down
with a TA or consultant to talk about the unit test, to help you understand
it better. Note that there are a lot of new testing functions in cornelltest!
You might want to use the help() function to look at them all.
With that said, this unit test is not complete. It does not have full coverage
of all the major cases. It does not check that you are enforcing the preconditions
and invariants (something we will check while grading). If you want to add
additional tests for these, that is fine. However, we do not want you to submit
the file a5test.py when you are done, even if you made modifications
to the file.
Using the Visualizer
The visualizer is designed to be run as a script. To get it working, navigate to
the assignment directory and type
python visualizer.py
When you start the visualizer for the first time, it will look like the window
shown below. You will see a empty plot of 3-dimensional axis, as well as some
buttons and menus to the right. If you click on the window with the plot, you can
move the axes. You will find this useful when you are trying to look at
three-dimensional data.
This GUI is not going to do anything until you start implementing the various
parts of this assignment. Once you have most of the assignment completed (up
to Part E), you can use the visualizer to
perform a k-means clustering on a three-dimensional data set.
If you have everything working, click on the button to select a file and load
the file data.a5d. Your initialization step will pick three random
centroids and partition the data. The result will look something like the
following:
Note that each cluster is a different color. The points in the cluster are crosses,
while the solid circle is the centroid.
You can use the pull-down menu to change the k-value. There is a limit on the number
of k values you can choose because there are only so many different colors that we
could think of. Whenever you change the k, it will reinitialize the centroids. It
will also reinitialize the centroids when you hit the reset button.
Each time you hit the step button, it will perform step 3 of the
k-means algorithm. You can watch the centroids change as you hit the button. When
the algorithm converges, you will see the word "True" after Finished and
the step button will no longer do anything. Below is a picture of a finished
computation with final versions of the clusters.
Part I: Class Definitions
In the file a5.py you will notice definitions for two classes:
Database and Cluster. These definitions have all of
the methods that you will (eventually) implement. However, right now they
are just stubs. The first part of the assignment is to complete these
Class definitions. You will use these classes later.
Your implementation of these classes is broken into six tasks.
Each task has a stopping point that allows you to test what you have
done so far with the unit test script. You can also see what you
have done with the visualizer.
We have given you a guideline of how to pace yourself on this assignment.
As a general rule, you should spend two days on each section, finishing
all of Part I before the first exam. If you are having difficulty, you
can delay Part F until after the exam,
but any slower than that is a sign that you are having difficulty and
need more hints. In that case, you should see a
staff member
as soon as possible.
Part A: The Database Class
In the specification for Database, you will notice the following instance
attributes:
_dimension: the point dimension for this database
[int > 0. Value NEVER changes after initialization]
_ksize: the k value for k-means clustering [int >= 0]
_contents: the database contents
[either a 2D list of numbers or empty list,
the number of columns is _dimension]
_clusters: the Cluster objects for the k-means algorithm
[list of Clusters or empty list,
len(_clusters) = _ksize]
All of the attributes start with an underscore.
That indicates that they are meant to be hidden. They can be used
in class methods, but outside of this class definition the user should
only access them via getters and setters (However, you will note that
we violate this rule in a5test.py
for testing purposes).
You need to implement the initializer for this class, as well as the
getters and setters.
Initializer
The initializer needs to create all four of the instance attributes
above and assign them a value. We have provided the header and specification
for the initializer, but it is wrong and needs to be modified.
You will notice that the specification calls for a default value for
the parameter contents. You need to modify the header to
provide this default value.
Getters and Setters
The method headers and specifications for the getter and setters are
provided in the module. For this part of the assignment, you are to
implement the following methods:
def getDimension(self):
"""Return: The point dimension of this database"""
def getKSize(self):
"""Return: The k value for k-means clustering"""
def getContents(self):
"""Returns: The database contents"""
def appendContents(self,point):
"""Adds a COPY of point at the end of _contents.
This method does not add the point directly. It adds
a copy of the point.
Precondition: point is a list of numbers (int or float),
len(point) = _dimension."""
def clearContents(self):
"""Deletes all the database contents, making
that attribute _contents into an empty list."""
def getCluster(self,pos):
"""Returns: The cluster at position pos in _clusters
Precondition: pos is an int in 0..k-1,
where k is the k-value"""
Asserting Preconditions
As a last step, you should go back to every method with preconditions and
assert those preconditions. This includes both the initializer and
the setters (and in one case, a getter). Recall that preconditions
are how we preserve the invariants, so asserting the preconditions
prevents the invariants from being violated.
You will note that some of these preconditions are quite complex.
For example, the initializer requires that contents is a 2D list
of either ints or floats. There are a several things to check here:
-
contents is a list.
-
Every element of contents is a list of all numbers (int or float)
-
Every element of contents has the same length.
As in the previous assignment, you are going to need a helper function
(or helper method) to assert this precondition. But we did not provide
the helper this time. It is up to you to define the helper function
(or method) to use in the assert statement.
When you write your helpers, you need to remember to write a specification
for each helper. The helpers should not have any preconditions.
They should be prepared for any argument. It is not very useful to check
preconditions with function or method that has its own preconditions.
Testing it Out
Once you complete the getters and setters for the Database class,
you can run the unit test script. If you see a print statement that
says
Part A appears correct
then you can be pretty confident about this part. However, there is
on caveat: we cannot check getCluster() yet. We have to
wait until you complete the Cluster class in Part B to
test this method.
You can also try your code in the visualizer. Run the visualizer
and load the file data.a5d. You will see all points in
the data set as black crosses in the plot window, as shown below.
None of the other buttons will work, however. In addition, you will
see some error messages in the terminal saying VISUALIZATION FAILED.
This is perfectly normal. These error messages are the visualizer
letting us know that it was not able to form clusters yet and that
it was shifting down to do Part A only.
Part B: The Cluster Class
The specification for the Cluster class mentions the following instance attributes:
_centroid: the cluster centroid
[list of number (int or float)]
_dimension: the point dimension of this cluster
[int > 0. len(_centroid) = _dimension
Value NEVER changes after initialization]
_contents: the current data in the cluster
[either a 2D list of numbers or empty list,
the number of columns is _dimension]
Once again, all of the attributes start with an underscore. You need
to implement the initializer for this class, as well as the
getters and setters.
Initializer
The initializer header and specification is provided in the class
definition. The header is correct this time; you just need to
implement it.
You will notice that cluster has two attributes that we have not
mentioned: _srating and _crating. These attributes
are not needed at this time. You will not use them until
Part F. If you want to leave
them out, that is okay. It is also okay to put them in the
initializer if you wish.
Getters and Setters
The method headers and specifications for the getter and setters is
given below. Again, you will notice that not all instance attributes
have setters; some only have getters. We do this when we do not want
an attribute to change after initialization.
def getDimension(self):
"""Return: The point dimension of this cluster"""
def getCentroid(self):
"""Returns: The cluster centroid"""
def setCentroid(self,point):
"""Sets the centroid to be a COPY of point.
This method must copy point. You cannot assign it to
the attribute _centroid directly
Precondition: point a list of numbers w/ length _dimension"""
def getContents(self):
"""Returns: The current data in the cluster."""
def appendContents(self,row):
"""Adds a COPY of point at the end of _contents.
This method does not add the point directly. It adds
a copy of the point.
Precondition: point is a list of numbers (int or float)"""
def clearContents(self):
"""Deletes all the cluster contents, making
that attribute _contents into an empty list."""
You will notice that a lot of these methods are very similar to the
those in Part A. Just write the same code again.
Asserting Preconditions
As with the database class, you should look at every method with preconditions
and assert those preconditions. This includes both the initializer and
the setters. Once again, you will need to use helpers.
Testing it Out
Once you complete the getters and setters for the Cluster class, run the
unit test. If everything is working, you will get the appropriate message.
Now that Cluster has an initializer, it will also test out
your method getCluster() from Part A.
If you run the visualizer, you will see that it has even more functionality.
When you load the file data.a5d, you will see all points in the data
set as blue crosses (not black) in the plot window. In addition,
you will see a blue circle representing an initial centroid, as shown below.
There will still get the error message VISUALIZATION FAILED.
This is normal. However, if you still see black instead of blue for
your points, something is very wrong. In that case, you should look
at your error message to find out what is wrong.
Try to finish both Parts A and B by Tuesday, November 5.
Part C: Partitioning the Data Set
Both the initialization step (step 2) and the partition step (step 3)
in k-means clustering require you to partition the data. Remember
that partitioning involves assigning each point to its nearest cluster.
To do this, you need to implement several methods specified below.
Cluster Methods
Which cluster a point is assigned to depends on how far it is away from
the centroid. This is the purpose of the method distance() in
the class Cluster. Read the provided specification and
implement the method.
When implementing distance(), remember what we said about
Euclidean distance above. That definition
had some ellipses in it. As we saw in class, this means you need a for-loop.
Database Methods
You need to implement two new methods in class Database. This
first method is nearest; read the specification of this method.
Using distance as a helper, this method computes the distance
from the point to each cluster, and picks the one that is closest. It
returns None if the attribute _clusters is empty.
The second, and more important, method to implement is partition.
This method implements step 2 in the k-means clustering algorithm. Between
the description above and the method
specification, you should have enough information to implement this method.
Asserting Preconditions
All three methods have preconditions. You must assert them. Note that
the precondition for partition is a precondition about an
attribute, not a parameter.
Testing it Out
The unit test for Part C tests each of the methods seperately. Therefore you
can (and are encouraged to) run the unit test after you complete each method
above. The only catch is that they are tested in the order listed (first
distance, then nearest and then partition), so
you must complete them in that order for the test to work.
You can also test the part in the visualizer. When you load the file
data.a5d, it will have three clusters colored as in the picture below.
If you have done it correctly, each point should have the same color as its
nearest centroid.
Try to finish this part by Thursday, November 7.
Part D: Initializing K-Means
When ever you set the k-value (e.g. the attribute _ksize in
Database), you must reinitialize the k-means algorithm.
Therefore, step 1 in the algorithm actually takes place in the
setter setKSize.
Obviously this method is a setter, just like the ones that you
created in Parts A and B. However, it does a lot of things
before it sets the attribute _ksize. Read the
specification to see what you should do. In addition,
you might want to review step 1
above.
The one tricky step involves picking the centroids to start.
You want to pick k different points in the database. To do
that, pick k random numbers between 0 and len(_contents);
those are the (positions of) the points to use for the centroid.
The points must be different, so the positions must all be
different. To help you, we have provided the function
random_range at the top of a5.py.
This function guarantees to return a list of distinct random
integers. See the specification of this function for more.
Important (November, 12): There seems to be some
confusion with the function specification. In this problem,
you are not supposed to clear the contents of the existing
clusters (like you did in partition). Instead, you are
supposed clear the cluster list
(attribute _clusters) and make brand new clusters.
Asserting Preconditions
This method has a precondition. Assert it.
Testing it Out
Once again, you should test your code using the unit test
provided in a5test.py .
The visualizer should also work without error now. When you
load the data file, it will create three clusters
and color your points as before (even with the test code removed).
More importantly, if you change the k value, it will change
the number of clusters and their colors. The step button
will not work, however.
Keep in mind that the choice of clusters is random. If you
hit the reset button, you will see different clusters
each time.
Try to finish this part by Friday, November 8.
Part E: The Update Step
You have implemented steps 1 and 2 of k-means clustering. It is
now time to implement step 3 (step 4 is handled automatically
for you in the visualizer). To do this, you need to implement
methods in both Database and Cluster.
Cluster Method
In step 3, you need to update the centroid of each cluster via
the update method. To understand what to do in this
method, read the method specification, as well as
step 3 in the description above.
Notice the return value for this method. Before changing the
cluster centroid, you need to compare the new centroid with
the old centroid. To best way to do this is with the function
numpy.allclose
in the numpy module. This compares two lists of numbers
(either int or float) and returns True if they are close enough
(and False otherwise). For example
numpy.allclose([1.0,2.0],[1.0,2.0])
evaluates to True while
numpy.allclose([1.0,2.0],[1.0,2.1])
evaluates to False.
Database Method
The final method to implement for k-means clustering is the step
method. This method specification has been provided. Once again, you should
note the return value. You need to check the result of update on
all clusters. If they are all true, then this method returns True.
But if you just one cluster returns False on update, then this
method should return False as well. When step return True, we
know that the algorithm is finished and our clusters are done.
Testing it Out
Once more, you should test your methods with the unit test. Both methods
update and step are tested separately. Ff there is
an error, this will help you determine which method has the error.
At this point, the visualizer will work completely. Play with it and search
for clusters. Remember, because of the random step at initialization,
you may not get the same result each time.
Try to finish this part by Sunday, November 10.
Part F: Extending Class Cluster
Congratulations, you have now completed the basic k-means clustering algorithm.
The next step is to figure out how to apply this algorithm to do interesting things.
This is what we do in Part II, which can be left after
the exam. However, in testing Part II, we discovered that we can make it a lot
easier if we add some extra features to class Cluster.
Normally, this would be an ideal place to make a subclass. We have the base
features in class Cluster, which is necessary for all applications
of k-means clustering. Since our extra features are unique to a specific
application of clustering, we would put those in a subclass. However, you
have already done enough work, so we are just going to put the new features
in Cluster. Besides, you will get a lot of experience with
subclasses in the next assignment.
Cluster Ratings
To use k-means in market analysis, we need to assign each cluster a rating.
The rating of a cluster is the average of the rating of all the points in
that cluster. Therefore, we need methods to get the current (average)
rating, and to include a new rating in the average when we add a new point.
Averages cannot be stored in a single attribute. To compute an average,
you need to keep track of both the sum of all the ratings, as well as
the number of ratings so far. That is the purpose of the attributes
_srating and _crating that we saw in
Part B. If you did not assign them
to 0 in the initializer, you should do so now.
Once you have done that, you should implement the methods getRating,
addRating, and clearRating. They are fully
specifed in a5.py. Note that addRating has a precondition
that must be specified.
String Representations
We have provided you almost every method header for you in Part I.
However, there are two methods that we have not added to the file. We
want you to add str and backquotes support. We talked about the
methods to do this
in lecture.
In writing the method to support str, it should represent
each cluster as its centroid. For example:
>>> c = Cluster([1.0,1.2,3.5])
>>> str(c)
'[1.0, 1.2, 3.5]'
The body of this method should just be one line.
The spaces are added automatically by Python when you convert
a list to a string.
For backquotes, you should have the same output as str,
but have the class name in front. For example:
>>> c = Cluster([1.0,1.2,3.5])
>>> `c`
"<class 'a5soln.Cluster'>[1.0, 1.2, 3.5]"
Both methods need to be completely specified.
Testing it Out
There is a test procedure to test out Part F in the unit test. As theses
are additional features, there is nothing to test them with in the visualizer.
You will not need the visualizer for the rest of the assignment.
Try to finish this part by Monday, November 11. However, if you are
pressed for time, this part can be left until after the exam.
Part II: Market Analysis
Now that you have implemented k-means clustering, we wanted to give you an
opportunity to apply what you have learned. In particular, we wanted a
problem that was a little more open-ended and gave you some creative
flexibility. With that said, in our testing we discovered that this was
making the assignment really long.
We have tried to simplify this assignment as much as possible, in order
to give you a chance to work on Part II. But this where we really need
to know how the class is doing. If the class is really struggling on
Part II, we will give hints, or drastically shorten this part of the
assignment. Of course that means that you must make a good faith
effort to complete Part I before the second prelim. If you wait
until after the prelim to complete the classes, we cannot help you.
Pay attention to Piazza for possible hints or updates on Part II
Analyzing Candy
In this problem, you will assume that you are a CEO of a candy company.
You want to create a new product. However, you are fairly conservative, so
you do not want to make a radically new product. Instead, you want to make
a product that is similar to other best-selling candies.
The way to do this is simple. You look at all of the best-selling candies
(not just any candies, but the best-selling ones). You then cluster them
by properties like sweetness, sourness, and texture. You pick one of the
centroids of this cluster (preferably the cluster with the absolute best
sellers), and that is the candy that you should make.
To help you with this process, we have already done the market research
for you. We have provided you with a file called candy.txt.
Each row specifies a candy and its properties. For example, the
first row of this file is
Hearsay Sour Semisweet Chocolate Mints,0.32,0.87,0.14,0.68,3.04
This is six items separated by commas. The items are (in order)
-
The name of the candy. It will almost always have spaces.
-
The candy sweetness. This is a float between 0 and 1
(where 0 = not sweet, 1 = extremely sweet).
-
The candy sourness. This is a float between 0 and 1
(where 0 = not sour, 1 = extremely sour).
-
The candy nuttiness. This is a float between 0 and 1
(where 0 = no nuts, 1 = all nuts).
-
The candy texture. This is a float between 0 and 1
(where 0 = smooth, 1 = hard and crunchy).
-
The rating. This is a float between 0 and 5 to measure its popularity (e.g the "stars")
Your market analysis should cluster candies by their four attributes.
Do not include rating when performing your cluster analysis. We will
come back to rating later.
The Analysis Function
In performing your market analysis, you should implement the function
best_cluster. The stub and specification of this function
can be found at the end of the file a5.py. The specification
is quite long, but it is complete. But to make it a bit more clear,
here is an outline of what you should be doing:
-
You need to create a new Database instance.
-
You need to read each line from the file, and turn it into a 4-dimensional point.
-
For each point, you add it the database only if the candy rating is above the cutoff.
(The purpose of the cutoff is to ignore poor-selling candies and make analysis faster)
-
If no candy was above the cutoff, stop the function immediately and return None.
-
You need to run k-means clustering on this database for the given k.
-
Now that you have your final clusters, you need to compute a rating for each cluster:
-
For each candy (including those below the cutoff), find the nearest cluster
-
Add that rating to the cluster using the methods from Part F
-
Find the Cluster object with the highest rating and return it.
Note that this is a lot of things to do. This is going to be a massive
function if you do it all one place. Therefore, we are adding an additional
rule:
Do not write any function in this module with more than 30
lines (not counting specification).
If you find yourself writing
more than 30 lines of code for this function (or any other function)
you need to pull out some of that code and put it into a helper function.
We are not going to tell you what helper functions to create. That
is up to you. Remember to write a specification for each helper
function (including preconditions).
To help you with the function best_cluster, here are some
additional hints and guidelines.
Assert the Preconditions
You should assert the preconditions in the function best_cluster.
However, we are giving you a choice for your helper functions. If your
helper functions are hidden (e.g. the name starts with an underscore), then
you do not need to assert the preconditions. Asserts are to protect your
code from outside users, and outside users should not be accessing hidden
functions.
On the other hand, if you do not hide your helper functions, then we
expect you to assert the preconditions for them as well.
Reading the Data from a File
Reading data from a file is easy. There is a built-in function open
to open a file and create a file object. This object works a lot like a list;
you can put it in a for-loop, and each line of the file will be assigned to
the loop variable.
For example, suppose you wanted to write a Python script that prints out
the name of all the candies. You write the following code:
f = open('candy.txt')
for line in f:
pos = line.find(',')
print line[:pos]
f.close()
Note the close method. You should always close the file when done.
If you want more information about using files in Python, see Chapter 14 of
your textbook.
Running K-Means Until Convergence
In Part I, the visualizer took care of Step 4 in
k-means clustering. It is up to you now. You need to repeat the process until
step returns true, or you have gone onfor maxstep steps, which
ever comes first.
You cannot do this with a for-loop. While the for-loop could run for maxstep
steps, it cannot (easily) finish early if step returns true. You could
do it with recursion. However, you might get a call where maxstep = 1000,
which is more than the number of call frames that Python allows.
The best solution is to use a while-loop. A while-loop is similar to a
for-loop except that it has the form
while <boolean-expression>
body
A while-loop repeats the body as long as the boolean expresion is true.
For example, if you wanted to print out 'Hello World!'
10 times, you could do it with the following Python code:
count = 0
while count < 10:
print 'Hello World!'
count = count+1
For more information on how to use a while-loop, see Chapter 7 of your textbook.
Completing the Analysis
Now that you have completed the analysis function, it is time to do your market
analysis. Open up the interactive shell and try the function best_cluster
on the data file candy.txt. You need to come up with the values k,
maxstep and cutoff. We are not going to tell you want to use;
it entirely is up to you.
As a word of warning, candy.txt has almost 20,000 candies in it. If your
cutoff is too low, it will take a long, long time to run (we gave up after a few
minutes in early testing). If your cutoff is too high, you will not get enough
candies to make a cluster. So you have to experiment a bit.
To speed things up, you might want to turn assertions off. All those
assert statements enforcing the preconditions are really slowing things down.
If you start Python with command
python -O
it will ignore all assert statements, regardless of whether they are True or
not (type assert False after starting Python this way and see what
happens). This will speed up your code significantly, though the cutoff
still matters.
When you finally get an answer, the result will be a Cluster
object. Use the str function to see the centroid of this cluster.
That is the candy you want to make. You can round off the sweetness,
sourness, etc. but your candy should be reasonably close to those values.
A Short Essay
In the docstring for the file kmeans.py, we want you
to write a short essary describing what you did. In this essay,
we want
-
The description of your candy.
-
Your choice of k and the rating cutoff to decide on this candy.
-
Some justification for why you chose that k and cutoff.
When decribing the candy tell us the sweetness, sourness, nuttiness and texture.
Do not give a rating; that is up to your customers. In addition, you should
make up a name. It can be anything you want, so long as it is not obviously
offensive.
Finishing the Assignment
Once you have everything working you should go back and make sure that your program
meets the class coding conventions. In particular, you should check that the
following are all true:
-
There are no tabs in the file, only spaces (recall, tabs look like arrows if whitespace is visible)
-
Functions are each separated by two blank lines.
-
Lines are short enough (80 chars) that horizontal scrolling is not necessary.
-
The specifications for all of the functions are complete and are docstrings.
-
Specifications are immediately after the function header and indented.
-
All method preconditions are fully checked and asserted.
At the top of a5py.py you should have three single line comments with
(1) the module name, (2) your name(s) and netid(s), and (3) the date you finished the
assignment. Finally, the module docstring should contain your
short essay.
Upload the file a5.py to CMS by the due date: Tuesday, November 19th at 11:59 pm.
Do not submit any files with the extension/suffix .pyc . It will
help to set the preferences in your operating system so that extensions always
appear.
Survey
As usual, we ask that you complete the survey posted in CMS.
The survey will ask about things such as how long you spent on the assignment, your
impression of the difficulty, and what could be done to improve it. The
survey is particularly important this time because it is the first time
that we have ever given this assignment. We want to know whether we should
keep it, or look for a new assignment next year.
|