T-Th 9:05
or
T-Th 11:15
in Olin 155

CS 1110: Introduction to Computing Using Python

Fall 2013

Assignment 5:
Cluster Analysis

Due to CMS by Tuesday, November 19th at 11:59 pm.

Big Data is a major computer science topic these days. As computers become both ubiquitous and more powerful, many applications — from science to business to entertainment — are generating huge amounts of data. This data is too large to process by conventional means. That means we need new tools (such as new programming languages) and new algorithms to handle it all. The CS department has several upper level courses (such as CS 4780: Machine Learning) dedicated to this topic.

In this assignment you will work with one of the most popular (though not necessarily most efficient) tools for analyzing large amounts of data: k-means clustering. In k-means clustering, you organize the data into a small number (k) of clusters of similar values. For example, in the picture to the right, there are a lot of points in 3-dimensional space. These points are color-coded into five clusters, with all the points in a cluster being near to one another.

If you look at the wikipedia entry for k-means clustering, you can see that it has an untold number of applications. It has been used in market analysis, recommender systems (e.g. how Netflix recommends new movies to you), and computer vision, just to name a few. It is popular because it is relatively easy to implement. It is simple enough that you, a CS 1110 student, should be able do it.


A Mild Warning

This assignment is brand new. It has never been used in this class before. We have beta-tested this assignment on other students, but that will not catch all of the errors and misunderstandings that will inevitably pop up. Furthermore, while we have some idea of how long the assignment should take, that is different than how long it will actually take you. The survey for this assignment is incredibly important.

We recommend that you start early, and work on a little bit each day. In particular, we have indicated several suggested mini-deadlines throughout the assignment. These are not enforced, but if you follow them, this assignment should be very manageable. The most important thing is to complete Part I before the first prelim, as this assignment covers almost every major topic (lists, for-loops, classes) on that prelim.

You should check this site and Piazza regularly for updates and clarifications. If we hear enough complaints that this assignment (or parts of this assignment) are too hard, we will make modifications to adjust the difficulty.


IMPORTANT NOTE FOR WINDOWS USERS

There appears to be an issue with the old version of CornellExtensions on Windows only. The matplotlib module is not working correctly. To test whether your installation has problems, down load the test file scatter3d.py and run it as a script. You should get a window that looks like the one below:

If your CornellExtensions is not working properly, you will get the following error instead:

To solve this problem, you need to download and reinstall a new copy of CornellExtensions before you start this assignment. Please see a staff member if you are continuing to have problems.

For Macintosh users, you should be fine as long as you have not installed Mavericks. For you Mavericks users, we are trying to get a test machine to fix you, but it may still be a while.


Learning Objectives

This assignment has several important objectives.

  • It gives you practice with writing your own classes.
  • It exposes you to classes with complex invariants.
  • It gives you practice with using data encapsulation in classes.
  • It gives you practice with both both 1-dimensional and 2-dimensional lists.
  • It gives you experience with designing helper functions to structure your code properly.
  • It introduces you to a powerful machine learning tool.

Table of Contents

Authors: W. White, D. Murphy, T. Westura.


Academic Integrity and Collaboration

Academic Integrity

This is a brand new assignment, so we are less concerned about your copying off of someone from a previous year. With that said, the usual warnings about academic integrity stand. Do not share your code with others, in this semester or with students in future semesters. We prohibit this behavior because engaging in it so would help no one in the long run.

In this assignment, it is highly unlikely that your code for this assignment will look exactly like someone else's. We will be using Moss to check for instances of cheating and plagiarism. Anyone caught using files that are obviously identical to other students will be prosecuted, with the end result perhaps being to fail the course.

Collaboration Policy

You may do this assignment with one other person. If you are going to work together, then form your group on CMS as soon as possible. If you do this assignment with another person, you must work together. It is against the rules for one person to do some programming on this assignment without the other person sitting nearby and helping.

With the exception of your CMS-registered partner, you may not look at anyone else's code or show your code to anyone else, in any form what-so-ever.


K-Means Clustering

In this section, we describe what cluster analysis is, and how we use k-mean clustering to implement it. There is no Python code in this section: it is all math. It is possible to do this assignment without completely understanding this math (so long as you follow the function/method specifications). However, you will do better if you have some intuition for what you are doing.

In this section, we also talk about the classes that you are creating. We have found that the biggest trouble students have with this assignment is figuring out how to connect the k-means algorithm to the two classes: Database and Cluster. We have structured the assignment that anyone can do Parts A and B without really understanding anything at all. But unless you know what k-means is doing, the assignment gets a lot harder after this point.


Cluster Analysis

Cluster analysis is the act of grouping together similar points into small groups. When we say point, it could be a 2-dimensional point, a 3-dimensional point, a 4-dimensional point, or even higher dimensional. In fact, for cluster analysis, a point is really just a list of numbers. Those numbers can either be ints or floats; we do not care.

For example, suppose we work in the candy industry. We decide to rate every candy product (using the scale 0 to 1) on how sweet it is, how sour it is, how many nuts are included, and how smooth it is. A sample candy bar might be labeled as follows:

Billy Bonker's Honey Nut Nonpariels (sweetness=0.73,sourness=0.37,nuttiness=0.74,texture=0.30)

We have just turned a piece of candy into a 4-dimensional point. Cluster analysis would allow us to to group candies that are "close" to one another. We might do this for market analysis, to make a candy that is similar to some of the best selling brands, or to make a candy that is very different from the rest of the market.

When we cluster points, "close" means according to the Euclidean Distance. You know how to compute the distance between two points in 2-dimensions, or even three dimensions. In general, suppose you have points \(\textbf{p} = (p_1, p_2, \ldots, p_n)\) and \(\textbf{q} = (q_1, q_2, \ldots, q_n)\). Then we would define the distance between those two points to be

\[ d(\textbf{p},\textbf{q}) = \sqrt{(p_1-q_1)^2+(p_2-q_2)^2+\cdots+(p_n-q_n)^2} \]

Distance calculations are always between points of the same dimension. For example, suppose we have the points \(\textbf{p} = [0.1,0.2,0.3,0.4]\) and \(\textbf{q} = [0.0,0.2,0.3,0.2]\)

\[ d(\textbf{p},\textbf{q}) = \sqrt{(0.1-0.0)^2+(0.2-0.2)^2+(0.3-0.3)^2+(0.4-0.2)^2} = 0.7071\ldots \]

When we create a cluster, we always have a centroid. The centroid is the center of mass of the cluster (e.g. the average of all of the points in the cluster). The centroid could be one of the points in the cluster, but it does not have to be. For example, the picture to the right is an example of a 4 point cluster with a centroid that is not one of the points in the cluster.

To compute a centroid, we average. Remember that to average, you add up a collection of things and divide by the total. The same is true for points; we add up all the points to get a very large point, and then divide that point by the total number of points. To add two points, we just add the coordinates individually:

\[ \textbf{p}+\textbf{q} = (p_1+q_1, p_2+q_2, \ldots, p_n+q_n) \]

We we divide a point by a number, we just divide each coordinate, as follows:

\[ \textbf{p}/a = (p_1/a, \>p_2/a,\> \ldots,\> p_n/a) \]

As you can see, this is a natural thing to do with a for-loop.


K-Means Clustering

One of the hardest part of clustering is figuring out how many clusters you should have. K-means simplifies this by starting out with a guess at the number of clusters (this is the meaning of the number k). All you have to do is group the data into that many clusters.

While that may sound like an unsatisfactory solution, it works well in practice. A user usually has a good idea ahead of time about the number of clusters. The user can run the program for several different values of k, and pick the one that gives the best results.

To implement k-means clustering, we break it down until several steps.

1. Pick k Centroids

We start with the number k. The next step is to guess k centroids. Throughout these instructions, we will refer to the centroids as \(\textbf{m}_1,\textbf{m}_2,\ldots \textbf{m}_k\). The letter m indicates that each centroid will represent a type of mean of our data points.

Remember that the centroids are themselves points. These points could have any dimension, so long as they all have the same dimension. We will use the number n to refer to the dimension. Hence, for any mean \( \textbf{m}_j \)

\[ \textbf{m}_j = (m_{j,1},m_{j,2},\ldots,m_{j,n}) \]

So each \(m_{j,i}\) represents the ith coordinate of the jth centroid.

To pick k centroids, all we do is chose k points from our original data set. We chose these points at random; we have provided you will a helper function in the assignment to make this step easier. This is referred to as the Forgy Method, and it is one of the most popular ways to start k-means clustering.

2. Partition the Dataset

Now that we have k centroids, we need to assign every point in the database to a centroid. This is know as partitioning; no two different centroids will have the same set of points. The points for each centroid will form a cluster.

To represent this mathematically, we create the sets \(S_1, S_2, \ldots, S_n\). The set \(S_j\) (which we will refer to as a cluster is the points associated with centroid \(\textbf{m}_j \).

The partition step involves puting points from our data set into the sets \(S_j\). For each point p in the data set we

  • Find the closest centroid \(\textbf{m}_j\). In the unlikely event of ties, we chose the one with j least.
  • Put p in the cluster \(S_j\).

3. Recompute the Mean and Repartition

Once we have put points in each cluster \(S_j\), we need to compute a new mean \(\textbf{m}_j\). This new mean is just the average of all of the elements in the set. Suppose that

\[ S_j = \lbrace \textbf{p}_{j,1},\textbf{p}_{j,2},\ldots,\textbf{p}_{j,c_j}\rbrace \]

So \(S_j\) has \(c_j\) points inside of it. Then the new mean is

\[ \textbf{m}_j = \frac{1}{c_j}(\textbf{p}_{j,1}+\textbf{p}_{j,2}+\cdots+\textbf{p}_{j,c_j}) \]

The formula above uses the rules for adding points and dividing points by numbers that we discussed above. If you do not understand this formula, please talk to a staff member immediately.

Now that you have new means, the points might be in the wrong cluster. We have to repartition the data set. To do this, empty out every set \(S_j\) and refill it with the points that are closest to the new centroid. In refilling the sets, you continue exactly as you did in step 2.

4. Repeat Step 3 Until You Converge

The clustering algorithm continues this process: recompute the means and repartition the data set. We do not want to do this forever, so we have to stop some time. The right time to stop is when the means stop changing. In other words, if all the means are unchanged after step 3, it is okay to stop. When this happens, we say that the algorithm has converged.

Sometimes it can take a very long time to converge — thousands of thousands of steps. Therefore, when we implement k-means clustering we often add a "time-out factor". This is a maximum number of times to repeat step 3. If it does not converge before this maximum number of times, we stop anyway, and use the clusters that we do have.


Implementing K-Means with Classes

Classes are ideal for representing complex mathematical objects. For example, we saw in class how to use classes to represent fractions. There are several interesting mathematical objects in the description above (points, means, sets/clusters) and we could use classes to represent all of these.

To cut down on the amount of code that you have to write, we will not be using classes for points and means. In particular, you should not use the Point class in tuple3d. That class only supports 3-dimensional points. As we saw in the candy example, we need to support 4-dimensional points, or even higher.

Throughout this assignment, a point will be represented as a list of numbers; both ints and floats are allowed. This is true for means as well. For example, here is an example of a possible mean if we were working with 4-dimensional points.

\[ \textbf{m}_j = [0.1,0.5,0.3,0.7] \]

In this example \(m_{j,2}\) (using the notation from the previous section is 0.5, while \(m_{j,4}\) is 0.7. Note that our mathematical notation starts at 1, while list indices start at 0.

Class Cluster

The first interesting mathematical type is the clusters \(S_j\). To support this new type, you will implement the class Cluster. Objects of this class will actually hold two different values: the mean \(\textbf{m}_j\) and the set \(S_j\). We do this because the mean and cluster are associated with one another, so it makes sense to store them in a single object.

Right away, we see that we are going to need two attributes:

  • _centroid: A list of numbers representing the point \(\textbf{m}_j\)
  • _contents: A list of points inside of the cluster \(S_j\)

In order for this class to work, we need a very strong invariant: all of the points in \(S_j\) must have the same dimension has \(\textbf{m}_j\). To keep track of our invariant, we add a third attribute:

  • _dimension: The dimension of all points in this cluster

Notice that all of the attributes have underscores before their names. This means that the attributes are meant to be hidden. A large part of this assignment is writing getters and setters for all these attributes. We will provide the specifications for these, but you must implement them.

One of the trickier part of this class is working with the attribute _contents. This is a list of points, and each point is a list of numbers (int or float). That means that _contents is a 2-dimensional list, just like we talked about in lecture. For example, if a cluster contains the points (0.1,0.5,0.3,0.7) and (0.2,0.0,0.2,0.1), then the attribute _contents is

     [[0.1,0.5,0.3,0.7], [0.2,0.0,0.2,0.1]]

The class Cluster needs some methods other than getters and setters. To see what methods we need, we look at the description of the k-means algorithm, and use that as motivation. We are going to need

  • A method to find the distance between a point and a cluster (distance)
  • A method to update the mean, as in step 3 of the algorithm (update)

As a word of warning, you will notice a few other attributes and methods in the class Cluster provided in the assignment source code. Those are added to make the market analysis activity easier. They have nothing to do with k-means clustering, and so we will ignore them for now.

Class Database

While describing the k-means algorithm, there was always this mysterious "data set" in the background. This was the set of all points that we were clustering. When we partition in step 2, we are looping over this data set to find all of our points. This means we need another class to represent this data set. That is the purpose of the class Database.

The class Database will not only keep track of the points in the data set; it will also manage the k-means algorithm itself. That means it needs the following attributes:

  • _contents: A list of points inside this data set
  • _ksize: The k-value for k-means clustering
  • _clusters: The list current clusters \(S_1, \ldots, S_k\)

As with class Cluster, we are going to add one more attribute to help us keep track of our invariants:

  • _dimension: The dimension of all points in this data set

Once again, all of these attributes are hidden, and they need getters and setters. As in class Cluster, the attribute _contents is a 2-dimensional list of numbers (so you will find yourself writing the same code in both Cluster and Database). However, attribute _clusters is list of Cluster objects, and not a 2-dimensional list. Therefore, its getter (and setter) will be a little different. Fortunately, we will give you directions for all of these.

The other methods of class Database all come from looking at the k-means clustering algorithm. We are going to need

  • A method to perform step 1 and initialize algorithm (actually handled by the setter setKSize)
  • A method to perform step 2 and partition the points (partition)
  • A helper method for partition to find the nearest cluster to a point (nearest)
  • A method to perform step 3 in the algorithm (step)

Note that there is no method for step 4 to repeat the algorithm. We could have added such a method, but we chose not to (because it complicated the class a bit more). Step 4 is going to be handled by our testing programs and additional functions.

That is it. We now know all we need to know about the classes Database and Cluster. Of course, we are going to give you a lot more hints, but an upper-level CS student could probably stop reading the directions at this point.


How to Work on the Assignment

As we said, this assignment is longer and more difficult than ones in the past. We do believe that you can do it. The trick is to pace yourself and make use of all of the unit test and the visualizer that we provide.


Assignment Source Code

Before you do anything else, you should download the zip archive A5.zip from this link. You should unzip it and put the contents in a new directory. This time, you will see that this directory contains a lot of files:

a5.py
This is the only file that you will need to modify. It is the Python file that you will submit for a grade.
a5test.py
This file is a unit test to provide you with even more help with this assignment. Is not a 100% complete unit test, but you are not required to finished it. We describe this in more detail below.
cornelltest.py
This is a brand new version of cornelltest. In testing, we found that all the lists made it difficult to use cornelltest. We have made several improvements (used in a5test.py) to make errors easier to read.
visualizer.py
This is a GUI that will help you see what is going on. We describe how it works below.
data.a5d
This is a sample data set to get you started with the visualizer. It contains only 3-dimensional points.
candy.txt
This is a set of candy data to use in the market analysis part of the assignment. Because candy is 4-dimensional, it cannot be used in the visualizer.
small_candy.txt
This is a smaller set of candy data. It is used by the unit test a5test.py to verify that your market analysis function is working correctly.

Pacing Yourself

You should start as this assignment soon as possible. The second prelim is due the Thursday before this is due. If you wait until after the prelim to start this assignment, you will have a hard time completing it. If you do one part of it every few days, you will be more relaxed and get it done on time.

In addition, if you start this assignment early, you will do a lot better on the exam. This assignment covers a lot of things that are on the second prelim, such as lists (particularly 2-dimensional lists), for-loops, and classes.

At the end of each part, we have a suggested completion date. While this is not enforced, we recommend that you try to hit these deadlines. If you cannot make this deadlines, it might be a sign that you are having difficulty and need some extra help.


Using the Unit Test

Ideally, you should always write your own tests. However, writing test cases for this assignment is quite difficult. Test cases are often multiple lines, since you have to initialize the object, and check the results. One of the primary methods relies on a random number generator, and that means that you are not guaranteed to have the same output every time. Finally, some of the tests require that we violate coding conventions and access hidden attributes directly.

Because of this we have written a unit test for you, in the file a5test.py. There is a separate test procedure for each part of the assignment. In addition, there are print statements inside of the test procedures, so you can determine exactly what part is failing at any given time.

This unit test script is quite long, and can be a bit complicated at points. However, if you learn to read what this script is doing, you will understand exactly what is going on in this assignment. You are encouraged to sit down with a TA or consultant to talk about the unit test, to help you understand it better. Note that there are a lot of new testing functions in cornelltest! You might want to use the help() function to look at them all.

With that said, this unit test is not complete. It does not have full coverage of all the major cases. It does not check that you are enforcing the preconditions and invariants (something we will check while grading). If you want to add additional tests for these, that is fine. However, we do not want you to submit the file a5test.py when you are done, even if you made modifications to the file.


Using the Visualizer

The visualizer is designed to be run as a script. To get it working, navigate to the assignment directory and type

   python visualizer.py

When you start the visualizer for the first time, it will look like the window shown below. You will see a empty plot of 3-dimensional axis, as well as some buttons and menus to the right. If you click on the window with the plot, you can move the axes. You will find this useful when you are trying to look at three-dimensional data.

This GUI is not going to do anything until you start implementing the various parts of this assignment. Once you have most of the assignment completed (up to Part E), you can use the visualizer to perform a k-means clustering on a three-dimensional data set.

If you have everything working, click on the button to select a file and load the file data.a5d. Your initialization step will pick three random centroids and partition the data. The result will look something like the following:

Note that each cluster is a different color. The points in the cluster are crosses, while the solid circle is the centroid.

You can use the pull-down menu to change the k-value. There is a limit on the number of k values you can choose because there are only so many different colors that we could think of. Whenever you change the k, it will reinitialize the centroids. It will also reinitialize the centroids when you hit the reset button.

Each time you hit the step button, it will perform step 3 of the k-means algorithm. You can watch the centroids change as you hit the button. When the algorithm converges, you will see the word "True" after Finished and the step button will no longer do anything. Below is a picture of a finished computation with final versions of the clusters.


Part I: Class Definitions

In the file a5.py you will notice definitions for two classes: Database and Cluster. These definitions have all of the methods that you will (eventually) implement. However, right now they are just stubs. The first part of the assignment is to complete these Class definitions. You will use these classes later.

Your implementation of these classes is broken into six tasks. Each task has a stopping point that allows you to test what you have done so far with the unit test script. You can also see what you have done with the visualizer.

We have given you a guideline of how to pace yourself on this assignment. As a general rule, you should spend two days on each section, finishing all of Part I before the first exam. If you are having difficulty, you can delay Part F until after the exam, but any slower than that is a sign that you are having difficulty and need more hints. In that case, you should see a staff member as soon as possible.


Part A: The Database Class

In the specification for Database, you will notice the following instance attributes:

  _dimension: the point dimension for this database 
              [int > 0. Value NEVER changes after initialization]
              
  _ksize:     the k value for k-means clustering [int >= 0]
  
  _contents:  the database contents
              [either a 2D list of numbers or empty list,
              the number of columns is _dimension]
              
  _clusters:  the Cluster objects for the k-means algorithm
              [list of Clusters or empty list, 
               len(_clusters) = _ksize]

All of the attributes start with an underscore. That indicates that they are meant to be hidden. They can be used in class methods, but outside of this class definition the user should only access them via getters and setters (However, you will note that we violate this rule in a5test.py for testing purposes).

You need to implement the initializer for this class, as well as the getters and setters.

Initializer

The initializer needs to create all four of the instance attributes above and assign them a value. We have provided the header and specification for the initializer, but it is wrong and needs to be modified. You will notice that the specification calls for a default value for the parameter contents. You need to modify the header to provide this default value.

Getters and Setters

The method headers and specifications for the getter and setters are provided in the module. For this part of the assignment, you are to implement the following methods:

  def getDimension(self):
      """Return: The point dimension of this database"""
  def getKSize(self):
      """Return: The k value for k-means clustering"""
  def getContents(self):
      """Returns: The database contents"""
  def appendContents(self,point):
      """Adds a COPY of point at the end of _contents.
        
        This method does not add the point directly. It adds
        a copy of the point.

        Precondition: point is a list of numbers (int or float), 
        len(point) = _dimension."""

  def clearContents(self):
      """Deletes all the database contents, making
        that attribute _contents into an empty list."""
  def getCluster(self,pos):
      """Returns: The cluster at position pos in _clusters

        Precondition: pos is an int in 0..k-1, 
        where k is the k-value"""

Asserting Preconditions

As a last step, you should go back to every method with preconditions and assert those preconditions. This includes both the initializer and the setters (and in one case, a getter). Recall that preconditions are how we preserve the invariants, so asserting the preconditions prevents the invariants from being violated.

You will note that some of these preconditions are quite complex. For example, the initializer requires that contents is a 2D list of either ints or floats. There are a several things to check here:

  • contents is a list.
  • Every element of contents is a list of all numbers (int or float)
  • Every element of contents has the same length.

As in the previous assignment, you are going to need a helper function (or helper method) to assert this precondition. But we did not provide the helper this time. It is up to you to define the helper function (or method) to use in the assert statement.

When you write your helpers, you need to remember to write a specification for each helper. The helpers should not have any preconditions. They should be prepared for any argument. It is not very useful to check preconditions with function or method that has its own preconditions.

Testing it Out

Once you complete the getters and setters for the Database class, you can run the unit test script. If you see a print statement that says

   Part A appears correct

then you can be pretty confident about this part. However, there is on caveat: we cannot check getCluster() yet. We have to wait until you complete the Cluster class in Part B to test this method.

You can also try your code in the visualizer. Run the visualizer and load the file data.a5d. You will see all points in the data set as black crosses in the plot window, as shown below.

None of the other buttons will work, however. In addition, you will see some error messages in the terminal saying VISUALIZATION FAILED. This is perfectly normal. These error messages are the visualizer letting us know that it was not able to form clusters yet and that it was shifting down to do Part A only.


Part B: The Cluster Class

The specification for the Cluster class mentions the following instance attributes:

  _centroid:  the cluster centroid
              [list of number (int or float)]

  _dimension: the point dimension of this cluster
              [int > 0. len(_centroid) = _dimension
               Value NEVER changes after initialization]

  _contents:  the current data in the cluster 
              [either a 2D list of numbers or empty list,
               the number of columns is _dimension]

Once again, all of the attributes start with an underscore. You need to implement the initializer for this class, as well as the getters and setters.

Initializer

The initializer header and specification is provided in the class definition. The header is correct this time; you just need to implement it.

You will notice that cluster has two attributes that we have not mentioned: _srating and _crating. These attributes are not needed at this time. You will not use them until Part F. If you want to leave them out, that is okay. It is also okay to put them in the initializer if you wish.

Getters and Setters

The method headers and specifications for the getter and setters is given below. Again, you will notice that not all instance attributes have setters; some only have getters. We do this when we do not want an attribute to change after initialization.

  def getDimension(self):
      """Return: The point dimension of this cluster"""
  def getCentroid(self):
      """Returns: The cluster centroid"""
  def setCentroid(self,point):
      """Sets the centroid to be a COPY of point.
        
        This method must copy point. You cannot assign it to
        the attribute _centroid directly
        
        Precondition: point a list of numbers w/ length _dimension"""
  def getContents(self):
      """Returns: The current data in the cluster."""
  def appendContents(self,row):
      """Adds a COPY of point at the end of _contents.
        
        This method does not add the point directly. It adds
        a copy of the point.
        
        Precondition: point is a list of numbers (int or float)"""
  def clearContents(self):
      """Deletes all the cluster contents, making
        that attribute _contents into an empty list."""

You will notice that a lot of these methods are very similar to the those in Part A. Just write the same code again.

Asserting Preconditions

As with the database class, you should look at every method with preconditions and assert those preconditions. This includes both the initializer and the setters. Once again, you will need to use helpers.

Testing it Out

Once you complete the getters and setters for the Cluster class, run the unit test. If everything is working, you will get the appropriate message. Now that Cluster has an initializer, it will also test out your method getCluster() from Part A.

If you run the visualizer, you will see that it has even more functionality. When you load the file data.a5d, you will see all points in the data set as blue crosses (not black) in the plot window. In addition, you will see a blue circle representing an initial centroid, as shown below.

There will still get the error message VISUALIZATION FAILED. This is normal. However, if you still see black instead of blue for your points, something is very wrong. In that case, you should look at your error message to find out what is wrong.

Try to finish both Parts A and B by Tuesday, November 5.

Part C: Partitioning the Data Set

Both the initialization step (step 2) and the partition step (step 3) in k-means clustering require you to partition the data. Remember that partitioning involves assigning each point to its nearest cluster. To do this, you need to implement several methods specified below.

Cluster Methods

Which cluster a point is assigned to depends on how far it is away from the centroid. This is the purpose of the method distance() in the class Cluster. Read the provided specification and implement the method.

When implementing distance(), remember what we said about Euclidean distance above. That definition had some ellipses in it. As we saw in class, this means you need a for-loop.

Database Methods

You need to implement two new methods in class Database. This first method is nearest; read the specification of this method. Using distance as a helper, this method computes the distance from the point to each cluster, and picks the one that is closest. It returns None if the attribute _clusters is empty.

The second, and more important, method to implement is partition. This method implements step 2 in the k-means clustering algorithm. Between the description above and the method specification, you should have enough information to implement this method.

Asserting Preconditions

All three methods have preconditions. You must assert them. Note that the precondition for partition is a precondition about an attribute, not a parameter.

Testing it Out

The unit test for Part C tests each of the methods seperately. Therefore you can (and are encouraged to) run the unit test after you complete each method above. The only catch is that they are tested in the order listed (first distance, then nearest and then partition), so you must complete them in that order for the test to work.

You can also test the part in the visualizer. When you load the file data.a5d, it will have three clusters colored as in the picture below.

If you have done it correctly, each point should have the same color as its nearest centroid.

Try to finish this part by Thursday, November 7.

Part D: Initializing K-Means

When ever you set the k-value (e.g. the attribute _ksize in Database), you must reinitialize the k-means algorithm. Therefore, step 1 in the algorithm actually takes place in the setter setKSize.

Obviously this method is a setter, just like the ones that you created in Parts A and B. However, it does a lot of things before it sets the attribute _ksize. Read the specification to see what you should do. In addition, you might want to review step 1 above.

The one tricky step involves picking the centroids to start. You want to pick k different points in the database. To do that, pick k random numbers between 0 and len(_contents); those are the (positions of) the points to use for the centroid. The points must be different, so the positions must all be different. To help you, we have provided the function random_range at the top of a5.py. This function guarantees to return a list of distinct random integers. See the specification of this function for more.

Important (November, 12): There seems to be some confusion with the function specification. In this problem, you are not supposed to clear the contents of the existing clusters (like you did in partition). Instead, you are supposed clear the cluster list (attribute _clusters) and make brand new clusters.

Asserting Preconditions

This method has a precondition. Assert it.

Testing it Out

Once again, you should test your code using the unit test provided in a5test.py. The visualizer should also work without error now. When you load the data file, it will create three clusters and color your points as before (even with the test code removed). More importantly, if you change the k value, it will change the number of clusters and their colors. The step button will not work, however.

Keep in mind that the choice of clusters is random. If you hit the reset button, you will see different clusters each time.

Try to finish this part by Friday, November 8.

Part E: The Update Step

You have implemented steps 1 and 2 of k-means clustering. It is now time to implement step 3 (step 4 is handled automatically for you in the visualizer). To do this, you need to implement methods in both Database and Cluster.

Cluster Method

In step 3, you need to update the centroid of each cluster via the update method. To understand what to do in this method, read the method specification, as well as step 3 in the description above.

Notice the return value for this method. Before changing the cluster centroid, you need to compare the new centroid with the old centroid. To best way to do this is with the function numpy.allclose in the numpy module. This compares two lists of numbers (either int or float) and returns True if they are close enough (and False otherwise). For example

    numpy.allclose([1.0,2.0],[1.0,2.0])

evaluates to True while

    numpy.allclose([1.0,2.0],[1.0,2.1])

evaluates to False.

Database Method

The final method to implement for k-means clustering is the step method. This method specification has been provided. Once again, you should note the return value. You need to check the result of update on all clusters. If they are all true, then this method returns True. But if you just one cluster returns False on update, then this method should return False as well. When step return True, we know that the algorithm is finished and our clusters are done.

Testing it Out

Once more, you should test your methods with the unit test. Both methods update and step are tested separately. Ff there is an error, this will help you determine which method has the error.

At this point, the visualizer will work completely. Play with it and search for clusters. Remember, because of the random step at initialization, you may not get the same result each time.

Try to finish this part by Sunday, November 10.

Part F: Extending Class Cluster

Congratulations, you have now completed the basic k-means clustering algorithm. The next step is to figure out how to apply this algorithm to do interesting things. This is what we do in Part II, which can be left after the exam. However, in testing Part II, we discovered that we can make it a lot easier if we add some extra features to class Cluster.

Normally, this would be an ideal place to make a subclass. We have the base features in class Cluster, which is necessary for all applications of k-means clustering. Since our extra features are unique to a specific application of clustering, we would put those in a subclass. However, you have already done enough work, so we are just going to put the new features in Cluster. Besides, you will get a lot of experience with subclasses in the next assignment.

Cluster Ratings

To use k-means in market analysis, we need to assign each cluster a rating. The rating of a cluster is the average of the rating of all the points in that cluster. Therefore, we need methods to get the current (average) rating, and to include a new rating in the average when we add a new point.

Averages cannot be stored in a single attribute. To compute an average, you need to keep track of both the sum of all the ratings, as well as the number of ratings so far. That is the purpose of the attributes _srating and _crating that we saw in Part B. If you did not assign them to 0 in the initializer, you should do so now.

Once you have done that, you should implement the methods getRating, addRating, and clearRating. They are fully specifed in a5.py. Note that addRating has a precondition that must be specified.

String Representations

We have provided you almost every method header for you in Part I. However, there are two methods that we have not added to the file. We want you to add str and backquotes support. We talked about the methods to do this in lecture.

In writing the method to support str, it should represent each cluster as its centroid. For example:

>>> c = Cluster([1.0,1.2,3.5])
>>> str(c)
'[1.0, 1.2, 3.5]'

The body of this method should just be one line. The spaces are added automatically by Python when you convert a list to a string.

For backquotes, you should have the same output as str, but have the class name in front. For example:

>>> c = Cluster([1.0,1.2,3.5])
>>> `c`
"<class 'a5soln.Cluster'>[1.0, 1.2, 3.5]"

Both methods need to be completely specified.

Testing it Out

There is a test procedure to test out Part F in the unit test. As theses are additional features, there is nothing to test them with in the visualizer. You will not need the visualizer for the rest of the assignment.

Try to finish this part by Monday, November 11. However, if you are pressed for time, this part can be left until after the exam.


Part II: Market Analysis

Now that you have implemented k-means clustering, we wanted to give you an opportunity to apply what you have learned. In particular, we wanted a problem that was a little more open-ended and gave you some creative flexibility. With that said, in our testing we discovered that this was making the assignment really long.

We have tried to simplify this assignment as much as possible, in order to give you a chance to work on Part II. But this where we really need to know how the class is doing. If the class is really struggling on Part II, we will give hints, or drastically shorten this part of the assignment. Of course that means that you must make a good faith effort to complete Part I before the second prelim. If you wait until after the prelim to complete the classes, we cannot help you.

Pay attention to Piazza for possible hints or updates on Part II


Analyzing Candy

In this problem, you will assume that you are a CEO of a candy company. You want to create a new product. However, you are fairly conservative, so you do not want to make a radically new product. Instead, you want to make a product that is similar to other best-selling candies.

The way to do this is simple. You look at all of the best-selling candies (not just any candies, but the best-selling ones). You then cluster them by properties like sweetness, sourness, and texture. You pick one of the centroids of this cluster (preferably the cluster with the absolute best sellers), and that is the candy that you should make.

To help you with this process, we have already done the market research for you. We have provided you with a file called candy.txt. Each row specifies a candy and its properties. For example, the first row of this file is

Hearsay Sour Semisweet Chocolate Mints,0.32,0.87,0.14,0.68,3.04

This is six items separated by commas. The items are (in order)

  • The name of the candy. It will almost always have spaces.
  • The candy sweetness. This is a float between 0 and 1 (where 0 = not sweet, 1 = extremely sweet).
  • The candy sourness. This is a float between 0 and 1 (where 0 = not sour, 1 = extremely sour).
  • The candy nuttiness. This is a float between 0 and 1 (where 0 = no nuts, 1 = all nuts).
  • The candy texture. This is a float between 0 and 1 (where 0 = smooth, 1 = hard and crunchy).
  • The rating. This is a float between 0 and 5 to measure its popularity (e.g the "stars")

Your market analysis should cluster candies by their four attributes. Do not include rating when performing your cluster analysis. We will come back to rating later.


The Analysis Function

In performing your market analysis, you should implement the function best_cluster. The stub and specification of this function can be found at the end of the file a5.py. The specification is quite long, but it is complete. But to make it a bit more clear, here is an outline of what you should be doing:

  1. You need to create a new Database instance.
  2. You need to read each line from the file, and turn it into a 4-dimensional point.
    • For each point, you add it the database only if the candy rating is above the cutoff.
      (The purpose of the cutoff is to ignore poor-selling candies and make analysis faster)
    • If no candy was above the cutoff, stop the function immediately and return None.
  3. You need to run k-means clustering on this database for the given k.
  4. Now that you have your final clusters, you need to compute a rating for each cluster:
    • For each candy (including those below the cutoff), find the nearest cluster
    • Add that rating to the cluster using the methods from Part F
  5. Find the Cluster object with the highest rating and return it.

Note that this is a lot of things to do. This is going to be a massive function if you do it all one place. Therefore, we are adding an additional rule:

Do not write any function in this module with more than 30 lines (not counting specification).

If you find yourself writing more than 30 lines of code for this function (or any other function) you need to pull out some of that code and put it into a helper function. We are not going to tell you what helper functions to create. That is up to you. Remember to write a specification for each helper function (including preconditions).

To help you with the function best_cluster, here are some additional hints and guidelines.

Assert the Preconditions

You should assert the preconditions in the function best_cluster. However, we are giving you a choice for your helper functions. If your helper functions are hidden (e.g. the name starts with an underscore), then you do not need to assert the preconditions. Asserts are to protect your code from outside users, and outside users should not be accessing hidden functions.

On the other hand, if you do not hide your helper functions, then we expect you to assert the preconditions for them as well.

Reading the Data from a File

Reading data from a file is easy. There is a built-in function open to open a file and create a file object. This object works a lot like a list; you can put it in a for-loop, and each line of the file will be assigned to the loop variable.

For example, suppose you wanted to write a Python script that prints out the name of all the candies. You write the following code:

  f = open('candy.txt')
  for line in f:
      pos = line.find(',')
      print line[:pos]
  f.close()

Note the close method. You should always close the file when done.

If you want more information about using files in Python, see Chapter 14 of your textbook.

Running K-Means Until Convergence

In Part I, the visualizer took care of Step 4 in k-means clustering. It is up to you now. You need to repeat the process until step returns true, or you have gone onfor maxstep steps, which ever comes first.

You cannot do this with a for-loop. While the for-loop could run for maxstep steps, it cannot (easily) finish early if step returns true. You could do it with recursion. However, you might get a call where maxstep = 1000, which is more than the number of call frames that Python allows.

The best solution is to use a while-loop. A while-loop is similar to a for-loop except that it has the form

  while <boolean-expression>
      body

A while-loop repeats the body as long as the boolean expresion is true. For example, if you wanted to print out 'Hello World!' 10 times, you could do it with the following Python code:

  count = 0
  while count < 10:
      print 'Hello World!'
      count = count+1

For more information on how to use a while-loop, see Chapter 7 of your textbook.


Completing the Analysis

Now that you have completed the analysis function, it is time to do your market analysis. Open up the interactive shell and try the function best_cluster on the data file candy.txt. You need to come up with the values k, maxstep and cutoff. We are not going to tell you want to use; it entirely is up to you.

As a word of warning, candy.txt has almost 20,000 candies in it. If your cutoff is too low, it will take a long, long time to run (we gave up after a few minutes in early testing). If your cutoff is too high, you will not get enough candies to make a cluster. So you have to experiment a bit.

To speed things up, you might want to turn assertions off. All those assert statements enforcing the preconditions are really slowing things down. If you start Python with command

    python -O

it will ignore all assert statements, regardless of whether they are True or not (type assert False after starting Python this way and see what happens). This will speed up your code significantly, though the cutoff still matters.

When you finally get an answer, the result will be a Cluster object. Use the str function to see the centroid of this cluster. That is the candy you want to make. You can round off the sweetness, sourness, etc. but your candy should be reasonably close to those values.


A Short Essay

In the docstring for the file kmeans.py, we want you to write a short essary describing what you did. In this essay, we want

  • The description of your candy.
  • Your choice of k and the rating cutoff to decide on this candy.
  • Some justification for why you chose that k and cutoff.

When decribing the candy tell us the sweetness, sourness, nuttiness and texture. Do not give a rating; that is up to your customers. In addition, you should make up a name. It can be anything you want, so long as it is not obviously offensive.


Finishing the Assignment

Once you have everything working you should go back and make sure that your program meets the class coding conventions. In particular, you should check that the following are all true:

  1. There are no tabs in the file, only spaces (recall, tabs look like arrows if whitespace is visible)
  2. Functions are each separated by two blank lines.
  3. Lines are short enough (80 chars) that horizontal scrolling is not necessary.
  4. The specifications for all of the functions are complete and are docstrings.
  5. Specifications are immediately after the function header and indented.
  6. All method preconditions are fully checked and asserted.

At the top of a5py.py you should have three single line comments with (1) the module name, (2) your name(s) and netid(s), and (3) the date you finished the assignment. Finally, the module docstring should contain your short essay.

Upload the file a5.py to CMS by the due date: Tuesday, November 19th at 11:59 pm. Do not submit any files with the extension/suffix .pyc. It will help to set the preferences in your operating system so that extensions always appear.


Survey

As usual, we ask that you complete the survey posted in CMS. The survey will ask about things such as how long you spent on the assignment, your impression of the difficulty, and what could be done to improve it. The survey is particularly important this time because it is the first time that we have ever given this assignment. We want to know whether we should keep it, or look for a new assignment next year.


Course Material Authors: D. Gries, L. Lee, S. Marschner, & W. White (over the years)