Assignment 6 - Spotify

CS 1110 Spring 2019

Due to CMS by Tuesday, April 16, 2019 @ 11:59pm


Overview

Music. We listen while studying, partying, running, coding, and dancing. We have tracks on our playlists to dance to at parties, and also guilty pleasure songs that we dance to when no one is watching. There are over 30 million songs on Spotify and yet often Spotify can recommend the near-perfect song. Spotify and Apple Music seem to know you better than you know yourself. How can this be? How can it be that Spotify knows what songs are pop and what songs are indie? Spotify and Apple Music don’t hire millions of people whose sole purpose is to categorize songs, so how is this possible? Companies use machine learning and analysis of big data to make applications smarter. It is the reason that Amazon knows what to recommend and the reason that Spotify can have a tailored playlist for you every Monday.

In this assignment you will work with one of the most popular (though not necessarily the most efficient) tools for analyzing large amounts of data: k-means clustering. In k-means clustering, you organize the songs (though it can be any data) into a small number (k) of clusters of similar songs. If k is large (i.e., there are lots of clusters), clustering could distinguish between early and late Beatles songs. But if k is small (e.g., k=3), it could put all your pop, country, and folk music in the same cluster. By the end of this assignment you will be able to classify music into different genres, recommend songs to yourself, and cluster your own playlist.

If you look at the wikipedia entry for k-means clustering, you can see that it has many, many applications beyond music. It has been used in market analysis, recommender systems (e.g. how Netflix recommends new movies to you), and computer vision, just to name a few. It is an essential tool to have if you are working with data. With the skills and knowledge you have been acquiring in CS 1110, you are now able to implement it yourself!


Learning Objectives

This assignment is designed to give you practice with the following skills:

  • It gives you practice with writing your own classes.
  • It gives you practice with writing code spread across many files.
  • It illustrates the importance of class invariants.
  • It gives you practice with using data encapsulation in classes.
  • It gives you practice with both both 1-dimensional and 2-dimensional lists.
  • It gives you experience with structuring your code with helper functions.
  • It demonstrates how to implement a powerful data-analysis algorithm on real data.
  • It gives you an idea of how companies like Spotify, and Netflix recommend different songs.

Before You Get Started

Note on Spring Break Overlap:

Unfortunately both A5 and A6 overlap with spring break. While you should start working on these the week of 3/25, we strongly urge you to take spring break off. Your mental health is way more important than a python assignment. So put python on the back burner for a week, and hang out with your family and friends, sleep a full night's rest, binge Netflix (Arrested Development is great), or go to the beach! Your python knowledge will be cached away for when you get back.

-Have a Great Spring Break

The CS 1110 Course Staff

Academic Integrity

With the exception of your CMS-registered partner, we ask that you do not look at anyone else's code, seek solutions online or from students of past offerings of this course, or show your code to anyone else (except a CS1110 staff member) in any form whatsoever. Do not show your code to anyone but your partner or a member of course staff. This is in keeping with the the Academic Integrity Policy for CS 1110. It is okay to post error messages on Piazza, but not code. If we need to see your code, we will do that in person.

In this assignment, it is highly unlikely that your code for this assignment will look exactly like someone else's; we will be comparing submissions, looking for instances of plagiarism. We also ask that you do not enable violations of academic policy. Do not post your code to Pastebin, GitHub, or any other publicly accessible site.

Partner Policy

You may do this assignment with one other person. Form your group on CMS as soon as possible. This can only be done after the assignment is released on CMS, but it must be completed before you submit the assignment. Both people must do something to form the group. The first person proposes, and then the other accepts. CMS does not allow you to form groups once grades are released. Once you have grouped on CMS, only one person needs to submit the files.

If you do this assignment with a partner, you must work together. It is against the rules for one person to do some programming on this assignment without the other person sitting nearby and helping. Furthermore, this assignment is not designed for you to split the work cleanly between partners. Sit together and take turns "driving" (using the keyboard and mouse) and "navigating" (guiding the code that's being typed, spotting errors, keeping track of what's done and what's still on your todo list).


Development Environment

To do this assignment, Python must be set up properly. If you have not already done this, follow the installation instructions to set it up on your computer. Alternatively, you can just work in the ACCEL lab.

You should also create a folder on your hard drive that is dedicated to this assignment and this assignment only. Every time that you work on a new assignment, we want you to make a new folder, to keep things organized and avoid problems with naming collisions. Make sure that the command shell and Atom Editor are both open in the current folder before you start.

Assignment Help

If you do not know where to start, if you do not understand testing, or if you are completely lost, please see someone immediately. This can be the course instructor, a TA, or a consultant. Do not wait until the last minute. A little in-person help can do wonders. See the staff page for more information.


K-Means Clustering

In this section, we describe what cluster analysis is, and how we use k-means clustering to implement it. Cluster analysis is the act of grouping together similar points into groups. For example, let’s say you are clustering your playlist. Suppose that your songs are represented by points in 2 dimensions, words per minute and tempo. Clustering this data into groups could cluster all rap music together since they tend to have a high words per minute.

When we say point, it could be a 2-dimensional point, a 3-dimensional point, a 4-dimensional point, and so on. In fact, we could have points of arbitrary dimension, particularly if the data is not spatial. For example, Spotify has at least 11 dimensions for describing your songs.

Distances between Points

When we cluster points, we use the Euclidean Distance to measure the distance between them. You may already know how to compute the distance between two 2-dimensional points:

$d(\mathbf{p},\mathbf{q}) = \sqrt{(p_x - q_x)^2 + (p_y - q_y)^2}$

or between 3-dimensional points:

$d(\mathbf{p},\mathbf{q}) = \sqrt{(p_x - q_x)^2 + (p_y - q_y)^2 + (p_z - q_z)^2}$

These are special cases of the general definition: given two n-dimensional points $\textbf{p} = (p_1, p_2, \ldots, p_n)$ and $\textbf{q} = (q_1, q_2, \ldots, q_n)$ the Euclidean distance between them is:

$d(\textbf{p},\textbf{q}) = \sqrt{(p_1-q_1)^2+(p_2-q_2)^2+\cdots+(p_n-q_n)^2}$

For example, suppose we have two points: $\textbf{p} = (0.1,0.2,0.3,0.4)$ and $\textbf{q} = (0.0,0.2,0.3,0.2)$. Then:

$d(\textbf{p},\textbf{q}) = \sqrt{(0.1-0.0)^2+(0.2-0.2)^2+(0.3-0.3)^2+(0.4-0.2)^2} = 0.7071$

Cluster Centroids

Given a set of points, we can identify its centroid. The centroid is the “center of mass” of the cluster, which is to say, the average, or mean, of all of the points in the cluster. The centroid might happen to be one of the points in the cluster, but it does not have to be. For example, the picture to the right is an example of a 4-point cluster (black circles) with a centroid (red circle) that is not one of the points in the cluster.

To compute a centroid, we average. Remember that to average, you add up a collection of things and divide by the total. The same is true for points; we add up all the points, and then “divide” the result by the total number of points. Adding two points results in a new point that is the coordinate-wise addition of the two input points:

$\textbf{p}+\textbf{q} = (p_1+q_1, p_2+q_2, \ldots, p_n+q_n)$

When we divide a point by a number, the division is done coordinate-wise, as follows:

$\textbf{p}/a = (p_1/a, \>p_2/a,\> \ldots,\> p_n/a)$

The Algorithm

The idea of k-means is quite simple: each point should belong to the cluster whose mean (centroid) it is closest to. But, the centroid of a cluster depends on which points you put into it. This creates a kind of chicken-and-egg problem, which is solved using an iterative approach: first assign points to clusters, then compute new centroids for these clusters, then re-assign points using the new centroids, and so on.

To make this general idea into a precise algorithm we can implement, we break down k-means clustering into several steps. Throughout, let n be the dimension of the points.

Step 1: Pick k Centroids

The first step is to guess k centroids. Throughout these instructions, we will refer to the centroids as $\textbf{m}_1,\textbf{m}_2,\ldots, \textbf{m}_k$ (The letter m stands for “mean”, or centroid, which is where the algorithm's name comes from.)

Remember that the centroids are themselves n-dimensional points. Hence, for any mean $\textbf{m}_j$, where $1 <= j <= k$ we have

$\textbf{m}_j = (m_{j,1},m_{j,2},\ldots,m_{j,n})$

So each $m_{j,i}$ represents the iith coordinate of the jth centroid

To pick k centroids, all we do is choose k points from our original dataset. We will choose these points at random, as if drawing cards from a deck. This is referred to as the Forgy Method, and it is one of the most popular ways to start k-means clustering.

Step 2: Partition the Dataset

Now that we have k centroids, we assign every point in the data set to a centroid. We do this by matching each point to the nearest centroid $\textbf{m}_j$, and then have the points assigned to each centroid form a new cluster. This is known as partitioning because each point will belong to one and only one cluster.

If there happens to be a tie between which centroids are closest to a given point, we arbitrarily break the tie in favor of the $\textbf{m}_j$ with smallest j. If we put the centroids in a list, this would be the centroid that comes first in the list.

There will then be k sets $S_1, S_2, \ldots, S_k$ where set $S_j$ is the cluster consisting of all the points associated with centroid $\textbf{m}_j$.

Step 3: Recompute the Means

Once we have the sets $S_1, S_2, \ldots, S_k$, we need to compute a new mean $\textbf{m}_j$ for each $S_j$. This new mean is just the average of all of the points in that set. Let

$S_j = \lbrace \textbf{q}_{1},\textbf{q}_{2},\ldots,\textbf{q}_{c_j}\rbrace$

be the set of points associated with centroid $\textbf{m}_j$, where $c_j$ is the number of points inside of $S_j$. Then the new mean is the n-dimensional point

$\textbf{m}_j = \frac{1}{c_j}(\textbf{q}_{1}+\textbf{q}_{2}+\cdots+\textbf{q}_{c_j}).$

The formula above uses the rules for adding points and dividing points by numbers that we discussed above. If you do not understand this formula, please talk to a staff member immediately.

Now, because we have recomputed the means, some points might be closer to the centroid of a cluster they are not currently in than to the centroid of the cluster they are in. To fix this, we repartition the dataset by going back to step 2 and proceeding as before.

Step 4: Repeat Steps 2 and 3 Until Convergence

The clustering algorithm continues this process: recompute the means and repartition the dataset. We want to keep doing this until the means stop changing. If all the means are unchanged after step 3, we know that partitioning again will produce the same clusters, which will have the same means, so there is no point in continuing. When this happens, we say that the algorithm has converged.

Sometimes it can take a very long time to converge — thousands of thousands of steps. Therefore, when we implement k-means clustering, we often add a “time-out factor”. This is a maximum number of times to repeat steps 2 and 3. If the algorithm does not converge before this maximum number of times, we stop anyway, and use the clusters that we do have.


Classes Used in Our Implementation

Classes are ideal for representing complex mathematical objects. For example, we saw in class how to use classes to represent portfolios or Cornell students. There are several interesting mathematical objects in the description above (points, centroids, sets/clusters) and we can use classes to represent all of these. Deciding which of these things to create classes for is the core of object-oriented software design. For this assignment we have made these decisions for you, since it is not always easy to hit the right balance of complexity against structure the first time, and we want you to spend your time implementing rather than redesigning. You will have plenty of time to create your own classes in the future.

For our implementation of k-means, we decided not to use classes to represent points and centroids, because Python's built-in lists serve well enough. In particular, you should not use the Point class from the introcs module. That class only supports 3-dimensional points, but we would like to support points of any dimension.

We have created classes for three important concepts in k-means. First, the entire dataset (all the songs in your playlist) is represented by an instance of the class Dataset. Second, clusters of data are represented by an instance of the class Cluster. This would represent a group of similar songs. Finally, the set of clusters created by k-means is represented by an instance of the class Algorithm.

Points are Lists

Throughout this assignment, a point will be represented as a list of numbers; both ints and floats are allowed. This is true for centroids as well. For example, here is an example of a possible centroid if we were working with 4-dimensional points:

$\textbf{m}_j \mbox{ is represented as the list } [0.1, 0.5, 0.3, 0.7].$

In this example, if x stores the name of the list, then $m_{j,2}$ (using the notation from a previous section) is x[1], or 0.5, while $m_{j,4}$ is x[3], or 0.7. Note that our mathematical notation starts at 1, while list indices start at 0.

Class Dataset

The k-means algorithm starts with a dataset—a collection of points/songs that will be grouped into clusters. This dataset is stored in an instance of the class Dataset. This class is very simple; it just holds the list of points, so its most important attribute is:

  • _contents: A list of points inside this dataset

For example, if a dataset contains the points (0.1,0.5,0.3,0.7) and (0.2,0.0,0.2,0.1), then the attribute _contents is:

[[0.1,0.5,0.3,0.7], [0.2,0.0,0.2,0.1]]

The points are all lists of numbers, and they all have to be the same length; remember that this length is the dimension of the data. We keep track of this dimension in a second attribute:

  • _contents: A list of points inside this dataset

The fact that all the points should have the right dimension is expressed by a class invariant:

The number of columns in _contents is equal to _dimension.

That is, for every item _contents[i] in the list _contents, len(_contents[i]) == _dimension. You might argue that storing the dimension separately is redundant, since you could always look at the points in _contents to find it, but we would like the dimension to be defined even before we add any points to the dataset.

We may also want to use the ids of the songs, which is stored in the attribute _song_ids. You may also think of this attribute as storing a unique name for each point

  • _song_ids: A list of ids of the songs in the dataset

Thus, every song (or data point) will be described by two things: its point stored at a certain index in _contents, and also by its song id stored in _song_ids at the same index. If a song is the 5th song in the dataset, the 5th element of _contents is a point that corresponds to the song id residing at the 5th position of _song_ids. Sometimes we do not have to have this attribute. Perhaps, we are clustering random points that are not songs. In this case we would just have an empty list for this attribute.

You might notice that these attribute names all start with an underscore. As we mentioned in class, this is a coding convention that indicates that these attributes should not be accessed directly from outside the Dataset class. In particular, neither Cluster nor Algorithm should access these attributes at all. When those two classes need to access attributes of Dataset, they should do so using getters and setters. You will implement these getters and setters.

Important: If a method of any class accesses the hidden attributes of an object of another class without a getter or setter, you will lose points.

Class Cluster

The next idea in the algorithm that we have defined a class for is a cluster $S_j$. To support this type, we defined the class Cluster. Objects of this class will actually hold two different values: the centroid $\textbf{m}_j$, the set $S_j$, and a optional name for . We do this because the centroid and cluster are associated with one another, so it makes sense to store them in a single object. We need two attributes corresponding to these ideas:

  • _centroid: A list of numbers representing the point $\textbf{m}_j$,
  • _indices: A list of integers that are the indices where the points of this cluster can be found in the dataset.

Note that we are not storing the points belonging to a cluster; we are only storing the indices of the points in the dataset. This simplifies Cluster because it does not need to worry about maintaining lists of lists. Code that works with clusters does need to be able to get to the points, however, so one of its instance variables is a reference to the dataset:

  • _dataset: The dataset of which this cluster is a subset.

We also have an attribute _name. Sometimes at the end of the algorithm, it would be nice to describe these clusters. For example, perhaps one of the clusters represents all of the country songs in the playlist. We could thus label this “Country”. If we do not have a name for this cluster we can leave it as the empty string.

  • _name: The name that describes the cluster

Once again, these attributes start with an underscore, so we want getters and setters if we are going to access them in the class Algorithm (it is okay if a method in Cluster accesses one of its own attributes with an underscore in it). So again, you need getters, like getCentroid and setters like addIndex. There is also a special getter getContents that returns a copy of the points to be used by the visualizers.

In addition to the getters and setters, the class Cluster is the logical place to put code that does computations involving a single cluster. Reading the description of the k-means algorithm above, we find that the two core operations with clusters are computing the distance from a point to a cluster's centroid and finding a new centroid for the cluster. These operations are handled by the two most important methods of the Cluster class:

  • A method to find the distance from a point to the cluster (distance).
  • A method to compute a new centroid for a cluster (update).

Finally, this class contains the special Python methods __str__ and __repr__. These methods are to help you with printing clusters more informatively while debugging.

Class Algorithm

The k-means algorithm needs to work with several clusters at once. Hence it does not make sense to put this code in the Cluster class. Instead, we put this code in a separate class called Algorithm. The data stored in this class includes a list of all clusters in the algorithm, and as with Cluster, a reference back to the dataset.

  • _dataset: The dataset we are clustering
  • _clusters: The list of current clusters S1,…,Sk

Once again, we have getters and setters for these attributes. But the important methods of class Algorithm are the core of the k-means algorithm.

  • _partition: A method to partition the dataset, implementing step 2 of the algorithm
  • _nearest: A helper method for _partition to find the nearest cluster to a point
  • _update: A method to update all the cluster centroids, implementing step 3

Finally there are the methods that orchestrate the whole process:

  • step: A method that executes one step of the process, updating the centroids and re-partitioning the data
  • run: A method that computes a clustering, from start to finish

Finally we have a method findTotalError, which sums up the error of all the clusters.

A user who wants to cluster a dataset into k clusters would create a Dataset, add all the points to it, create an Algorithm with k clusters, and then call run for the algorithm object.

You will notice that some of the methods have names starting with an underscore. The meaning is the same as with attributes: these methods are meant to be used internally rather than called directly by code outside the Algorithm class.


How to Work on the Assignment

This assignment involves implementing several things before you'll have the whole method working. As always, the key to finishing is to pace yourself and make effective use of all of the unit tests and the visualizer that we provide.

Assignment Source Code

This assignment's code base is more complicated than in previous assignments. To make sure that you have everything, we have provided two zip files.

cluster.zip: The package for the assignment application code
datasets.zip: Sample data files for clustering

You should download the zip archive cluster.zip from the link above. Unzip it and put the contents in a new directory. This time, you will see that this directory contains a lot of files. Most of these files are not all that important; they are similar to a3app.py in that they drive a GUI application that you do not need to understand. In fact, you only need to pay attention to these files:

  • a6dataset.py: contains the Dataset class. You need to have completed it before you can start the visualizer.
  • a6cluster.py: contains the Cluster class. You will need to complete it to visualize a single cluster and its centroid.
  • a6algorithm.py: contains the Algorithm class. You will need to complete it to get all of the features of the visualizer.
  • a6checks.py: contains the helper functions for enforcing preconditions. Since you will need these functions for all three classes, they are put in their own separate file.
  • a6test.py: contains test cases for the assignment. You should read this, even though you don't have to modify it to complete the assignment. It is described in the next section below.
  • spotify.py (optional): contains the code that clusters your spotify song. You only need to modify the top fields with your usernames and the playlist you want to cluster.

These are not the only files, but these are the ones to read or attention to. Do not download the individual files. Download the zip files provided. That way you can guarantee you have everything.

Running the Application

Because there are so many files involved, this application is handled a little differently from previous assignments. To run the application, keep all of the files inside of the folder named cluster. Do not rename this folder. To run the program, change the directory in your command shell to just outside of the folder cluster and type

python cluster --view

In this case, Python will run the entire folder. What this really means is that it runs the script in __main__.py. This script imports each of the other modules in this folder to create a complex application.

The script __main__.py is actually quite sophisticated. If you look at the code in this file, you will see that actually manages multiple applications underneath one roof. Try typing

python cluster --test

The application will run test cases (provided in a6test.py) on all of your classes. You will want to rely on this, because the visualizer only works on two-dimensional data.

Finally, if you type

python cluster file.csv 5

the application will load the CSV (comma separated values) file file.csv and perform k-means clustering with 5 clusters, and print out the results.

Using the Test Cases

As we said, executing the application with the --test option will run test cases on the classes Dataset, Cluster and Algorithm. These test cases are designed so that you should be able to test your code in the order that you implement it. However, if you want to "skip ahead" on a feature, you are allowed to edit a6test.py to comment out a test. Those tests are simply there for your convenience.

This unit test script is fairly long, but if you learn to read what this script is doing, you will understand exactly what is going on in this assignment and it will be easier to understand what is going wrong when there is a bug. However, one drawback of this script is that (unlike a grading program) it does not provide a lot of detailed feedback. You are encouraged to sit down with a staff member to talk about this test script in order to understand it better.

With that said, this unit test is not complete . It does not have full coverage of all the major cases, and it may miss some bugs in your code. It is just enough to ensure that basic features are working. You may lose points during grading even if you pass all the tests in this file (our grading program has a lot more tests). Therefore, you may want to add additional tests as you debug. However, we do not want you to submit the file a6test.py when you are done, even if you made modifications to the file.

Asserting Preconditions

Once again, we want you to get into the habit of asserting preconditions. Unlike the last assignment, we are not making a distinction between enforced and normal preconditions. If it is a precondition, you should enforce it with an assert statement. Some of the preconditions are quite complete this time. For example, method addPoint in Dataset requires that point is a list of numbers. There are two different things to check here:

  • point is a list
  • Every element of point is a number (int or float)

The second part requires a for-loop (or equivalent) to check, and you do not know how to mix for-loops and asserts. The best way to handle this precondition is to write a helper function that returns True if these are both true, and False if not. We have provided this helper is_point for you in a6.py. To use it in the method addPoint, you simply write

assert is_point(point)

As with Assignment 4, we are not requiring any error messages with your assert statements. Note that this is not the only part of the precondition in addPoint to enforce, but it is the hardest part.

As you work on this assignment, you may decide to write more helpers to enforce your preconditions. These helpers should go in the file a6checks.py . When you write your helpers, you need to remember to write a specification for each helper. The helpers should not have any preconditions. They should be prepared for any argument. It is not very useful to check preconditions using a function or method that has its own preconditions.


Optional: Using the Visualizer

K-means clustering is cooler when you can see what is actually going on. If you run the application with the --view option, you will get a visualizer that will help you see the application in action. It is much more limited than the unit test because you can only visualize 2-dimension points. It will also only work properly once you have completed the entire assignment. However, even if you have not completed the assignment, it is still useful and will show some information.

When you start the visualizer for the first time, it will look like the window shown below. You will see an empty plot of a 2-dimensional axis, as well as some buttons and menus to the right. You will also get an error message telling you to finish Dataset.

As the error message says, this visualizer is not going to do anything until you start implementing the various parts of this assignment. Once you have most of the assignment completed (up to Step 3 of Algorithm), you can use the visualizer to perform a k-means clustering on a two-dimensional data set.

Once you have everything working, click on the drop-down menu to select a dataset, like sample1. These options correspond to the CSV files in the data directory. Your initialization step will pick three random centroids and partition the data. The result will look something like the following:

Note that each cluster is a different color. The points in the cluster are crosses, while the solid circle is the centroid.

You can use the pull-down menu to change the k-value. There is a limit on the number of k values you can choose because there are only so many different colors that we could think of. Whenever you change the k, it will reinitialize the centroids. It will also reinitialize the centroids when you hit the reset button.

The overlay option changes how the centroid is drawn. If it is false, the centroid is a simple, small circle. If it is True, then it grows to a large transparent circle containing (most of) the dataset.

The overlay option requires that you finish the getRadius method in Cluster. However, it is not perfect, and you should not assume that your getRadius is wrong if a few points stray outside of the overlay.

Each time you hit the step button, it will perform step 3 of the k-means algorithm. You can watch the centroids change as you hit the button. When the algorithm converges, you will see the word "True" after Finished and the step button will no longer do anything. Below is a picture of a finished computation with final versions of the clusters.

This visualizer will work with valid CSV (see below). You can load a new file by selecting the <select file>. For example, try the file small.csv in dataset.zip. This is a small set of 2-dimensional points that. For data files that are more than two dimensions (like the candy data sets), it will only use the first two values for each point. For data sets that are one dimensional (like flat.csv) it will use 0.5 for the y values, creating a line in the middle of the window.

Important: There is a minor bug that was introduced in MacOS with High Sierra. When you use the visualizer to load a file, you will get an error message that starts with Class FIFinderSyncExtensionHost is implemented in both... This is harmless. You can ignore it.

Using CSV Files

While working on this assignment, you might want to work with your own datasets. The cluster package can process any dataset you want, of any dimension you want, so long as it is in a proper CSV file. A CSV file is a comma separated value file, and these can be generated by most spreadsheet programs like Excel. CSV files represent simple tables. They have a header row with the names of each column, but then every row after the first only contains values. The files in datasets.zip are examples of CSV files.

The dataset CSV file must have a particular format. All of the values in the non-header rows must be numbers. The only exception is if you have a column named COMMENTS. You can write whatever you want in this column; it is a way of labeling your data. We have done that in the candy datasets.

To process a file, type the command

python cluster file.csv k

(where k is a number and file.csv is a name of a CSV file) will perform k-means clustering with k clusters on the file file.csv. It will print the result on your screen. If you would like to output the results to a new CSV file (which will have additional columns for the centroid and cluster number), you can do that with the --output option, like this:

python cluster file.csv k --output output.csv

You can also use this command with the --view option to open the visualizer on a custom dataset.


Recommended Implementation Strategy

In the files a6dataset.py, a6cluster.py, and a6algorithm.py we have provided the definitions for the classes discussed above. You have the class invariants and all method specifications, but almost all the method bodies are just stubs. Your task is to implement all the stubbed methods in these classes. We recommend starting with Dataset, since that is needed by the other classes. Then implement Cluster, and test that all its methods are working correctly. Finally implement and test Algorithm. When that is working correctly, your assignment is finished.

Task 1: The Dataset Class

This class has seven methods: the initializer, a method for standardization, four getters and a setter. The specifications of all of these methods are included in the file a6dataset.py. The getters are the most straight-forward. Most of the time, you just need to return the attribute.

The only tricky getter is getPoint. Remember that when you return a list, you are returning the name of the folder. So, once we get a point from the Dataset, we could change its contents, potentially corrupting the dataset. getPoint always copies the point before it returns it, ensuring that we cannot accidentally corrupt the data set. The unit test a6test.py has a test to make sure that you did this correctly.

Similarly, when you implement the initializer, note that the specification says that the class is supposed to keep a copy of the parameter contents. In this context this means that you need to set the _contents attribute to a new list that contains copies of all the points in contents. Just writing contents[:] won't do it—that creates a new list that has references to the same points. A question to consider is, how do you copy all those individual lists? You may need a loop of one form or another.

Finally, when you implement addPoint, be sure that it also adds a copy of the provided point to the dataset.

Asserting Preconditions

As a last step, you should go back to every method with preconditions and assert those preconditions. This includes both the initializer and the setters. Recall that preconditions are how we preserve the invariants, so asserting the preconditions prevents the invariants from being violated.

As we explained above, you may want helper functions to assert the preconditions. For example, the initializer requires that contents is a 2D list of either ints or floats. There are three things to check here:

  • contents is a list.
  • Every element of contents is a point (list of numbers).
  • Every element of contents has the same length.

It would be best to write a helper function called is_point_list to check these three. As we have said before, this helper should go in a6checks.py

Testing it Out

Once you complete the Dataset class, you can run the unit test script. If you see output that says

class Dataset appears correct

then you can be pretty confident about this part.

You can also try your code in the visualizer. Run the visualizer and load a data file as described above. The points will only be displayed as black crosses, and the algorithm controls will not work, but you will be able to see the data.

If you click on the Reset button, you will see some error messages in the terminal saying FAILED VISUALIZATION, or nothing may happen. This is perfectly normal. These error messages are the visualizer letting us know that it was not able to form clusters yet and that it shifted down to testing just the Dataset part of your code only.

Task 2: The Cluster Class

The class Cluster is a little more involved than Dataset, so we have broken it up into two parts: Part A and Part B. Each part has its own test procedure in a6test.py, so you can fully test one part before moving on to the next.

Part A: Getters and Setters

The specification for the Cluster class mentions the following instance attributes:

"""
INSTANCE ATTRIBUTES:
	_dataset [Dataset]: the dataset this cluster is a subset of
	_indices [list of int]: the indices of this cluster's points in the dataset
	_centroid [list of numbers]: the centroid of this cluster
	_name [str]: an optional label for the centroid. Can be the empty string
EXTRA INVARIANTS:
	len(_centroid) == _dataset.getDimension()
	0 <= _indices[i] < _dataset.getSize(), for all 0 <= i < len(_indices)
"""

Once again, these attributes begin with an underscore, which means that you will need to implement getters and setters. Once again these are straightforward. The only unusual one is getContents. The return value is supposed to be a list containing (copies of) the points themselves, not the indices of those points. You will need to loop over the indices, creating copies of the corresponding points from the database and collecting those copies in a newly created list.

You will also need to implement the initializer in this part of the assignment. When you are done, remember to assert all your preconditions. Look very carefully at the precondition for dset. In particular, use isinstance, not type.

Part B: distance, getRadius, update, and findError

The last four methods are more complex methods that are necessary for the k-means algorithm. To implement distance, refer to the definition of euclidean distance given earlier in this document. The function getRadius will loop over all of the points in the cluster, compute their distance to the centroid, and return the largest value.

The function update is the most complex function in this class. You will need to first go through all the points, adding up their coordinates. You need the sum of the first coordinates of all the points in the cluster, the sum of all the second coordinates, and so on. After that, you will need to divide by the number of points in the cluster. Finally, you need to decide whether the centroid has changed. The function numpy.allclose is mentioned in the specification. This function takes two lists of numbers and tells you whether the corresponding numbers from the two lists are all very close in value.

The method findError finds the error of the centroid. This is the squared distance from each point in the centroid to the center of the centroid.

When you are done, remember to assert all your preconditions.

Testing it Out

Once you complete each of these two parts, run the unit test. If everything is working, you will get the appropriate message.

Once you complete both parts, you will see that the visualizer has a lot more functionality. When you load up a data file and click the Reset button, you will see all points in the data set as blue markers in the plot window. In addition, you will see a small blue circle with black border representing an initial centroid. Finally, you can turn the small blue circle into a big one by choosing the overlay option.

There will still be the error message FAILED VISUALIZATION printed out somewhere in the command shell. This is normal.

Task 3: The Algorithm Class

The last class to implement is the Algorithm class. As before, you need to implement the methods following their specifications. The methods in this class do a lot, but because of all the work you did in the lower-level classes, the code for these methods are reasonable.

Once again, we have broken this assignment up into several parts. Each part has its own test procedure in a6test.py.

Part A: The Initializer

The initializer has the job of setting up the clusters to their initial states so that the first iteration of the algorithm can proceed. Each cluster needs its centroid initialized, but will start with no data points in it until the first partitioning step. You will need to create a list of Cluster objects, and each one needs to be created using a point for its centroid. We draw these points at random from the dataset, but we do not want to select the same point twice, so simply selecting k random indices into the dataset will not do it.

Fortunately, in the random package there is a function random.sample that selects a number of items from a list. The returned list has no duplicates in it. You should use this function to select which points from the dataset will become the seeds of clusters. The specification for the initializer also supports the caller specifying exactly which points to use as cluster seeds. This is primarily for testing. Your code will have to do something a bit different depending on whether the seeds parameter is supplied. Once again, remember to assert all your preconditions.

There is only one getter to implement in Part A. Since Algorithm accesses other classes, but no classes access it, it does not need a lot of getters.

Part B: The Partitioning

In the next step, you implement the two methods necessary to partition the data set among the different clusters. When implementing the _nearest method, you should use a loop over the list of clusters, since they are the possible candidates. During the execution of your loop, you may want to keep track of both the nearest cluster found so far and the distance to the nearest cluster found so far. An interesting question to consider (with more than one good answer) is what you should initialize these values to.

The _partition method separates all the points into clusters. It can be very simple since the distance computation is already implemented in Cluster.

Part C: Update

The next two methods are used to perform a single step in the k-means algorithm. The _update method involves updating all the clusters, which is easy because the core operation is already implemented in Cluster. The only complication is keeping track of whether all the centroids remained unchanged. One approach is to do something similar to what you did for finding the nearest cluster: keep track of whether a changed centroid has been seen so far.

The step method is simple once you have _update and _partition. If you do it right, it is only two lines.

Part D: Run

The next method run involves calling step repeatedly. The only issue is that you have to know when to stop. You do not need a while-loop to do this part. Just loop maxstep times; if the computation converges early, exit the loop with a return statement.

This method takes an argument, so you should remember to assert your preconditions.

Part E: Find Total Error

You are almost there. In this last method you have to add up all the errors of the all centroids. Our implementation was around 4 lines. Once you have implemented this method, you are done with the assignment.

Testing it Out

Once you complete each part of Algorithm, run the unit test. If everything is working, you will get the appropriate message, just like with Cluster. You can also test this part in the visualizer. When you load a data file, you should be able to run the entire clustering process. You can see the initial clusters once the Cluster initializer is working. Once part C is done, you can see the process of refining clusters step by step. You should be able to visually confirm the operation of the algorithm; after each iteration, each point should have the same color as the nearest centroid.

You might find it interesting to run k-means clustering on them several times, and see how the means move around. Note that if you have “too few” clusters, points that seem far apart end up in the same cluster. But if you have “too many” clusters, you get clusters that seem to lie “too close” to each other. Cluster analysis can be finicky.


Finishing the Assignment

Once you have everything working you should go back and make sure that your program meets the class coding conventions, including the following:

  • You have indented with spaces, not tabs (this is not an issue if using Atom Editor).
  • Functions are each separated by two blank lines.
  • Lines are short enough (~80 characters) that horizontal scrolling is not necessary.
  • The specifications for all of the functions are complete.
  • Function specifications are immediately after the function header and indented.
  • Your name(s) and netid(s) are in the comments at the top of the modules.

One of the things that you may have the biggest difficulty with is breaking up long lines. First, you may not be aware when your lines are too long. If you are using Atom Editor, you should see a vertical line on the right side of the window. This is the wrap guide. If you go past this, you have gone too far; it is time to break up your lines.

As for breaking up long lines, there are two solutions. First, Python allows you to "hit Return" within any expression inside of parentheses. So if you are adding together several expressions together, like

  a = 'Hello ' + name + ', it is good to meet you'

you can break it up over several lines, using parentheses, as follows:

a = ('Hello ' + name +
		  ', it is good to meet you')

Another solution is to use the backslash symbol \. Remember that this is the escape character for making special characters in strings. It also has a special effect outside of a string. If you type this symbol, immediately followed by a return, then Python will know to continue to the next line. So you can rewrite the addition above as

a = 'Hello ' + name + \
		  ', it is good to meet you'

We only want you to upload the four files you modified: a6checks.py, a6dataset.py, a6cluster.py and a6algorithm.py. We do not want any other files, even if you added more tests to a6test.py. Upload these files to CMS by the due date listed at the top of this page.

Congrats! You're done!


Now that you’ve completed the assignment, you can now cluster your own songs and other datasets.

How Spotify Stores Your Songs As Points:

Every song on spotify is rated according to 11 features. They are acousticness, danceability, energy, instrumentalness, key, liveness, loudness, mode, speechiness, temp, time signature, and valence. You can read more about them here. Together with these 11 measures we can turn every song into an 11 dimensional point. We can then use k-means clustering algorithm to cluster our playlists.

How to cluster your songs

1. Getting a Spotify Account

The first part of this part of the assignment is to get a Spotify user account. If you already have an account you can skip to part 2. You do not need a premium account for this assignment.

a. Go to https://www.spotify.com/us/ and hit Sign Up.

b. Fill out relevant information or signup with Facebook.

2. Find Your Spotify Username

There are multiple ways to find your spotify username (while it’s called a username, it’s actually just a bunch of numbers -- you might find it in the form usr:1234567890, in which case you would just want 1234567890).

a. The easiest way is to go to https://www.spotify.com/us/account/set-device-password/ and and copy the Your device username is. NOTE: You may be redirected to https://www.spotify.com/us/account/overview/. This is fine. You can copy your username from here as well.

b. Another way is to use the desktop based app to copy your username. Just search yourself and click the three dotted button. Then Click share -> Copy Spotify URI.

Copy that username and paste it into the spotify.py file under where it says 'INSERT YOUR USERNAME ID HERE'

3. Download Spotipy

Spotipy is an API for spotify built with Python. Notice we said Spotipy and not Spotify. You can install it with:

pip install spotipy

4. Get Your App Registered

The next step is to get your application registered so you can use the API. Don’t worry, it is free as long as you don’t try to make money off your A6 project.

a. Go to https://developer.spotify.com/dashboard/ and log in with your personal spotify account.

b. Click Create An App.

c. Fill out the first page. Click “I don’t know” under what are you building. You will get a warning message. This is fine as we are not trying to make money off this assignment.

d. Click through a bunch of boxes that no one ever reads.

e. You are now at the dashboard! Copy and paste your IDs into spotify.py!

f. Press Edit Settings:

g. Paste http://localhost/ into Redirect URIs and Press Add. Then Press Save.

5. Cluster!!!

a. Find a playlist that you want to cluster! This can be your own or anyone else’s.

b. Find the Spotify URI.

This can be done by clicking the three dots -> share -> Copy Spotify URI. OR three dots -> share -> Copy Spotify LINK.

c. Copy and paste this URI into the spotify.py where it says 'INSERT PLAYLIST ID HERE'. (You want the entire thing, starting with spotify:user: and ending with a bunch of random characters)

d. Navigate to your A6 folder. Run your application by typing python spotify.py into terminal

e. Spotify may have you the first time using the API to copy and paste a URL into terminal. This is fine.

f. Clustered! You can change the amount of clusters if you want. Spotify will create new playlists on your account.

WARNING: Don’t set clusters to a large number like 500 or something like that or else spotify will create 500 playlists on your account.

Example Cluster: Taylor Swift

We have provided some interesting playlists to cluster. For example artists who have changed style over the years.

Taylor Swift has changed her music style over the years. TSwift CS 1110: URI = "playlist:6f3t3Kc3vaHIcex5feh7Hf"

If you cluster her songs you can see that the k-means algorithm clusters her songs into three distinct catagories when k=4:

A cluster for pop country songs

A cluster for just pop songs from her later albums

A cluster for slow/mellow songs

There is also another cluster with only one song in it.

Example Playlists

We also have some other playlists:

A playlists of some of your TA's favorite songs URI=playlist:2axHUCq6nyjxEKAKl9TXrB

Kanye West URI = playlist:37i9dQZF1DX7qQG2hCGiwy

Lady Gaga URI = playlist:37i9dQZF1DXc7FZ2VBjaeT

Steve Aoki (Slope Day is just around the corner!) URI = playlist:37i9dQZF1DWWVnzqQyu8XB

Cluster your own playlist as well! Let us know if you find any interesting cases.

Authors: W. White, D. Murphy, T. Westura, S. Marschner, L. Lee, N. Doyle