615 - Peer-to-Peer Systems

Spring 2006

Emin Gun Sirer

Homework 2

The hidden agenda behind the second homework is to write code that interoperates with existing protocols, to perform network measurements, and to deploy some measurement code on PlanetLab.

The concrete goal of the project is to compare the network properties of Gnutella peers to those of PlanetLab hosts.

The homework consists of two parts.

Part I: The Gnutella graph, and its peer properties

First, you should construct a Gnutella crawler that walks over the Gnutella network, discovers peers, discovers which files they are sharing, and performs some sample downloads from these peers to determine the download bandwidth available from that peer.

The Gnutella specification, as well as many reference implementations, are online. Your code needs to follow the spec in order to talk to existing Gnutella clients. You might want to start by developing (or adopting - feel free to reuse any of the existing Gnutella walkers and clients as a starting point) a program that will simply walk and output the Gnutella graph. Then modify it to query each node for the list of files it is sharing. Then query the sizes of the files, and if you find a suitable file in the 1 to 5 MB range, download it, discard all received data, and time how long it took for the download to complete. Be sure to record all relevant information, including the time of day, size of file, download duration, and peer ip address.

Be sure to write your code defensively. We do not want to bring down the Gnutella network. You must make sure that you do not leave any zombie processes behind that will perform pointless downloads forever.

You should collect data from at least 300 peers. You should have at least 3 data points from each peer. Of course, it's always good to collect more data, say from millions of peers, but make sure that you do not exert undue stress on any given node. Treat them as you would like to have your home box treated.

This experiment should yield a CDF plot of bandwidth versus percentage of nodes.

Part II: The PlanetLab peers and their properties

PlanetLab is an incredibly valuable system for distributed systems experimentation that you should all be familiar with.

Sign up for a PlanetLab account. It'll take 24 hours for it to be enabled and propagated. Once that happens, you will have the ability to ssh into any of the nodes on the PlanetLab network, spanning 5 continents and several hundred sites.

There are two ways in which we will use PL hosts in this homework. First, we'll use them as participants in a direct experiment. Then we'll use them as an experimental platform from which we measure other nodes on the Internet.

First, measure the bandwidth from Ithaca to a few hundred PL hosts, extract the bandwidth CDF, and compare the results against Gnutella peers. Do PL hosts have more or less bandwidth available than Gnutella peers?

Second, use your time of day data to determine the variation between measurements during the day (noon-5pm) and in the evening (7pm-midnight) (use our timezone, ignore the remaining hours). This should yield four graphs, corresponding to (day, night) and (Gnutella, PlanetLab). You will likely see some natural diurnal variation. Is the diurnal variation more or less pronounced on PL hosts? Answering this question (whose answer no one knows) should require nothing more than processing of data you collected earlier.

Finally, repeat your Gnutella tracing study, using different PlanetLab hosts. How different are the characteristics of the Gnutella hosts, when you use a different vantage point for your measurements?

Extra Credit:

Here are some things you can do for extra credit and to earn a hacker badge:
  • Draw the Gnutella graph. Everybody has one of these, so should you. Free graphing tools on the internet make it trivial to convert the output of your graph walker to a nice plot.
  • Examine the Gnutella graph for vulnerabilities. Draw a plot of how badly the graph would be impacted (% nodes disconnected) if certain nodes were taken out (% nodes failed). Is there a knee in this curve? If so, are there any common features to the nodes below the knee, i.e. the most valuable assets in the Gnutella network, which, if targeted, would bring the vast majority of the network down?
  • Draw a global bandwidth map. Use Octant to determine the approximate location of a Gnutella node on the global map. Octant will compute an estimated region and a point estimate for any peer that responds to ICMP ping requests (many peers do not, so expect that this approach will work for 1 out of 10 nodes or so). For simplicity, ignore the region, and assume that the node is at the estimated point. Divide the globe into grid squares, say 100x100 miles. Color the grid square that the node resides in based on your bandwidth measurement to that peer. Keep doing this until a fair portion of the map is colored in. Take the median when multiple nodes occupy the same grid square. This should yield a map of earth's bandwidth achievable from Ithaca, NY. Do a nice job of plotting it, write your name on the lower right corner and I'll get one framed copy for you and one for my office.
  • Draw other global bandwidth maps. Draw the graph as in the previous step, and repeat it from different PlanetLab nodes. How different are the bandwidth maps when they are collected from Boston, SF, Seattle, NYC, Ithaca, Urbana-Champagne, Toronto, and Cambridge, UK? Who has the best connection to the rest of the world?