Introduction

In this problem set you will perform a number of tasks related to graphs. You will need to run your implementation on a graph with over 6 million edges, so efficiency is a prime concern. You are not allowed to use any mutable datatypes (like refs, arrays or hash tables). However, we have provided you with a partial implementation of red-black trees which, if properly used, will allow you do perform all the tasks required efficiently. You are strongly encouraged to read the whole problem set before you start working. A good design will simplify implementation immensely. In general, all of the tasks asked of you can be done in time at most O(M log M) where M is the number of edges. If you find yourself doing anything that is O(M2), you're in trouble, as 6 million squared is a very big number. Note that even with a good implementation, it will take a few minutes for some of the tasks to run, and will require at least 500 megs of memory.
If you feel that it is necessary, you are welcome to add additional functions to the signatures provided. However, you must maintain the same level of abstraction, using the opaque form. This means that any functions that work on graphs must be within the graph signature and any functions that work on rbtrees must be in the RBTreeFn functor and must be general to all red-black trees.

You will use the compilation manager (CM). The .sml and .sig files that CM manages are listed in sources.cm; do not modify this file. To compile the code with CM, change to the directory containing the files for this assignment, start SML, and run CM.make(). This will compile all of the source files. If you edit your files and want to re-compile, run CM.make() again. Before you submit your files, run CM.make() to make sure that all of your code compile without errors and warnings.

Parts 1 and 6 are due on Thursday, the 16th. The rest of the assignment is due on Thursday, the 30th.

Part 0 - Files

Zip file with all files
Zip file without large graph/communities

Part 1 (10 points) - Basic RB-Tree functions

The datatype and the code for insertions has already been written for you. However, you need to implement each of the following functions. You will likely find many of them useful when doing other parts of this problem set.
    (* returns the number of (non-Null) nodes in t*)
    fun size(t:rbtree):int

    (* returns the depth of t -- depth(Null) = 0*)
    fun depth(t:rbtree):int

    (* returns true if and only if t contains x*)
    fun contains(t:rbtree, x:elem):bool
    
    (* returns SOME element of the tree for which 
     * the compare function returns EQUAL, or returns 
     * NONE if no such element exists.*)
    fun find(t:rbtree, x:elem):elem option

    (* returns a tree containing the elements in 
     * either t1 or t2 (no duplicates though)*)
    fun union(t1:rbtree, t2:rbtree):rbtree

    (* returns a tree containing the elements in 
     * both t1 and t2*)
    fun intersection(t1:rbtree, t2:rbtree):rbtree

    (* returns an ordered list containing all elements of
     * t1. *)
    fun toList(t1:rbtree):elem list

    (* returns an ordered list containing all elements of
     * t1 whose value is between lo and hi, inclusive. *)
    fun sublist(t1:rbtree, lo: elem, hi: elem):elem list

Part 2 (20 points) - Reading the graph

A graph G = (V,E) is a set of vertices, V, and a set of edges, E. In this problem, the vertices will each be numbered, starting from 1. Thus, each directed edge will be given by a pair of integers: the starting point of the edge and its destination. Using the TextIO structure in SML, you should read a file containing a description of a graph. Each line of the file will contain two integers, u followed by v, indicating an edge from u to v. Its up to you how you represent the graph, but you will need to do it efficiently for the later parts of this problem. HINT: use red-black trees!

Once you have read the graph into an appropriate data structure, you should generate some simple statistics about the graph. The indegree of a vertex is the number of edges that go in to that vertex. The outdegree of a vertex is the number of edges that go out from that vertex. You should implement methods to calculate each of the following (see the signature for more details). Recall that any functions which you choose to add to the rbtree signature must be general to all red-black trees, and may not be specific to a particular type of red-black trees (like integer red-black trees). In the runtimes listed, M is the number of edges, and N is the number of vertices. You are not absolutely required to achieve the listed runtimes, but you should not be off by more than a logarithmic factor. Thus, if your algorithm were O(M log M) instead of O(M), that would be ok, but O(M*N) would not be ok.
  1. The number of vertices -- O(M)
  2. The number of edges -- O(M)
  3. The maximum indegree -- O(M)
  4. The maximum outdegree -- O(M)
  5. The indegree of a specific vertex -- O(log M + K), where K is the indegree
  6. The outdegree of a specific vertex -- O(log M + K), where K is the outdegree
  7. The number of nodes with indegree k, for every k -- O(M log N)
  8. The number of nodes with outdegree k, for every k -- O(M log N)
After you get your code working on small cases, you should try it on this large graph included in the zip file. It may take a while for you to read the contents of the file (many minutes) even for good implementations, so make sure everything works before you try this part. In a comment, you should report the results results of (1)-(4) above for the large graph. You should also report the results of (7) and (8) in a table for k<=10 (put these in a comment in your graph structure file).

Part 3 (10 points) - BFS

Once you have read a graph from a file, you can do a breadth first search to find the length of the shortest path from one vertex, s, to another vertex, t.

In this part, you will have to implement BFS to find the distance between s and t. You should write a function:
    (*returns the distance from s to t in the graph, or ~1 if no path exists*)
    fun distance(g:graph, s:int, t:int):int
Let's consider a small example with 5 vertices and 7 edges:
1 2
1 3
2 3
2 4
3 4
5 2
5 3
If you read in a file containing this graph, then you should be able to compute the distance from node 1 to node 4 as 2.

Once you have your BFS working properly, make sure that it works on the large graph from part 2. You want to be sure that your code works on small graphs before you start experimenting with the large graph. Once the file is read, you should be able to perform a BFS from any node to any other node in just a couple minutes.

Part 4 (10 points) - Using BFS

Using the BFS code from part 3 and the functions from part 1, you should be able to do all of the following:
  1. Find the distance from some vertex s to vertex t.
  2. Find all vertices that can be reached from vertex s.
  3. Find all vertices that can reach vertex s.
  4. Find all vertices t such that there is a path from vertex s to vertex t if all edges are treated as undirected. This is known as the weakly connected component of s.
  5. Find all vertices t such that there is a path from vertex s to vertex t and from vertex t to vertex s. This is known as the strongly connected component of s.
In addition to implementing the above functions, you should randomly select 10 (s,t) pairs of vertices. For each pair of vertices, use BFS to compute the distance from s to t. You should include a comment with the 10 pairs of vertices, and the distances you found between each pair.

Part 5 (10 points) - Communities

The communities file contains information about a number of communities in the large graph you downloaded. Each community is identified by an integer id. Each line of the file is formatted as:
<COMMUNITY> <VERTEX>
inidcating that <VERTEX> is a number of community <COMMUNITY>. You need to figure out how to read this file into an appropriate datatype to implement the rest of this problem.

We are interested in vertices that are one step away from communities. More specifically, for a particular community, we'd like to find all the vertices that are not in the community, but which have edges between them and vertices in the community. Furthermore, we can break these vertices into three disjoint sets: 1) vertices with edges going into the community and also coming out of the community, 2) vertices just with edges going into the community, 3) vertices just with edges coming out of the community. For this part, you should write a method:
    oneAway(g:graph, c:community_membership, communityId:int):(int*int*int)
The three elements of the returned tuple should correspond to the three types of vertices mentioned above. This operation should be very fast, once the graph and community_membership structures have been created. Even for the largest community (326), this shouldn't take more than a couple seconds.

Part 6 (8 points) - Induction

A tree is a graph with no cycles in it. Equivalently, it is a connected graph with exactly N-1 edges, where N is the number of nodes. The diameter of a tree is the maximum distance between some pair of two nodes. It turns out that there is a simple algorithm for finding the diameter of a tree:
  1. Pick an arbitrary node, call it u.
  2. Run BFS (or DFS) to find the nodes furthest from u. Of all the nodes furthest from u, pick one arbitrarily, call it v.
  3. Run BFS (or DFS) to find the nodes furthest from v. The distance from v to the furthest node from v is the diameter of the tree.
Prove that this algorithm correctly finds the diameter of a tree.

Part 7 (8 points) - Wrapup