CS410, Summer 1998
Lecture 25 Outline
Dan Grossman

Goals:
  * Finish up SCC
  * Do MST

SCC wrap-up:
  We were a bit rushed at the end of last class.  So we reviewed the
  proof.  See last lecture's notes.

MST:

Today G is a weighted undirected connected graph (with positive
weights, n vertices, and m edges).  

We want to choose a subset of the edges such that:
   * The graph using only those edges is connected.
   * The sum of the weights of the edges is less than or equal to the
     sum of the weights of any other such subset of edges.

Easy Claim: Such a subset is a tree of n-1 edges.
Proof: Can't be connected with fewer than n-1 edges.  If more edges, then
       there must be a cycle -- remove any edge of the cycle, 
       and the graph is still connected with lower total weight.

Another name for a connected graph with n-1 edges is a tree.  Hence
this is called the minimum spanning tree (MST) problem.

Naively it might seem hard -- we're building up an MST and then we
find some super cheap edge and have to undo some of our other edges.
However, we can actually develop algorithms that don't make mistakes
as they go along.  This is an example of a greedy algorithm.  The
general theory of greedy algorithms is covered in 482.

Here is a generic MST algorithm:

A, a set of edges, empty
while A's size is less than n-1
  find an edge such that A plus that edges is part of an MST
  add the edge to A
return A

So we just need to do the find and it's not at all clear we can do it
efficiently.

We need some definitions:
* A cut in a graph is a partition of the vertices into S and V-S.
* An edge (u,v) crosses a cut (S, V-S) if u in S and v in V-S.
* An edge is a light edge for a cut if it crosses the cut and no other
  edge crosses the cut has smaller weight.
* A set of edges respects a cut if not edge in the set crosses the cut.

The key theorem for proving various MST algorithms correct is the
following:

Theorem: If A is a subset of a MST that respects a cut (S, V-S) and
(u,v) is a light edge on the cut,  then A plus (u,v) is a subset of a
MST.

Proof: By assumption, A is included in a MST T.  Either (u,v) is in T
or it isn't.  If it is, we're done.  If it isn't, then consider the
simple path from u to v in T.  It must cross the cut at least once,
say along edge (x,y).  Since (u,v) is a light edge for the cut,
weight((x,y)) >= weight((u,v)).  So if we replace (x,y) with (u,v) our
weight has not increased.  Furthermore, we are still connected because
any path that used to use (x,y) can replace that edge with a sequence
of edges taking x to u, u-->v, v to y.  So we must still have a MST.
(It also must be the case that weight((x,y)) == weight((u,v)).)

So just need to find a light edge for some cut that A respects on
every iteration.  That sounds easier.

Corollary: Think of A as a forest over G.  (When we're done it will be
a tree, but in the middle it will be a multiple-tree forest.)  If an
edge is the lightest between one tree and everything else, then we can
add it.

Proof of the corollary: Make the cut be (the one tree, everything
else).

Now we will present Kruskal's algorithm for finding an MST:

Kruskal:
sort edges by weight
A empty
For edges in order: 
	if edge connects different trees of A, 
		add it (thus connecting two trees)
return A.

Correctness: If a tree in A is unconnected to another, then no lighter
edge connects them (else would have been added already).  So by
corollary, we're safe.

But for an edge (u,v) we need to know if u and v are currently in the
same tree.  Initially all vertices in separate trees.  When adding an
edge, two trees become one.  This is a perfect application for
Union-Find!

Kruskal:
sort edges by weight
A empty
initialize union-find with each vertex in its own set.
for each edge (u,v) in order
	if (find(u) != find(v))
	   then add (u,v) to A and union(u,v)
return A

Running time:
sort O(mlog m)
union find with 2m finds and n-1 unions is O(mlog* n)

So O(mlog m).  Notice the sorting is the asymptotic bottleneck, so
constant factors probably matter more there.  Notice how all the hard
work in the analysis of the algorithm (correctness and running time)
was already done by more general arguments earlier in the course!

A second algorithm is Prim's algorithm.  We started discussing it, but
the full notes will appear in tomorrow's lecture.