Chapter 2

Graphs

Graphs—sets of nodes connected by edges—are extremely important mathematical objects that arise quite often in real-life applications. For instance:

In this chapter, we will discuss some basic properties of graphs, and present some simple algorithms for exploring them. (We return to some more advanced algorithms in chapter 6.)

2.1  Basic Definitions

Let us start by defining the notions of directed and undirected graphs. See Figure 2.1 for an illustration.

Figure 2.1: A basic example of directed and undirected graphs.

Definition 2.1. A directed graph, or simply graph, G, is a pair (V,E) where V is a set of vertices (or nodes), and E V × V is a set of edges. We say that G is an undirected graph if for every pair of nodes u,v V , (u,v) E if and only if (v,u) E (i.e., all edges are “bidirectional”).

Given an undirected graph G = (V,E), to simplify notation, we often represent the pair of edges (u,v),(v,u) E as a single unordered pair {u,v}; that is, for undirected graphs G = (V,E), E {{u,v}|u,v V }. Given an edge e = (u,v) in a graph G, we say that e is outgoing from u and incoming to v. Note that it can be the case that u = v in which case we refer to the edge e = (u,u) as a self-loop. When we refer to the size of a graph G = (V,E) (denoted |G|), we mean the number of vertices |V | in the graph; that is, |G| is defined as |V |. A complete graph G = (V,E) is a graph where E = V × V ; that is, we have an edge between any two nodes u,v V .

A subgraph of a graph G = (V,E) is a smaller graph on a subset of nodes from V consistent with the edges from E. More formally,

Definition 2.2. Given a graph G = (V,E), a subgraph of G is a graph G = (V ,E) where V V and E (V × V ) E. We use the notation G G to denote that G is a subgraph of G.

We turn to defining the notion of node degree:

Definition 2.3. In a graph G = (V,E), the in-degree of a node v V , denoted in-deg(v), is the number of edges incoming to v (i.e., edges e E of the form e = (u,v)); the out-degree of v, denoted out-deg(v), is the number of edges outgoing from v (i.e., edges e E of the form e = (v,u)). The degree of v, denoted deg(v), is the sum of the in-degree and the out-degree of v.

We refer to a node that has in-degree 0 (i.e., no incoming edges) as a source, and a node with out-degree 0 (i.e., no outgoing edges) as a sink. Pictorially, the degree of a vertex corresponds to the number of “lines” going in or out from the vertex. (Note that self-loops contribute twice to the degree of a vertex.)

2.2  Connectivity and Reachability

We turn to consider the notion of connectivity. We first define what it means for a node v to be reachable from (or connected to) a node v.

Definition 2.4. A path (or walk) in a graph G = (V,E) is a sequence of vertices (v1,v2,,vk) such that there exists an edge between any two consecutive vertices (i.e., (vi,vi+1) E for 1 i < k). We refer to k as the length of the path. If k > 1 and v1 = vk (i.e., the starting vertex is the same as the ending vertex), we refer to the path as a cycle. We say that a node v is reachable from v (or connected to v) if there exists a path from v to v.

A graph without cycles is called acyclic. We turn to defining what it means for a graph to be connected. For simplicity, we will restrict our attention to undirected graphs.

Definition 2.5. An undirected graph G = (V,E) is connected if there exists a path between any two nodes u,v V . (Note that a graph containing a single node v is considered connected via the length 0 path (v).) If an undirected graph is not connected, we say that it is disconnected.

When an undirected graph is disconnected, we may decompose the graph into smaller connected components.

Definition 2.6. A connected component of an undirected graph G = (V,E) is a maximal connected subgraph—that is, a subgraph H G that is connected, and any larger subgraph H (satisfying HH, H H G) must be disconnected.

See Figure 2.2 for an illustration of connected components of a graph.

Figure 2.2: Example of connected components.

Let us note an intersting application of connectivity to social networks. It has been experimentally observed that in social networks (e.g., Facebook), although not everyone is in the same connected component, most people typically are: we informally refer to this component as the giant component. One reason for why most people are part of the same giant component is that if we had two (or more) large components in the graph, then all it takes to “merge” two large components into a single larger component is for one of the people in each component to become friends—it is very unlikely that this does not happen for any such pair of people. (This argument can be formalized in Erdős and Renyi’s “random graph” model [ER59, Bol84]; we discuss the random graph model in more detail below.)

2.3  Breadth-first Search and Shortest Paths

How do we check if there is a path from v to v? One efficient way to do this is with breadth-first search (BFS). Breadth first search is a basic graph search algorithm that given a graph G and a starting node v proceeds as follows: visit all the neighbors v of v, then visit all the currently unvisited neighbors of those nodes v, and so on and so forth, while keeping track of which nodes have previously been visited. More precisely,

See Figure 2.3 for an illustration of the execution of the BFS algorithm.

Figure 2.3: An illustration of breadth-first search. Starting from node S, in the first step we visit all of the neighbors of S (marked with a 1), in the second step we visit all of the unvisited neighbors (marked with a 2) of each node visited in the first step, and so on. Notice that, in four steps, BFS visits every node except for the top-right node (unlabeled), which clearly has no path from S.

As we now show, the BFS algorithm can be used not only to determine whether there exists a path from v to v, but also to find some shortest path (i.e., a path of minimal length) between v and v.

Claim 2.1. For all vertices v,v, if there exists a path from v to v, the BFS algorithm starting from vertex v traverses some shortest path from v to v (and this shortest path can be retrieved by following the “back-pointers” from v).

Proof. Assume for contradiction that there exists a path from v to v, but that the BFS algorithm does not traverse any shortest path from v to v. Let (v0 = v,v1,v2,,vn = v) be a shortest path from v to v, and let vi be the first node on the path such that BFS does not traverse a path of length (at most) i from v to vi. Such a node must exist because, by assumption, BFS does not traverse a path of length n from v = v0 and v = vn; additionally i 1, since v0 = v is the starting point.

Since vi is the first node on the path for which BFS fails to traverse a path of length at most i, it must have traversed a path of length at most i 1 to get to vi1. Consequently, after visiting vi1, BFS will visit vi since it is a neighbor of vi1 that still is unvisited at this point (or else BFS would have traversed a path of length at most i 1 to get from v to vi). Thus, BFS traverses a path of length at most i from v to vi, which is a contradiction.

As a consequence of the above claim, and observing that BFS will only visit each node once (and thus its running time is polynomial in the number of nodes in the graphs), we have the following theorem.

Theorem 2.1. Shortest paths in a graph G can be found in polynomial time in |G|.

Since running BFS from a node v visits all the nodes that are reachable from v (by Claim 2.1), we also have the following result.

Theorem 2.2. Given any node v in G, the set of nodes that are reachable from v in G can be found in polynomial time in |G|.

We can also use BFS to find the connected components of an undirected graph G: Start at any node v and use BFS to find the set of nodes that are reachable from v; by the definition of reachability, the subgraph corresponding to those nodes is the maximal connected subgraph containing v and thus a connected component. Mark all those nodes as component 1. Next, take some currently unmarked node v, and again use BFS to find the set of nodes that are reachable from v; mark all those notes as component 2. Continue in the same manner until all nodes have been marked.

Some applications of BFS and shortest paths to social networks include the following:

To get some intuition for why small-world effects happen, let us consider Erdős and Renyi’s “random graph” model [ER59]:

The random graph model Consider a graph with n nodes, where every pair of nodes has an edge between them (determined independently and with no self-loops) with probability 1 2. To gain some intuition about how many nodes have “short paths” between them, we start by asking the following question: What is the probability that two vertices v and v have a path of length at most 2 between them?

Given any specific third vertex z, the probability of a path ((v,z),(z,v)) is only 1 2 1 2 = 1 4. However, since there are n 2 possible “third vertices” for this path, the probability that two nodes are not connected by any path of length exactly 2 is

(1 1 4)n2 =(3 4)n2.

By the Union Bound (see Corollary A.2 in Appendix A), we thus have that the probability that there exists some pair of nodes that is more than distance 2 apart is

Pr uvu, v has distance > 2 uv Pr[u, v has distance > 2] uv Pr[u, v has distance  2] = n(n 1) 2 3 4n2.

This quantity decreases very quickly as the number of vertices, n, increases. Therefore, it is extremely likely that every pair of nodes is at most distance 2 apart.

Needless to say, in a real social network, people are far less likely than probability 1 2 to be connected, but the number n of nodes is extremely large. Hence, similar arguments apply to show that the average separation between two nodes is relatively small.

Notes

We refer the reader to [CLRS09] and [KT05] for a more in-depth study of graph algorithms, and to [EK10] for connections between graph properties and social networks.