Excerpted from a paper at the 9th World Wide Web conference, 2000

Graph structure in the web


Andrei Broder1, Ravi Kumar2, Farzin Maghoul1, Prabhakar Raghavan2, Sridhar Rajagopalan2, Raymie Stata3, Andrew Tomkins2, Janet Wiener3

1: AltaVista Company, San Mateo, CA.
2: IBM Almaden Research Center, San Jose, CA.
3: Compaq Systems Research Center, Palo Alto, CA.


Abstract

The study of the web as a graph is not only fascinating in its own right, but also yields valuable insight into web algorithms for crawling, searching and community discovery, and the sociological phenomena which characterize its evolution.  We report on experiments on local and global properties of the web graph using two Altavista crawls each with over 200M pages and 1.5 billion links.  Our study indicates that the macroscopic structure of the web is considerably more intricate than suggested by earlier experiments on a smaller scale.

Keywords: graph structure, diameter, web measurement


Introduction

Consider the directed graph whose nodes correspond to static pages on the web, and whose arcs correspond to hyperlinks between these pages.  We study various properties of this graph including its diameter, degree distributions, connected components, and macroscopic structure.  There are several reasons for developing an understanding of this graph:
  1. Designing crawl strategies on the web [Cho and Garcia-Molina 2000].
  2. Understanding of the sociology of content creation on the web.
  3. Analyzing the behavior of web algorithms that make use of link information [Butafogo and Schneiderman 91, Mendelson and Wood 95, Carierre and Kazman 97, Kleinberg 97, Brin and Page 98].  To take just one example, what can be said of the distribution and evolution of PageRank [Brin and Page 98] values on graphs like the web?
  4. Predicting the evolution of web structures such as bipartite cores [Kumar et. al. (1) 99] and webrings, and better algorithms for discovering and organizing them.
  5. Predicting the emergence of new, yet unexploited phenomena in the web graph.
We detail a number of experiments on a web crawl of approximately 200 million pages and 1.5 billion hyperlinks; the scale of this experiment is thus five times larger than the previous biggest study [Kumar et. al. (1) 99] of structural properties of the web graph, which used a pruned data set from 1997 containing about 40 million pages.  Recent work ([Kumar et. al. (1) 99] on the 1997 crawl, and [Barabasi and Albert 99] on the approximately 325K-node nd.edu subset of the web) has suggested that the distribution of degrees (especially in-degrees) follows a power law:
The power law for indegree: the probability that a node has in-degree i is proportional to 1/ix, for some positive x > 1.
We verify the power law phenomenon in current (considerably larger) web crawls, confirming it as a basic web property.

In other recent work, [Albert, Jeong, and Barabasi 99] report the intriguing finding that most pairs of pages on the web are separated by a handful of links, almost always under 20, and suggest that this number will grow logarithmically with the size of the web. This is viewed by some as a "small world" phenomenon.  Our experimental evidence reveals a rather more detailed and subtle picture: significant portions of the web cannot at all be reached from other (significant) portions of the web, and there is significant number of pairs that can be bridged, but only using paths going through hundreds of intermediate pages.

Our main results

We performed three sets of experiments on web crawls from May 1999 and October 1999.  First, we generated the in- and out-degree distributions, confirming previous reports on power laws; for instance, the fraction of web pages with i in links is proportional to 1/i2.1, the constant 2.1 being in remarkable agreement with earlier studies at varying scales [Kumar et. al. (1) 99, Barabasi and Albert 99].  In our second set of experiments we studied the directed and undirected connected components of the web.  We show that power laws also arise in the distribution of sizes of these connected components.  Finally, we performed a number of breadth-first searches from randomly-chosen start nodes.  We detail these experiments in the section describing our experiments and results.

Our analysis reveals an interesting picture (Figure 9) of the web's macroscopic structure.  Most (over 90%) of the approximately 203 million nodes in our crawl form a single connected component if hyperlinks are treated as undirected edges.  This connected web breaks naturally into four pieces.  The first piece is a central core, all of whose pages can reach one another along directed hyperlinks -- this "giant strongly connected component" (SCC) is at the heart of the web.  The second and third pieces are called IN and OUT. IN consists of pages that can reach the SCC, but cannot be reached from it - possibly new sites that people have not yet discovered and linked to.  OUT consists of pages that are accessible from the SCC, but do not link back to it, such as corporate websites that contain only internal links.  Finally, the TENDRILS contain pages that cannot reach the SCC, and cannot be reached from the SCC.  Perhaps the most surprising fact is that the size of the SCC is relatively small -- it comprises about 56M pages. Each of the other three sets contain about 44M pages -- thus, all four sets have roughly the same size.

[deletion]

In a sense the web is much like a complicated organism, in which the local structure in a microscopic scale looks very regular like a biological cell, but the global structure exhibits interesting morphological structure (body and limbs) that are not obviously evident in the local structure. Therefore, while it might be tempting to draw conclusions about the structure of the web graph from a local picture of it, such conclusions may be misleading.

Related prior work

[deletion]

Zipf-Pareto-Yule and Power laws.   Distributions with an inverse polynomial tail have been observed in a number of contexts. The earliest observations are due to Pareto [Pareto 1897] in the context of economic models. Subsequently, these statistical behaviors have been observed in the context of literary vocabulary [Yule 44], sociological models [Zipf 49], and even oligonucleotide sequences [Martindale and Konopka 96] among others.  Our focus is on the closely related power law distributions, defined on the positive integers, with the probability of the value i  being proportional to 1/ik for a small positive number k.  Perhaps the first rigorous effort to define and analyze a model for power law distributions is due to Herbert Simon [Simon 55].

More recently, power law distributions have been observed in various aspects of the web. Two lines of work are of particular interest to us. First, power laws have been found to characterize user behavior on the web in two related but dual forms:

  1. access statistics for web pages, which can be easily obtained from server logs (but for caching effects); see [Glassman 97, Huberman  et. al. 98, Adamic and Huberman  (1) 99, and Adamic and Huberman (2) 99].
  2. the number of times users at a single site access particular pages also enjoy power laws, as verified by instrumenting and inspecting logs from web caches, proxies, and clients (see [Barford et.al. 99] and references therein, as well as [Lukose and Huberman 98]).
Second, and more relevant to our immediate context is the distribution of degrees on the web graph.  In this context, recent work (see Kumar et. al (1) 99, Barabasi and Albert 99) suggests that both the in- and the out-degrees of vertices on the web graph have power laws. The difference in scope in these two experiments is noteworthy. The first (Kumar et.al. (1) 99) examines a web crawl from 1997 due to Alexa, Inc., with a total of over 40 million nodes. The second (Barabasi and Albert 99), examines web pages from the University of Notre Dame domain, *.nd.edu, as well as a portion of the web reachable from 3 other URLs. In this paper, we verify these power laws on more recent (and considerably larger) web crawls.  This collection of findings reveals an almost fractal like quality for the power law  in-degree and out-degree distributions, in that it appears both as a macroscopic phenomenon on the entire web, as a microscopic phenomenon at the level of a single university website, and at intermediate levels between these two.

There is no evidence that users' browsing behavior, access statistics and the linkage statistics on the web graph are related in any fundamental way, although it is very tempting to conjecture that this is indeed the case.  It is usually the case, though not always so, that pages with high in-degree will also have high PageRank [Brin and Page 98]. Indeed, one way of viewing PageRank is that it puts a number on how easy (or difficult) it is to find particular pages by a browsing-like activity. Consequently, it is plausible that the in-degree distributions induce a similar distribution on browsing activity and consequently, on access statistics. [deletion]

In this section we formalize our view of the web as a graph; in this view we ignore the text and other content in pages, focusing instead on the links between pages.  Adopting the terminology of graph theory [Harary], we refer to pages as nodes, and to links as arcs.  In this framework, the web becomes a large graph containing several hundred million nodes, and a few billion arcs.  We will refer to this graph as the web graph, and our goal in this paper is to understand some of its properties.  Before presenting our model for web-like graphs, we begin with a brief primer on graph theory, and a discussion of graph models in general.

A brief primer on graphs and terminology

The reader familiar with basic notions from graph theory may skip this primer.

A directed graph consists of a set of nodes, denoted V and a set of arcs, denoted E.  Each arc is an ordered pair of nodes (u,v) representing a directed connection from u to v. The out-degree of a node u is the number of distinct arcs (u,v1)...(u,vk) (i.e., the number of links from u), and the in-degree is the number of distinct arcs (v1,u)...(vk,u) (i.e., the number of links to u).  A path from node u to node v is a sequence of arcs (u,u1), (u1,u2), ... (uk,v). One can follow such a sequence of arcs to "walk" through the graph from u to v.  Note that a path from u to v does not imply a path from v to u. The distance from u to is one more than the smallest k for which such a path exists. If no path exists, the distance from u to v is defined to be infinity. If (u,v) is an arc, then the distance from u to v is 1.

Given a directed graph, a strongly connected component (strong component for brevity) of this graph is a set of nodes such that for any pair of nodes u and v in the set there is a path from u to v.  In general, a directed graph may have one or many strong components.  The strong components of a graph consist of disjoint sets of nodes. One focus of our studies will be in understanding the distribution of the sizes of strong components on the web graph.

[deletion]
 

Experiments and results

[deletion]

In the experiments reported in this paper, CS2 [the database] was built from a crawl performed at AltaVista in May, 1999. The CS2 database contains 203 million URLs and 1466 million links (all of which fit in 9.5 GB of storage).  Some of our experiments were repeated on a more recent crawl (October, 1999) containing 271 million URLs and 2130 million links.

In general, the AltaVista crawl is based on a large set of starting points accumulated over time from various sources, including voluntary submissions.  The crawl proceeds in roughly a BFS manner, but is subject to various rules designed to avoid overloading web servers, avoid robot traps (artificial infinite paths), avoid and/or detect spam (page flooding), deal with connection time outs, etc.  Each build of the AltaVista index is based on the crawl data after further filtering and processing designed to remove duplicates and near duplicates, eliminate spam pages, etc.  Then the index evolves continuously as various processes delete dead links, add new pages, update pages, etc.  The secondary filtering and the later deletions and additions are not reflected in the connectivity server. But overall, CS2's database can be viewed as a superset of all pages stored in the index at one point in time.  Note that due to the multiple starting points, it is possible for the resulting graph to have many connected components.

[deletion]

Degree sequences.  The first experiment we ran was to verify earlier observations that the in- and out-degree distributions on the web are distributed according to power laws.  We ran the experiment on both the May and October crawls of the web.   The results, shown in Figure 1, show remarkable agreement with each other, and with similar experiments from data that is over two years old [Kumar et. al. (1) 99].  Indeed, in the case of in-degree, the exponent of the power law is consistently around 2.1, a number reported in [Kumar et. al. (1) 99,Barabasi and Albert 99].  The anomalous bump at 120 on the x-axis is due a large clique formed by a single spammer.  In all our log-log plots, straight lines are linear regressions for the best power law fit.

[deletion]
 

fig1 fig2
Figures 1 and 2: In-degree and out-degree distributions subscribe to the power law. The law also holds if only off-site (or "remote-only") edges are considered.

[deletion]

[deletion]

Strongly connected components.  Motivated in part by the intriguing prediction of  [Albert, Jeong, and Barabasi 99] that the average distance (referred to in their paper as diameter) of the web is 19 (and thus it should be possible to get from any page to any other in a small number of clicks), we turned to the strongly connected components of the web as a directed graph.  By running the strongly connected component algorithm, we find that there is a single large SCC consisting of about 56M pages, all other components are significantly smaller in size. This amounts to barely 28% of all the pages in our crawl. One may now ask: where have all the other pages gone?  The answer to this question reveals some fascinating detailed structure in the web graph; to expose this and to further study the issues of the diameter and average distance, we conducted a further series of experiments. 

[deletion]

Zipf distributions vs power law distributions.  The Zipf distribution is an inverse polynomial function of ranks rather than magnitudes -- for example, if only in-degrees 1, 4, and 5 occurred then a power law would be inversely polynomial in those values, whereas a Zipf distribution would be inversely polynomial in the ranks of those values:  i.e., inversely polynomial in 1, 2, and 3.  The in-degree distribution in our data shows a striking fit with a Zipf (more so than the power law) distribution; Figure 8 shows the in-degrees of pages from the May 1999 crawl plotted against both ranks and magnitudes (corresponding to the Zipf and power law cases).  The plot against ranks is virtually a straight line in the log-log plot, without the flare-out noticeable in the plot against magnitudes.
 

fig8
Figure 8: In-degree distributions plotted as a power law and as a Zipf distribution.

Interpretation and further work

Let us now put together the results of the connected component experiments with the results of the random-start BFS experiments.  Given that the set SCC contains only 56M of the 186M nodes in our giant weak component, we use the BFS runs to estimate the positions of the remaining nodes.  The starting points for which the forward BFS "explodes" are either in SCC, or in a set we call IN, that has the following property: there is a directed path from each node of IN to (all the nodes of) SCC.  Symmetrically, there is a set we call OUT containing all starting points for which the backward BFS "explodes"; there is a directed path from any node in the SCC to any node in OUT.  Thus a forward BFS from any node in either the SCC or IN will explode, as will a backward BFS from any node in either the SCC or OUT.  By analyzing forward and backward BFS from 570 random starting points, we can compute the number of nodes that are in SCC, IN, OUT or none of these.  Figure 9 shows the situation as we can now infer it.
 
fig9
Figure 9: Connectivity of the web: one can pass from any node of IN through SCC to any node of OUT.  Hanging off IN and OUT are TENDRILS containing nodes that are reachable from portions of IN, or that can reach portions of OUT, without passage through SCC.  It is possible for a TENDRIL hanging off from IN to be hooked into a TENDRIL leading into OUT, forming a TUBE -- a passage from a portion of IN to a portion of OUT without touching SCC.

We now give a more detailed description of the structure in Figure 9.  The sizes of the various components are as follows:
 
Region SCC IN OUT TENDRILS DISC. Total
Size 56,463,993 43,343,168 43,166,185 43,797,944 16,777,756 203,549,046

[deletion] Acknowledgment.  We thank Keith Randall for his insights into our SCC algorithm and implementation.


References