The Structure of Information Networks

Computer Science 685
Fall 2002

  • Instructor: Jon Kleinberg

  • MWF 1:25-2:15 pm.

  • Phillips 203.


    Information networks such as the World Wide Web are characterized by the interplay between heterogeneous content and a complex underlying link structure. This course covers recent research on algorithms for analyzing such networks, and models that capture their basic properties. Topics include methods for link analysis, centralized and decentralized search algorithms, probabilistic models for networks, and connections with work in the areas of social networks and citation analysis.

    The course pre-requisites include background in algorithms and graphs at the level of CS 482, as well as some familiarity with probability and linear algebra.

    The work for the course will consist of a mixture of reaction papers and a few problem sets, concluding with a project. The coursework is discussed in more detail here.

    Course Outline

    (1) Overview and Background.

    Several things laid the groundwork for the material in this course, but two stand out in particular: the increasing availability of network data across technological, social, and biological domains; and the rise of the Web as a central object of study in computer science. We begin by surveying this background material, with a particular emphasis on the World Wide Web.

    (2) What Can Link Structure Tell Us About Content?

    Link structure can be a powerful source of information about the underlying content in the network. In the context of the Web, we can try to identify high-quality information resources from the way in which other pages link to them; this idea has reflections in the analysis of citation data to find influential journals, and in the analysis of social networks to find important members of a community.

    (3) Inferring Communities from Link Topology

    We began the discussion of link analysis with the goal of searching for and ranking high-quality content. A lot of the methods to do this, however, uncover richer structure in the network --- densely connected communties focused on a particular topic. The problem of uncovering such community structure differs in some interesting ways from the heavily studied problems of clustering and graph partitioning (though there are clear relationships): rather than trying to decompose the full network into a few pieces, the goal is to identify small, densely connected regions that -- even put together -- may not account for much of the whole graph.

    (4) Rank Aggregation and Meta-Search

    We have now seen a number of different techniques for ranking Web pages according to various measures of quality. Are there principled methods for combining them, to get an even better `meta-ranking'? We begin this discussion with a celebrated result of Arrow from mathematical economics, suggesting some of the trade-offs inherent in such an approach.

    (5) Power-Law Distributions

    If we were to generate a random graph on n nodes by including each possible edge independently with some probability p, then the fraction of nodes with d neighbors would decrease exponentially in d. But for many large networks -- including the Web, the Internet, collaboration networks, and semantic networks -- it quickly became clear that the fraction of nodes with d neighbors decreases only polynomially in d; to put it differently, the distribution of degrees obeys a power law. What processes are capable of generating such power laws, and why should they be ubiquitous in large networks? The investigation of these questions suggests that power laws are just one reflection of the local and global processes driving the evolution of these networks.

    (6) Decentralized Search and the Small-World Phenomenon

    We now shift our focus to problems with a decentralized flavor: rather than assuming we have access to a central index of the network, we consider the issue of exploring a network `from the inside', without global knowledge. This could be for purposes of search or approximate measurement; it could also be for the purpose of constructing a complete index, in order to make centralized algorithms possible. We begin the discussion of this issue with one of the oldest results on decentralized search -- Stanley Milgram's `small-world' experiments from the 1960's, whose participants forwarded letters through chains of acquaintances to a designated target, and whose striking outcome established the `six degrees of separation' principle. What is a general framework for thinking about this type of result, and what are the general properties that make a network `searchable'?

    (7) Decentralized Search in Peer-to-Peer Networks

    Recently, decentralized search has been applied to the problem of sharing files in a peer-to-peer network without a global index. Each host in the system holds a subset of the content, and requests must be routed to the appropriate host in a decentralized fashion. As in the case of the small-world problem, the goal is a network that is easily searchable; but how should a distributed protocol shape the network topology so as to attain this goal?

    (8) Exploring the Web by Crawling, Focused Crawling, and Approximate Sampling

    In the Web, crawlers are the bridge between the decentralized and centralized worlds -- they gather the content so it can be centrally indexed. We consider some of the issues in scaling a crawler up to the scope of the full Web, as well as the problem of focused crawling, in which one simply wants to crawl a particular subset of the Web -- for example, to gather pages only on a particular topic. Finally, we consider methods based on random sampling and random walks for determining aggregate properties of the network and its indices.

    (9) Link-Based Classification

    Link analysis plays a role in a problem that is related to the ones we have been considering: the task of classifying Web pages into topic categories. Automated text analysis can give us an estimate of the topic of each page; but we also suspect that pages have some tendency to be similar to neighboring pages in the link structure. How should we combine these two sources of evidence? A number of probabilistic frameworks are useful for this task, including the formalism of Markov random fields, which -- for quite different applications -- has been extensively studied in computer vision.

    (10) Diffusion of Information Through Networks

    We can think of a network as a large circulatory system, through which information continuously flows. This diffusion of information can happen rapidly or slowly; it can be disastrous -- as in a panic or cascading failure -- or beneficial -- as in the spread of an innovation. Work in several areas has proposed models for such processes, and investigated when a network is more or less susceptible to their spread.

    (11) Epidemic Algorithms in Networks

    The type of diffusion or cascade process we discussed in the previous part can also be used as a design principle for network protocols. This leads to the idea of epidemic algorithms, also called gossip-based algorithms, in which information is propagated through a collection of distributed computing hosts, typically using some form of randomization. The result is a means of information dissemination that is often more simpler and more robust than regimented, centralized schemes.

    Network Datasets

    There are a number of interesting network datasets available on the Web; they form a valuable resource for trying out algorithms and models across a range of settings.

    Other Courses with Overlapping Content