Algorithms for Information Networks

Carnegie Mellon, Spring 2005

  • Instructor: Jon Kleinberg

  • Time: MW 3:00-4:20 pm.

  • http://www.cs.cornell.edu/home/kleinber/sp05course.html

    Overview

    Information networks such as the World Wide Web are characterized by the interplay between heterogeneous content and a complex underlying link structure. This course covers recent research on algorithms for analyzing such networks, and models that abstract their basic properties. Topics include combinatorial and probabilistic techniques for link analysis, centralized and decentralized search algorithms, network models based on random graphs, and connections with work in the social sciences.

    The course pre-requisites include background in algorithms and graphs, as well as some familiarity with probability and linear algebra.

    The work for the course will consist primarily of a short reaction paper and a more subtantial project. The coursework is discussed in more detail here.

    Course Outline

    (1) Small-World Properties in Networks

    A major goal of the course is to illustrate how networks across a variety of domains exhibit common structure at a qualitative level. One area in which this arises is in the study of `small-world properties' in networks: many large networks have short paths between most pairs of nodes, even though they are highly clustered at a local level, and they are searchable in the sense that one can navigate to specified target nodes without global knowledge. These properties turn out to provide insight into the structure of large-scale social networks, and, in a different direction, to have applications to the design of decentralized peer-to-peer systems.

    (2) Power-Law Distributions

    If we were to generate a random graph on n nodes by including each possible edge independently with some probability p, then the fraction of nodes with d neighbors would decrease exponentially in d. But for many large networks -- including the Web, the Internet, collaboration networks, and semantic networks -- it quickly became clear that the fraction of nodes with d neighbors decreases only polynomially in d; to put it differently, the distribution of degrees obeys a power law. What processes are capable of generating such power laws, and why should they be ubiquitous in large networks? The investigation of these questions suggests that power laws are just one reflection of the local and global processes driving the evolution of these networks.

    (3) Cascading Behavior in Networks

    We can think of a network as a large circulatory system, through which information continuously flows. This diffusion of information can happen rapidly or slowly; it can be disastrous -- as in a panic or cascading failure -- or beneficial -- as in the spread of an innovation. Work in several areas has proposed models for such processes, and investigated when a network is more or less susceptible to their spread. This type of diffusion or cascade process can also be used as a design principle for network protocols. This leads to the idea of epidemic algorithms, also called gossip-based algorithms, in which information is propagated through a collection of distributed computing hosts, typically using some form of randomization.

    (4) Nash Equilibria in Networks

    In order to model the interaction of agents in a large network, it often makes sense to assume that their behavior is strategic -- that each agent operates so as to optimize his/her/its own self-interest. The study of such systems involves issues at the interface of algorithms and game theory. A central definition here is that of a Nash equilibrium -- a state of the network from which no agent has an incentive to deviate -- and recent work has studied how well a system operates when it is in a Nash equilibrium.

    (5) Spectral Analysis of Networks

    A powerful approach to analyzing networks is to look at the eigenvalues and eigenvectors of their adjacency matrices. The connection between these parameters and the structure of the network is a subtle issue, and while many results have been established about this connection, it is still not fully understood.

    (6) Link Analysis for Web search

    Link structure can be a powerful source of information about the underlying content in the network. In the context of the Web, we can try to identify high-quality information resources from the way in which other pages link to them; this idea has reflections in the analysis of citation data to find influential journals, and in the analysis of social networks to find important members of a community. From a methodological point of view, current approaches to link analysis on the Web make extensive of methods based on eigenvalues and eigenvectors.

    (7) Clustering and Community Structures in Networks

    Clustering is one of the oldest and most well-established problems in data analysis; in the context of networks, it can be used to search for densely connected communities. A number of techniques have been applied to this problem, including combinatorial and spectral methods.

    (8) Labeling and Classification using Networks of Pairwise Relationships

    A task closely related to clustering is the problem of classifying the nodes of a network using a known set of labels. For example, suppose we wanted to classify Web pages into topic categories. Automated text analysis can give us an estimate of the topic of each page; but we also suspect that pages have some tendency to be similar to neighboring pages in the link structure. How should we combine these two sources of evidence? A number of probabilistic frameworks are useful for this task, including the formalism of Markov random fields, which -- for quite different applications -- has been extensively studied in computer vision.

    (9) The VC-dimension and some of its applications.

    Random sampling is a useful technique in many of the types of problems we have been considering, and understanding the full power of sampling can be more subtle than it first appears. Here we try to understand the following phenomenon: when the underlying space being sampled has a certain kind of ``orderly'' structure, very small samples can provide extremely strong information. There is a general and powerful theory, the notion of VC-dimension, that explains this; after developing the concept of VC-dimension, we'll look at some of its applications in approximation algorithms and computational geometry. and efficient indexing.

    (10) Rank Aggregation and Meta-Search

    We have now seen a number of different techniques for ranking Web pages according to various measures of quality. Are there principled methods for combining them, to get an even better `meta-ranking'? We begin this discussion with a celebrated result of Arrow from mathematical economics, suggesting some of the trade-offs inherent in such an approach.

    Network Datasets

    There are a number of interesting network datasets available on the Web; they form a valuable resource for trying out algorithms and models across a range of settings.