Network Topology Discovery

Project Argus - Network topology discovery, monitoring, history, and visualization.

Table of Contents
Introduction
Long-term goals
Tools Availible
Topology Discovery - basic algorithm
What about runtime?
Project Argus:
Obtaining Argus
Contact Information

Introduction

There are many situations in which we'd like to be able to determine the layout of a particular network. However, topology discovery algorithms based on SNMP and other protocols are often insufficient for reliably mapping the entire network. We want to be able to map the topology of a network while keeping assumptions about the network to a minimum. The goal is to develop algorithms for topology discovery that are:

Fast - the algorithms should be fast, to allow near-real-time mapping of network topology, and to ensure greater internal consistency of data. An algorithm that cannot map fast enough to keep up with changes in the network will be of limited use.
Complete - to be useful, the algorithm should correctly discover the majority of the hosts and routers within a particular network, with a minimum of errors.
Accurate - it should not make egregious mistakes.
Efficient - the algorithm should not consume an undue amount of network resources. It should be able to do it's work with a minimum load on the network.

This project is part of the Cornell Network Research Group.

Long-term goals

To develop a set of network tools capable of:

Automatic topology discovery, in a wide variety of network environments.
Network status monitoring
Service monitoring - mail, ftp, web, snmp, dns.
Visualization of complex topologies
Statistical data gathering and correlation - able to correlate changes in network conditions (traffic, delay time, etc) to any set of user-defined conditions (time of day, national holidays, Super Bowl, etc)

Tools Availible

Ping. Everybody supports ping. We can use this to determine if hosts are alive/dead. Broadcast ping is also useful, but not universally supported.
Traceroute. Discovers the route between the sender and the targeted host by sending packets with varying TTLs. Traceroute will discover the path accuratly in most cases, except if the network manager chooses to hide internal topology by manipulating their ICMP TTL-expired packets.
SNMP. One of the most powerful tools at our disposal, we can use this to discover almost anything we could want to know quickly and easily. However, it is frequently not supported, or restricted when it is supported.
DNS ls. Fast and accurate, but not always availible.
NIC information.

Topology Discovery - Basic Algorithm

Start with a temporary set of possible valid addresses.
Using ping, validate these addresses and add valid addresses to a permanent set. Traceroute to determine connectedness.
Applying various heuristics to the permanent set, generate new addresses to add to the temporary set.

How do we add new addresses to the current set?

For hosts with addresses a.b.c.d that are alive, the subnet typically includes a "corresponding" address of a.b.c.1, a.b.c.129, a.b.c.65, and a.b.c.193, which is usually the router (24, 25, and 26-bit subnet masks). Most hosts that are alive within the same subnet have "close" addresses; that is, if a.b.c.d exists, then there is a good chance that a.b.c.(d+1) exists as well.

Based on this, if we know a host is alive, we can generate additional addresses. We add the next N addresses to our list (where N is an "aggressiveness" or "persistence" factor that can be adjusted), and if the address ends in 1, 65, 129, or 193, then it mya be a router and we then add N random addresses with the same prefix to our list.

This describes the guessing/probing algorithm. It makes the fewest assumptions about the network and is thus applicable almost anywhere. There are other ways to grab data; read on.

How do we generate an initial set of addresses?

Normally the user supplies an initial list of addresses. Theoretically, all the program needs is one good address to get started, but completeness of the results is extremely sensitive to the initial set. In our testing against the cs.cornell.edu domain, we generate an initial set of all addresses matching 128.84.*.1 (of course, most of the addresses in this space do not correspond to actual hosts)

What else can we use?

Described above is the main guessing/probing algorithm. It makes the fewest assumptions about a given network. However, we may have other information availible, and we'd like to be able to use it. In particular, we want to be able to do:

Broadcast ping. Where supported, this is a useful way to rapidly grab all the hosts in a subnet.
DNS ls, to determine all the nodes in a network.
SNMP. We could recursively use SNMP queries to download routing tables, etc. We can get a LOT of information this way.

What about runtime?

Initial attempts at this last year showed that these algorithms can be quite expensive on runtime, especially when it is generating a lot of invalid addresses to check. Since then, we have made dramatic strides (orders of magnitude!) towards improving the runtime, to the point where it is no longer a terribly significant issue.

Fast Batch Pinging

Last semester we determined that the primary bottleneck in performance was in using ping to verify the existence of hosts. A ping to a live host returns fairly quickly, but a ping to a dead host, using the UNIX ping(1M), takes at least 1 second. In particular, in the cases where the algorithm was aggressively discovering the topology, and thus generating a large proportion of addresses corresponding to dead or non-existent hosts, the program can spend as much as 80-90% of its time waiting for pings to come back. We needed a way to minimize time spent waiting for pings to timeout

We developed a version of ping that sends it's ICMP echo-request packets in parallel at short intervals, allowing us to ping many hosts at the same time. Further, by using setitimer(2) and related routines, we can set timeouts of less than one second. This allows us, for instance, to ping 40 hosts with a timeout of 50ms.

Fast Traceroute

The "traditional" traceroute determines the route taken by sending successive packets with increacing TTLs. For each packet, it must wait for it to return. Our fast traceroute program sends out packets with different TTLs in parallel. The savings in time can become quite dramatic when tracerouting to distant locations over slow links.

Both fast traceroute and batch ping can be obtained with the Argus alpha release.

In-Memory Database
Re-implementing the topology database routines to do everything in memory (saving only on completion) have yielded runtimes for the cornell.edu domain as fast as 7 minutes. We're still working on this; there are some puzzling bugs, but it will be done soon. Preliminary tests suggest it can slash runtime for the cornell.edu domain from 20 minutes down to 7 minutes.

Further optimizations to the original code include elimination of redundant DNS lookups, cacheing of lookups, and general tweaking.

Overall, we have been able to cut runtime dramatically:

	Before	After
cs.cornell.edu domain	58 minutes	1 minute 50 seconds
cornell.edu	1080 minutes	20 minutes

Project Argus: argus.gif (9161 bytes)

As you can see, within each autonomous system, there exists a local domain manager (brain). This brain contains a local topology database (including IP-level topology, history, statistics, whatever else we care to track) and a partial backbone topology containing information about the backbone topology "near" the AS.

Each brain can control a number of agents (eyes). The eye is capable of responding to requests from the brain for information, such as "check this set of addresses" or "traceroute to these hosts" or "watch that router and let me know if it goes down." In the event that the eye should lose its connection with the brain, it is capable of acting autonomously for a short period of time, storing information up to some user-specified limit, and retaining information in order of importance.

The backbone manager (mastermind) combines information from various brains to form a complete picture of the network.

With this structure, we can do fun things, such as:

Watch changes in network topology in near-real-time. Brains can tell eyes to periodically check up on various subnets, routers, etc. rediscovering if necessary. When the topology of the network changes, the eyes will detect and report the change. You can watch, for instance, how the network topology changes when a router is taken down for emergency maintainence, or watch how gradual upgrades are grafted onto the existing network. Used in conjuncion with statistical information this is a powerful network management tool.
Statistics. The eyes can monitor and collect statistics about traffic, delay, stability, uptime/downtime, services, and more, over selecte links or particularly important hosts/routers. We can see from this which routers are unreliable, which hosts are behaving funny, what services are the most popular where, when and where delays suddenly jump, and more.
Correlation. We can track various statistics against various time criterion. We can answer such questions as "what times do people prefer to use email/web/ftp?" "is there a big drop in network traffic as students leave for break?" "do major sporting events have an impact on network traffic?" and more. We can investigate such speculation as "the business school likes to use the internet during the day, while computer science people prefer late evenings"
History. We can watch, for instance, the upgrade of the Cornell network to gigabit ethernet. We can watch as new routers are added, where they're connected, and when old routers go offline. Moreover, we have the aforementioned data on performance, so we can assess the real impact of any upgrade on actual performance. This is just one of many possible applications for history.
Visualization. We want to be able to visualize the network at varying levels of detail, with collapsable clouds, zoom, color-coding, etc.

Obtaining Project Argus source code

The alpha release of the source code is availible as a gzipped tar archive. It should be straightforward to set up; we've tested it on Linux and Solaris. Get it here.

Contact Information

Email: Srinivasan Keshav, Cristian Estan, Haye Chan, Walter Chang

This page maintained by Walter Chang. Send him questions, comments, and gripes.