Project Argus - Network topology discovery, monitoring, history, and visualization.
Table of Contents
Topology Discovery - basic algorithm
What about runtime?
There are many situations in which we'd like to be able to determine the layout of a particular network. However, topology discovery algorithms based on SNMP and other protocols are often insufficient for reliably mapping the entire network. We want to be able to map the topology of a network while keeping assumptions about the network to a minimum. The goal is to develop algorithms for topology discovery that are:
This project is part of the Cornell Network Research Group.
To develop a set of network tools capable of:
Topology Discovery - Basic Algorithm
How do we add new addresses to the current set?
For hosts with addresses a.b.c.d that are alive, the subnet typically includes a "corresponding" address of a.b.c.1, a.b.c.129, a.b.c.65, and a.b.c.193, which is usually the router (24, 25, and 26-bit subnet masks). Most hosts that are alive within the same subnet have "close" addresses; that is, if a.b.c.d exists, then there is a good chance that a.b.c.(d+1) exists as well.
Based on this, if we know a host is alive, we can generate additional addresses. We add the next N addresses to our list (where N is an "aggressiveness" or "persistence" factor that can be adjusted), and if the address ends in 1, 65, 129, or 193, then it mya be a router and we then add N random addresses with the same prefix to our list.
This describes the guessing/probing algorithm. It makes the fewest assumptions about the network and is thus applicable almost anywhere. There are other ways to grab data; read on.
How do we generate an initial set of addresses?
Normally the user supplies an initial list of addresses. Theoretically, all the program needs is one good address to get started, but completeness of the results is extremely sensitive to the initial set. In our testing against the cs.cornell.edu domain, we generate an initial set of all addresses matching 128.84.*.1 (of course, most of the addresses in this space do not correspond to actual hosts)
What else can we use?
Described above is the main guessing/probing algorithm. It makes the fewest assumptions about a given network. However, we may have other information availible, and we'd like to be able to use it. In particular, we want to be able to do:
What about runtime?
Initial attempts at this last year showed that these algorithms can be quite expensive on runtime, especially when it is generating a lot of invalid addresses to check. Since then, we have made dramatic strides (orders of magnitude!) towards improving the runtime, to the point where it is no longer a terribly significant issue.
Fast Batch Pinging
Last semester we determined that the primary bottleneck in performance was in using ping to verify the existence of hosts. A ping to a live host returns fairly quickly, but a ping to a dead host, using the UNIX ping(1M), takes at least 1 second. In particular, in the cases where the algorithm was aggressively discovering the topology, and thus generating a large proportion of addresses corresponding to dead or non-existent hosts, the program can spend as much as 80-90% of its time waiting for pings to come back. We needed a way to minimize time spent waiting for pings to timeout
We developed a version of ping that sends it's ICMP echo-request packets in parallel at short intervals, allowing us to ping many hosts at the same time. Further, by using setitimer(2) and related routines, we can set timeouts of less than one second. This allows us, for instance, to ping 40 hosts with a timeout of 50ms.
The "traditional" traceroute determines the route taken by sending successive packets with increacing TTLs. For each packet, it must wait for it to return. Our fast traceroute program sends out packets with different TTLs in parallel. The savings in time can become quite dramatic when tracerouting to distant locations over slow links.
Both fast traceroute and batch ping can be obtained with the Argus alpha release.
Re-implementing the topology database routines to do everything in memory (saving only on completion) have yielded runtimes for the cornell.edu domain as fast as 7 minutes. We're still working on this; there are some puzzling bugs, but it will be done soon. Preliminary tests suggest it can slash runtime for the cornell.edu domain from 20 minutes down to 7 minutes.
Further optimizations to the original code include elimination of redundant DNS lookups, cacheing of lookups, and general tweaking.
Overall, we have been able to cut runtime dramatically:
|cs.cornell.edu domain||58 minutes||1 minute 50 seconds|
|cornell.edu||1080 minutes||20 minutes|
As you can see, within each autonomous system, there exists a local domain manager (brain). This brain contains a local topology database (including IP-level topology, history, statistics, whatever else we care to track) and a partial backbone topology containing information about the backbone topology "near" the AS.
Each brain can control a number of agents (eyes). The eye is capable of responding to requests from the brain for information, such as "check this set of addresses" or "traceroute to these hosts" or "watch that router and let me know if it goes down." In the event that the eye should lose its connection with the brain, it is capable of acting autonomously for a short period of time, storing information up to some user-specified limit, and retaining information in order of importance.
The backbone manager (mastermind) combines information from various brains to form a complete picture of the network.
With this structure, we can do fun things, such as:
Obtaining Project Argus source code
The alpha release of the source code is availible as a gzipped tar archive. It should be straightforward to set up; we've tested it on Linux and Solaris. Get it here.
Email: Srinivasan Keshav, Cristian Estan, Haye Chan, Walter Chang
This page maintained by Walter Chang. Send him questions, comments, and gripes.