Argus - an extensible toolkit for efficient and adaptable network discovery and monitoring

There has been a significant amount of work on network topology discovery and visualization and network monitoring. This work can be divided into two distinct parts: work centered on the Internet and work centered on the enterprise network.

Existing tools for Internet discovery and monitoring

The work in the this category has been done mainly by academics and small companies. Because most of the Internet is not owned by the one analyzing it, the tools only observe the network. Discovery and monitoring is done usually by sending probes into the network (e.g. ping, traceroute). The problems addressed are usually collecting the data, visualization and efficient correlation of the data gathered by different points. Most of what I mention here is work in progress. There are few articles published on this topic.

Felix

The Felix project of Bellcore started in September 1997. Their work focuses on 5 main components: Because they measure delays and other parameters of the network by looking at packets circulating between their own monitors, they can only cover a very small part of the Internet.  The focus of their approach is on developing  some linear decomposition algorithms that allow fast processing of the data collected into a database, but there are no results publicly available.

CAIDA

Caida is an organization that focuses on Internet topology discovery and monitoring. The tools they developed are: For backbone topology visualization they use some graph layout code written by Bill Cheswick from Bell Labs and Hal Burch from CMU. The tool is based on an annealing algorithm. For a tree with 80,000 nodes, a typical layout run takes 24 CPU hours on a 400 MHz Pentium. It is non geographical and they plan to develop a 3D representation.

MIDS

Matrix Information and Directory Services  is a company that claims to be the oldest in Internet analysis. They have products that provide visualization of the Internet geography : They use beacons doing active monitoring. The information they try to collect is on a very high level on the scale of countries. They are mostly concerned with the backbone connectivity and its correspondence to geographical locations, as well as the related statistics such as latencies. Based on these information, they can evaluate ISPs by latency, packet loss, overall throughput, reliability and speed of repair. The process by which they generate their data is proprietary.

Octopus

This is a set of tools implementing heuristics for network discovery. It was developed at Cornell in 1998 and my work will be based on these tools. A detailed description of this project can be found in this  paper.

Enterprise level discovery and monitoring

The work in this category is done mainly by large software companies. Their products are expensive and try to address the problem network management. Discovery and monitoring is only a small part of their integrated solutions. They control completely the devices from the network by using SNMP. I never used any of their tools but the opinion of the ones who did is that they don't work where SNMP is not deployed and are not suited for discovery outside their own network. The significant current  products addressing this problem are:

Description of Argus project

I work on this project together with two undergraduates, Walter Chang and Haye Chen, under the supervision of S. Keshav. There were dramatic improvements in performance (from 1000 to 10 minutes for discovering Cornell) by modifications to ping and traceroute done by Haye and Walter. The purpose of this work is to further improve the existing implementation (Octopus) and add new functionality. An important point is that we will be actively distributing our toolkit for being able to measure its behavior in other settings too. The work can be conceptually divided in two parts: local domain discovery and backbone discovery, but many of the algorithms and methods will be common.

In order to achieve these goals we are currently implementing an easily extensible framework in which the current algorithms will be integrated (now they are stand-alone scripts). This will also give us the benefit of being able to use the various algorithms interchangeably choosing the one best suited for the current task. Another major aim is automating some tasks that were done manually up to now. An important problem that has to be addressed by the algorithms is and that will get much attention in the framework is that they have to deal with inconsistent data. The reason for having inconsistent data are: uncorrected sources of information, fast change in the network, the heuristic nature of many of our algorithms.

Besides the usual metrics for algorithms (memory usage and time) there are some very important ones that can be used to describe topology discovery and monitoring algorithms: accuracy of the output (topology), completeness (percentage of the target domain that was discovered), amount of network traffic generated during discovery, adaptability of the algorithm to various network conditions and topologies.

Improvements will be achieved by the following means:

We will also add new functionality to the toolkit: tracking the history of the changes in topology, continuous monitoring of selected network elements and possibly a better visualization tool. We have a description of the current status of the project with more details.

Research question

The research question I want to address in the paper is the design of the framework. I will use performance measurements to determine the relative merits of various algorithms and features of the framework. The main metrics for evaluating the time to completion, amount of traffic generated, accuracy and completeness.