Argus - an extensible toolkit for efficient and adaptable network discovery and monitoring

There has been a significant amount of work on network topology discovery and visualization and network monitoring. This work can be divided into two distinct parts: work centered on the Internet and work centered on the enterprise network.

Existing tools for Internet discovery and monitoring

The work in the this category has been done mainly by academics and small companies. Because most of the Internet is not owned by the one analyzing it, the tools only observe the network. Discovery and monitoring is done usually by sending probes into the network (e.g. ping, traceroute). The problems addressed are usually collecting the data, visualization and efficient correlation of the data gathered by different points. Most of what I mention here is work in progress. There are few articles published on this topic.

Felix

The Felix project of Bellcore started in September 1997. Their work focuses on 5 main components:

Network station - UNIX workstations to be used as monitors,
Monitor Data Exchange Protocol - the protocol which the monitors use to communicate,
Performance database - measurements recorded by each monitor for computation of topology and network health displays,
Web based GUI - for user to access and control the database,
Linear Decomposition Algorithms (LDA) - for topology discovery and performance evaluation of specific network elements.

Because they measure delays and other parameters of the network by looking at packets circulating between their own monitors, they can only cover a very small part of the Internet. The focus of their approach is on developing some linear decomposition algorithms that allow fast processing of the data collected into a database, but there are no results publicly available.

CAIDA

Caida is an organization that focuses on Internet topology discovery and monitoring. The tools they developed are:

Mapnet - Java applet that does geographically based macroscopic Internet infrastructure visualization. The data is introduced manually.
Otter - Java based general purpose topology visualization tool. I does not structure the network hierarchically, therefore it can be used only for viewing networks with a small number of nodes.
Skitter - is their discovery tool which base on traceroute and ping. Based on active probing of the network, it discovers network connectivity (topology), measures round trip time and record dynamic changes of topologies.

For backbone topology visualization they use some graph layout code written by Bill Cheswick from Bell Labs and Hal Burch from CMU. The tool is based on an annealing algorithm. For a tree with 80,000 nodes, a typical layout run takes 24 CPU hours on a 400 MHz Pentium. It is non geographical and they plan to develop a 3D representation.

MIDS

Matrix Information and Directory Services is a company that claims to be the oldest in Internet analysis. They have products that provide visualization of the Internet geography :

MatrixIQ - a program which produces a comprehensive consistent view of the Internet and specific ISPs. They do not sell the program, nor do they have intention to open the source code.
Internet Weather Report - a free sampler of some partial views of the latency statistics in geographical form. Available online for free.
Matrix Maps Quarterly - a research publication which contains maps of the Internet worldwide by region. $200 per back issues (online).

They use beacons doing active monitoring. The information they try to collect is on a very high level on the scale of countries. They are mostly concerned with the backbone connectivity and its correspondence to geographical locations, as well as the related statistics such as latencies. Based on these information, they can evaluate ISPs by latency, packet loss, overall throughput, reliability and speed of repair. The process by which they generate their data is proprietary.

Octopus

This is a set of tools implementing heuristics for network discovery. It was developed at Cornell in 1998 and my work will be based on these tools. A detailed description of this project can be found in this paper.

Enterprise level discovery and monitoring

The work in this category is done mainly by large software companies. Their products are expensive and try to address the problem network management. Discovery and monitoring is only a small part of their integrated solutions. They control completely the devices from the network by using SNMP. I never used any of their tools but the opinion of the ones who did is that they don't work where SNMP is not deployed and are not suited for discovery outside their own network. The significant current products addressing this problem are:

Open View from Hewlett Packard,
Tivoli from IBM,
Unicenter from Computer Associates,
Solstice Enterprise Manager from Sun.

Description of Argus project

I work on this project together with two undergraduates, Walter Chang and Haye Chen, under the supervision of S. Keshav. There were dramatic improvements in performance (from 1000 to 10 minutes for discovering Cornell) by modifications to ping and traceroute done by Haye and Walter. The purpose of this work is to further improve the existing implementation (Octopus) and add new functionality. An important point is that we will be actively distributing our toolkit for being able to measure its behavior in other settings too. The work can be conceptually divided in two parts: local domain discovery and backbone discovery, but many of the algorithms and methods will be common.

In order to achieve these goals we are currently implementing an easily extensible framework in which the current algorithms will be integrated (now they are stand-alone scripts). This will also give us the benefit of being able to use the various algorithms interchangeably choosing the one best suited for the current task. Another major aim is automating some tasks that were done manually up to now. An important problem that has to be addressed by the algorithms is and that will get much attention in the framework is that they have to deal with inconsistent data. The reason for having inconsistent data are: uncorrected sources of information, fast change in the network, the heuristic nature of many of our algorithms.

Besides the usual metrics for algorithms (memory usage and time) there are some very important ones that can be used to describe topology discovery and monitoring algorithms: accuracy of the output (topology), completeness (percentage of the target domain that was discovered), amount of network traffic generated during discovery, adaptability of the algorithm to various network conditions and topologies.

Improvements will be achieved by the following means:

multi-threading allowing for parallelization of jobs not very closely related
new structure of the topological database allowing faster access and ease in adding extensions
fine-tuning of the existing algorithms and heuristics based on collected run traces
intelligent mechanisms for choosing the best algorithm for the current task based on collected run traces
multiple probe points for improving accuracy and coverage
new, better algorithms and heuristics

We will also add new functionality to the toolkit: tracking the history of the changes in topology, continuous monitoring of selected network elements and possibly a better visualization tool. We have a description of the current status of the project with more details.

Research question

The research question I want to address in the paper is the design of the framework. I will use performance measurements to determine the relative merits of various algorithms and features of the framework. The main metrics for evaluating the time to completion, amount of traffic generated, accuracy and completeness.