This file contains documentation on the internals of argus. The source
files contain sometimes detailed comments. Where the comments from the
source files are enough, this file won't duplicate that information.

List of files

checker.pl	- various IP address checkers
dbdisk.pl	- handles the disk image of the database
dbinterface.pl	- interface of the database with the workers
dboperations.pl	- operations on the database
dbstructure.pl	- a description of the data stored in the database
dns.pl		- interacting with DNS and DNS cache operations
h4-generator.pl	- generator of IP addresses by DNS listing
h5.pl		- the h5 heuristics
init.pl		- includes all files and defines the init function
inet.pl		- general functions fo IP address manipulation
logging.pl	- deals with logging messages to the log file
serializer.pl	- functions for serializing and parsing data
shorties.pl	- short functions that call others (acronyms)


The internal structure of Argus's brains

The brain is structured into tasks and a scheduler. These can be all in
the same process or be distributed in several processes. The most
important point in designing these tasks is that by putting them in
separate processes we can reduce the overall running time by letting some
tasks work while others wait for answers from the network. Since context
switches are expensive tasks are designed in such a way that the
communication among them is minimal. Batching is also employed for
reducing intertask communication, but it can reduce the accuracy of
results in some cases. This design gives us a considerable amount of
control over the parallelism we use. The ease of profiling the code also
played a role in these decisions.

The job of the scheduler is to control the other tasks. The tasks will
read inputs and produce outputs. These will be represented as queues in
the single process case and as pipes in the multiprocess one. Local data
will be stored in frames that can be explicitly manipulated so that
multiple instances of the same task don't interact in unwanted ways (when
they all run in the same process). In order to allow scheduling in both
single and multi-process settings the tasks are written as small chunks of
processing that can be thought of as the bodies of the processing loops.

The interactions with the database happen through normal function calls
that hide entirely whether the in-memory database is in the same process
or another one. The database manipulation routines ensure the inner
consistency of the database. The fact that a single process has the
database gives a simple way of insuring that no concurrent access to the
database occurs. When data is passed between processes it needs to be
serialized. Whenever feasible, heavy-weight serializing (as the one used
for writing the database to the disk) is avoided and replaced with simpler
methods. The hash with the visited IP adresses is stored together with the
database, but each worker process has its own DNS cache.

!!!!!!!!Where to put the Domain delay statistics??????

A normal component of the discovery is generally accomplished by at least
two tasks. The central part is an IP address generator, helped by one or
more checkers. There will be one checker for each eye used (when we get to
put eyes on more hosts). If we use a two-step scheme to check IPs, there
are two checkers (per eye) one doing the probes with a small timeout and
the other, large timeout probes to make sure that no computers are
omitted. These checkers also do the traceroutes (or "partial traceroutes")
whenever needed.

!!!!!!!!!!!!!
There are 3 types of generators
that offer the functionality of our present algorithms: one that will
act somewhat as IPmaker, one that will do DNS listings of single domains
and one that will do SNMP querying of a router. 

There are some obvious information flows: from the main program to the
generators, from the generators to their checkers, from the checkers to
the database and even from the generators to the database. Besides these
some less obvious flows are needed. Feedback from the checkers to the
generator is necessary when it guesses new IP addresses based on the ones
that proved to be valid. The generators will signal back to the main
program when they find indication that we might need to start another
generator (e.g. by finding a new subdomain or a new router). The database
manipulation routines have a secondary function too. As a side effect they
log to some special lists noteworthy events that might result in starting
new generators (e.g. an IP address is discovered that was not in covered
by the set of initial networks for the domain).  There will also be some
special tasks called scavengers that will give the main program hints to
start generators by looking at these special lists and additions to the
data base done by others. For performance reasons it makes sense to keep
the scheduler, the database and the scavengers in the same process. 

The generators and checkers can wait for the network, but the scheduler
and database process never does. It only gives up the processor when it
waits for information from the worker processes.


A new element in the database is the storage of delays. Delays are stored
for each host. They are then aggregated at subnet and
subdomain level. Besides being a very useful output these aggregated times
also serve as basis for determining the timeouts used by the
checkers. Probe points with similar expected delays are grouped into the
same batch ping to improve efficiency. 

Batching is a very useful method for reducing the overheads incurred by
interprocess communications. However a very large batch size can hurt by
putting too large bursts onto the network. Or it can have a negative
impact on the accuracy of the estimated timeouts since the data obtained
by the probes from the same batch cannot be used in the estimation. Or it
can even have a bad impact on the accuracy of the round trip times of the
batch pings. The batch sizes will are easily changeable parameters of the
flows the tasks use to communicate with each other and we should make
measurements to see where the batch sizes start to hurt. 








A measurement of the performance of this implementation will be the
comparison between the single process case of argusbeta and argusalpha. 

When tracing, if we have data for the subnet we need only the lasthop (by
default), if we have data about the domain we need the last 3 hops, if we
know about the parent domain 4 hops, otherwise full traceroute. 

If there are computers untraced to, they get picked up by scavengers.

Each IP address belongs to at most one router. No repetitions occur in the
router's address list.



Parallelism in the brain

THe brain can run in three modes: single process, forked multiprocess
and external multiprocess mode. In the single process mode, the brain
is a single process. In the multiprocess mode, there is a central
process holding the database and handing out jobs and one or many
workers interaqcting with the network mainly through the eye. The
workers interact with the central process through a Unix domain socket 
called arguslink, located in the /tmp/ directory. The difference
between the forked and the external multiprocess modes is in how th
workers are created: in the forked case they are forked off by the
central process, and in the external case they are started by an
external script. The forked case has the advantage that all logging
goes to the ssme log file. The external case has the advantage that
workers are started in separate directories called worker1 .. workern
allowing profiling.

Command line options

--targetdns gives the DNS domain that will be targeted by the
discovery. If used multiple times, multiple domains will be targeted.
--targetip gives the IP address prefix to be targeted. The syntax is
address/netmask (both 128.84.0.0/255.255.0.0 and 128.84.0.0/16 are
accepted). If used multiple times, multiple prefixes will be targeted.
--limitdns gives the DNS domain that will limit the search. Computers
outside this domain will not be stored. Multiple domains can be given
as with --targetdns. If not specified it defaults to the value of
--targetdns. If --targetdns isn't given either it defaults to accept
anything. It makes no sense to explicitly set this parameter to a
value that is more restrictive than --targetdns.
--limitip plays a similar role to --limitdns
--strictdns instructs argus not to store addresses that do not have a
DNS name associatd with them
--database gives the name of the directory where the database  will be
stored
--children the number of children (workers) to be forked
--identity in the external multiprocess case, the identity of this
process (e.g. central, worker4, worker 7)

examples:

argus.pl --targetdns cornell.edu
discovers the cornell.edu domain

argus.pl --targetip 128.84.0.0/255.255.0.0 --targetip 128.253.0.0/16
--targetip 132.236.0.0/16
discovers the 128.84.0.0/16, 128.253.0.0/16 and 132.236.0.0/16
prefixes

argus.pl --targetip 141.210.0.0/14 --limitdns umich.edu
discover the 141.210.0.0/14 prefix, but dont't go outside the
umich.edu domain

argus.pl 
