A major concern of our architecture is the overhead of each layer.
Prior work on the
x-kernel has demonstrated that modularization and layering does
not necessarily mean bad performance, and our initial experience confirms
this. To get an idea of the price of layering, we stacked the fragmentation
layer ten times and compared the performance to stacking it only once. We
found that the cost of the fragmentation layer adds 50
seconds to the
one-way latency. We believe we can bring this down somewhat.
In this section we present the
overall performance of Horus on a system of SUN Sparc10 workstations
running SunOS 4.1.3. The workstations communicate through a loaded
Ethernet consisting of multiple segments connected by a Powerhub
concentrator. When reporting performance for communicating between
just two machines, the workstations reside on the same Ethernet segment.
We used two network layers: normal UDP, and UDP with the Deering hardware multicast extensions. It should be noted, however, that in the second case Horus will use UDP for point-to-point messages. In particular, if a group has only two members, only normal UDP is used. To highlight some of the performance numbers: we achieve a one-way latency of 1.2 msecs over the FAST:MBRSHIP:FRAG:NAK:COM:udp stack (we think we may be able to bring this down to less than a millisecond in the near future), and 7,500 1-byte messages per second (with the TOTAL:MBRSHIP:FRAG:NAK:COM:udp stack). With some help from the application, we can drive up the total number of messages per second to over 75,000 using the FC layer. We easily reach the Ethernet 1007 Kbytes/second maximum bandwidth with a message size smaller than 1 kilobyte.
Our performance test program has each member do exactly the same thing:
send
messages and wait for
messages of size
,
where
is the number of members (see figure 3).
It runs this for
rounds and divides the results by
.
In the results that we present below, we have not used the FAST (message
acceleration) layer, as it does not yet provide full virtual
synchrony.
Figure 3: The performance test protocol runs rounds in which each member
multicasts
messages and subsequently waits for
messages (where
is the number of members). In (a)
, and in
(b)
.
We use the terms
latency for the time between sending a message and receiving it,
bandwidth for the total number of bytes received per second,
and throughput for the total number of messages received per second.
To measure latency, we choose
and
.
To measure bandwidth we choose a large
(on the order of 4 Kbytes).
To measure throughput, we use
and a large
(on the order of 25
messages per round).
Figure 4: The top figure compares the one-way latency of FIFO Horus messages
over straight UDP and UDP with the Deering hardware multicast extensions.
The bottom
figure compares the performance of total and FIFO order of Horus, both
over UDP multicast.
Figure 4 depicts the one-way communication latency of
Horus messages. As can be seen in the top graph, hardware multicast is
a big win, especially when the message size goes up. This is not
surprising, since without hardware multicast
more messages need
be sent to simulate the multicast, and hence quadratic behavior results.
In the bottom graph, we compare FIFO to totally ordered communication.
For small messages we get a FIFO one-way latency of about 1.5 milliseconds
and a totally ordered one-way latency of about 6.7 milliseconds. As
explained in section 4.7, the totally ordered layer is not
particularly efficient for all senders sending at random and synchronously.
In case of only one sender, the one-way latency is 1.6 milliseconds for this
ordering. The next figure will show that the asynchronous message throughput
of totally ordered communication is excellent.
Figure 5: These graphs depict the message throughput for virtually synchronous,
FIFO ordered communication over normal UDP and Deering UDP, and for totally
ordering communication over Deering UDP.
Figure 5 shows the number of 1-byte messages per second that can be achieved for three cases. For normal UDP and Deering UDP the throughput is fairly constant, independent of the number of messages per round and going down slowly with each additional member. For totally ordered communication we see that the throughput becomes better if we send more messages per round (because of increased asynchrony). Perhaps surprisingly, the throughput also becomes better as the number of members in the group goes up. The reason for this is threefold. First, with more members there are more senders. Second, with more members it takes longer to order messages, and thus more messages can be packed together and sent out in single network packets. Last, our ordering protocol allows only one sender on the network at a time, thus introducing flow control and reducing collisions.