next up previous
Next: MPI Implementation over Active Up: Low-Latency Communication on the Previous: Summary and Comparison with

Split-C Application Benchmarks

Split-C [4] is a simple parallel extension to C for programming distributed memory machines using a global address space abstraction. It is implemented on top of Generic Active Messages and is used here to demonstrate the impact of SP AM on applications written in a parallel language. Split-C has been implemented on the CM-5, Intel Paragon, Meiko CS-2, Cray T3D, a network of Sun Sparcs over U-Net/ATM, as well as on the IBM SP using both SP AM and MPL. A small set of application benchmarks is used here to compare the two SP versions of Split-C with each other and to the CM-5, Meiko CS-2, and U-Net cluster versions. Table 4 compares the machines with respect to one another: the CM-5's processors are slower than the Meiko's and the U-Net cluster's, but its network has lower overheads and latencies. The CS-2 and the U-Net cluster have very similar characteristics. The SP has the fastest CPU, a network bandwidth comparable to the CS-2, but a relatively high network latency.

   table267
Table 4: Comparison of the TMC CM-5, Meiko CS-2, U-Net ATM cluster, and IBM SP performance characteristics

The Split-C benchmark set used here consists of three programs: a blocked matrix multiply, a sample sort optimized for small messages, the same sort optimized to use bulk transfers, and two radix sorts optimized for small and large transfers. All the benchmarks have been instrumented to account for the time spent in local computation phases and in communication phases separately such that the time spent in each can be related to the processor and network performance of the machines. The absolute execution times for runs on eight processor are shown in Table 5. Execution times normalized to the SP AM are shown in Figure 4. Detailed explanation of the benchmarks can be found in [2].

   table282
Table 5: Absolute Execution Times (seconds)

   figure295
Figure 4: Split-C benchmark results normalized to SP.

The two matrix multiply runs use matrices of 4 by 4 blocks with 128 by 128 double floats per block, respectively 16 by 16 blocks with 16 by 16 double floats each. For large blocks, the performance of Split-C over SP AM and MPL is the same which can be explained by the comparable bandwidth in large block transfers. The floating-point performance of Power2 give the SP an additional edge over the CM-5, CS-2, and the U-Net/ATM cluster. For smaller blocks, however, the performance over MPL degrades significantly with respect to SP AM because of higher message overheads. Notice that the results over SP AM exhibit a smaller network time compared to all other machines. As long as the transfer sizes remain below 8064 bytes, flow control is not activated and thus overhead matters more than latency.

For radix and sample sorts, Figure 4 shows that the SP spends less time in local computation phases because of the faster CPU. SP AM spends about the same amount of time, if not less, in the communication phases as the CM-5 and CS-2. Although SP's round-trip latency is relatively higher, SP AM combines low message overhead with high network bandwidth to achieve a higher message throughput. Again, the performance over MPL for small messages suffers from the high message overhead. For large messages (albeit not large enough to activate flow control), both the SP AM and MPL outperform the other machines in both computation and communication phases.

The Split-C benchmark results show that SP AM's low message overhead and high throughput compensates for SP's high network latency. The software overhead in MPL degrades the communication performance of fine-grain applications, allowing machines with slower processors (CM-5) or even higher network latencies (U-Net/ATM cluster) to outperform the SP.


next up previous
Next: MPI Implementation over Active Up: Low-Latency Communication on the Previous: Summary and Comparison with

Chris Hawblitzel
Thu Sep 19 12:22:33 EDT 1996