A Performance Study of Sequential I/O on Windows NT 4

by Erik Riedel

Notes by Daniel Switkin switkin@cs.cornell.edu

This paper is essentially just a summary of IO performance with standard hardware under NT. It attempts to benchmark SCSI devices under a variety of configurations and determine where bottlenecks occur. Along the way it proposes a few tips for better performance on large sequential tasks.

Random Terminology

PAP - Peak advertised performance
RAP - Real application performance
Half-power point - The point at which the system delivers at least half of the theoretical maximum performance (PAP / 2)
WCE - Write Cache Enable, a setting which allows the disk to announce write completion as soon as the data is stored in its cache, and before the actual write to media

Component Performance

The SCSI bus tends to saturate at 75-80% due to bus protocol overheads
The PCI bus is capable of about half its PAP (72 out of 132 MBps), probably due to arbitration
A 200 Mhz Pentium Pro will get saturated between 16 and 50 MBps for buffered IO, but is capable of 480 MBps unbuffered (other components will be a bottleneck long before this)

Findings and Conclusions

Prefetching at both the file system and drive levels is a significant boost for reads
WCE allows the disk to hide seek and media transfer times, and gives an enormous boost to writes at the risk of corruption
Small requests result in lower bandwidth and higher CPU utilization (gasps are heard from the crowd)
However, sizes above 64k tend to suffer because of the need to break them down into multiple 64k requests
Bypassing the file system cache reduces CPU overhead without affecting reads, but writes suffer badly (can be compensated for by enabling WCE)
Asynchronous IO allows simultaneous requests which reduces idle time and approaches WCE speeds
Asynchronous IO is even more effective with striped disks which operate in round-robin fashion, because true parallelism can be achieved at the media level
While bandwidth increases as drives are added to a RAID configuration, the overhead of switching between drives takes up an increasing percentage of the SCSI bus compared to actual data transfer
The NT restriction that unbuffered writes be synchronous for new files and extending existing files is a serious bottleneck, particularly on non-sequential data - the solution is to preallocate the space
Lack of alignment in RAID systems is terrible for large requests, because it forces each 64k chunk to span two drives - can be solved by formatting drives with 64k block sizes
The magic formula to reach the sum of the device media limits is:
1. Large request sizes
2. Deep asynchronous requests
3. WCE
4. Striping
5. Unbuffered IO