A Performance Study of Sequential I/O on Windows NT 4
by Erik Riedel
Notes by Daniel Switkin switkin@cs.cornell.edu
This paper is essentially just a summary of IO performance with standard
hardware under NT. It attempts to benchmark SCSI devices under a variety
of configurations and determine where bottlenecks occur. Along the way
it proposes a few tips for better performance on large sequential tasks.
Random Terminology
- PAP - Peak advertised performance
- RAP - Real application performance
- Half-power point - The point at which the system delivers
at least half of the theoretical maximum performance (PAP / 2)
- WCE - Write Cache Enable, a setting which allows the disk
to announce write completion as soon as the data is stored in its cache,
and before the actual write to media
Component Performance
- The SCSI bus tends to saturate at 75-80% due to bus protocol overheads
- The PCI bus is capable of about half its PAP (72 out of 132 MBps), probably
due to arbitration
- A 200 Mhz Pentium Pro will get saturated between 16 and 50 MBps for buffered IO,
but is capable of 480 MBps unbuffered (other components will be a bottleneck
long before this)
Findings and Conclusions
- Prefetching at both the file system and drive levels is a significant
boost for reads
- WCE allows the disk to hide seek and media transfer times, and gives
an enormous boost to writes at the risk of corruption
- Small requests result in lower bandwidth and higher CPU utilization
(gasps are heard from the crowd)
- However, sizes above 64k tend to suffer because of the need to break
them down into multiple 64k requests
- Bypassing the file system cache reduces CPU overhead without affecting
reads, but writes suffer badly (can be compensated for by enabling WCE)
- Asynchronous IO allows simultaneous requests which reduces idle time
and approaches WCE speeds
- Asynchronous IO is even more effective with striped disks which operate
in round-robin fashion, because true parallelism can be achieved at the
media level
- While bandwidth increases as drives are added to a RAID configuration,
the overhead of switching between drives takes up an increasing percentage
of the SCSI bus compared to actual data transfer
- The NT restriction that unbuffered writes be synchronous for new files
and extending existing files is a serious bottleneck, particularly on
non-sequential data - the solution is to preallocate the space
- Lack of alignment in RAID systems is terrible for large requests, because
it forces each 64k chunk to span two drives - can be solved by formatting
drives with 64k block sizes
- The magic formula to reach the sum of the device media limits is:
- Large request sizes
- Deep asynchronous requests
- WCE
- Striping
- Unbuffered IO