Chapter 14 - Detecting Disk Bottlenecks
Disk space is a recurring problem. No matter how large a hard disk you buy, your software seems to consume it. But disk bottlenecks pertain to time, not space. When the disk becomes the limiting factor in your workstation, it is because the components involved in reading from and writing to the disk cannot keep pace with rest of the system.
The parts of the disk that create a time bottleneck are less familiar than the megabytes or gigabytes of space. They include the I/O bus, the device bus, the disk controller, and the head stack assembly. Each of these components contributes to and, in turn, limits the performance of the disk configuration.
Performance Monitor measures different aspects of physical and logical disk performance. This chapter examines logical and physical disk performance, shows how to spot and eliminate disk bottlenecks, and describes some special strategies for tuning disk sets.
Tip "Disk and File System Basics," Chapter 17 of this book, provides a comprehensive introduction to the state-of-the-art disk terminology and technology. It is a useful foundation for the information in this chapter.
Prepare to monitor your disk configuration by logging the System, Logical Disk, and Memory objects for several days at an update interval of 60 seconds. If you suspect that slow disk response is periodic, for example, if it is exaggerated by downloads on certain days or certain times of day, log those times separately or place bookmarks in your general log.
Warning Performance Monitor will not monitor disk activity until you run Diskperf and restart the computer. For more information, see "Diskperf: Enabling the Disk Counters," later in this chapter.
Use the following Performance Monitor counters to measure the performance of physical and logical disks.
Disk Testing Tips
The following tips will help you test your disk configuration:
Diskperf: Enabling the Disk Counters
To use the Performance Monitor physical and logical disk counters, you must first run the Diskperf utility included in Window NT. Once you run Diskperf and restart the computer, Performance Monitor can collect disk data. Otherwise, Performance Monitor displays zeros for all counter values for the disks.
Diskperf installs the Disk Performance Statistics Driver that collects data for Performance Monitor and a high-precision timer that times each disk transfer. The timer and the driver are omitted by default to avoid their overhead when you're not monitoring disk performance. This overhead has been measured at less than 1% on a 33-MHz 486 processor.
Note By default, Diskperf installs the Disk Performance Statistics Driver above the fault tolerant driver, Ftdisk.sys, in the I/O Manager's disk driver stack. To monitor the physical disks in disk configurations which include Ftdisk, use the diskperf -ye option. To determine if Ftdisk is started in your configuration, use the Devices Control Panel.
You must be a member of the Administrator's local group on a computer to run Diskperf on it. Run Diskperf from a command prompt window. At the command prompt, type one of the following, then restart the computer: The counters remain enabled, even when you reboot, until you remove them by using the diskperf -n option.
c:\> diskperf -y
This enables the counters on a standard disk configuration.
To run diskperf on a remote computer, type one of the commands in the following table, followed by the computer name, then restart the computer. For example:
diskperf -y \\ComputerName
Diskperf -ye for Ftdisk Configurations
The diskperf -ye option is for disk configurations that use the fault tolerant disk driver, Ftdisk. This includes mirror sets, stripe sets with or without parity, and other combinations of noncontiguous physical disk space into a single logical disk.
Tip To determine if your configuration uses Ftdisk, find Ftdisk on the Devices Control Panel. Ftdisk will be marked as Started if it is used in the disk configuration.
Diskperf -ye places the Disk Performance Statistics Driver below the fault tolerant driver in the disk driver stack. In this position, the Disk Performance Statistics Driver can see physical instances of a disk before they are logically combined by Ftdisk. This lets Performance Monitor collect data on each physical disk in a disk set.
Hardware RAID configurations do not use Ftdisk. The physical disks are combined in the disk controller hardware, which is always below the Disk Performance Statistics Driver. Performance monitoring tools always see the drive set as a single physical disk. It does not matter whether you use diskperf -y or diskperf -ye.
At the command prompt, type diskperf -ye, then restart the computer This installs or moves the Disk Performance Statistics Driver below the fault tolerant driver and installs a high-performance timer.
Note If you have already enabled disk collection using the default diskperf -y option, you can change it by typing diskperf -ye and restarting the computer.
The following figure shows the positioning of the Disk Performance Statistics Driver in the diskperf -y (default) and diskperf -ye (optional) configurations.
By using the optional configuration on software RAID, the physical disks in a software RAID set appear as separate physical instances in Performance Monitor and on other monitoring tools on Windows NT.
The Performance Monitor Disk Counters
Performance Monitor has many useful counters for measuring disk activity. This section
Understanding the Disk Counters
The following list describes the most commonly used logical disk counters in simple terms. (To see the complete list, scroll through the Physical Disk and Logical Disk counters listed in Performance Monitor and read the explanatory text for each counter.)
Troubleshooting the Disk Activity Counters
Sometimes, the disk activity counters just don't add up. %Disk Read Time or %Disk Write Time might sum to more than 100% even on a single disk, and %Disk Time, which represents their sum, is still 100%. Even worse, on a disk set, the set looks 100% busy even when some disks are idle.
Even the fanciest disk can't be more than 100% busy, but it can look that way to Performance Monitor. Several factors can cause this discrepancy and they are sometimes all happening at once:
Now that you understand how the disk counters work, you can use them more effectively.
New Disk Activity Counters
Performance Monitor for Windows NT 4.0 includes new counters for monitoring disk activity:
These counters tell how often the disk is busy during the sample interval. Despite their name, they don't count items in a queue. They use the same data as the %Disk Time counters, but they report the result in decimals, rather than percentages. This allows them to report values greater than 100%.
For example, if %Disk Read Time is 96%, then Avg. Disk Read Queue Length is 0.96.
The advantage of these counters is their ability to display values that exceed 100%.
For example, if %Disk Read Time is 90% and %Disk Write Time is 30%, %Disk Time cannot report the sum because it cannot exceed 100%. In this case, %Disk Time is 100% and Avg. Disk Queue Length is 1.2.
Still, you need to be cautious when interpreting these values, especially those that are sums. In a 3-disk set, if one disk is reading for 66% of the sample interval, another is reading for 70% of the interval, and the third is idle, the Avg. Disk Read Queue Length, would be 1.36. This doesn't mean that the set is 136% busy; it means that it is at about 45% (1.36/3) capacity.
Monitoring Application Efficiency
It's not easy to measure disk use by applications, though it is important. To measure how efficiently your application is using the disks, chart the memory and cache counters.
Applications rarely read or write directly to disk. The file system first maps application code and data into the file system cache and copies it from the cache into the working set of the application. When the application creates or changes data, the data is mapped into the cache and then written back to disk in batches. The exceptions are when an application requests a single write-through to disk or tells the file system not to use the cache at all for a file, usually because it is doing its own buffering.
Fortunately, the same design characteristic that improves an application's use of cache and memory also reduces its transfers from the disk. This characteristic is locality of reference, that is, having a program's references to the same data execute in sequence or close in time. When references are localized, the data the program needs is more likely to be in its working set, or certainly in the cache, and is less likely to be paged out when needed. Also, when a program reads sequentially, the Cache Manager does read aheads, that is, it recognizes the data request pattern and reads in larger blocks of data on each transfer.
If your application is causing a lot of paging, run it under controlled circumstances while logging Cache: Copy Read Hits %, Cache: Read Ahead/sec, Memory: Pages Input/sec, Memory: Pages Output/sec. Then try reorganizing or redesigning your data structures and repeat the test.
Also, use the file operation counters on the System object.
The relevant system counters are
These count file control and data operations for the whole system. Unlike the disk counters, the count read and write requests from the file system to devices and count time in the cache.
Recognizing Disk Bottlenecks
Disk bottlenecks appear as sustained rates of disk activity above 85% of a sample interval and as persistent disk queues greater than 2 per disk, while paging, as measured by Memory: Page Reads/sec and Memory: Page Writes/sec, remains at less than 5 per second, on average.
High use, by itself, is a sign of demand, not a problem. In general, a high-performance disk is capable of about 40 I/O operations per second. However, nearly constant use and lengthy queues are a cause for concern. When response is poor; when you hear the disk clicking, and you see its light flashing, chart Logical Disk: Avg. Disk Queue Length and Memory: Pages/sec for all logical partitions on your workstation.
Note Sustained high disk use and persistent long queues typically are symptoms of a memory shortage, not a disk bottleneck. When physical memory is scarce, the system starts writing the contents of memory to disk and reading in smaller chunks more frequently. The less memory you have, the more the disk is used. Rule out a memory bottleneck before investing any money in disk technology. For more information, see the following section.
Disk Bottlenecks and Memory
The first step in diagnosing a disk bottleneck is distinguishing it from a memory bottleneck. Sustained paging, a symptom of a memory shortage, can look like a disk bottleneck.
Paging—moving pages of code and data between memory and disk—is a necessary part of the virtual memory system. Paging is expected, especially when the system or a process is starting. However, excessive paging that consumes the disk is a symptom of a memory shortage. The solution is to add memory, not disks.
To measure paging, chart the following counters:
The following graph shows an apparent, if transient, disk bottleneck.
In this example, the thick black stripe on the top is % Disk Time, at a sustained rate of 100%. The white line is Current Disk Queue Length, an instantaneous count of the items in the disk queue. There are up to 7 items in the queue to disk in this sample, and the average is nearly 3. It looks like a faster disk is needed.
However, the following graph reveals at least one element contributing to the queue.
This graph is the same as the previous one, except for the addition of Memory: Page Reads/sec (the white line) and Memory: Page Writes/sec (the thin line at the bottom of the graph). Current Disk Queue is now the thin, black line behind Page Reads/sec. The memory counters show how many times the disk was accessed to retrieve pages that were not found in memory or to write pages to free up memory for new data coming in from disk.
The average of 37 disk accesses per second—including 35 Page Reads/sec (as shown in the value bar) and nearly 2 Page Writes/sec—is probably the maximum for this older technology disk.
If this pattern persists beyond the startup of the system or a process, you have a memory bottleneck, not a disk bottleneck. However, before you add memory, make sure that the memory bottleneck is not caused by an inefficient application. For more information, see Chapter 12, "Detecting Memory Bottlenecks."
Interrupts and Disk Use
Just as a memory shortage can look like a disk problem, a disk bottleneck can look like a processor problem. This happens when the rate of interrupts caused by disk activity consume the processor. Although different disk components use different strategies for transferring data to and from the disk, they all use the processor to some extent. You can measure the effect of the disk on the processor by charting
There is a lot of activity other than disk operations that produce processor interrupts, even on a relatively idle system. To determine the number of processor interrupts attributable to disk activity, you need to subtract from your measurements those attributable to other causes. On an Intel 486 or later processor, the processor clock interrupts every 10 milliseconds, or 100 times per second. Network interrupts can produce 200–1000 interrupts/sec. Also, hardware errors, like failing drivers, can produce thousands of interrupts.
The following report shows an example of interrupts during a maximum throughput test on a controller that uses programmed I/O, a disk transfer method that uses the processor to tell the drive what sectors to read. The computer was disconnected from the network during the test.
In this example, there were an average of 426.5 interrupts per second. Subtracting 100 per second for the system clock leaves 326.5 from the disk activity, or 76.5% of interrupts. The processor was 98.3% busy on average, and 97.8% of it was in privileged mode, where interrupts are serviced. On average, 91.5% of processor time was consumed by interrupts. Since the disk was responsible for 76.5% of interrupts, it is likely to have generated about 70% of processor use (76.5% of 91.5%). This is substantial and, if sustained, could slow the whole system.
Measuring Disk Efficiency
Each component of your disk assembly (the adapter bus, the device bus and cable, the disk controller, and the disk or disk cabinet) has a rate of maximum and expected throughput. The total configuration is limited to the maximum throughput of the slowest component, so it's important to find that value for your system. The booklets provided by the component manufacturer usually list maximum and expected transfer rates and throughput for each component.
The final components in your disk configuration are the applications that issue the I/O requests. They determine how the physical disks are used. In general, reading or writing a few large records is more efficient than reading or writing many small ones. This curve levels off when the disk is moving such large blocks of data that each transfer is slower, though its total throughput is quite high. Unfortunately, it is not always easy to control this factor. However, if your system is being used to transfer many small units of data, this inefficiency may help to explain, though not resolve, high disk use.
In this section, we'll show you how to test the efficiency of your disk configuration at reading and writing and at sequential versus random transfers. We'll also share some strategies for testing the maximum throughput of your disk, and point you to some files on the Windows NT Resource Kit 4.0 CD for testing disk throughput.
Reading and Writing
Some disks and disk configurations perform better when reading than when writing. You can compare the reading and writing capabilities of your disks by reading from a physical disk and then writing to the same physical disk.
To measure reading from and writing to disk, log the Logical and Physical Disk objects in Performance Monitor, then chart the following counters:
On standard disk configurations, you will find little difference in the time it takes to read from or write to disk. However, on disk configurations with parity, such as hardware RAID 5 and stripe sets with parity, reading is quicker than writing. When you read, you read only the data; when you write, you read, modify, and write the parity, as well as the data.
Mirror sets also are usually quicker at reading than writing. When writing, a mirror set writes all of the data twice. When reading, it reads simultaneously from all disks in the set. Magneto-optical devices (MOs), known to most of us as Read/Write CDs, also are quicker at reading than writing. When writing, they use one rotation of the disk just to burn a starting mark and then wait for the next rotation to begin writing.
Measuring Disk Reading
The following graph shows a test of disk-reading performance. A test tool is set to read 64K records sequentially from a 60-MB file on a SCSI drive. The reads are unbuffered, so the disk can be tested directly without testing the program's or system's cache efficiency. Performance Monitor is logging every two seconds.
Note The test tool used in this example submits all of its I/O requests simultaneously. This exaggerates the disk time and Avg. Disk sec/Transfer counters. If the tool submitted its requests one at a time, the throughput might be the same, but the values of counters that time requests would be much lower. It is important to understand your applications and test tools and factor their I/O methods into your analysis.
In this graph, the top line is Disk Reads/sec. The thick, black, straight line running right at 64K is Avg. Disk Bytes/Read. The white line is Disk Read Bytes/sec, and the lower thin, black line is Avg. Disk sec/Read. The scale of the counters has been adjusted to fit all of the lines on the graph, and the Time Window eliminates the starting and ending values from the averages.
In this example, the program is reading the 64K records from Logical Drive D and writing the Performance Monitor log to Logical Drive E on the same physical disk. The drive is doing just less than 100 reads and reading more than 6.2 MB per second. At the points where the heavy black and white lines meet, the drive is reading 100 bytes per second. Note that reading 6.2 MB/sec is reading a byte every 0.00000016 of a second. That is fast enough to avoid a bottleneck under almost any circumstances.
However, Avg. Disk sec/Read is varying between 0.05 and 3.6 second per read, instead of the 16 milliseconds that would be consistent with the rest of the data (1 second/64K bytes). As noted above, the value of Avg. Disk sec/Read tells us more about the test tool and the Performance Monitor counters than about the disk. However, you might see something like this, so it's worth understanding.
Avg. Disk sec/Read times each request from submission to completion. If this consisted entirely of disk time, it would be in multiples of 16 milliseconds, the time it takes for one rotation of this 3600 RPM disk. The remaining time counted consists of time in the queue, time spent moving across the I/O bus, and time in transit. Since the test tool submits all of I/O operations to the device at once, at a rate of 6.2 MB per second, the requests take 3 seconds, on average.
Measuring Writing while Reading
There are some noticeable dips in the curves of all three graphs. If Performance Monitor were logging more frequently, you could see that the disk stops reading briefly so that it can write to the log and update file system directories. It then resumes reading. Disks are almost always busy with more than one process, and the total capacity of the disks is spread across all processes. Although the competing process just happens to be Performance Monitor, it could be any other process.
The following graph shows the effect of writing on the efficiency of the reads.
In this graph, several lines are superimposed, because the values are nearly the same. The thick, black line is Physical Disk: Disk Reads/sec and Logical Disk: Disk Reads/sec for Drive D; the thick, white line is Physical Disk: Disk Writes/sec and Logical Disk: Disk Writes/sec for Drive E. The thin, black blips at the bottom of the graph are Disk Reads/sec on Drive E and Disk Writes/sec on Drive D, both magnified 100 times to make them visible.
Although Disk Writes/sec on Drive D are negligible, fewer than 0.05 per second, on average, Performance Monitor is writing its log to Drive E, the other logical partition on the physical disk. This accounts for the writing on Physical Drive 1. Although the logical partitions are separate, the disk has a single head stack assembly that needs to stop reading, however briefly, while it writes. The effect is minimal here, but it is important to remember that logical drives share a physical disk, especially because most disk bottlenecks are in shared physical components.
The report on this graph shows the average values, but averages obscure the real activity, which happens in fits and starts. The following figure shows an Excel spreadsheet to which the values of writing to Drive D have been exported.
Drive D is also writing, just to update file system directory information. It writes a page (4096 bytes), then a sector (512K bytes)—the smallest possible transfer on this disk. You can multiply column B, Disk Bytes/Write by column C, Disk Writes/sec, to get column D, Disk Write Bytes/sec. Although the transfer rates aren't stellar here, we are reading very small records and have an even smaller sample.
The spreadsheet for Drive E follows.
This shows the wide variation of writes in this small sample. In general, Drive E is writing about a page at a time, but the transfer rate varies widely, from less than a page per second, up to 33.5 pages per second. However, this small amount of writing is enough to account for the dips in the main reading data.
Measuring Disk Writing
The graphs of writing to this simple disk configuration are almost the same as those of reading from it. The test tool is set to write sequential 64K records to a 60 MB file on a SCSI drive. The writes are unbuffered, so they bypass the cache and go directly to disk. Performance Monitor is logging once per second.
Note Disks cannot distinguish between writing a file for the first time and updating an existing file. Recognizing and writing only changes to a file would require much more memory than is practical. The writing tests in this chapter consist of writing new data to disk, but writing changes to data would produce the same results.
The following figure shows the reading and writing measures side by side. The top graph measures reading; the bottom, writing.
In these graphs, the lines (from top to bottom of each graph) represent
The actual values are almost identical or vary only within experimental error. The dips in the values represent the time the disk spent writing the Performance Monitor log to the other logical drive.
If you have enough disks, you can eliminate the variation caused by Performance Monitor logging. The following graph shows the test tool writing sequential 64K records to a 40 MB file. Because Performance Monitor is logging to a different physical drive, the logging does not interfere with the writing test.
As expected, the dips in the graph are eliminated. The overall transfer rate is also somewhat improved, although writing a log doesn't have that much overhead. Whenever possible, isolate your tests on a single physical drive. Also, if you have a high-priority task, or an I/O intensive application, designating a separate physical drive for the task will improve overall disk performance.
Random vs. Sequential Reading
It is much quicker to read records in sequence than to read them from different parts of the disk, and it is slowest to read randomly throughout the disk. The difference is in the number of required seeks, operations to find the data and position the disk head on it. Moving the disk head, a mechanical device, takes more time than any other part of the I/O process. The rest of the process consists of moving data electronically across circuits. However slow, it is thousands of times faster than moving the head.
The operating system, disk driver, adapter and controller technology all aim to reduce seek operations. More intelligent systems batch and sequence their I/O requests in the order that they appear on the disk. Still, the more times the head is repositioned, the slower the disk reads.
Tip Even when an application is reading records in the order in which they appear in the file, if the file is fragmented throughout the disk or disks, the I/O will not be sequential. If the disk-transfer rate on a sequential or mostly sequential read operation deteriorates over time, run a defragmentation utility on the disk and test again.
The following figure compares random to sequential reading to show how random reading affects disk performance. In the top report, the disk is reading 64K records randomly throughout a 40 MB file. Performance Monitor is writing its log to a different physical drive. In the bottom report, the same disk is reading 64K records in sequence from a 60 MB file, with Performance Monitor logging to a different logical partition on the same drive.
The difference is quite dramatic. The same disk configuration, reading the same size records, is 32% more efficient when the records read are sequential rather than random. The number of bytes transferred fell from 6.39 MB/sec to 4.83 MB/sec because the disk could only sustain 75 reads/sec, compared to 97.6 reads/sec for sequential records. Queue time, as measured by Avg. Disk sec/Read, was also 1/3 higher in the random reading test.
The following figure shows the graphs of the two tests so you can see the differences in the shape of the curves. The top graph represents random reading; the bottom represents sequential reading.
In both graphs, the top line is Disk Reads/sec, the thick, black line is Avg. Disk Bytes/Read, the white line is Disk Read Bytes/sec, and the thin, black line near the bottom is Avg. Disk sec/Read. The time window was adjusted on both graphs to eliminate startup and shutdown values, and the counter values were scaled to get them on the chart. The counter scales are the same on both charts.
Also, Disk Reads/sec and Disk Read Bytes/sec are scaled so that their lines meet when the disk is reading an average of 100 bytes/sec, the norm for this disk configuration reading sequential records of a constant size. Space between the lines indicates that the disk is reading more or less than 100 bytes/sec. This is where the 1/3 efficiency gain in sequential reads is most pronounced.
The sequential test graph is less regular because the log is being written to the same drive. Nonetheless, the transfer rate curve is straighter on the sequential test, showing that the disk is spending more time reading. The attractive pattern on the random graph appears because the disk assembly must stop reading and seek between each read. Had it been able to measure at a much higher resolution, it would show the transfer rate dropping to zero and then spiking back to 100 reads/sec.
To examine the cause of the pattern in the random test, add Processor: %Processor Time to the graph. If you have a multiprocessor computer, substitute System: %Total Processor Time.
In this example, the Processor: % Processor Time is the white line superimposed upon the previous graph of the random test. The processor time follows the same pattern as the transfer rate, but is offset by about half of the read time. The processor, which is not particularly busy otherwise, is consumed for short periods while it locates the drive sector for the read operation. Once the sector is found, the read can begin, and the processor can resume its other threads until the read is complete, when it is again interrupted by the next seek request.
Although it is often impractical to read records sequentially in a real application, these tests demonstrate how much more efficient the same disk can be when reading sequentially. The following methods can improve disk efficiency:
If your disk is used to read data from many different files in different locations, it cannot be as efficient as it might otherwise be. Adjust the expected values for the disk based upon the work it is expected to do.
Random vs. Sequential Writing
Seek operations affect writing to disk as well as reading from it. Use the following counters to measure and compare the effects of writing sequential records to writing randomly throughout the disk:
Remember to defragment your disk before testing. If your disk is nearly full, the remaining free space is likely to be fragmented, and the disk must seek to find each sector of free space. The efficiencies won by writing sequential records will be lost in the added seek time.
The following figure compares random to sequential writing on the same disk. In the top graph, the disk is writing 64K records randomly throughout a 60 MB file. In the bottom graph, the same disk is writing the same size records to the same size file, but is writing sequentially. In both cases, Performance Monitor is logging to a different partition on the same physical disk.
In both graphs, the white line is Disk Writes/sec, the thick, black line is Avg. Disk Bytes/Write, the gray line is Disk Write Bytes/sec, and the thin, black line near the bottom is Avg. Disk sec/Write.
The pattern of seek and read that was evident on the graph of random reading does not appear in this graph of random writing. In fact, the shapes of the random and sequential writing counter curves are quite similar, but their values are very different. Disk Writes/sec and Disk Write Bytes/sec are both 50% higher on the sequential writing test, an even greater effect than on the reading test.
The following comparison of reports makes this more evident. The top report is of the random writing test; the bottom report is of the sequential writing test.
When writing throughout the disk, the transfer rate, as measured by Disk Writes/sec, drops from 96.335/sec to 62.703/sec on average. Disk Write Bytes/sec drops also by one-third, from 6.3 MB/sec to 4.0 MB on average.
Reading Smaller vs. Larger Records
All other things being equal, it is quicker and more efficient to read a few large records than many small ones. Although this seems obvious, it is vital to disk performance. If your applications are efficient in their I/O strategy, in localizing data access, and in minimizing repeated I/O requests, the application, the disk, and the computer will function more efficiently.
You can test how your computer responds to reading and writing in smaller and larger units. The Windows NT Resource Kit 4.0 CD includes Diskmax, a Response Probe test of maximum throughput which reads 64K records and Minread, a Response Probe test for reading in 512-byte records. The tests are on the Windows NT Resource Kit 4.0 CD in the Performance Tools group in the \Probe subdirectory. Instructions for running the tests are in Diskmax.txt and Minread.txt.
Note The Minread tests use 512-byte records as the minimum record size because unbuffered reads must be done in sectors, and 512-bytes is a common disk sector size. If your disk has a different sector size, substitute that value for 512 in the RECORDSIZE parameter of the Minread.sct file.
To find the sector size of your disk, use Windows NT Diagnostics in the Administrative Tools group. Select the Drives tab, double-click the drive letter to open the Properties page, then select the General tab. Sector size is listed along with other useful information.
The following figure displays the extremes. It compares the results of the Minread and Diskmax tests run on the same drive of the same computer. Performance Monitor was writing its log to a different physical drive. Both tests show Response Probe doing unbuffered reads of sequential records from a 20 MB file.
The figure was created by superimposing two Performance Monitor reports of the same counters. The data in the first column shows Response Probe reading 512-byte records. The data in the second column shows Response Probe reading 64K records. Avg. Disk Bytes/Read, the size of each read from the disk, is set by the test. The other values vary with the efficiency of the system.
In this example, larger reads improved throughput substantially, but the transfer rate dropped as more of the disk time was consumed. While reading smaller records, the disk was only busy 50% of the time, so it could have been shared with another process. It managed 655 reads per second on average, at a quick 0.001 seconds per read. Reading the larger records, the disk was almost 96% busy, reading only 23.4 times/sec at 0.041 seconds per read.
Total throughput was much better for larger records. Disk Read Bytes/sec was 336K bytes per second on average for the small records and 1.5 MB/sec for the large records.
Interrupts/sec at 1124.466 were close to the expected 1 per sector for this disk, as shown in the following table. Note that although interrupts were high, they amounted to a small proportion of disk time. Some of the interrupts might not have been serviced.
In this system, 100 interrupts per second are generated by the processor clock and about 300 interrupts per second are generated by the network. Thus, 724 interrupts per second can be attributed to disk activity while reading smaller records or about 1 interrupt for every 463 bytes (336006.5 / 724) on average. For larger records, 3252 interrupts per second are likely to be caused by disk activity or 1 interrupt for every 470 bytes (1529675.125 / 3252).
One important value, elapsed time, is not shown in the report, but can be calculated, at least roughly, from values that are shown. To read a 20 MB file in 512-byte chunks would take 40,000 reads. At about 655 disk reads per second, that would take longer than minute. ( (20,048,000 / 512) / 655 = 61 seconds) To read the same file in larger records, even at the slower rate would take only just over 13 seconds (((20,480,000 / 65536) / 23.4) = 13.34).
This test of the extremes of record size performance used sequential reading with no memory access. To test within and beyond this range, copy and edit the Diskmax and Minread files.
For more information on Response Probe, see "Response Probe" in Chapter 11, "Performance Monitoring Tools."
Reading Records of Increasing Size
Another interesting test is to read records of gradually increasing size. You can see how the system responds to the change in requirements.
In this test, a test tool was set up to do unbuffered, sequential reads from a 40 MB file. It did three reads each of 2K, 8K, 64K, 256K, 1024K, 4096K and 8192K records with a 5-second interval between cluster of three reads.
Note The Windows NT Resource Kit 4.0 CD includes all the files you need to use Response Probe to test the performance of your disk while reading records of increasing size. The Sizeread test is controlled by an MS-DOS batch file which runs a series of Response Probe tests. To run Sizeread, use Setup to install the Performance Tools group from the CD. The test files are in the Probe subdirectory. Instructions for running the test are in Sizeread.txt.
The following graphs show the data. The first two graphs show values for the smaller records, 2K, 8K, and 64K. Values for the larger files appear to stop at 100, but actually go off of the top of the graph. The last graph in this section shows values for the larger records, 256K, 1024K, 4096K, and 8192K. In these graphs, values for the smaller record sizes run along the bottom of the graph. Throughout the test, Performance Monitor was logging to a different physical drive.
In this graph, the gray line is Disk Reads/sec, the black line is Avg. Disk Bytes/Read, and the white line is Disk Read Bytes/sec. As the record size (Avg. Disk Bytes/Read) increases, the throughput (Disk Read Bytes/sec) increases and the transfer rate (Disk Reads/sec) falls because it takes fewer reads to move the same amount data. At 8K, the reading performance wobbles as the system runs short of memory, then recovers. Above 64K, the values are greater than 100 and go beyond the top of the graph.
The following graph shows the affect of the disk activity on the processor.
In this graph, Processor: % Processor Time (the white line) is added to the graph, along with Interrupts/sec. The processor time curve shows that the processor is used more frequently as throughput increases, but the amount of processor time decreases as the record size increases. This value is characteristic of the architecture of this disk, which interrupts for each read, not for each sector. On disks that interrupt at each sector, the pattern would be quite different.
The patterns seem to fall apart at record sizes greater than 64K bytes. The processor use begins to increase, and throughput rate hits a plateau and remains there.
This graph is designed to show the larger values. The counters are scaled quite small, and the vertical maximum on the graph is increased to 450. The thick, black line (Avg. Disk Bytes/Read) represents the record size. The white line is the throughput, in Disk Read Bytes/sec. The gray line is transfer rate, in Disk Reads/sec.
The scales are so small that the first few record size variations just appear as close to zero. The first noticeable bump is 64K, the next is the attempt at 256K, then 1024K, 4096K and 8192K. The disk adapter cannot handle the higher disk sizes, so the actual values are closer to 252K, 900K, then 6.5M for both 4096K and 8192K.
What is clear from this otherwise busy graph, is that maximum throughput is reached at 64K and does not increase any further with record size, although the transfer rate continues to fall as the buses are loaded with larger and larger records.
The actual values are best shown on this Excel spreadsheet. It was prepared by using a single copy of Performance Monitor with a graph of Avg. Disk Bytes/Read in Chart view, and a report of the Logical Disk and Processor counters was created in Report view. In Chart view, the Time Window was adjusted to limit the values to a single record size segment, then the values were read from report view and entered into the spreadsheet. The procedure was repeated for each record size segment of the chart.
This spreadsheet reveals the I/O strategy of this system. When transferring data blocks greater than 64K, it breaks the transfers into 64K chunks. Above 64K, the transfer rate drops sharply, and throughput sticks at 6.5 MB. The buffer size appears to be at its maximum at an average record size of 2.8 MB, although the largest record transferred was 4.194 MB. (To determine the largest record size, use the time window to limit the graph to the single highest value on the chart, then read the Max value from the value bar.)
Processor use and interrupts also appear to level off at 64K. The remaining variation is just as likely to be due to sampling. It is beyond the resolution of this tool.
This is just an example of what you can test. Remember to use different applications and test tools and combine all results in your analysis. Save the data to show long term trends in disk performance, especially if your workload changes or memory or disks are upgraded.
Use the same testing methods to compare the performance of different disks. Disk components vary in architecture and design philosophy, and they use different protocols. As expected, performance varies widely and is usually correlated with the price of the components. Most workstations will perform adequately with the most moderately priced disk components. However, if you have a disk bottleneck, you might want to evaluate different disk components.
The following figure compares the reading performance of two very different disk configurations side by side. As you might expect, the graph on the right represents the more expensive disk. That disk uses direct memory access (DMA), a method of transferring data to disk that minimizes the use of the processor. The graph on the left represents the performance of a more traditional disk design which uses programmed I/O, a disk transfer method that uses the processor to determine which disk sectors are read.
The disks were both reading 64K records from a 40 MB file on an otherwise idle computer. Performance Monitor was writing its log to a different physical drive.
In both graphs, the heavy black line is Avg. Disk Bytes/Read, the gray line is Disk Read Bytes/sec, the white line is Disk Reads/sec, and the thin, black line near the bottom is Avg. Disk sec/Read. Because the lines don't curve much, they can be shrunk to show them side by side without losing too much detail.
In summary, the gray and white lines, representing Disk Read Bytes/sec and Disk Reads/sec, respectively, are much lower in the first graph than in the second. On the same task and the same computer, throughput on the disk that is represented by the graph on the right is 3.4 times higher than throughput on the disk represented by the graph on the left.
Because the lines are nearly straight, the averages shown in the following comparative reports are likely to represent disk performance accurately.
To produce a report like this one, superimpose two copies of Performance Monitor reports on the same counters for different disks. (You can also export the data to a spreadsheet, but this method is quicker.)
The reports are evidence of significant performance differences between the disks. In this example, Drive C uses programmed I/O; Drive D uses DMA. While reading from the C drive, the processor was nearly 100% busy, but because it was reading only 28.7 times per second, the throughput was just 1.88 MB/sec. When the same test was run on the D drive, the processor was only 53% busy, and it was reading nearly 100 times per second, for a total throughput of 6.5 MB/sec.
The difference in the strategies is revealed in the % Interrupt Time, which was 93.5% on the C drive and only 2% on the D drive. The C drive configuration uses the processor for disk access. The processor is interrupted between each 512-byte sector read. This amounts to 128 interrupts for each 65536-byte record read. The C drive is reading an average of 187955.875 bytes/sec, so the processor is being interrupted 3671 times each second. That is enough to consume all processor time.
In contrast, the more advanced D drive configuration interrupts the processor five times to set up the read operation, but doesn't issue any further interrupts for the remainder of the read. This strategy obviously benefits even more from larger records, whereas the C drive strategy produces ever more interrupts as the record size grows.
This test is just an example of the kinds of tests you can use to compare disk configurations. A complete test would use the same methods to test reading, writing, reading and writing randomly and sequentially, and reading and writing records of different sizes. A vital test, one to measure maximum throughput on your disk, is explained in the following section.
Testing Maximum Disk Throughput
Disk configurations vary widely between design technologies and manufacturers. If disk performance is important to your system, it's wise to assemble and test different disk components. A maximum throughput test will tell you one of the limits of your system.
The Windows NT Resource Kit 4.0 CD includes all the files you need to use Response Probe to test maximum throughput on your disks. Use Setup to install the Performance Tools group from the CD. The test files are in the Probe subdirectory. Instructions for running the test are in Diskmax.txt.
Warning Response Probe, like other Windows NT Resource Kit 4.0 CD tools, is not an officially supported tool. Be aware of this when using Response Probe and other tools on the CD.
To test how fast the disk can go, give it the best possible circumstances: Have it read large (but not excessively large) records sequentially from a large file. In this test, Response Probe reads 64K records sequentially from a 20 MB file of zeros. The reads are not buffered by the cache, there is no datapage access, and the same codepage function is read repeatedly to minimize the effect of codepage access.
While Response Probe is running, use Performance Monitor to log the System and Logical Disk objects once per second. Then chart the following counters:
Disk Read Bytes/sec is the essential throughput measurement; the other counters are included to help to in interpreting its value.
The following report of Response Probe activity was generated by using the Diskmax test files on the CD.
In this example, the maximum throughput, as measured by Disk Read Bytes/sec is 1.54 MB, which is quite good for this disk configuration, although higher throughput is expected from more advanced disk technologies. The disk was reading 23.5 of the 64K records per second and each read took 0.04 seconds on average.
Total interrupts for this activity, 3676 per second, seem excessive, but the processor was busy only 79% of the time, and the interrupts generated only 7.65% of that activity.
Run the Diskmax test on your disk configuration. When you have run it a few times, test the response of your disk configuration to changes in the test. For example, increase the size of the workload file to 60 or even 100 MB. Increase the record size, too. After you determine the maximum throughput for your disk, you can adjust the load on your disk so it doesn't become a bottleneck.
Monitoring Disk Sets
Two heads are better than one. In fact, when it comes to disks, the more heads the better, if you can keep them all busy. Because each physical disk typically has its own head stack assembly, and the heads work simultaneously, performance on disk sets can be significantly better on than single disks. However, some disk configurations are designed for data security, with performance as a secondary concern.
The Windows NT 4.0 Workstation Disk Administrator supports many different disk configurations, including volume sets, which are logical combinations of multiple physical disks. Performance Monitor and other monitoring tools can be set up to measure and help you tune performance on volume sets.
Note Whenever you combine noncontiguous physical disk space into a logical partition, the Disk Administrator adds Ftdisk.sys, a fault tolerant driver, to your disk driver stack and starts the FTDISK service. To see if FTDISK is started in your computer, check the Devices Control Panel.
There are three main strategies for combining physical disks. The terms introduced here are used throughout this section:
The counters used for single disks can also be used on disk sets. However, two issues are of particular concern for disk sets:
Use the new disk counters. Avg. Disk Queue Length, Avg. Disk Read Queue Length, and Avg. Disk Write Queue Length display disk activity as a decimal, not a percentage, so that it can display values over 1.0 (100%). Then, remember to recalculate the values over the whole disk configuration. For more information, see "New Disk Activity Counters" earlier in this chapter.
Reading from Stripe Sets
Windows NT Workstation supports most hardware RAID configurations and stripe sets without parity. Testing the performance of these volume sets is much like testing single disk. The Response Probe tests used to measure single disks can also be run on any disk in a stripe set and on the virtual volume that hardware RAID exposes to Windows NT.
To test your volume sets, use the following counters:
Note The equivalent counters for measuring writing (for example, Avg. Disk Write Bytes/sec) are used to test the performance of volume sets while writing to disk. The values for reading and writing in our tests were so similar that showing the writing test added little value. However, you can use the same methods to test writing to disk on your volume sets.
These reading tests were run on a stripe set of four physical disks. Disks 0, 1, and 2 are on a single disk adapter, and Disk 3 is on a separate adapter. Performance Monitor is logging to Disk 3. In each test, the test tool is doing unbuffered, sequential reads of 64K records from a 60 MB file on a FAT partition. The test begins with reading only from Disk 0. Another physical disk was added with each iteration of the test to end with 4 stripes. During the test, Performance Monitor was logging data to Stripe_read.log, which is included on the Windows NT Resource Kit 4.0 CD.
Tip The logs recorded during these tests are included on the Windows NT Resource Kit 4.0 CD in the Performance Tools group. The logs are Stripe_read.log (sequential reading), Stripe_rand.log (random reading), and Stripe_write.log (sequential writing). Use Performance Monitor to chart the logs and follow along with the discussion that follows. The logs include data for the Processor, Logical Disk, and Physical Disk objects, so you can add more counters than those shown here.
The following graph shows an overview of the test, and the disk time contributed by each disk to the total effort. In the first segment of the graph there is one disk, the second, two disks; the third, three disks; and the fourth, four disks.
The graph consists of Physical Disk: % Disk Read Time for all disks in the stripe set. The thin gray line represents Disk 0, the white line is Disk 1, the heavy black line is Disk 2, and the heavy gray line is Disk 3. The striping strategy apportions the workload rather equally in this case, so the lines are superimposed upon each other. This graph is designed to show that, as new disks were added to the test, each disk needed to contribute a smaller portion of its time to the task.
The following table shows the average values of Avg. Disk Read Queue Length, a measure of disk time in decimals, for each disk during each segment of the test.
This table shows how the FTDISK, the Windows NT fault tolerant disk driver, distributes the workload among the stripes in the set, so each disk requires less time. The exception is Disk 0 which has a disproportionate share of the activity during the last stage of the test.
The following graph shows the effect of their combined efforts in the total work achieved by the stripe set.
In this graph, the gray line is Disk Reads/sec: Total, the heavy black line is Avg. Disk Bytes/Read: Total, the white line is Disk Read Bytes/sec: Total and the thin, black line at the bottom is Avg. Disk sec/Read: Total. The vertical maximum on the graph is increased to 200 to include all values.
The following figure shows the average values for each segment of the test.
Tip To produce a figure like this, open four copies of Performance Monitor and chart the counters you want to see for all available instances. The first copy is used just to show the counter names. Use the time window to set each copy of Performance Monitor to a different time segment of the test. Then, you can scroll each copy to the instance you want to examine in that time segment. In this example, the Total instance is shown for all time segments.
The graph and reports show that the transfer rate (Disk Reads/Sec: Total) is most affected by adding stripes to the set. It increases from an average of 69 reads/sec on a single disk to an average of 179 reads per second with four stripes. Throughput (Disk Read Bytes/Sec: Total) increases from an average of 4.5 MB per second to 11.75 MB/sec with four stripes.
Note that there is almost no change in the values upon adding the third stripe, Disk 2, to the set. The total transfer rate increases significantly with the addition of the second disk, but not at all with the third disk. Throughput, which is 4.5 MB with one disk, inches up to an average of 4.8 MB, then stays there until the fourth disk is added.
We cannot measure it directly, but it appears that this plateau is caused by a bottleneck on the disk adapter shared by Disks 0, 1, and 2. Although each of the physical disks has a separate head stack assembly, they still share the adapter. Shared resource contention is one of the limits of scalability. Multiple computers share the network, multiple processors share memory, multiple threads share processors, and multiple disks share adapters and buses. Fortunately, we can measure it and plan for future equipment needs.
The following graph shows how each disk is affected when stripes are added to the set. While the totals go up, each disk does less work. Potentially, it has time available for other work.
This is a graph of the transfer rate, as measured by Disk Reads/sec. The gray line is Disk Reads/Sec: Total, the black line is Disk Reads/sec: Disk 0, and the white line is Disk Reads/sec: Disk 1. The lines for Disks 2 and 3 run along the bottom of the graph until they are added, and then they are superimposed on the line for Disk 1.
The average values are:
These averages are a fairly good representation of the strategy of the stripe set controller as it distributes the workload equally among the stripes in the set. Each disk does less work, and the total achieved increases two and half times. Note that the transfer rate did not increase fourfold; the difference is likely to be related to sharing of resources.
The cause of the exceptional values of Disk 0, which appear in every test, are not entirely clear. They probably result from updates to the File Allocation Table. The tests were run on a FAT partition which was striped across all participating drives. In each case, the File Allocation Table is likely to be written to the first disk, Disk 0. Because the File Allocation Table is contiguous and sequential, Disk 0 can perform at maximum capacity. It appears that distributing the load to the other disks let Disk 0 double its productivity in the last sample interval. More research will be required to determine what happened.
The next graph shows that the same pattern holds for throughput. As more stripes are added, the total throughput increases, and the work is distributed across all four disks. This data also shows the disproportionate workload on Disk 0.
This is a graph of disk throughput, as measured by Disk Read Bytes/sec. The gray line is Disk Read Bytes/sec: Total, the black line is Disk Read Bytes/sec: Disk 0, the white line is Disk Read Bytes/Sec: Disk 1. Again, the lines for Disks 2 and 3 run along the bottom of the graph until they are added, and then they are superimposed on the line for Disk 1.
This table shows the average throughput, in megabytes, for each disk as the test progresses.
The pattern, quite reasonably, is very similar to that for the transfer rate. The workload is distributed evenly and the total throughput rate achieved increases by 2.6%. Disk 0 is still doing a disproportionate share of the work (57%), which probably consists of its share of the read operations plus updating the FAT table.
Random Reading from Stripe Sets
Reading and writing randomly throughout a disk is about the most laborious process required of disks. The constant seek activity consumes the disk and the processor. These issues were discussed in detail in the earlier section, "Random vs. Sequential Reading." This section describes how to test the effect of random reading on your volume set.
Note As in the previous section, the relative values for reads and writes are nearly indistinguishable, so data for writing is not shown here. You can, however, use the same methods to explore random writing performance on your disks.
As before, these reading tests were run on a stripe set of four physical disks. Disks 0, 1, and 2 are on a single disk adapter, and Disk 3 is on a separate adapter. Also, Performance Monitor is logging to Disk 3. In each test, the test tool is doing random, unbuffered reads of 64K records from a 60 MB file. The test begins reading only from Disk 0, and adds a disk with each iteration of the test to end with four stripes.
During the test, Performance Monitor was logging to Stripe_rand.log. This log is included on the Windows NT Resource Kit 4.0 CD, so you can follow along and chart additional counters.
Stripes sets are know for their seeking efficiency. They are created by associating free space in multiple physical disks. Code and data written to the stripe set is distributed evenly across all disks. Because each disk in the set has its own head stack assembly, the heads on each disk in the set can seek simultaneously. Some performance loss is expected when the disk is reading randomly, but it should not be as pronounced as on single disks or on unassociated disk configurations.
This following graph shows an overview of random reading performance for the disk set.
In this graph, the gray line is Disk Reads/Sec: Total, the thick, black line is Avg. Disk Bytes/Read: Total, the white line is Disk Read Bytes/Sec: Total and the thin, black line is Avg. Disk sec/Read: Total. The vertical maximum as been increased to 250 to incorporate all values.
The trend is much like that for sequential reads. As more stripes are added, the transfer rate (Disk Reads/sec) and throughput (Disk Read Bytes/sec) increase, and the queue (Avg. Disk sec/Read) diminishes.
The following figure compares the performance graphs of random and sequential reading by stripe sets. The graph on the left is sequential reading; the graph on the right is random reading. Both graphs in the figure show values for the _Total instance, representing all physical disks in the stripe set.
In both graphs, the gray line is Disk Reads/sec: Total, the thick, black line is Avg. Disk Bytes/Read: Total, the white line is Disk Read Bytes/Sec: Total and the thin, black line is Avg. Disk sec/Read: Total. The vertical maximum on both graphs has been set to 250 to incorporate all values.
This figure shows that although the patterns are similar, the values are slightly different. The transfer rate (Disk Reads/sec: Total) increases to more than 215 reads/sec in the random test. Throughput (Disk Read Bytes/sec: Total — the white line) runs lower on the random graph through almost every stage of the test.
The following tables compare the average values for sequential and random reading on the stripe set. To find these values, open a copy of Performance Monitor for each sample interval on the graph. Then use the time window to limit each one to a single sample interval, and display the disk reading counters in a report. These values were taken from four of such reports.
Note that the performance lost by reading randomly diminishes significantly as disks are added. On a single disk, throughput on random is 53% lower than on sequential reading. The difference drops to 15% with two disks and, on four disks, throughput is 17% greater on random reading than on sequential reading.
The affect on individual disks in the set follows the patterns evidenced by the sequential reads. The following graph shows the effect of adding disks to the set on the transfer rate of each disk.
Physical Disk: Disk Reads/sec is charted for the _Total instance and for Disks 0 through 3. The gray line is the _Total, the black line is Disk 0, the white line is Disk 1. The lines representing Disks 2 and 3 run along the bottom of the graph until they are added to the test. Then they are superimposed upon the line for Disk 1.
The pattern continues. The transfer rate of the disk set increases with each disk added, but the work is distributed evenly to all disks in the set. The proportion of transfers for each disk declines accordingly.
The average values are shown in the following table.
As observed in the sequential reads, the increased workload is distributed equally among all disks. There appears to be slightly more variation in the values, but it is too small to measure accurately without a much larger sample pool. Again, the transfer rate on Disk 0 increases significantly when the fourth disk is added. It is probably doing its share of the reading and also updating the FAT table.
The following graph shows the throughput values for random reading on a stripe set. The chart shows Disk Read Bytes/sec for all disks in the stripe set.
In this graph, the gray line is the _Total instance for all disks, which increases as more disks are added to the stripe set. The heavy, black line is Disk 0, and the white line is Disk 1. The lines representing Disks 2 and 3 run along the bottom of the graph until they are added to the test. Then, they are superimposed upon the line for Disk 1.
This table shows the average values.
As disks are added, total throughput for the disk set increases 3.36 times, from 2.95 MB/sec to 9.91 MB/sec, compared to a 2.6 times increase for sequential reading. FTDISK is clearly taking advantage of the stripe set.
It is clear from this data that stripe sets are a very efficient means of disk transfer, and that the difference is especially apparent on very seek-intensive tasks such as random reading.
Although it is not shown in these graphs, processor use remained at 100% for the duration of the sequential and random reading and writing tests on stripe sets. The improved productivity has a cost in processor time.
Resolving a Disk Bottleneck
The obvious that the solution to a disk bottleneck is to add another disk. That is an appropriate solution if you are out of space. However, if your disk is too slow, the addition of a new disk might not be the cheapest or most effective solution. Disk systems have many parts, and any one could be the limiting factor in your disk configuration.