A database management system (DBMS), such as SQL Server, relies on the timeliness of file input and output (I/O) operations. Any one of the following items may cause stuck or stalled I/O operations and adversely affect SQL Server responsiveness and performance:
Incorrectly configured hardware
Other conditions in the I/O path
These I/O problems may cause the following behavior to occur:
Latch contention and time-outs
Slow response time
Stretching of resource boundaries
Starting in Microsoft SQL Server 2000 Service Pack 4 (SP4), SQL Server includes logic that helps detect stalled and stuck conditions for database I/O reads and writes and log file I/O reads and writes. When an I/O operation has been pending for 15 seconds or longer, SQL Server performs the following steps:
Detects that the operation has been pending
Writes an informational message to the SQL Server error log
The text of the log message resembles the following:
2004-11-11 00:21:25.26 spid1 SQL Server has encountered 192 occurrence(s) of IO requests taking longer than 15 seconds to finish on file [E:\SEDATA\stressdb5.ndf] in database [stressdb] (7). The OS file handle is 0x00000000000074D4. The offset of the latest long IO is: 0x00000000022000”.
Informational message explanation
The number of I/O requests that did not complete the read or the write operation in less than 15 seconds.
The complete file name, the database name, and the database identification (DBID) number.
The operating system handle of the file. You can use the operating system handle with debuggers or with other utilities to help track I/O request packet (IRP) requests.
The offset of the last stuck I/O operation or the last stalled I/O operation. You can use the offset with debuggers or with other utilities to help track IRP requests.
Note When the informational message is written to the SQL Server error log, the I/O operation may no longer be stuck or stalled.
This informational message indicates that the current load may be experiencing one of the following conditions:
The workload is exceeding the I/O path capabilities.
The workload is exceeding the current system capabilities.
The I/O path has malfunctioning software; perhaps a firmware or a driver issue.
The I/O path has malfunctioning hardware components.
For more information about SQL Server 2000 I/O patterns, go to the following Microsoft website:
Note This TechNet article also applies to Microsoft SQL Server 2005 and later versions.
Stuck I/O and Stalled I/O
Stuck I/O is defined as an I/O request that does not finish. Frequently, stuck I/O indicates a stuck IRP. To resolve a stuck I/O condition, you must usually restart the computer or perform a similar action. A stuck I/O condition typically indicates one of the following:
A bug in an I/O path component
Stalled I/O is defined as an I/O request that does finish or that takes excessive time to finish. Stalled I/O behavior typically occurs because of one of the following reasons:
The hardware configuration
The firmware settings
A filter driver issue that requires assistance from the hardware or the software vendor to trace and resolve
SQL Server stalled I/O and stuck I/O recording and reporting
Microsoft SQL Server Support handles many cases each year that involve stuck or stalled I/O problems. These I/O problems appear in different ways. I/O problems are some of the most difficult problems to diagnose and to debug, and they require significant time and resources for debugging from Microsoft and from the customer. The reporting features that have been added to SQL Server 2000 SP4 and later versions significantly reduce the time that is required to identify an I/O problem.
The reporting and the recording of I/O requests are designed on a per-file basis. The detection and the reporting of stalled and stuck I/O requests are two separate actions.
There are two moments when a record action occurs in SQL Server. The first is when the I/O operation actually finishes. If an I/O request takes more than 15 seconds to finish, a record operation occurs. The second moment is when the lazy writer runs. When the lazy writer runs, the lazy writer checks all the pending data and all the pending log file I/O requests. If the 15-second threshold has been exceeded, a record operation occurs.
Reporting occurs in intervals that are 5 minutes or more apart. Reporting occurs when the next I/O request is made on the file. If a record action has occurred and 5 minutes or more have passed since the last report occurred, the informational message that is mentioned in the "Summary" section is written to the SQL Server error log.
The 15-second threshold is not adjustable. However, you can disable stalled or stuck I/O detection by using trace flag 830, although we do not recommend that you do this.
To disable detection when SQL Server starts, use the -T830 startup parameter to disable detection every time that SQL Server is started. To disable detection for an instance of SQL Server that is currently running, use the following statement:
dbcc traceoff(830, -1)
This setting is effective only for the life of the SQL Server process.
Note An I/O request that becomes stalled or stuck is reported only one time. For example, if the message reports that 10 I/O requests are stalled, those 10 reports will not occur again. If the next message reports that 15 I/O requests are stalled, that means that 15 new I/O requests have become stalled.
Tracking the I/O request packet (IRP)
SQL Server uses the standard Microsoft Windows API calls to read and write data. For example, SQL Server uses the following functions:
The read or write request is handled by Windows as an I/O request packet (IRP). To determine the state of the IRP, use both of the following:
Microsoft Platform Support assistance
The kernel debugger
For more information about IRP and IRP tracing, go to the following Microsoft website and search on the keyword "IRP":
Note Kernel debugging can be an invasive process because kernel debugging can require you to stop the system to complete the debugging actions. We recommend that you check for any available updates for the following items:
Any other I/O path components
Contact your hardware vendors before you perform additional debugging actions. The debug session will likely involve a third-party driver, firmware, or filter driver component.
System performance and query plan actions
Overall system performance can play a key role in I/O processing. You should take the general health of the system into consideration when you are investigating reports of stalled or stuck I/O operations. Excessive loads can cause the overall system to be slow. This includes I/O processing. The behavior of the system at the time that the problem occurs can be a key factor in determining the root cause of the problem. For example, if CPU usage becomes high or if CPU usage remains high when the problem occurs, this behavior may indicate that a process on the system is using so much CPU that other processes are being adversely affected.
To monitor I/O performance, examine the following performance counters for specific I/O path information:
Average Disk Sec/Transfer
Average Disk Queue Length
Current Disk Queue Length
For example, the Average Disk Sec/Transfer time on a computer that is running SQL Server is typically less than 15 milliseconds. If the Average Disk Sec/Transfer value climbs, this indicates that the I/O subsystem is not optimally keeping up with the I/O demand.
Be careful when you use the performance counters because SQL Server takes full advantage of asynchronous I/O capabilities that push the disk queue lengths heavily. Therefore, longer disk queue lengths alone do not indicate a problem.
In Windows System Monitor, you can review the counters "Physical Disk: Disk Bytes/sec" for each affected disk and compare the rate of activity against the counters "Process: IO Data Bytes/Sec" and “Process: IO Other Bytes/sec” for each process to identify whether a specific set of processes is generating excessive I/O requests. There are various other I/O related counters available in the Process object that reveals more granular information. If you determine that a SQL Server instance is responsible for excessive IO load on the server, review the next section on “Indexes and Parallelism”. For a detailed discussion on detecting and resolving I/O bottlenecks, review the section “I/O Bottlenecks” in the MSDN whitepaper Troubleshooting Performance Problems in SQL Server 2008 or Troubleshooting Performance Problems in SQL Server 2005.
Indexes and parallelism
Frequently, bursts of I/O occur because an index is missing. This behavior can severely push the I/O path. A pass that uses the Index Turning Wizard (ITW) may help resolve I/O pressure on the system. If a query benefits from an index instead of a table scan, or perhaps if it uses a sort or hash, the system can gain the following advantages:
A reduction is made in the physical I/O that is required to complete the action that directly creates performance benefits for the query.
Fewer pages in the data cache have to be turned over. Therefore, those pages that are in the data cache remain relevant to active queries.
Sorts and hashes are used because an index may be missing or because statistics are out of date. You may reduce tempdb use and contention by adding one or more indexes.
A reduction is made in resources, parallel operations, or both. Because SQL Server does not guarantee parallel query execution, and because the load on the system is considered, it is best to optimize all queries for serial execution. To optimize a query, open Query Analyzer and set the sp_configure value of the max degree of parallelism option to 1. If all the queries are tuned to run promptly as a serial operation, parallel execution is often just a better result. However, many times parallel execution is selected because the amount of data is just large. For a missing index, a large sort may have to occur. Multiple workers that are performing the sort operation will create a quicker response. However, this action can dramatically increase the pressure on the system. Large read requests from many workers can cause an I/O burst together with increased CPU usage from multiple workers. Many times a query can be tuned to run faster and to use fewer resources if an index is added or if another tuning action occurs.
Practical examples from Microsoft SQL Server Support
The following examples have been handled by Microsoft SQL Server Support and Platforms Escalation Support. These examples are intended to provide a frame of reference and help set your expectations about stalled and stuck I/O situations and about how a system may be affected or may respond. There is no specific hardware or set of drivers that pose any specific risk or increased risk over another. All systems are the same in this respect.
Example 1: A log write that is stuck for 45 seconds
An attempt to write a SQL Server log file periodically becomes stuck for approximately 45 seconds. The log write does not finish in a timely manner. This behavior creates a blocking condition that causes 30-second client time-outs.
The application submitted a commit to SQL Server, and the commit became stuck as a log write pending. This behavior causes the query to continue holding locks and to block incoming requests from other clients. Then, other clients start to time out. This compounds the problem because the application does not roll back open transactions when a query time-out occurs. This creates hundreds of open transactions that are holding locks. Therefore, a severe blocking situation occurs.
For more information about transaction handling and blocking, see the following Microsoft Knowledge Base article:
The application services a website by using connection pooling. As more connections become blocked, the website creates more connections. These connections become blocked, and the cycle continues.
After approximately 45 seconds, the log write finishes. However, by this time, hundreds of connections are backed up. The blocking problems cause several minutes of recovery time for SQL Server and for the application. Combined with the application problems, the stalled I/O condition has a very negative effect on the system.
This problem was tracked to a stuck I/O request in a Host Bus Adapter (HBA) driver. The computer had multiple HBA cards with failover support. When one HBA was behind or was not communicating with the Storage Area Network (SAN), the "retry before failover" time-out value was configured to 45 seconds. When the time-out was exceeded, the I/O request was routed to the second HBA. The second HBA handled the request and quickly finished. To help prevent such stall conditions, the hardware manufacturer recommended a "retry before failover" setting of 5 seconds.
Example 2: Filter driver intervention
Many antivirus software programs and backup products use I/O filter drivers. These I/O filter drivers become part of the I/O request stack, and they have access to the IRP request. Microsoft Product Support Services has seen various issues from bugs that create stuck I/O conditions or stalled I/O conditions in a filter driver implementation.
One such condition was a filter driver for backup processing that allowed a backup of the files that were open when the backup occurred. The system administrator had included the SQL Server data file directory in the file backup selections. When the backup occurred, the backup tried to gather the correct image of the file at the time the backup started. Doing this delayed I/O requests. The I/O requests were allowed to finish only one at a time as they were handled by the software.
When the backup started, SQL Server performance dropped dramatically because the I/Os of SQL Server were forced to finish one at a time. To compound the issue, the "one at a time" logic was such that the I/O operation could not be performed asynchronously. Therefore, when SQL Server expected to post an I/O request and to continue, the worker was stuck in the read or the write call until the I/O request finished. Processing tasks such as a SQL Server read-ahead were effectively disabled by the actions of the filter driver. Additionally, another bug in the filter driver left the "one at a time" actions in process, even when the backup was finished. The only way to restore SQL Server performance was to close and then reopen the database, or to restart SQL Server so that the file handle was released and reacquired without the filter driver interaction.
To resolve this problem, the SQL Server data files were removed from the file backup process. The software manufacture also corrected the problem that left the file in "one at a time" mode.
Example 3: Hidden errors
Many higher-end systems have multichannel I/O paths to handle load balancing or similar activities. Microsoft Product Support has found problems with the load balancing software where an I/O request fails but the software does not handle the error condition correctly. The software can attempt infinite retries. The I/O operation becomes stuck and SQL Server cannot finish the specified action. Much like the log write condition that was described earlier, many poor system behaviors can occur after such a condition wedges the system.
To resolve this problem, restarting SQL Server is often required. However, sometimes you must restart the operating system to restore processing. We also recommend that you obtain a software update from the I/O vendor.
Example 4: Remote storage, mirroring, and raid drives
Many systems use mirroring or take similar steps to prevent data loss. Some systems that use mirroring are software-based and some are hardware-based. The situation that is typically discovered by Microsoft Support for these systems is increased latency.
An increase in the overall I/O time occurs when the I/O must finish to the mirror before the I/O is considered complete. For remote mirror installations, network retries can become involved. When drive failures occur and the raid system is rebuilding, the I/O pattern can also be interrupted.
Strict configuration settings are required to reduce latency to mirrors or to raid rebuild operations.
Microsoft does not support Microsoft SQL Server 7.0 or Microsoft SQL Server 2000 data files and log files on compressed drives. NTFS compression is not safe for SQL Server because NTFS compression breaks Write Ahead Logging (WAL) protocol. NTFS compression also requires increased processing for each I/O operation. Compression creates "one at a time" like behavior that causes severe performance issues to occur.
To resolve this problem, uncompress the data and the log files.
PAGEIOLATCH_* and writelog waits in sys.dm_os_wait_stats dynamic management views (DMV) are key indictors to investigate I/O path performance. If you see significant PAGEIOLATCH waits, this means that SQL Server is waiting on the I/O subsystem. A certain amount of PAGEIOLATCH waits is typical and expected behavior. However, if the average PAGEIOLATCH wait times are consistently greater than 10 milliseconds (ms), you should investigate why the I/O subsystem is under pressure. For more information, see the following documents:
SQL Server requires that systems support "guaranteed delivery to stable media" as outlined under the SQL Server I/O Reliability Program Requirements. For more information about the input and output requirements for the SQL Server database engine, go to the following Microsoft Knowledge Base article: