To learn more about the underlying concepts of the engines and about ARIES (Algorithm for Recovery and Isolation Exploiting Semantics), see the following ACM Transactions on Database Systems document (under "Volume 17, Number 1, March 1992):
We recommend that you read the following articles in the Microsoft Knowledge Base for more information about caching and alternative failure mode discussions:
|Battery-backed||Separate and localized battery backup facility directly available and controlled by the caching mechanism to prevent data loss. |
Note This is not an uninterruptible power supply (UPS). A UPS does not guarantee any write activities and can be disconnected from the caching device.
|Cache||Intermediary storage mechanism used to optimize physical I/O operations and improve performance.|
|Dirty Page||Page containing data modifications that have yet to be flushed to stable storage. For more information about dirty page buffers, see the "Writing Pages" topic at SQL Server Books Online. |
Note The content also applies to Microsoft SQL Server 2012 and later versions.
|Failure||Anything that might cause an unexpected outage of the SQL Server process. Examples include: power outage, computer reset, memory errors, other hardware issues, bad sectors, drive outages, system failures, and so on.|
|Flush||Forcing of a cache buffer to stable storage.|
|Latch||Synchronization object used to protect physical consistency of a resource.|
|Nonvolatile storage||Any medium that remains available across system failures.|
|Pinned page||Page that remains in data cache and cannot be flushed to stable storage until all associated log records are secured in a stable storage location.|
|Stable storage||Same as nonvolatile storage.|
|Volatile storage||Any medium that will not remain intact across failures.|
Write-Ahead Logging (WAL) protocolThe term protocol is an excellent way to describe WAL. It is a specific and defined set of implementation steps necessary to make sure that data is stored and exchanged correctly and can be recovered to a known state in the event of a failure. Just as a network contains a defined protocol to exchange data in a consistent and protected manner, so too does the WAL describe the protocol to protect data.
The ARIES document defines the WAL as follows:
SQL Server and the WALSQL Server uses the WAL protocol. To make sure that a transaction is correctly committed, all log records that are associated with the transaction must be secured in stable storage.
To clarify this situation, consider the following specific example.
Note For this example, assume that there is no index and that the affected page is page 150.
INSERT INTO tblTest VALUES (1)
|BEGIN TRANSACTION||Written to the log cache area. However, it is not necessary to flush to stable storage because the SQL Server has not made any physical changes.|
|INSERT INTO tblTest|
Looking at the example in more detail, you might ask what happens when the LazyWriter or CheckPoint processes run. SQL Server issues all appropriate flushes to stable storage for transactional log records that are associated with the dirty and pinned page. This makes sure that the WAL protocol data page can never be written to stable storage until the associated transactional log records have been flushed.
SQL Server and stable storageSQL Server enhances log and data page operations by including the knowledge of disk sector sizes (commonly 4,096 or 512 bytes).
To maintain the ACID properties of a transaction, the SQL Server must account for failure points. During a failure many disk drive specifications only guarantee a limited amount of sector write operations. Most specifications guarantee completion of a single sector write when a failure occurs.
SQL Server uses 8-KB data pages and the log (if flushed) on multiples of the sector size. (Most disk drives use 512 bytes as the default sector size.) In the case of a failure, SQL Server can account for write operations larger than a sector by employing log parity and torn write techniques.
Torn page detectionThis option allows SQL Server to detect incomplete I/O operations caused by power failures or other system outages. When true, it causes a bit to be flipped for each 512-byte sector in an 8-kilobyte (KB) database page whenever the page is written to disk. If a bit is in the wrong state when the page is later read by SQL Server, then the page was written incorrectly; a torn page is detected. Torn pages are usually detected during recovery because any page that was written incorrectly is likely to be read by recovery.
Although SQL Server database pages are 8 KB, disks perform I/O operations by using a 512-byte sector. Therefore, 16 sectors are written per database page. A torn page can occur if the system fails (for example, because of a power failure) between the time that the operating system writes the first 512-byte sector to disk and the completion of the 8-KB I/O operation. If the first sector of a database page is successfully written before the failure, the database page on disk will appear as updated, although it may not have succeeded.
By using battery-backed disk controller caches, you can make sure that data is successfully written to disk or not written at all. In this situation, do not set torn page detection to "true" because this is not necessary.
Note Torn page detection is not enabled by default in SQL Server. For more information, see the following MSDN website:
Log parityLog parity checking is very similar to torn page detection. Each 512-byte sector contains parity bits. These parity bits are always written with the log record and evaluated when the log record is retrieved. By forcing log writes on a 512-byte boundary, SQL Server can make sure that committal operations are completely written to the physical disk sectors.
Versions of SQL Server earlier than 7.0Versions of SQL Server earlier than 7.0 did not provide log parity or torn bit detection facilities. In fact, those versions can write the same log page multiple times until the log records fill the 2-KB log page. This can expose transactions that have successfully committed. If the log page is being rewritten during a failure, a sector with the committed transaction may not get rewritten correctly.
Performance impactsAll versions of SQL Server open the log and data files by using the Win32 CreateFile function. The dwFlagsAndAttributes member includes the FILE_FLAG_WRITE_THROUGH option when they are opened by SQL Server.
Instructs the system to write through any intermediate cache and go directly to disk. The system can still cache write operations, but cannot lazily flush them.
The FILE_FLAG_WRITE_THROUGH option makes sure that when a write operation returns a successful completion, the data is correctly stored in stable storage. This aligns with the WAL protocol that ensures the data.
Many hardware vendors provide battery-backed disk controller solutions. These controller caches can maintain the data in the cache for several days and even allow the caching hardware to be placed in a second computer. When power is correctly restored, the unwritten data is completely flushed before the further data access is allowed. Many of them allow a percentage of read versus write cache to be established for optimal performance. Some contain large memory storage areas. In fact, for a very specific segment of the market, some hardware vendors provide high-end battery-backed disk caching controller systems with 6 GB of cache. These can significantly improve database performance.
Advanced caching implementations will handle the FILE_FLAG_WRITE_THROUGH request by not disabling the controller cache because they can provide true rewrite capabilities in the event of a system reset, power failure, or other failure point.
I/O transfers without the use of a cache can be significantly longer because of to the mechanical time that is required to move the drive heads, spin rates, and other limiting factors.
Sector orderingA common technique used to increase I/O performance is sector ordering. To avoid mechanical head movement the read/write requests are sorted, allowing a more consistent motion of the head to retrieve or store data.
The cache can hold multiple log and data write requests at the same time. The WAL protocol and the SQL Server implementation of the WAL protocol require flushing of the log writes to stable storage before the page write can be issued. However, use of the cache might return success from a log write request without the data being written to the actual drive (that is, written to stable storage). This may lead to SQL Server issuing the data page write request.
With the write cache involvement, the data is still considered to be in volatile storage. However, from the Win32 API WriteFile call, exactly how SQL Server sees the activity, a successful return code was obtained. SQL Server or any process that uses the WriteFile API call can determine only that the data has correctly obtained stable storage.
For discussion purposes, assume that all sectors of the data page are sorted to write before the sectors of the matching log records. This immediately violates the WAL protocol. The cache is writing a data page before the log records. Unless the cache is fully battery-backed, a failure can cause catastrophic results.
When you evaluate the optimal performance factors for a database server, there are many factors to consider. The most important of these is, "Does my system allow valid FILE_FLAG_WRITE_THROUGH capabilities?"
Note Any cache that you are using must fully support a battery-backed solution. All other caching mechanisms are suspect to data corruption and data loss. SQL Server makes every effort to ensure the WAL by enabling FILE_FLAG_WRITE_THROUGH.
Testing has shown that many disk drive configurations may contain write caching without the appropriate battery backup. SCSI, IDE, and EIDE drives take full advantage of write caches. For more information about how SSDs work together with SQL Server, see the following CSS SQL Server Engineers Blog article:
In many configurations, the only way to correctly disable the write caching of an IDE or EIDE drive is by using a specific manufacturer utility or by using jumpers located on the drive itself. To make sure that the write cache is disabled for the drive itself, contact the drive manufacturer.
SCSI drives also have write caches. However, these caches can commonly be disabled by the operating system. If there is any question, contact the drive manufacturer for appropriate utilities.
Write Cache StackingWrite Cache Stacking is similar to Sector Ordering. The following definition was taken directly from a leading IDE drive manufacturer's website:
A disk write task begins to store the host data to disk. Host write commands continue to be accepted and data transferred to the buffer until either the write command stack is full or the data buffer is full. The drive may reorder write commands to optimize drive throughput.
Automatic Write Reallocation (AWR)Another common technique that is used to protect data is to detect bad sectors during data manipulation. The following explanation comes from a leading IDE drive manufacturer’s website:
Data safetyThere are several precautions that a database administrator should take to ensure the safety of the data.
- It is always a good idea to make sure that your backup strategy is sufficient to recover from a catastrophic failure. Offsite storage and other precautions are appropriate.
- Test the database restore operation in a secondary or test database on a frequent basis.
- Make sure that any caching devices can handle all failure situations (power outage, bad sectors, bad drives, system outage, lockups, power spike, and so on).
- Make sure that your caching device:
- Has integrated battery backup
- Can reissue writes on power-up
- Can be fully disabled if it is necessary
- Handles bad sector remapping in real time
- Enable torn page detection. (This has little effect on performance.)
- Configure RAID drives allowing for a hot swap of a bad disk drive, if it is possible.
- Use newer caching controllers that let you add more disk space without restarting the OS. This can be an ideal solution.
Testing drivesTo fully secure your data, you should make sure that all data caching is correctly handled. In many situations, you must disable the write caching of the disk drive.
Note Make sure that an alternative caching mechanism can correctly handle multiple types of failure.
Microsoft has performed testing on several SCSI and IDE drives by using the SQLIOSim utility. This utility simulates heavy asynchronous read/write activity to a simulated data device and log device. Test performance statistics show the average write operations per second between 50 and 70 for a drive with disabled write caching and an RPM range between 5,200 and 7,200.
For more information about the SQLIOSim utility, see the following article in the Microsoft Knowledge Base:
Data devicesIn all but non-logged situations, SQL Server will require only the log records to be flushed. When doing non-logged operations, the data pages must also be flushed to stable storage; there are no individual log records to regenerate the actions in the case of a failure.
The data pages can remain in cache until the LazyWriter or CheckPoint process flushes them to stable storage. Using the WAL protocol to make sure that the log records are correctly stored makes sure that recovery can recover a data page to a known state.
This does not mean that it is advisable to place data files on a cached drive. When the SQL Server flushes the data pages to stable storage, the log records can be truncated from the transaction log. If the data pages are stored on volatile cache, it is possible to truncate log records that would be used to recover a page in the event of a failure. Make sure that both your data and log devices accommodate stable storage correctly.
Increasing performanceThe first question that might occur to you is: "I have an IDE drive that was caching. But when I disabled it, my performance became significantly less than expected. Why?"
Many of the IDE drives that are tested by Microsoft run at 5,200 RPM, and the SCSI drives at 7,200 RPM. When you disable the write caching of the IDE drive, the mechanical performance can become a factor.
To address the performance difference, the method to follow is very clear: "Address the transaction rate."
Many online transaction processing (OLTP) systems require a high transaction rate. For these systems, consider using a caching controller that can appropriate support a write cache and provide the desired performance boost while still ensuring data integrity.
To observe significant performance changes that occur in SQL Server on a caching drive, the transaction rate was increased by using small transactions.
Testing shows that high write activity of buffers that is less than 512 KB or greater than 2 MB can cause slow performance.
CREATE TABLE tblTest ( iID int IDENTITY(1,1), strData char(10))
SET NOCOUNT ON
INSERT INTO tblTest VALUES ('Test')
WHILE @@IDENTITY < 10000
INSERT INTO tblTest VALUES ('Test')
SCSI(7200 RPM) 15 seconds (Caching controller)
IDE(5200 RPM) 14 seconds (Drive cache enabled)
IDE(5200 RPM) 160 seconds
The process of wrapping the entire series of INSERT operations into a single transaction runs in approximately four seconds in all configurations. This is because of the number of log flushes that are required. If you do not create a single transaction, every INSERT is processed as a separate transaction. Therefore, all the log records for the transaction must be flushed. Each flush is 512 bytes in size. This requires significant mechanical drive intervention.
When a single transaction is used, the log records for the transaction can be bundled, and a single, larger write can be used to flush the gathered log records. This significantly reduces the mechanical intervention.
Warning We recommend that you do not increase your transaction scope. Long-running transactions can cause excessive and unwanted blocking and increased overhead. Use the SQL Server:Databases SQL Server performance counters to view the transaction log-based counters. Specifically, Log Bytes Flushed/sec can indicate many small transactions that can cause high mechanical disk activity.
Examine the statements that are associated with the log flush to determine whether the Log Bytes Flushed/sec value can be reduced. In the previous example, a single transaction was used. However, in many scenarios, this can cause undesired locking behavior. Examine the design of the transaction. You can use code similar to the following to run batches to reduce the frequent and small log flush activity:
INSERT INTO tblTest VALUES ('Test')
WHILE @@IDENTITY < 50
INSERT INTO tblTest VALUES ('Test')
if(0 = cast(@@IDENTITY as int) % 10)
PRINT 'Commit tran batch'