Troubleshooting Multiple Cluster Symptoms on the Same SAN


This article describes the multiple-cluster scenarios when certification is not met and the disks are allowed to see multiple-clustered nodes on the same SAN. Multiple-cluster is more than one set of MSCS clusters that are assigned to one or more fiber-attached host bus adapters (HBAs).

These same SAN devices can be attached to main frame computers or UNIX operating systems. This can present some challenges because of differences in SCSI commands sets. These anomalies can be caused by firmware revisions or the inability to properly zone or mask the bus resets to control the disks with MSCS Cluster Services.

Without this protection in place (and proper masking, zoning or a combination of both) the following problems could occur:
  • Lost delayed write errors of various versions and descriptions.
  • Chkdsk.exe could run.
  • Data loss or damaged databases.
  • Seeing the same disk from both servers.
  • Disk-reservation issues in which the disks are lost temporarily or permanently.
  • Error messages on a blue screen, such as the following error message:
    Stop 0x0000001e (0xc000009a, 0x80469ac6, 0x00000000, 0x00000000)
  • Mismatched disk sets from one node of the cluster to the other.
IMPORTANT: You should contact the SAN vendor for the specific technology to use.

More Information

The issues that are described in the "Summary" section of this article may appear to resolve themselves after Chkdsk.exe runs, but they may then return several weeks later and repeat the same pattern on one or more clustered nodes. These issues may be seen in any pattern, but Event IDs 26, 50, or 51 are most prevalent.

You may also see event warnings and error messages that are similar to the following error messages:
Event ID: 51
Source: Disk
Description: An error was detected on device \Device\Harddisk9\DR9 during a paging operation.
Event ID: 50
Source: Disk
Description: {Lost Delayed-Write Data} The system was attempting to transfer file data from buffers to \Device\Harddisk\Volumex. The write operation failed, and only some of the data may have been written to the file.
Event ID: 26
Source: Application Popup
Description: Application popup: Windows - Delayed Write Failed : Windows was unable to save all the data for the file \Device\HarddiskVolumex\SQLDatabases\System\machine\LOG. The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.
Event ID: 9
Source: HBA Driver
Description: The device, \Device\Scsi\HBA driver, did not respond within the timeout period.
Event ID: 15
Source: Disk
Description: The device, \Device\Harddiskx\DRx, is not ready for access yet.
Event ID: 1066
Source: ClusSvc
Description: Cluster disk resource Disk x: is corrupt. Running ChkDsk /F to repair problems.
NOTE: These issues can also affect network adapters if bus contention exists throughout the system.
Event ID: 1123
Source: ClusSvc
Description: The node lost communication with cluster node 'machine' on network 'heartbeat'.
Event ID: 1122
Source: ClusSvc
Description: The node (re)established communication with cluster node 'machine' on network 'heartbeat'.

ID do Artigo: 311081 - Última Revisão: 7 de jan de 2008 - Revisão: 1