Removing the HBA cable on a server cluster

Summary

This article describes how failover and failback operations work on server cluster resources if you remove the cable that connects the host bus adapter (HBA) to the shared disk on a server cluster. If you remove the cable to the HBA and failover does not occur, contact the hardware vendor. Additionally, if the physical disk resource does not fail back to a node for which you have removed the cable, contact the hardware vendor.

Note This article is directed toward fibre channel solutions because if you remove the cable in a small computer system interface (SCSI) solution, termination may be broken and the SCSI bus may be disrupted.

More Information

If you remove the cable from the HBA to the shared disk (either intentionally or not), failover and failback may not work as expected if the hardware cannot fail over and fail back. The following items determine whether a physical disk resource can fail over if the cable is removed:
  • Switches
  • Storage controllers
  • HBAs
  • Drivers for the HBA
  • Related firmware revisions
  • Multiple-path software that may be installed
In some situations, the initial failover works, but a physical disk resource may not be able to fail back to the original node after you reattach the fibre cable.

When you remove the cable from the HBA, if the physical disk resource does not fail over automatically or if it fails over but does not fail back, contact the hardware vendor. The hardware controls the behavior of devices on a shared fibre bus (switches, storage controllers, HBAs, and other devices) when you remove the cable. The Cluster service operates accordingly if you remove and reinsert the cable to the HBA. The hardware vendor must determine and support the particular server cluster implementation and confirm that all devices (switches, storage controllers, HBAs, and other devices) handle the event properly. Typically, problems occur if HBAs, switches, and storage controllers are configured incorrectly or if you are using incorrect versions of drivers and firmware for these devices.

Microsoft Product Support Services (PSS) does not have access to all configurations of a fibre cluster (switches, storage controllers, HBAs, and other devices). The vendor who designed and implemented the fibre solution must have tested the solution. It is the hardware vendor's responsibility to verify that the physical disk will work as expected if you remove the fibre cable.

Hardware vendors use the following different types of drivers for HBAs:
  • A miniport driver: This driver contains the hardware-specific information for the fibre card and it interfaces with Scsiport.sys to communicate with Windows.
  • A port driver: This driver bypasses Scsiport.sys to communicate with Windows; the driver implements Scsiport.sys functionality internally.
Contact the hardware vendor to determine whether to use the vendor's miniport driver or the port driver.

There are two sets of connections between the server cluster nodes and the actual storage device. One connection is between the nodes and a switch; the second connection is between the switch and the storage controller. Therefore, there are two places for cable failure. The failure may occur between the HBA and the switch or between the switch and the storage controller. The following sections describe the behavior that occurs when you remove and reinsert cables for these kinds of connections.

Note For multiple-path scenarios, there are more connections that may be affected. See the "Removing and Reinserting a Cable in a Multiple-Path Environment" section later in this article for more information.

Removing the cable between the HBA and the switch

If you remove a cable between the HBA and the switch (or if the cable fails), the HBA driver logs several events to Windows. The HBA miniport driver generates a "BusChangeDetected" notification, which indicates that a target device has been added or removed from the bus. However, if the HBA miniport driver reports a "ResetDetected" generic status notification, this notification indicates that the HBA has detected a bus reset on the SCSI bus. After this notification is generated, the HBA miniport driver is still responsible for completing any active requests. Port drivers do not issue "BusChangeDetected" notifications; instead, they process device removals internally. Port drivers typically generate "IoInvalidateDeviceRelations" notifications in Windows. Regardless of the type of driver, if the HBA driver properly reports to Windows that a device has been removed, the clustering software detects that the disk is no longer available, and it fails the disks over to another member in the cluster. If a SCSI reserve that is issued by Clusdisk.sys (the cluster disk driver) fails, the failure is detected in approximately three seconds.
  • LooksAlive: This routine is a cursory status check that runs every five seconds (by default). This routine checks that the disk status is not marked as "Failed," which indicates a loss of the periodic SCSI reserve.
  • IsAlive: This routine is a more complete check that occurs every 60 seconds (by default). This routine checks that the disk status is not marked as "Failed," which indicates periodic SCSI reserve failure. If the status is not marked as "Failed," FindFirstFile runs on the root of the disk to make sure the file system is still mounted and that the disk is accessible.
NOTE: If the HBA is holding on to input/output (I/O) because of a failed path to the storage controller, the LooksAlive routine may succeed and not cause a failure. The failure is detected when an IsAlive routine occurs when the Cluster service checks the root of the shared disk to detect a failure. Contact the hardware vendor if this problem occurs. For more information about how the Cluster service interacts with the shared bus, click the following article number to view the article in the Microsoft Knowledge Base:

309186 How the Cluster service reserves a disk and brings a disk online

Removing the cable between the switch and the storage controller

If you disconnect or remove the cable from the switch to the storage controller, the HBA receives a Registered State Change Notification (RSCN) from the switch. When the HBA receives the RSCN, the HBA driver notifies Windows of any changes it detects. Detection is a very complex operation, which has many external dependencies that the hardware vendor will have to verify. If the switch does not issue an RSCN or if the HBA driver does not notify Windows that the device has been removed, the LooksAlive routine or the IsAlive routine fails on the resource, and fail over occurs to the other node (as described earlier).

Reinserting the cable between the HBA and the switch

The Cluster service allows you to reconnect or replace a cable between the HBA and the switch, and then allows for the node to take ownership of the physical disk resource. The HBA driver must consider several issues for a Plug and Play rescan to occur. The HBA miniport driver must issue a "BusChangeDetected" notification when the cable is reconnected (an HBA port driver issues an "IoInvalidateDeviceRelations" notification) so that Windows is notified that a change has been made to the shared bus. When Windows receives this notification, the disks are redetected. If the HBA driver does not properly notify Windows of the device insertion, you may be able to manually initiate the discovery of the disk. To do so, make Windows rescan the storage system by using the Disk Management utility. If you do so, Windows is forced to detect devices on the shared bus. If issues occur when you rescan the devices after the cable has been reinserted, contact the hardware manufacturer. Typically, issues occur with this functionality if the HBAs and the switches are configured incorrectly or if you are using incorrect drivers or firmware.

Reinserting the cable between the switch and the storage Controller

When you reinsert the cable between the switch and the storage controller, the switch issues an RSCN so that the HBA miniport driver can notify Windows by using a "BusChangeDetected" notification (an HBA port driver issues an "IoInvalidateDeviceRelations" notification), which indicates that the devices are now available. If the driver does not notify Windows that the devices are available, you may be able to rescan the storage system by using the Disk Management utility. If you do so, Windows is forced to detect devices on the shared bus. If you experience issues rescanning the devices after you reinsert the cable, contact the hardware manufacturer.

Removing and reinserting a cable in a multiple-path environment

You can use multiple-path (also known as MPIO) software to add redundancy to the shared bus to help maintain a high level of availability. Some hardware vendors offer multiple-path software that enables you to use multiple HBAs to the shared disk (the use of this software is acceptable). If you remove the cable from one of the HBAs, all data can be rerouted through another HBA. However, if any problems seem to be related to multiple-path software, Microsoft PSS requires the hardware vendor to work on the issue. In extreme cases, Microsoft PSS may ask you to disable the multiple-path software temporarily to see if it is causing the problem.

WARNING You may experience problems if you disable the multiple-path software before you unplug all but one path to the storage. Contact the hardware manufacturer and see the following Microsoft Knowledge Base article for details:

293778 Multiple-path software may cause disk signature to change

Program behavior on a server cluster can vary with the program and the hardware platform. If you remove the cable when the program is in the middle of writing data, that data may be lost. Data corruption can occur if the cache on the HBA or the storage controller is flushed back to the disk when the disk is presented again to the node (when you reconnect the cable). This issue may occur if the disk was failed over to another node that mounted the disk and the program resumed writing data. Contact the hardware vendor for information about how to configure the hardware to keep this issue from occurring.


For more information about how to use the disk on the shared bus, click the following article numbers to view the articles in the Microsoft Knowledge Base:

309395 The Microsoft support policy for server clusters, the Hardware Compatibility List, and the Windows Server Catalog

318534 Best practices for drive-letter assignments on a server cluster

303431 Explanation of why server clusters do not verify that resources will work properly on all nodes

Eigenschaften

Artikelnummer: 294173 – Letzte Überarbeitung: 12.11.2009 – Revision: 1

Feedback