Network failure detection and recovery in Windows Server 2003 Clusters

Article translations Article translations
Article ID: 286342 - View products that this article applies to.
This article was previously published under Q286342
Expand all | Collapse all

On This Page

SUMMARY

The way a server cluster in Windows Clustering handles the loss of private, internal cluster (heartbeat) communication is different in Microsoft Windows Server 2003 from the way it does in Microsoft Windows 2000. In Windows 2000, if there was a complete loss of heartbeat communication between the nodes in a cluster, the node that owned the Quorum resource, takes ownership of all resources. This article compares the behavior of a Windows 2000 cluster and a Windows Server 2003 cluster in the handling of such a situation.

MORE INFORMATION

Windows 2000

When a cluster node loses connection to all networks that are set for intra-cluster communication, the Cluster service must use the Quorum disk resource to arbitrate and determine which node should remain up and functioning because the nodes have no other way of communicating. The node that receives the ownership of the Quorum resource then brings all resources online, and the Cluster service takes all other nodes in the cluster offline.

Example

There is a complete loss of all networks, where node A owns the Quorum resource:

By disconnecting all of node A's network interfaces, there is a situation where there is no LAN for private cluster communication. Therefore, when node A loses all of its network connections, it is no longer able to detect whether node B is running. Likewise, node B is no longer able to detect if node A is running. The two nodes arbitrate for the Quorum resource, and node A successfully defends its ownership. Node B removes itself from the cluster, and all of its resources failover to node A.

Note: This type of double failure is extremely rare.

If node A no longer has any viable public network interfaces, it cannot receive service requests from clients, but it owns all the resources, which eventually transition to a Failed state. At this point, no resources are available to external clients. Meanwhile, node B may have a perfectly viable public network interface, but it is has been excluded from the cluster because it has no private network connectivity to the node that owns the Quorum resource.

Windows Server 2003

Prior to arbitrating for the Quorum resource, a node checks whether at least one of its network interfaces, which is enabled for cluster use, is connected to any network. In this scenario, this would be any network enabled for client access (All Communications or Client Access Only). If it finds no viable interfaces, the node voluntarily drops out of Quorum resource arbitration, thus removing itself from the cluster.

In the "Example" section, node A determines that both of its networks are unavailable, and it declines to arbitrate. If the Quorum device responds, and the node A reservation terminates quickly, node B wins the Quorum arbitration, and all resources switch to node B. Node B then makes the cluster resources available to clients.

Node A cannot rejoin the cluster until it re-establishes network connectivity with node B and you restart the cluster service. For additional information, click the following article number to view the article in the Microsoft Knowledge Base:
242600 Network failure detection and recovery in a two-node Windows Server 2000 cluster

Properties

Article ID: 286342 - Last Review: March 1, 2007 - Revision: 7.3
APPLIES TO
  • Microsoft Windows Server 2003, Enterprise Edition (32-bit x86)
  • Microsoft Windows Server 2003, Datacenter Edition (32-bit x86)
Keywords: 
kbenv kbinfo kbnetwork KB286342

Give Feedback

 

Contact us for more help

Contact us for more help
Connect with Answer Desk for expert help.
Get more support from smallbusiness.support.microsoft.com