The default group failover threshold value in the Windows Server 2008 Failover Cluster Management snap-in is incorrect
For example, in a five-node cluster, any highly available application or service resource grouping has a default failover threshold set equal to five. By default, the Period: (hours): setting is set to six hours. Therefore, when a highly available service or application group experiences a failure of one or more resources in the group, the service or application group tries to fail the group over to another node in the cluster up to five times in a six-hour period. After the fifth failover attempt, the service or application group remains in a "Failed" state.
In this situation, a total of n - 1 failovers occurs in the six-hour period. Therefore, four failovers occur. The failover process works correctly. However, the number that appears in the Failover Cluster Management snap-in is incorrect. In this situation, the number is 5.
Steps to reproduce the issue
- In the Failover Cluster Management snap-in navigation pane, expand one of the managed clusters that has a highly available application or service configured.
- Expand the Services and Applications category.
- Select and then right-click one of the groups, and then click Properties.
- Click the Failover tab, and then view the Maximum failures in the specified period setting.
The number that you see is equal to the number of nodes in the cluster.
- To simulate the behavior, select and then right-click a resource in the group, and then click Simulate Failure of this Resource under More Actions.
The default restart behavior for a cluster resource is to try to restart the original owning node. Therefore, the failure that you have started causes a failure. The resource comes back online on the owning node.
- Start a failure again. This causes the group to go offline, and then to move to another node in the cluster.
- Execute step 5 and step 6 until the resource remains in a "Failed" state. Make sure that you count the number of times that the group comes online on other nodes in the cluster. The final count is equal to n - 1.
- Select and then configure another service or application.
- Increase the Maximum failures in the specified period setting by one.
- Select and then right-click a resource in the group, and then click Simulate Failure of this Resource under More Actions.
- Start a failure again.
Article ID: 950804 - Last Review: 09/11/2010 10:12:00 - Revision: 2.1
- kbclustering kbexpertiseadvanced kbtshoot kbprb KB950804