In a cluster where there is an active Generic Script resource, the cluster may become unresponsive. Cluster Administrator and Cluster.exe appear to stop responding (hang). The cluster log shows blocked threads inside a Generic Script resource. For example:
000007c4.000007e4::2002/12/12-19:17:03.781 INFO [FM] FmpRmOnlineResource: called InterlockedIncrement on gdwQuoBlockingResources for resource f37f58fb-03ff-44b3-a4d7-086b0838d73dThe event log contains a message similar to either of the following:
A Generic Script resource script can cause the whole cluster to stop responding or become unresponsive if any of the following conditions exist:
- The Generic Script resource script contains an infinite loop (and therefore never exits).
- Calls to certain cluster application programming interfaces (APIs) are occurring. Calls to certain cluster APIs must be avoided from within a resource DLL or resource script because they can cause a cluster-wide deadlock. This script may be calling cluster APIs or starting Cluster.exe (which may result in calling cluster APIs that must be avoided) as one of the steps. For information about APIs that should not be called from a resource DLL or script, see “Function Calls to Avoid in Resource DLLs” in the Microsoft Platform SDK (PSDK).
- An action the Generic Script resource script is performing takes longer than the pending timeout value.
The Cluster Resource Monitor will not perform any additional operations on a Generic Script resource after any entry point has exceeded the pending timeout value, but the problematic thread will continue to run. To resolve the problem, disable the resource (that is, prevent it from coming online), stop the Cluster service (this terminates the problematic thread), fix the script problem, and then restart the Cluster service. Depending on the cause of this problem, you may want to increase the online or offline pending timeout value for this resource. For step-by-step instructions, see the "Recover and Restart the Cluster Service” section later in this article.
Changing Pending Timeout ValuesAny cluster resource operation should complete execution well inside the range of the pending timeout. For this reason, do not change the timeout value without a thorough understanding of why your script entry point exceeds this period of time. Also, consider all the implications of increasing this value because the cluster will be unresponsive until the timeout value is exceeded.
Recover and Restart the Cluster Service
- Disable the resource (in this example, named MyScript) by typing the following command: cluster resource "MyScript" /properties PersistentState=0
- Stop the Cluster service on the node that currently owns this resource’s group by typing the following command in a console window:net stop clussvc
- Fix any problem that you identify in the script that causes it to stop responding, loop, or exceed the pending timeout value. You may determine that the appropriate thing to do is to increase the pending timeout value, but make sure that you carefully consider the implications of doing so.
- Restart the Cluster service by typing the following command:net start clussvc
- Bring the resource back online manually by using Cluster Administrator or Cluster.exe. To do so, type the following command:cluster resource “MyScript” /onlineNote that bringing the resource back online automatically sets PersistentState to 1, so there is no need for an additional command to change the value from 0.
Microsoft has confirmed that this is a bug in the Microsoft products that are listed at the beginning of this article.
Article ID: 811685 - Last Review: Jan 7, 2008 - Revision: 1