Troubleshooting gray agent states in System Center Operations Manager

Summary

This article describes how to troubleshoot problems in which an agent, a management server, or a gateway is unavailable or "grayed out" in System Center Operations Manager.

More Information

An agent, a management server, or a gateway can have one of the following states, as indicated by the color of the agent name and icon in the Monitoring pane.
StateAppearanceDescription
HealthyGreen check markThe agent or management server is running normally.
CriticalRed check markThere is a problem on the agent or management server.
UnknownGray agent name, gray check markThe health service watcher on the root management server (RMS) that is watching the health service on the monitored computer is no longer receiving heartbeats from the agent. The health service watcher had been receiving heartbeats previously, and the health service was reported as healthy). This also means that the management servers are no longer receiving any information from the agent.

This issue may occur because the computer that is running the agent is not running or there are connectivity issues. You can find more information about the Health Service Watcher view.
UnknownGreen circle, no check markThe status of the discovered item is unknown. There is no monitor available for this specific discovered item.

Causes of a gray state

An agent, a management server, or a gateway may become unavailable for any of the following reasons:
  • Heartbeat failure
  • Invalid configuration
  • System workflows failure
  • OpsMgr Database or data warehouse performance issues
  • RMS or primary MS or gateway performance issues
  • Network or authentication issues
  • Health service issues (service is not running)

Issue scope

Before you begin to troubleshoot the agent "grayed out" issue, you should first understand the Operations Manager topology, and then define the scope of the issue. The following questions may help you to define the scope of the issue:
  • How many agents are affected?
  • Are the agents experiencing the issue in the same network segment?
  • Do the agents report to the same management server?
  • How often do the agents enter and remain in a gray state?
  • How do you typically recover from this situation (for example, restart the agent health service, clear the cache, rely upon automatic recovery)?
  • Are the Heartbeat failure alerts generated for these agents?
  • Does this issue occur during a specific time of the day?
  • Does this issue persist if you failover these agents to another management server or gateway?
  • When did this problem start?
  • Were any changes made to the agents, the management servers, or the gateway or management group?
  • Are the affected agents Windows clustered systems?
  • Is the Health Service State folder excluded from antivirus scanning?
  • What is the environment this is occurring in OpsMgr SP1, R2, 2012?

Troubleshooting strategy

Your troubleshooting strategy will be dictated by which component is inactive, where that component falls within the topology, and how widespread the problem is. Consider the following conditions:

Issue scenarios

Consider the following scenarios.

Scenario 1

Only a few agents are affected by the issue. These agents report to different management servers. Agents remain unavailable on a regular basis. Although you are able to clear the agent cache to help resolve the issue temporarily, the problem recurs after a few days.

Resolution 1

To resolve the issue in this scenario, follow these steps:
  1. Apply the appropriate hotfix to the affected operating systems.
  2. Exclude the Agent cache from antivirus scanning.
  3. Stop the Health service.
  4. Clear the Agent cache.
  5. Start the Health service.
Note We recommend that you proactively apply the hotfixes that are listed in step 1 to all monitored systems. This includes the management servers. Additionally, exclude the agent or management cache from antivirus scanning to prevent this issue from spreading to other systems.

For more information about these procedures, click the following article numbers to view the articles in the Microsoft Knowledge Base:
982018 An update that improves the compatibility of Windows 7 and Windows Server 2008 R2 with Advanced Format Disks is available

2553708 A hotfix rollup that improves Windows Vista and Windows Server 2008 compatibility with Advanced Format disks

981263 Management servers or assigned agents unexpectedly appear as unavailable in the Operations Manager console in Windows Server 2003 or Windows Server 2008

975931 Recommendations for antivirus exclusions that relate to MOM 2005 and to Operations Manager 2007

Scenario 2

Only a few agents are affected by the issue. These agents report to different management servers. Agents remain inactive constantly. Although you are able to clear the agent cache, this does not reolve the issue.

Resolution 2

To resolve the issue in this scenario, follow these steps:
  1. Determine whether the Health Service is turned on and is currently running on the management server or gateway. If the Health Service has stopped responding, generate an Adplus dump in a service hang mode to help determine the cause of the problem. For more information, click the following article number to view the article in the Microsoft Knowledge Base:
    286350 How to use Network Monitor to capture network traffic
  2. Examine the Operations Manager Event log on the agent to locate any of the following events:

    Event ID: 1102
    Event Source: HealthService
    Event Description:
    Rule/Monitor "%4" running for instance "%3" with id:"%2" cannot be initialized and will not be loaded. Management group "%1"

    Event ID: 1103
    Event Source: HealthService
    Event Description:
    Summary: %2 rule(s)/monitor(s) failed and got unloaded, %3 of them reached the failure limit that prevents automatic reload. Management group "%1". This is summary only event, please see other events with descriptions of unloaded rule(s)/monitor(s).

    Event ID: 1104
    Event Source: HealthService
    Event Description:
    RunAs profile in workflow "%4", running for instance "%3" with id:"%2" cannot be resolved. Workflow will not be loaded. Management group "%1"

    Event ID: 1105
    Event Source: HealthService
    Event Description:
    Type mismatch for RunAs profile in workflow "%4", running for instance "%3" with id:"%2". Workflow will not be loaded. Management group "%1"

    Event ID: 1106
    Event Source: HealthService
    Event Description:
    Cannot access plain text RunAs profile in workflow "%4", running for instance "%3" with id:"%2". Workflow will not be loaded. Management group "%1"

    Event ID: 1107
    Event Source: HealthService
    Event Description:
    Account for RunAs profile in workflow "%4", running for instance "%3" with id:"%2" is not defined. Workflow will not be loaded. Please associate an account with the profile. Management group "%1"

    Event ID: 1108
    Event Source: HealthService
    Event Description:
    An Account specified in the Run As Profile "%7" cannot be resolved. Specifically, the account is used in the Secure Reference Override "%6". %n%n This condition may have occurred because the Account is not configured to be distributed to this computer. To resolve this problem, you need to open the Run As Profile specified below, locate the Account entry as specified by its SSID, and either choose to distribute the Account to this computer if appropriate, or change the setting in the Profile so that the target object does not use the specified Account. %n%nManagement Group: %1 %nRun As Profile: %7 %nSecureReferenceOverride name: %6 %nSecureReferenceOverride ID: %4 %nObject name: %3 %nObject ID: %2 %nAccount SSID: %5

    Event ID: 4000
    Event Source: HealthService
    Event Description:
    A monitoring host is unresponsive or has crashed. The status code for the host failure was %1.

    Event ID: 21016
    Event Source: OpsMgr Connector
    Event Description:
    OpsMgr was unable to set up a communications channel to %1 and there are no failover hosts. Communication will resume when %1 is available and communication from this computer is allowed.

    Event ID: 21006
    Event Source: OpsMgr Connector
    Event Description:
    The OpsMgr Connector could not connect to %1:%2. The error code is %3(%4). Please verify there is network connectivity, the server is running and has registered its listening port, and there are no firewalls blocking traffic to the destination.

    Event ID: 20070
    Event Source: OpsMgr Connector
    Event Description:
    The OpsMgr Connector connected to %1, but the connection was closed immediately after authentication occurred. The most likely cause of this error is that the agent is not authorized to communicate with the server, or the server has not received configuration. Check the event log on the server for the presence of 20000 events, indicating that agents which are not approved are attempting to connect.

    Event ID: 20051
    Event Source: OpsMgr Connector
    Event Description:
    The specified certificate could not be loaded because the certificate is not currently valid. Verify that the system time is correct and re-issue the certificate if necessary%n Certificate Valid Start Time : %1%n Certificate Valid End Time : %2

    Event Source: ESE
    Event Category: Transaction Manager
    Event ID: 623
    Description: HealthService (<PID>) The version store for instance <instance> ("<name>") has reached its maximum size of <value>Mb. It is likely that a long-running transaction is preventing cleanup of the version store and causing it to build up in size. Updates will be rejected until the long-running transaction has been completely committed or rolled back. Possible long-running transaction:
    SessionId: <value>
    Session-context: <value>
    Session-context ThreadId: <value>.
    Cleanup:<value>
  3. If you locate the following specific events, follow these guidelines:
    • Events 1102 and 1103: These events indicate that some of the workflows failed to load. If these are the core system workflows, these events could cause the issue. In this case, focus on resolving these events.
    • Events 1104, 1105, 1106, 1107, and 1108: These events may cause Events 1102 and 1103 to occur. Ttypically, this would occur because of misconfigured "Run as" accounts. In OpsMgr R2, this typically occurs because the "Run as" accounts are configured to be used with the wrong class or are not configured to be distributed to the agent.
    • Event 4000 This event indicates that the Monitoringhost.exe process crashed. If this problem is caused by a Dll mismatch or by missing registry keys, you may be able to resolve the problem by reinstalling the agent. If the problem presists, try to resolve it by using the following methods:
      • Run a Process Monitor capture until the point the process crashes. For more information, visit the following Microsoft Sysinternals website: Process Monitor v2.96
      • Generate an Adplus dump in crash mode. For more information, click the following article number to view the article in the Microsoft Knowledge Base:
        286350 How to use ADPlus.vbs to troubleshoot "hangs" and "crashes"
      • If the agent is monitoring network devices, and the agent is running on Windows Server 2003, apply hotfix in KB 982501. For more information, click the following article number to view the article in the Microsoft Knowledge Base:
        982501 The monitoring of SNMP devices may stop intermittently in System Center Operations Manager or in System Center Essentials
    • Event ID 21006: This event indicates that communication issues exist between the agent and the management server. If the agent uses a certificate for mutual authentication, verify that the certificate is not expired and that the agent is using the correct certificate. If Kerberos is being used, verify that the agent can communicate with Active Directory. If authentication is working correctly, this may mean that the packets from the agent are not reaching the management server or gateway. Try to establish a simple telnet to port 5723 from the agent to the management server. Additionally, run a simultaneous network trace between the agent and the management server while you reproduce the communication failures. This can help you to determine whether the packets are reaching the management server, and whether any device between the two components is trying to optimize the traffic or is dropping some packets. For more information, click the following article number to view the article in the Microsoft Knowledge Base:
      812953 How to use Network Monitor to capture network traffic
    • Event ID: 623 This event typically occurs in a large Operations Manager environment in which a management server or an agent computer manages many workflows. For more information, click the following article number to view the article in the Microsoft Knowledge Base:
      975057  One or more management servers and their managed devices are dimmed in the Operations Manager Console of Operations Manager

Scenario 3

All the agents that report to a particular management server or gateway are unavailable.

Resolution 3

To resolve the issue in this scenario, follow these steps:
  1. Try to determine what kind of workloads the management server or gateway is monitoring. Such workloads might include network devices, cross-platform agents, synthetic transactions, Windows agents, and agentless computers.
  2. Determine whether the Health Service is running on the management server or gateway.
  3. Determine whether the management server is running in maintenance mode. If it is necessary, remove the server from maintenance mode.
  4. Examine the Operations Manager Event log on the agent for any of the events that are listed in Scenario 2. In the case of Event ID: 21006, follow the same guidelines that are mentioned in Scenario 2. Additionally in this case, this event indicates that management server or gateway cannot communicate with its parent server. In Operations Manager 2007 and R2 for a management server, the parent server is the root management sever (RMS). For a gateway, the parent server may be any management server. (Refer to step 3 in the Scenario 2 resolution.)
  5. If the health service is monitoring network devices, and the management server is running on a Windows Server 2003 system, you may also want to apply the following KB 982501 hotfix. For more information, click the following article number to view the article in the Microsoft Knowledge Base:
    982501 The monitoring of SNMP devices may stop intermittently in System Center Operations Manager or in System Center Essentials
  6. Examine the Operations Manager Event log for the following events. These events typically indicate that performance issues exist on the management server or Microsoft SQL Server that is hosting the OperationsManager or OperationsManagerDW database:

    Event ID: 2115
    Event Source: HealthService
    Event Description:
    A Bind Data Source in Management Group %1 has posted items to the workflow, but has not received a response in %5 seconds. This indicates a performance or functional problem with the workflow.%n Workflow Id : %2%n Instance : %3%n Instance Id : %4%n

    Event ID: 5300
    Event Source: HealthService
    Event Description:
    Local health service is not healthy. Entity state change flow is stalled with pending acknowledgement. %n%nManagement Group: %2 %nManagement Group ID: %1

    Event ID: 4506
    Event Source: HealthService
    Event Description: Operations Manager
    Data was dropped due to too much outstanding data in rule "%2" running for instance "%3" with id:"%4" in management group "%1".

    Event ID: 31551
    Event Source: Health Service Modules
    Event Description:
    Failed to store data in the Data Warehouse. The operation will be retried.%rException '%5': %6 %n%nOne or more workflows were affected by this. %n%nWorkflow name: %2 %nInstance name: %3 %nInstance ID: %4 %nManagement group: %1

    Event ID: 31552
    Event Source: Health Service Modules
    Event Description:
    Failed to store data in the Data Warehouse.%rException '%5': %6 %n%nOne or more workflows were affected by this. %n%nWorkflow name: %2 %nInstance name: %3 %nInstance ID: %4 %nManagement group: %1

    Event ID: 31553
    Event Source: Health Service Modules
    Event Description:
    Data was written to the Data Warehouse staging area but processing failed on one of the subsequent operations.%rException '%5': %6 %n%nOne or more workflows were affected by this. %n%nWorkflow name: %2 %nInstance name: %3 %nInstance ID: %4 %nManagement group: %1

    Event ID:31557
    Event Source: Health Service Modules
    Event Description:
    Failed to obtain synchronization process state information from Data Warehouse database. The operation will be retried.%rException '%5': %6 %n%nOne or more workflows were affected by this. %n%nWorkflow name: %2 %nInstance name: %3 %nInstance ID: %4 %nManagement group: %1
  7. Event ID 3155X may also be logged because of incorrect "Run as" account configurations or missing permissions for the "Run as" accounts. FOr more information, see the the following Microsoft Technet blog, which includes a Microsoft Office Excel worksheet that lists the permissions for various accounts that are used by OpsMgr:
Note To troubleshoot management server or gateway performance and SQL Server performance, see the "Resolutions" section for the next scenarios.

Scenarios 4 and 5

Scenarios 4
All the agents that report to a specific management server alternate intermittently between healthy and gray states.
Scenarios 5
All the agents in the environment alternate intermittently between healthy and gray states.

Resolutions 4 and 5

To resolve the issue in either of these scenarios, first determine the cause of the issue. Common causes of temporary server unavailability include the following:
  • The parent server of the agents is temporarily offline.
  • Agents are flooding the management server with operational data, such as alerts, states, discoveries, and so on. This may cause an increased use of system resources on the OpsMgr database and on the OpsMgr servers.
  • Network outages caused a temporary communication failure between the parent server and the agents.
  • Management pack (MP) changes occurred. In OpsMgr Console, these changes require an OpsMgr configuration and an MP redistribution to the agents. If the change affect a larger agent base, this may cause increased use of system resources usage on the OpsMgr database and OpsMgr servers.
The key to troubleshooting in these scenarios is to understand the duration of the server unavailability and the time of day during which it occurred. This will help you to quickly narrow the scope of the problem.

Troubleshooting management server and gateway performance

For OpsMgr 2007 and R2 - Root management server (RMS)

Configuration update bursts are caused by management pack imports and by discovery data. When system performance is slow, the most likely bottlenecks are, first, the CPU and, second, the OpsMgr installation disk I/O.

The RMS is responsible for generating and sending configuration files to all affected Health Services.

For Workflow reloading (which is caused by new configuration on RMS), the most likely bottlenecks are the same: the CPU first, and OpsMgr installation disk I/O second. The RMS is responsible for reading the configuration file, for loading and initializing all workflows that run on it, and for updating the RMS HealthService store when the configuration file is updated on the RMS.

For local workflow activity bursts (which is when agents change their availability), the most likely bottleneck is the CPU. If you find that the CPU is not working at maximum capacity, the next most likely bottleneck is the hard disk. The RMS is responsible for monitoring the availability of all agents that are using RMS local workflows. The RMS also hosts distributed dependency monitors that use the disk.

Management server

During a configuration update burst (that is caused by MP import and discovery), the typical bottlenecks are, first, the CPU and, second, the OpsMgr installation disk I/O. The management server is responsible of forwarding configuration files from the RMS to the target agents.

For Operational data collection, bottlenecks are typically caused by the CPU. The disk I/O may also be at maximum capacity, but that is not as likely. The management server is responsible for decompressing and decrypting incoming operational data, and inserting it into the Operational Database. It also sends acknowledgements (ACKs) back to the agents or gateways after it receives operational data, and uses disk queuing to temporarily store these outgoing ACKs. Lastly, the management server will also forward monitor state changes (by using a disk queue) to the RMS for distributed dependency monitors.

Gateway

The gateway is both CPU-bound and I/O-bound. When the gateway is relaying a large amount of data, both the CPU and I/O operations may show high usage. Most of the CPU usage is caused by the decompression, compression, encryption, and decryption of the incoming data, and also by the transfer of that data. All data that is received by the gateway and from the agents is stored in a persistent queue on disk, to be read and forwarded to the management server by the gateway Health service. This can cause heavy disk usage. This usage can be significant when the gateway is taken temporarily offline and must then handle accumulated agent data that the agents generated and tried to send when the GW was still offline.

To troubleshoot the issue in this situation, collect the following information for each affected management server or gateway:

Troubleshooting SQL Server Performance

Operational Database (OperationsManager)

For the OperationsManager database, the most likely bottleneck is the disk array. If the disk array is not at maximum I/O capacity, the next most likely bottleneck is the CPU. The database will experience occasional slowdowns and operational "data storms” (very high incidences of events, alerts, and performance data or state changes that persist for a relatively long time). A short burst typically does not cause any significant delay for an extended period of time.

During operational data insertion, the database disks are primarily used for writes. CPU use is usually caused by SQL Server churn. This may occur when you have large and complex queries, heavy data insertion, and the grooming of large tables (which, by default, occurs at midnight). Typically, the grooming of even large Events and Performance Data tables does not consume excessive CPU or disk resources. However, the grooming pf the Alert and State Change tables can be CPU-intensive for large tables.

The database is also CPU-bound when it handles configuration redistribution bursts, which are caused by MP imports or by a large instance space change. In these cases, the Config service queries the database for new agent configuration. This ususally causes CPU spikes to occur on the database before the service sends the configuration updates to the agents.

Data Warehouse (OperationsManagerDW)

For the OperationsManagerDW database, the most likely bottleneck is the disk array. This usually occurs because of very large operational data insertions. In these cases, the disks are mostly busy performing writes. Usually, the disks are performing few reads, except to handle manually-generated Reporting views because these run queries on the data warehouse.

CPU usage is usually caused by SQL Server churn. CPU spikes may occur during heavy partitioning activity (when tables become very large and then get partitioned), the generation of complex reports, and large amounts of alerts in the database, with which the data warehouse must constantly sync up.
General troubleshooting
To troubleshoot the issue in this situation, collect the following information for each affected management server or gateway:
Counters to identify memory pressure
Counters to identify disk pressure
Capture these Physical Disk counters for all drives that contain SQL data or log files:The following links provide deeper insight into troubleshooting SQL Server performance:

OpsMgr Performance counters

The following sections describe the performance counters that you can use to monitor and troubleshoot OpsMgr performance.
Gateway server role
Management server role
Overall performance counters: These counters indicate the overall performance of the management server:LogicalDisk(*)\Avg. Disk Queue LengthOpsMgr process generic performance counters: These counters indicate the overall performance of OpsMgr processes on the management server:Process(MonitoringHost*)\Working SetOpsMgr specific performance counters: These counters are OpsMgr specific counters that indicate the performance of specifric aspects of OpsMgr on the management server:
Root management server role
Overall performance counters: These counters indicate the overall performance of the root management server:LogicalDisk(*)\Avg. Disk Queue LengthOpsMgr process generic performance counters: These counters indicate the overall performance of OpsMgr processes on the root management server:Process(Microsoft.Mom.Sdk.ServiceHost)\Working SetOpsMgr specific performance counters: These counters are OpsMgr specific counters that indicate the performance of specific aspects of OpsMgr on the root management server:
Properties

Article ID: 2288515 - Last Review: Jul 9, 2012 - Revision: 1

Feedback