Troubleshooting gray agent states in Operations Manager 2012

What does this guide do?

Troubleshoots problems in which an agent, a Management Server or a Gateway is unavailable or "grayed out" (or greyed out) in System Center 2012 Operations Manager (OpsMgr 2012 or OpsMgr 2012 R2).

Who is it for?

Admins of System Center 2012 Operations Manager who help resolve issues with gray agent states.

How does it work?

We’ll begin by explaining the causes of the issue and helping you define the scope of the issue to come up with a troubleshooting strategy. Then we’ll take you through a series of steps to resolve your issue.

Estimated time of completion:

15-30 minutes.

Welcome to the guide

Background

An agent, a management server, or a gateway can have one of the following states, as indicated by the color of the agent name and icon in the Monitoring pane.

State Appearance Description
Healthy Green check mark The agent or Management Server is running normally.
Critical Red check mark There is a problem on the agent or Management Server.
Unknown Gray agent name, gray check mark The health service watcher on the Root Management Server (RMS) that is watching the health service on the monitored computer is no longer receiving heartbeats from the agent. The health service watcher had been receiving heartbeats previously, and the health service was reported as healthy. This also means that the Management Servers are no longer receiving any information from the agent.This issue may occur because the computer that is running the agent is not running or there are connectivity issues. You can find more information about the Health Service Watcher view.
Unknown Green circle, no check mark The status of the discovered item is unknown. There is no monitor available for this specific discovered item.

Causes of a gray state

An agent, a Management Server or a Gateway may become unavailable for any of the following reasons:


Heartbeat failure

  • Invalid configuration
  • System workflows failure
  • OpsMgr Database or data warehouse performance issues
  • RMS or primary MS or gateway performance issues
  • Network or authentication issues
  • Health service issues (service is not running)

Determining the Scope

Before you begin to troubleshoot these kinds of agent issues, you should first understand the Operations Manager topology, and then define the scope of the issue. The following questions may help you to define the scope of the issue:

  • How many agents are affected?
  • Are the agents experiencing the issue in the same network segment?
  • Do the agents report to the same Management Server?
  • How often do the agents enter and remain in a gray state?
  • How do you typically recover from this situation (e.g. restart the agent health service, clear the cache, etc.)?
  • Are Heartbeat failure alerts generated for these agents?
  • Does this issue occur during a specific time of the day?
  • Does this issue persist if you failover these agents to another Management Server or Gateway?
  • When did the problem start?
  • Were any changes made to the agents, the Management Servers, or the Gateway or Management group?
  • Are the affected agents Windows clustered systems?
  • Is the Health Service State folder excluded from antivirus scanning?

Troubleshooting Strategy

Your troubleshooting strategy will be dictated by which component is inactive, where that component falls within the topology, and how widespread the problem is. Consider the following conditions:

  • If the agents that report to a particular Management Server or Gateway are unavailable, troubleshooting should start at the Management Server or Gateway level.
  • If the Gateways that report to a particular Management Server are unavailable, troubleshooting should start at the Management Server level. 
  • For agentless systems, for network devices, and for UNIX/Linux computers, troubleshooting should start at the agent, Management Server or Gateway that is monitoring these objects.
  • If all the systems are unavailable, troubleshooting should start at the Root Management Server.
  • Troubleshooting typically starts at the level immediately above the unavailable component.

Regardless of where you start, the first when encountering agents that are unavailable or in a gray state is to make sure they have been updated with all available updates. Once updated, restart the computer and see if the agent begins to operate normally. If not, only then should you continue troubleshooting using the methods below. 

While it is beyond the scope of this troubleshooter to cover every possible scenario and problem, the most common issues are examined and are categorized based on the overall agent symptoms observed in the environment. Select the category below that most closely matches the problem you are experiencing:

Welcome to the guide

Background

An agent, a management server, or a gateway can have one of the following states, as indicated by the color of the agent name and icon in the Monitoring pane.

State Appearance Description
Healthy Green check mark The agent or Management Server is running normally.
Critical Red check mark There is a problem on the agent or Management Server.
Unknown Gray agent name, gray check mark The health service watcher on the Root Management Server (RMS) that is watching the health service on the monitored computer is no longer receiving heartbeats from the agent. The health service watcher had been receiving heartbeats previously, and the health service was reported as healthy. This also means that the Management Servers are no longer receiving any information from the agent.This issue may occur because the computer that is running the agent is not running or there are connectivity issues. You can find more information about the Health Service Watcher view.
Unknown Green circle, no check mark The status of the discovered item is unknown. There is no monitor available for this specific discovered item.

Causes of a gray state

An agent, a Management Server or a Gateway may become unavailable for any of the following reasons:


Heartbeat failure

  • Invalid configuration
  • System workflows failure
  • OpsMgr Database or data warehouse performance issues
  • RMS or primary MS or gateway performance issues
  • Network or authentication issues
  • Health service issues (service is not running)

Determining the Scope

Before you begin to troubleshoot these kinds of agent issues, you should first understand the Operations Manager topology, and then define the scope of the issue. The following questions may help you to define the scope of the issue:

  • How many agents are affected?
  • Are the agents experiencing the issue in the same network segment?
  • Do the agents report to the same Management Server?
  • How often do the agents enter and remain in a gray state?
  • How do you typically recover from this situation (e.g. restart the agent health service, clear the cache, etc.)?
  • Are Heartbeat failure alerts generated for these agents?
  • Does this issue occur during a specific time of the day?
  • Does this issue persist if you failover these agents to another Management Server or Gateway?
  • When did the problem start?
  • Were any changes made to the agents, the Management Servers, or the Gateway or Management group?
  • Are the affected agents Windows clustered systems?
  • Is the Health Service State folder excluded from antivirus scanning?

Troubleshooting Strategy

Your troubleshooting strategy will be dictated by which component is inactive, where that component falls within the topology, and how widespread the problem is. Consider the following conditions:

  • If the agents that report to a particular Management Server or Gateway are unavailable, troubleshooting should start at the Management Server or Gateway level.
  • If the Gateways that report to a particular Management Server are unavailable, troubleshooting should start at the Management Server level. 
  • For agentless systems, for network devices, and for UNIX/Linux computers, troubleshooting should start at the agent, Management Server or Gateway that is monitoring these objects.
  • If all the systems are unavailable, troubleshooting should start at the Root Management Server.
  • Troubleshooting typically starts at the level immediately above the unavailable component.

Regardless of where you start, the first when encountering agents that are unavailable or in a gray state is to make sure they have been updated with all available updates. Once updated, restart the computer and see if the agent begins to operate normally. If not, only then should you continue troubleshooting using the methods below. 

While it is beyond the scope of this troubleshooter to cover every possible scenario and problem, the most common issues are examined and are categorized based on the overall agent symptoms observed in the environment. Select the category below that most closely matches the problem you are experiencing:

Clear the agent cache

If only certain agents appear grayed out or unavailable, make sure that the agent cache is excluded from antivirus scanning, then clear the agent cache on those computers. To clear the agent cache, complete the following:

On an agent-managed computer


  1. In the Monitoring workspace, expand Operations Manager and then expand Agent Details.
  2. Click Agent Health State.
  3. In Agent State, click an agent.
  4. In the Tasks pane, click Flush Health Service State and Cache.

On a Management Server


  1. In the Monitoring workspace, expand Operations Manager and then expand Management Server.
  2. Click Management Servers State.
  3. In Management Server State, click a management server.
  4. In the Tasks pane, click Flush Health Service State and Cache.

NOTE: Be aware that clearing the agent cache can cause loss of monitoring data from that system.

Clearing the cache by using the steps above does the following:

  1. Stops the System Center Management service.
  2. Deletes the health service store files.
  3. Resets the state of the agent, including all rules, monitors, outgoing data, and cached management packs.
  4. Starts the System Center Management service. 

Did this solve your problem? 

Check the Health Service on the affected computer

Verify that the Health Service is enabled and is currently running on the Management Server, Gateway and client computer. If the Health Service has stopped responding or will not start, troubleshoot this issue before proceeding.


Is the Health Service started and running normally?

Examine the Operations Manager Event log on the agent for errors

Examine the Operations Manager Event log on the agent and look for any of the following events:

Event ID: 1102

Event Source: HealthServiceEvent Description:Rule/Monitor "%4" running for instance "%3" with id:"%2" cannot be initialized and will not be loaded. Management group "%1" 

Event ID: 1103 

Event Source: HealthService Event Description:Summary: %2 rule(s)/monitor(s) failed and got unloaded, %3 of them reached the failure limit that prevents automatic reload. Management group "%1". This is summary only event, please see other events with descriptions of unloaded rule(s)/monitor(s).

SOLUTION: Events 1102 and 1103 typically indicate that some of the workflows failed to load. If these are the core system workflows, these events could cause the issue. Detailed troubleshooting of the various errors, symptoms and workflows is beyond the scope of this troubleshooter, however a quick Bing or TechNet search should provide clues as to the cause and resolution.

=====

Event ID: 1104 

Event Source: HealthService Event Description:RunAs profile in workflow "%4", running for instance "%3" with id:"%2" cannot be resolved. Workflow will not be loaded. Management group "%1" 

Event ID: 1105 

Event Source: HealthServiceEvent Description:Type mismatch for RunAs profile in workflow "%4", running for instance "%3" with id:"%2". Workflow will not be loaded. Management group "%1" 

Event ID: 1106 

Event Source: HealthServiceEvent Description:Cannot access plain text RunAs profile in workflow "%4", running for instance "%3" with id:"%2". Workflow will not be loaded. Management group "%1"

Event ID: 1107 

Event Source: HealthService Event Description:Account for RunAs profile in workflow "%4", running for instance "%3" with id:"%2" is not defined. Workflow will not be loaded. Please associate an account with the profile. Management group "%1"

SOLUTION: Events 1104, 1105, 1106, 1107, and 1108 can themselves cause Event IDs 1102 and 1103 to occur. Typically this would occur because of misconfigured "Run as" accounts. Make sure that the "Run as" accounts are configured correctly, that they are configured to be used with the right class, and that they are configured to be distributed to the agent.

=====

Event ID: 4000

Event Source: HealthServiceEvent Description: A monitoring host is unresponsive or has crashed. The status code for the host failure was %1.

===== 

Event ID: 21006

Event Source: OpsMgr ConnectorEvent Description:The OpsMgr Connector could not connect to %1:%2. The error code is %3(%4). Please verify there is network connectivity, the server is running and has registered its listening port, and there are no firewalls blocking traffic to the destination.

SOLUTION: Event ID 21006 indicates that communication issues exist between the agent and the Management Server. If the agent uses a certificate for mutual authentication, verify that the certificate is not expired and that the agent is using the correct certificate. If Kerberos is being used, verify that the agent can communicate with Active Directory. If authentication is working correctly, this may mean that the packets from the agent are not reaching the management server or gateway. To test this, try to establish a simple telnet to port 5723 from the agent to the management server. If this fails then there is probably a device on the network such as a router or firewall that is block these packets.


Did this solve your problem?

Determine whether the Management Server is running in Maintenance Mode

If all of the agents that report to a particular Management Server or Gateway are unavailable, verify whether the Management Server is running in Maintenance Mode. If it is, remove the server from Maintenance Mode and test.


Did this solve your problem?

Examine the Operations Manager Event log on the server for errors

Examine the Operations Manager Event log on the Management Server and Gateway look for any of the following events:

Event ID: 1102

Event Source: HealthServiceEvent Description:Rule/Monitor "%4" running for instance "%3" with id:"%2" cannot be initialized and will not be loaded. Management group "%1" 

Event ID: 1103

Event Source: HealthServiceEvent Description:Summary: %2 rule(s)/monitor(s) failed and got unloaded, %3 of them reached the failure limit that prevents automatic reload. Management group "%1". This is summary only event, please see other events with descriptions of unloaded rule(s)/monitor(s).

SOLUTION: Events 1102 and 1103 typically indicate that some of the workflows failed to load. If these are the core system workflows, these events could cause the issue. Detailed troubleshooting of the various errors, symptoms and workflows is beyond the scope of this troubleshooter, however a quick Bing or TechNet search should provide clues as to the cause and resolution.

=====

Event ID: 1104 

Event Source: HealthService Event Description:RunAs profile in workflow "%4", running for instance "%3" with id:"%2" cannot be resolved. Workflow will not be loaded. Management group "%1" 

Event ID: 1105

Event Source: HealthService Event Description:Type mismatch for RunAs profile in workflow "%4", running for instance "%3" with id:"%2". Workflow will not be loaded. Management group "%1" 

Event ID: 1106 

Event Source: HealthServiceEvent Description:Cannot access plain text RunAs profile in workflow "%4", running for instance "%3" with id:"%2". Workflow will not be loaded. Management group "%1"

Event ID: 1107 

Event Source: HealthService Event Description:Account for RunAs profile in workflow "%4", running for instance "%3" with id:"%2" is not defined. Workflow will not be loaded. Please associate an account with the profile. Management group "%1"

SOLUTION: Events 1104, 1105, 1106, 1107, and 1108 can themselves cause Event IDs 1102 and 1103 to occur. Typically this would occur because of misconfigured "Run as" accounts. Make sure that the "Run as" accounts are configured correctly, that they are configured to be used with the right class, and that they are configured to be distributed to the agent.

=====

Event ID: 4000

Event Source: HealthServiceEvent Description: A monitoring host is unresponsive or has crashed. The status code for the host failure was %1.

=====

Event ID: 21006  

Event Source: OpsMgr ConnectorEvent Description:The OpsMgr Connector could not connect to %1:%2. The error code is %3(%4). Please verify there is network connectivity, the server is running and has registered its listening port, and there are no firewalls blocking traffic to the destination.

SOLUTION: Event ID 21006 indicates that communication issues exist between the agent and the Management Server. If the agent uses a certificate for mutual authentication, verify that the certificate is not expired and that the agent is using the correct certificate. If Kerberos is being used, verify that the agent can communicate with Active Directory. If authentication is working correctly, this may mean that the packets from the agent are not reaching the management server or gateway. To test this, try to establish a simple telnet to port 5723 from the agent to the management server. If this fails then there is probably a device on the network such as a router or firewall that is block these packets.


Did you solve your problem?

Look for performance related events

Examine the Operations Manager Event log for the following events. These events typically indicate that performance issues exist on the Management Server or the Microsoft SQL Server that is hosting the OperationsManager or OperationsManagerDW database:

Event ID: 2115 

Event Source: HealthService Event Description:A Bind Data Source in Management Group %1 has posted items to the workflow, but has not received a response in %5 seconds. This indicates a performance or functional problem with the workflow.%n Workflow Id : %2%n Instance : %3%n Instance Id : %4%n 

Event ID: 5300 

Event Source: HealthService Event Description: Local health service is not healthy. Entity state change flow is stalled with pending acknowledgement. %n%nManagement Group: %2 %nManagement Group ID: %1 

Event ID: 4506 

Event Source: HealthService Event Description: Operations Manager Data was dropped due to too much outstanding data in rule "%2" running for instance "%3" with id:"%4" in management group "%1".

Event ID: 31551

Event Source: Health Service Modules Event Description:Failed to store data in the Data Warehouse. The operation will be retried.%rException '%5': %6 %n%nOne or more workflows were affected by this. %n%nWorkflow name: %2 %nInstance name: %3 %nInstance ID: %4 %nManagement group: %1 

Event ID: 31552

Event Source: Health Service ModulesEvent Description: Failed to store data in the Data Warehouse.%rException '%5': %6 %n%nOne or more workflows were affected by this. %n%nWorkflow name: %2 %nInstance name: %3 %nInstance ID: %4 %nManagement group: %1 

Event ID: 31553

Event Source: Health Service ModulesEvent Description:Data was written to the Data Warehouse staging area but processing failed on one of the subsequent operations.%rException '%5': %6 %n%nOne or more workflows were affected by this. %n%nWorkflow name: %2 %nInstance name: %3 %nInstance ID: %4 %nManagement group: %1 

Event ID:31557

Event Source: Health Service ModulesEvent Description:Failed to obtain synchronization process state information from Data Warehouse database. The operation will be retried.%rException '%5': %6 %n%nOne or more workflows were affected by this. %n%nWorkflow name: %2 %nInstance name: %3 %nInstance ID: %4 %nManagement group: %1


Did you find any of the performance related events above?

Check the Health Service on the Management Server and Gateway

Verify that the Health Service is enabled and is currently running on the Management Server and Gateway. If the Health Service has stopped responding or will not start, troubleshoot this issue before proceeding.


Did this solve your problem?

Troubleshooting Management Server and Gateway performance

Operations Manager agents can alternate intermittently between healthy and gray states if there are performance related issues on the Management Server or the Gateway. For example, there could be configuration update bursts occurring in the environment caused by management pack imports and/or discovery data. 

The Management Server is responsible for generating and sending configuration files to all affected Health Services. For Workflow reloading (which is caused by new configuration), the most likely bottlenecks are the CPU first and OpsMgr installation disk I/O second. The Management Server is responsible for reading the configuration file, for loading and initializing all workflows that run on it, and for updating the HealthService store.

For local workflow activity bursts, which is when agents change their availability, the most likely bottleneck is the CPU. If you find that the CPU is not working at maximum capacity, the next most likely bottleneck is the hard disk. The Management Server is responsible for monitoring the availability of all agents using local workflows and also hosts distributed dependency monitors that use the disk.

In short, when Management Server or Gateway system performance is slow, the most likely bottlenecks are the CPU followed by OpsMgr installation disk I/O.

OpsMgr Performance Counters 

The following are the performance counters that prove the most useful in monitoring and troubleshooting Operations Manager performance.

Gateway Server Role

Overall performance counters: These counters indicate the overall performance of the Gateway:

  • Processor(_Total)\% Processor Time 
  • Memory\% Committed Bytes In Use 
  • Network Interface(*)\Bytes Total/sec 
  • LogicalDisk(*)\% Idle Time 

LogicalDisk(*)\Avg. Disk Queue LengthOpsMgr process generic performance counters: These counters indicate the overall performance of OpsMgr processes on the gateway:


  • Process(HealthService)\%Processor Time 
  • Process(HealthService)\Private Bytes (depending on how many agents this gateway is managing, this number may vary and could be several hundred megabytes) 
  • Process(HealthService)\Thread Count 
  • Process(HealthService)\Virtual Bytes 
  • Process(HealthService)\Working Set 
  • Process(MonitoringHost*)\% Processor Time 
  • Process(MonitoringHost*)\Private Bytes 
  • Process(MonitoringHost*)\Thread Count 
  • Process(MonitoringHost*)\Virtual Bytes 

Process(MonitoringHost*)\Working SetOpsMgr specific performance counters: These counters are OpsMgr specific counters that indicate the performance of specific aspects of OpsMgr on the gateway:

  • Health Service\Workflow Count 
  • Health Service Management Groups(*)\Active File Uploads: The number of file transfers that this gateway is handling. This represents the number of management pack files that are being uploaded to agents. If this value remains at a high level for a long time, and there is not much management pack importing at a given moment, these conditions may generate a problem that affects file transfer. 
  • Health Service Management Groups(*)\Send Queue % Used: The size of persistent queue. If this value remains higher than 10 for a long time, and it does not drop, this indicates that the queue is backed up. This condition is cause by an overloaded OpsMgr system because the management server or database is too busy or is offline. 
  • OpsMgr Connector\Bytes Received: The number of network bytes received by the gateway – i.e., the amount of incoming bytes before decompression. 
  • OpsMgr Connector\Bytes Transmitted: The number network bytes sent by the gateway – i.e., the amount of outgoing bytes after compression. 
  • OpsMgr Connector\Data Bytes Received: The number of data bytes received by the gateway – i.e., the amount of incoming data after decompression. 
  • OpsMgr Connector\Data Bytes Transmitted: The number of data bytes sent by the gateway – i.e. the amount of outgoing data before compression. 
  • OpsMgr Connector\Open Connections: The number of connections that are open on gateway. This number should be same as the number of agents or management servers that are directly connected to the gateway. 

Management Server Role

Overall performance counters: These counters indicate the overall performance of the management server:

  • Processor(_Total)\% Processor Time 
  • Memory\% Committed Bytes In Use 
  • Network Interface(*)\Bytes Total/sec 
  • LogicalDisk(*)\% Idle Time 

LogicalDisk(*)\Avg. Disk Queue LengthOpsMgr process generic performance counters: These counters indicate the overall performance of OpsMgr processes on the management server:

  • Process(HealthService)\% Processor Time 
  • Process(HealthService)\Private Bytes – Depending on how many agents this management server is managing, this number may vary, and it could be several hundred megabytes. 
  • Process(HealthService)\Thread Count 
  • Process(HealthService)\Virtual Bytes 
  • Process(HealthService)\Working Set 
  • Process(MonitoringHost*)\% Processor Time 
  • Process(MonitoringHost*)\Private Bytes 
  • Process(MonitoringHost*)\Thread Count 
  • Process(MonitoringHost*)\Virtual Bytes 
  • Process(MonitoringHost*)\Working SetOpsMgr specific performance counters: These counters are OpsMgr specific counters that indicate the performance of specifric aspects of OpsMgr on the management server:
  • Health Service\Workflow Count: The number of workflows that are running on this management server. 
  • Health Service Management Groups(*)\Active File Uploads: The number of file transfers that this management server is handling. This represents the number of management pack files that are being uploaded to agents. If this value remains at a high level for a long time, and there is not much management pack importing at a given moment, these conditions may generate a problem that affects file transfer. 
  • Health Service Management Groups(*)\Send Queue % Used: The size of the persistent queue. If this value remains higher than 10 for a long time, and it does not drop, this indicates that the queue is backed up. This condition is cause by an overloaded OpsMgr system because the OpsMgr system (for example, the root management server) is too busy or is offline. 
  • Health Service Management Groups(*)\Bind Data Source Item Drop Rate: The number of data items that are dropped by the management server for database or data warehouse data collection write actions. When this counter value is not 0, the management server or database is overloaded because it can’t handle the incoming data item fast enough or because a data item burst is occurring. The dropped data items will be resent by agents. After the overload or burst situation is finished, these data items will be inserted into the database or into the data warehouse. 
  • Health Service Management Groups(*)\Bind Data Source Item Incoming Rate: The number of data items received by the management server for database or data warehouse data collection write actions. 
  • Health Service Management Groups(*)\Bind Data Source Item Post Rate: The number of data items that the management server wrote to the database or data warehouse for data collection write actions. 
  • OpsMgr Connector\Bytes Received: The number of network bytes received by the management server – i.e., the size of incoming bytes before decompression. OpsMgr Connector\Bytes Transmitted: The number of network bytes sent by the management server – i.e., the size of outgoing bytes after compression. 
  • OpsMgr Connector\Data Bytes Received: The number of data bytes received by the management server – i.e., the size of incoming data after decompress). 
  • OpsMgr Connector\Data Bytes Transmitted: The number of data bytes sent by the management server – i.e., the size of outgoing data before compression). 
  • OpsMgr Connector\Open Connections: The number of connections open on management server. It should be same as the number of agents or root management server that are directly connected to it. 
  • OpsMgr database Write Action Modules(*)\Avg. Batch Size: The number of a data items or batches that are eceived by database write action modules. If this number is 5,000, a data item burst is occurring. 
  • OpsMgr DB Write Action Modules(*)\Avg. Processing Time: The number of seconds a database write action modules takes to insert a batch into database. If this number is often greater than 60, a database insertion performance issue is occurring. 
  • OpsMgr DW Writer Module(*)\Avg. Batch Processing Time, ms: The number of milliseconds for data warehouse write action to insert a batch of data items into a data warehouse. 
  • OpsMgr DW Writer Module(*)\Avg. Batch Size: The average number of data items or batches received by data warehouse write action modules. 
  • OpsMgr DW Writer Module(*)\Batches/sec: The number of batches received by data warehouse write action modules per second. 
  • OpsMgr DW Writer Module(*)\Data Items/sec: The number of data items received by data warehouse write action modules per second. 
  • OpsMgr DW Writer Module(*)\Dropped Data Item Count: The number of data items dropped by data warehouse write action modules. 
  • OpsMgr DW Writer Module(*)\Total Error Count: The number of errors that occurred in a data warehouse write action module. 
  • Note: While troubleshooting performance issues is beyond the scope of this guide, there are tools available to make this job easier. One example is the Windows Performance Toolkit which consists of two independent tools: Windows Performance Recorder (WPR) and Windows Performance Analyzer (WPA). WPR is a powerful recording tool that creates Event Tracing for Windows (ETW) recordings, while WPA is a powerful analysis tool that combines a very flexible UI with extensive graphing capabilities and data tables that can be pivoted and that have full text search capabilities. WPA also provides an Issues window to explore the root cause of any identified problem. Complete details as well as a download link can be found here. 

Did this solve your problem? 

Troubleshooting SQL Server performance

Operational Database (OperationsManager) 

For the OperationsManager database, the most likely bottleneck is the disk array. If the disk array is not at maximum I/O capacity, the next most likely bottleneck is the CPU. 

It is normal for the database to experience occasional slowdowns and operational "data storms” (very high incidences of events, alerts, and performance data or state changes that persist for length of time), however a short burst typically does not cause any significant delay for an extended period of time.

During operational data insertion, the database disks are primarily used for writes, and CPU use is usually caused by SQL Server churn. This can occur when you have large and complex queries, heavy data insertion, and the grooming of large tables (which, by default, occurs at midnight). Typically, the grooming of even large Events and Performance Data tables does not consume excessive CPU or disk resources, however the grooming of the Alert and State Change tables can be CPU-intensive for large tables.

The database is also CPU-bound when it handles configuration redistribution bursts, which are caused by MP imports or by a large instance space change. In these cases, the Config service queries the database for new agent configuration which causes CPU spikes to occur on the database before the service sends the configuration updates to the agents.

Data Warehouse (OperationsManagerDW) 

For the OperationsManagerDW database, the most likely bottleneck is the disk array. This usually occurs in the case of very large operational data insertions. In these cases, the disks are mostly busy performing writes. Usually the disks are performing few reads, except to handle manually generated reporting views because these run queries on the data warehouse.

Excessive CPU utilization is typically caused by SQL Server churn. CPU spikes may occur during heavy partitioning activity (when tables become very large and then get partitioned), the generation of complex reports, or large amounts of alerts in the database, with which the data warehouse must constantly sync.

Basic checks

  1. If OS is 32-bit and RAM is 4GB or greater, check whether the /pae or /3gb switches exist in the Boot.ini. file. These options could be configured incorrectly if the server was originally installed by having 4 GB or less of RAM and the RAM was later upgraded. For 32-bit servers that have 4GB of RAM, the /3gb switch in Boot.ini increases the amount of memory that SQL Server can address from 2GB to 3GB. For 32-bit servers that have more than 4GB of RAM, the /3gb switch in Boot.ini could actually limit the amount of memory that SQL Server can address. For these systems, add the /pae switch to Boot.ini and then enable AWE in SQL Server. 
  2. On a multi-processor system, check the Max Degree of Parallelism (MAXDOP) setting. In SQL Server 2008 and in SQL Server 2005, this option is on the Advanced tab in the Properties dialog box for the server. The default value is 0, which means that all available processors will be used. A setting of 0 is fine for servers that have eight or fewer processors, however for servers that have more than eight processors, the time that it takes SQL Server to coordinate the use of all processors may be counterproductive. Therefore, for servers that have more than eight processors, you generally should set Max Degree of Parallelism to a value of 8. To do this, run the following command in SQL Query Analyzer:
    sp_configure 'show advanced options', 1GORECONFIGURE WITH OVERRIDEGOsp_configure 'max degree of parallelism', 8GORECONFIGURE WITH OVERRIDEGO
  3. Verify that antivirus software is configured to exclude SQL data and log files. Antivirus software cannot scan SQL database files and attempts to do this can degrade performance.

Performance counters to identify memory pressure 

  • MSSQL$: Buffer Manager: Page Life expectancy – How long pages persist in the buffer pool. If this value is below 300 seconds it may indicate that the server could use more memory. It could also result from index fragmentation. 
  • MSSQL$: Buffer Manager: Lazy Writes/sec – Lazy writer frees space in the buffer by moving pages to disk. Generally, the value should not consistently exceed 20 writes per second. Ideally, it would be close to zero. 
  • Memory: Available Mbytes - Values below 100 MB may indicate memory pressure. Memory pressure is clearly present when this amount is less than 10 MB. 
  • Process: Private Bytes: _Total: This is the amount of memory (physical and page) being used by all processes combined. 
  • Process: Working Set: _Total: This is the amount of physical memory being used by all processes combined. If the value for this counter is significantly below the value for Process: Private Bytes: _Total, it indicates that processes are paging too heavily. A difference of more than 10% is probably significant. 

Performance counters to identify disk pressure 

Take a look at these Physical Disk counters for all drives that contain SQL data or log files:


  • % Idle Time: How much disk idle time is being reported. Anything below 50 percent could indicate a disk bottleneck. 
  • Avg. Disk Queue Length: This value should not exceed 2 times the number of spindles on a LUN. For example, if a LUN has 25 spindles, a value of 50 is acceptable. However, if a LUN has 10 spindles, a value of 25 is too high. You could use the following formulas based on the RAID level and number of disks in the RAID configuration:
  • RAID 0: All of the disks are doing work in a RAID 0 set. 
  • Average Disk Queue Length <= # (Disks in the array) *2. 
  • RAID 1: half the disks are “doing work”; therefore, only half of them can be counted toward Disks Queue. 
  • Average Disk Queue Length <= # (Disks in the array/2) *2. 
  • RAID 10: half the disks are “doing work”; therefore, only half of them can be counted toward Disks Queue. 
  • Average Disk Queue Length <= # (Disks in the array/2) *2. 
  • RAID 5: All of the disks are doing work in a RAID 5 set. 
  • Average Disk Queue Length <= # Disks in the array *2. 
  • Avg. Disk sec/Transfer: The number of seconds it takes to complete one disk I/O. 
  • Avg. Disk sec/Read: The average time, in seconds, of a read of data from the disk. 
  • Avg. Disk sec/Write: The average time, in seconds, of a write of data to the disk. 
    Note that the last three counters in this list should consistently have values of approximately .020 (20 ms) or lower and should never exceed.050 (50 ms). The following are the thresholds that are documented in the SQL Server performance troubleshooting guide:
      ◦ Less than 10 ms: very good
      ◦ Between 10 - 20 ms: okay 
      ◦ Between 20 - 50 ms: slow, needs attention
      ◦ Greater than 50 ms: serious I/O bottleneck
    Disk Bytes/sec: The number of bytes being transferred to or from the disk per second. 
  • Disk Transfers/sec: The number of input and output operations per second (IOPS). 

When % Idle Time is low (10 percent or less), this means that the disk is fully utilized. In this case, the last two counters in this list (“Disk Bytes/sec” and “Disk Transfers/sec”) provide a good indication of the maximum throughput of the drive in bytes and in IOPS, respectively. The throughput of a SAN drive is highly variable and depends on the number of spindles, the speed of the drives and the speed of the channel. It’s best to check with the SAN vendor to find out how many bytes and IOPS the drive should be able to support. If % Idle Time is low, and the values for these two counters do not meet the expected throughput of the drive, engage the SAN vendor to troubleshoot.

The following links provide deeper insight into troubleshooting SQL Server performance:

Did you solve your problem?

Congratulations!

Your issue with gray agent states in Operations Manager is resolved.

Sorry

It appears that we are unable to resolve your issue by using this guide. For more help resolving this issue please see our TechNet support forum or contact Microsoft Support.

Proprietà

ID articolo: 10129 - Ultima revisione: 07 mar 2016 - Revisione: 54

Feedback