Troubleshooting the Database Replication Service in Microsoft Configuration Manager

What does this guide do?
This guide helps administrators diagnose and resolve Database Replication Service (DRS) problems in the following products:
  • Microsoft System Center 2012 Configuration Manager (ConfigMgr 2012)
  • Microsoft System Center 2012 R2 Configuration Manager (ConfigMgr 2012 R2)
  • Microsoft System Center Configuration Manager current branch (1502, 1602, etc.)
Who is it for?
Administrators charged with diagnosing and resolving Database Replication Service (DRS) problems.

How does it work?
This guide helps determine the type of DRS problem you are experiencing, explains how the processes work and offers troubleshooting suggestions for some of the most common problems.

Estimated time of completion:
15-30 minutes.
Getting Started
When faced with a Database Replication Service (DRS) problem in Microsoft Configuration Manager, the beginning investigative phase is the most critical point. Any type of change or fix should be made only after careful study and understanding of the problem at hand.
Start by gathering information related to the history of the problem. Many times DRS problems can ultimately be traced back to a recent change made in the environment. Keep in mind that you should not focus solely on Configuration Manager, as changes to Windows or SQL Server can cause DRS problems as well. Having a clear understanding of any recent changes in the environment can provide important clues as to the source of the problem.
Once you’ve investigated environmental changes and made sure that your updates are in order, the next thing you want to do is run the Replication Link Analyzer (RLA). To launch RLA, open the Monitoring workspace and click the Database Replication node, then right click the link that is having a problem and select Replication Link Analyzer, as shown below.
20033_Image1
 

RLA will check the following on both sites:

  • The SMS service is running
  • SMS Replication Configuration Monitor component is running
  • Ports required for SQL replication are enabled
  • SQL version is supported
  • Network is available between the two sites
  • There’s enough space for the SQL database
  • SQL Server Broker service configuration exists
  • SQL Server Broker service certificate exists
  • Known errors in SQL log files
  • Whether the replication queues are disabled
  • Time is in sync
  • Is the transmission of data stuck?
  • Does a Key Conflict exist?
If RLS finds known problems, it will offer to try to fix them for you.
The RLA output report is also very straight forward; it tells you what it checked and what rules were run in addition to whether they passed or failed. Here is an example:
20033_Image2
 

Getting Started
When faced with a Database Replication Service (DRS) problem in Microsoft Configuration Manager, the beginning investigative phase is the most critical point. Any type of change or fix should be made only after careful study and understanding of the problem at hand.
Start by gathering information related to the history of the problem. Many times DRS problems can ultimately be traced back to a recent change made in the environment. Keep in mind that you should not focus solely on Configuration Manager, as changes to Windows or SQL Server can cause DRS problems as well. Having a clear understanding of any recent changes in the environment can provide important clues as to the source of the problem.
Once you’ve investigated environmental changes and made sure that your updates are in order, the next thing you want to do is run the Replication Link Analyzer (RLA). To launch RLA, open the Monitoring workspace and click the Database Replication node, then right click the link that is having a problem and select Replication Link Analyzer, as shown below.
20033_Image1
 

RLA will check the following on both sites:

  • The SMS service is running
  • SMS Replication Configuration Monitor component is running
  • Ports required for SQL replication are enabled
  • SQL version is supported
  • Network is available between the two sites
  • There’s enough space for the SQL database
  • SQL Server Broker service configuration exists
  • SQL Server Broker service certificate exists
  • Known errors in SQL log files
  • Whether the replication queues are disabled
  • Time is in sync
  • Is the transmission of data stuck?
  • Does a Key Conflict exist?
If RLS finds known problems, it will offer to try to fix them for you.
The RLA output report is also very straight forward; it tells you what it checked and what rules were run in addition to whether they passed or failed. Here is an example:
20033_Image2
 

Getting details with SPDiagDRS
Assuming Replication Link Analyzer was unable to detect and resolve the problem, Run SPDiagDRS and see if it can offer any clues as to what may be failing.
To run SPDiagDRS, open SQL Management Studio and connect to the two servers on each side of the link having the problem. On each CM_xxx database, run the command Exec SPDiagDRS.
The following is a breakdown of the various SPDiagDRS sections and some common places to look for problems. A simple search for error messages and codes found here will often times guide you to the source of the problem.
20033_Image3
 

Section 1

  • SiteStatus: This tells us whether the site is replicating or not. Anything other than ACTIVE is not good. 
  •  CertificateThumbprint: The thumbprint of the certificate used for authentication that contains the site’s public key (Local DB trusts remote DB). 
20033_Image4
 

Section 2

  • IncomingMessageInQueue: This tells us the incoming backlog that a site has. If the backlog is high due to the number of sites reporting to it, you may see the links going to a degraded or failed state because the heartbeat synchronizations are not processed in time. 
  • OutgoingMessageInQueue: This tells us the backlog that has yet to clear as we wait for the sites to receive the messages. This generally fluctuates, however if it continues to grow then this can represent a problem. Further troubleshooting should be performed to determine which site is not getting the messages.
20033_Image5
 

Section 3

This is simply the detailed view of the "Initialization Detail" in the console.
20033_Image6
 
 

Section 4

This is the detailed view of "Replication Detail" in the console. This gives more information about the flow between each replication group.
20033_Image7
 

Section 5

This section has some important and useful information about the sites we are connecting TO. In this example we are on Primary site server 002 and 001 is the CAS. If we had a secondary site under 002 it would be shown here. On a CAS, all the Primaries would be reflected but not Secondaries.
Primary Site 002 example:
20033_Image8
 
CAS Site 001 example:
20033_Image9
 

Section 6

This provides the general information of the sites in the hierarchy, the SiteServerName and DBServer names, as well as the status and version. You can see here that a different primary site, 003, is showing as being in Maintenance Mode. On working systems, Section 6 should be identical between the CAS and all Primaries in the hierarchy.
20033_Image10
 

Section 7

The two bottom sections contain detailed information on the heartbeat or LastSentStatus for each group as well as conversationIDs, etc., and the built in replication options configured for each group. 

Check RCMCtrl.log for errors
Next you will want to check RCMCtrl.log on each site for errors, as this will often times provide valuable clues regarding the source of the problem. For example, you may find that replication is in a Failed state for a site and that replication has not occurred for some time. In this scenario, you may find that RCMCtrl.log contains entries similar to the following:
7/4/2016 1:25:36 PM: ReplicationLinkAnalysis Information: 1 : Completed replication link analysis thread.7/4/2016 1:25:37 PM: ReplicationLinkAnalysis Error: 1 : Unable to find SiteCode or SiteNumber7/4/2016 1:25:37 PM: ReplicationLinkAnalysis Error: 1 : Microsoft.ConfigurationManager.ManagedBase.LocalServerDataNotFoundException: Unable to find SiteCode or SiteNumber   at Microsoft.ConfigurationManager.ManagedBase.SiteData.Refresh()   at Microsoft.ConfigurationManager.ReplicationLinkAnalyzer.ReplicationLinkAnalysisEngine.Initialize()   at Microsoft.ConfigurationManager.ReplicationLinkAnalyzer.ReplicationLinkAnalysisEngine.RunRulesInBackground(Object sender, DoWorkEventArgs e)   at System.ComponentModel.BackgroundWorker.WorkerThreadStart(Object argument)
If you see entries similar to these, make sure that the SMS Executive and the Site Component Manager services are running on the site in question. If not, this may be why replication is in a Failed state. If not running, start the SMS Executive and/or Site Component Manager services manually and troubleshoot the services if they fail to start.
Another example of an error you might find in RCMCtrl.log is the following:
07/04/2016 12:33:34 PM 6352 (0x18D0)CSqlBCP::ReadRowCount: Can't open file [F:\Program Files\Microsoft Configuration Manager\inboxes\rcm.box\42EEDF0E-17CF-4713-AA65-C0E66B51AD62\INSTALLED_EXECUTABLE_DATA.bcp.rowcount]. SMS_REPLICATION_CONFIGURATION_MONITOR 07/04/2016 12:33:34 PM 6352 (0x18D0) CSqlBCP::DRS_Init_BCPIN: ReadRowCount failed. SMS_REPLICATION_CONFIGURATION_MONITOR 07/04/2016 12:33:34 PM 6352 (0x18D0)*** DRS_Init_BCPIN() failed SMS_REPLICATION_CONFIGURATION_MONITOR 07/04/2016 12:33:34 PM 6352 (0x18D0) CBulkInsert::DRS_Init_BCPIN : Failed to BCP in SMS_REPLICATION_CONFIGURATION_MONITOR 07/04/2016 12:33:34 PM 6352 (0x18D0) BCP in result is 2147500037. SMS_REPLICATION_CONFIGURATION_MONITOR 07/04/2016 12:33:34 PM 6352 (0x18D0) ERROR: Failed to BCP in for table INSTALLED_EXECUTABLE_DATA with error code 2147500037. SMS_REPLICATION_CONFIGURATION_MONITOR 07/04/2016 12:33:34 PM 6352 (0x18D0) ERROR: Failed to apply BCP for all articles in publication Hardware_Inventory_7. SMS_REPLICATION_CONFIGURATION_MONITOR 07/04/2016 12:33:34 PM 6352 (0x18D0) Will try to apply BCP files again on next run. 
What’s happening here is that while the .CAB file sent from the parent was unpacked by the despooler, the space on the drive was exhausted, thus it was only able to uncompress some of the files. If you view despool.log it will have a 2147024784 failure which refers to insufficient disk space. To resolve this type of issue simply free up disk space on the drive.

Check RCM.log for errors
After checking RCMCtrl.log you’ll next want to take a look at RCM.log.
You may find that replication is in a Failed state for a site and that replication has not occurred for some time. In this scenario, you may find that RCM.log contains entries similar to the following:
7/4/2016 1:25:36 PM: ReplicationLinkAnalysis Information: 1 : Completed replication link analysis thread.7/4/2016 1:25:37 PM: ReplicationLinkAnalysis Error: 1 : Unable to find SiteCode or SiteNumber7/4/2016 1:25:37 PM: ReplicationLinkAnalysis Error: 1 : Microsoft.ConfigurationManager.ManagedBase.LocalServerDataNotFoundException: Unable to find SiteCode or SiteNumber   at Microsoft.ConfigurationManager.ManagedBase.SiteData.Refresh()   at Microsoft.ConfigurationManager.ReplicationLinkAnalyzer.ReplicationLinkAnalysisEngine.Initialize()   at Microsoft.ConfigurationManager.ReplicationLinkAnalyzer.ReplicationLinkAnalysisEngine.RunRulesInBackground(Object sender, DoWorkEventArgs e)   at System.ComponentModel.BackgroundWorker.WorkerThreadStart(Object argument)
If you see entries similar to these, make sure that the SMS Executive and the Site Component Manager services are running on the site in question. If not, this may be why replication is in a Failed state. If not running, start the SMS Executive and/or Site Component Manager services manually and troubleshoot the services if they fail to start.
Another example of an error you might find in RCM.log is the following:

07/04/2016 12:33:34 PM 6352 (0x18D0)CSqlBCP::ReadRowCount: Can't open file [F:\Program Files\Microsoft Configuration Manager\inboxes\rcm.box\42EEDF0E-17CF-4713-AA65-C0E66B51AD62\INSTALLED_EXECUTABLE_DATA.bcp.rowcount]. SMS_REPLICATION_CONFIGURATION_MONITOR 07/04/2016 12:33:34 PM 6352 (0x18D0) CSqlBCP::DRS_Init_BCPIN: ReadRowCount failed. SMS_REPLICATION_CONFIGURATION_MONITOR 07/04/2016 12:33:34 PM 6352 (0x18D0)*** DRS_Init_BCPIN() failed SMS_REPLICATION_CONFIGURATION_MONITOR 07/04/2016 12:33:34 PM 6352 (0x18D0) CBulkInsert::DRS_Init_BCPIN : Failed to BCP in SMS_REPLICATION_CONFIGURATION_MONITOR 07/04/2016 12:33:34 PM 6352 (0x18D0) BCP in result is 2147500037. SMS_REPLICATION_CONFIGURATION_MONITOR 07/04/2016 12:33:34 PM 6352 (0x18D0) ERROR: Failed to BCP in for table INSTALLED_EXECUTABLE_DATA with error code 2147500037. SMS_REPLICATION_CONFIGURATION_MONITOR 07/04/2016 12:33:34 PM 6352 (0x18D0) ERROR: Failed to apply BCP for all articles in publication Hardware_Inventory_7. SMS_REPLICATION_CONFIGURATION_MONITOR 07/04/2016 12:33:34 PM 6352 (0x18D0) Will try to apply BCP files again on next run.
What’s happening here is that while the .CAB file sent from the parent was unpacked by the despooler, the space on the drive was exhausted, thus it was only able to uncompress some of the files. If you view despool.log it will have a 2147024784 failure which refers to insufficient disk space. To resolve this type of issue simply free up disk space on the drive.

Check Despool.log for errors
After RCM.log, you should next check Despool.log for relevant information and error messages. For example, if you have packages that are not being distributed from a primary site to a secondary, you will typically see an error similar to the following in Despool.log:
Signature checked out OK for instruction coming from site P01, proceed with the instruction execution. SMS_DESPOOLER 04-07-2016 7.32.05 PM 2208 (0x08A0)Executing instruction of type MICROSOFT|SMS|MINIJOBINSTRUCTION|PACKAGE SMS_DESPOOLER 04-07-2016 7.32.05 PM 2208 (0x08A0)Received package P0100017 version 1. Compressed file –  C:\SMSPKG\P0100001.PCK.1 as C:\Program Files\Microsoft Configuration Manager\inboxes\despoolr.box\receive\ds_s916k.pkg SMS_DESPOOLER 04-07-2016 7.32.05 PM 2208 (0x08A0)This package[P0100001]'s information hasn't arrived yet for this version [1]. Retry later … SMS_DESPOOLER 04-07-2016 7.32.06 PM 2208 (0x08A0)Despooler failed to execute the instruction, error code = 12 SMS_DESPOOLER 04-07-2016 7.32.06 PM 2208 (0x08A0)
The key here is "Despooler failed to execute the instruction, error code = 12". This means that the package has reached the distribution point, however the instructions to process it were not received at the secondary site.
The easiest fix is to reinitialize the Secondary Site Data replication group by doing the following steps:
  1. Create a blank text file name 'Secondary Site Data.pub'
  2. Copy the file to the RCM.box of the secondary site server
This will force replication between the primary and secondary site for the secondary site data replication group.

Check for BCP Problems
If you still haven’t found the source of the problem, it could be that the replication process was simply interrupted because the bulk copy program (BCP) was going too slow. Is sender throttled to this site and perhaps this is slowing down the BCP transfer? To verify, open the console and go to Administration -> Overview -> Hierarchy Configuration -> File Replication, then right-click the site that would be sending the data. Verify that the schedule availability is set to "Open for all Priorities" and that Rate Limits is set to "Unlimited to this Site".
20033_Image11
 
If things are working but the data set from the BCP process is large and taking a long time to send, you can increase the number of sender threads to speed things up. The defaults are listed below. If your sender log is consistently advising "no more threads available" or "Using 5 or 5" or "Using 3 of 3", this is a good indication that you may want to increase the sender threads. 
20033_Image12
 
Also be aware that if you have a rate limit set to “Limited to specified maximum transfer rates by hour“ (as shown below), Configuration Manager will only use one sender thread at a time when transferring to that site regardless of what the number of sender threads are set to. The default setting of “Unlimited When Sending to this destination” will use all the configured sender threads. 
20033_Image13
 


Congratulations
Congratulations! Your Database Replication Service problem has been resolved. If you’d like more information regarding DRS please see the following:
You can also post a question in our Configuration Manager support forum here:

Visit our blog for all the latest news, information and tech tips on Microsoft System Center Configuration Manager.
Additional Information
For additional information regarding the  Database Replication Service please see the following:
You can also post a question in our Configuration Manager support forum here:

Visit our blog for all the latest news, information and tech tips on Microsoft System Center Configuration Manager.
Vlastnosti

ID článku: 20033 - Poslední kontrola: 31. 7. 2016 - Revize: 17

Váš názor