Slow VSS operations in huge Hyper-V environments

Applies to: Windows Server 2012 R2 DatacenterWindows Server 2012 R2 EssentialsWindows Server 2012 R2 Standard

Symptoms


On huge clustered Hyper-V servers, you may experience slow backup operations. In some cases, writer time-outs are reported, and backup fails. Additionally, when you use the vssadmin list writers command, writer enumeration fails, or you must wait much longer than expected for the results.

Cause


This issue may occur for various reasons. It depends on the number of virtual machines you handle on the cluster nodes, on the number of checkpoints, the number of attached VHDs, the amount of time it takes to get those details from the VHD location, and other configuration details.

In case of a backup on the csv, every attached node is asked about its configuration data (regardless of whether the virtual machines are turned on or not). The VM that requires the most time to enumerate the data is most responsible for prolonging the vss process.

Workaround


To work around this issue, try to balance the work of the nodes and reduce the limiting factors, such as the number of checkpoints. You'll also want to make sure that you have fast access to the storage locations in question (see the "Cause" section).

Note In Windows Server 2016, resilient change tracking (RCT) is leveraged and no longer involves the writer, so these limiting factors are not an issue.

For more information about RCT, see the "New DPM 2016 features overview" section in What's new in DPM in System Center 2016.

To recognize and troubleshoot the issue that's described in the "Symptoms" section, run procmon-trace on the nodes. This helps you determine the time spent on writer-metadata enumeration. You can also run vss-trace to learn who or what is consuming the most time. However, you cannot see the details behind the factors that extend this enumeration.

While you're running the vssadmin list writers command, run procmon on the nodes. To do this, follow these steps.

Note For more information about procmon, see Process Monitor.

    1. Load the traces.
    2. Use the following filters, to extract the time for the vss operations that took the most time:

      Include:Path --> ends with "(Leave)" + ends with "(Enter)"

      Also, filter to only registry access.
    3. You can now identify which writer comes last, per the following screen shot:

      Process Monitor with filter
       

      While the vssapi publisher represents a wrap around the enumeration of all the writers, the really important one at that point is “Hyper-V-Writer."

    4. Now you can filter on the TID in order to see more of what occurs while this enumeration is running.

      Note You also might want to exclude the times before (Enter) and after (Leave).
    5. You notice that the information that the Hyper-V Writer gathers is metadata, which contains XML with configuration details, as well as virtual disk and differencing disk information. Through this data, you may be able to identify what action is consuming the most time and then respond accordingly. Additionally, you can use the File Summary Procmon tool to determine where the enumeration used the most time or the Stack-Summary tool to see which functions are doing the work at that time(RetrieveVirtualMachineComponentMetadata).