CSV failover time is longer than expected in Windows failover cluster

Applies to: Windows Server 2012 R2 DatacenterWindows Server 2012 R2 StandardWindows Server 2012 Datacenter More

Symptoms


In a Windows failover-cluster that uses Cluster Shared Volumes (CSV), the diff area that is allocated by Volsnap is large and fragmented. In this situation, you encounter the following issues:
  • The failover time on the CSV is longer than expected.
  • The time that Volsnap takes to mount or unmount snapshots is several minutes.

More Information


When a NTFS or ReFS volume is mounted or dismounted, Volsnap iterates through the diff area to mount or unmount the snapshots that belong to that volume. When the diff area allocation becomes large and fragmented, the time that Volsnap takes to mount or unmount operations could be several minutes. Additionally, failover time can be longer than expected.

Resolution


To prevent the issue, use the CSV Software snapshot life cycle management mechanism.

As part of update 2878635 and update 2903939, a CSV Software snapshot life cycle management mechanism was introduced. This mechanism is used to minimize the occurrence of this issue by restricting the Volsnap diff area and purging old snapshots.

Note CSV Software snapshots map one-to-one with Volsnap snapshots.

To configure CSV software snapshot life cycle management, the following two new cluster Physical Disk resource private properties were added:
Note These properties do not apply to non-CSV disks.
SnapshotDiffSize
This property controls the maximum diff area size that can be consumed by Volsnap for a Physical Disk resource configured for CSV.

Units: In MB (DWORD)
Default Value: 0
Maximum Value: 1 TB
The Physical Disk resource must be taken offline/online for changes to take effect.

When the property is set to the default value, the cluster service will calculate the maximum desirable diff area size to be 20% of the CSV volume size. The minimum size is 320 MB and the maximum size is 1 TB. When the diff area extends beyond the specified size because of many old snapshots or too much IO churn, Volsnap automatically deletes snapshots on the corresponding volumes.

This behavior may cause the backups associated to the deleted snapshots to not work. Additionally, when all the snapshots are deleted by Volsnap, it may result in an Event ID 5120 to be logged to the System log. If too much IO churn is causing snapshots to frequently exceed the automatic 20% diff area allocation, you may manually configure the SnapshotDiffSize property to a larger allocation. 

To set this property, run the following command in a elevated Windows PowerShell console:
Get-ClusterSharedVolume  <Cluster Disk Name>  | Set-ClusterParameter snapshotdiffsize <Snapshot Diff Size in MB> 
SnapshotAgeLimit
This property is aResource Type private property of the Physical Disk to control the maximum age of a snapshot. Long lived snapshots are a significant contributor to diff area fragmentation.

Units: In Days (DWORD)
Default Value: 7
Range: 1-60 
This is a global property which affects  all Physical Disk resources. You do not have to take the resource offline or online for it to take effect. 

The default setting is that any snapshots that are older than 7 days are deleted. Since a backup operation rarely spans beyond 1 day, it generally is not necessary to increase the SnapshotAgeLimit value. Snapshots created on CSV by using Remote Volume Shadow Copy Service (RVSS) have a default lifetime of 1 day which is enforced by RVSS. In the scenario that a snapshot is deleted because of the age limit of a long-running backup operation that takes more than 7 days, the corresponding backup operation may fail. The CSV snapshot age limit is enforced on all software snapshots that includes non-persistent snapshots for backup purposes or persistent snapshots for non-backup purposes. When a snapshot exceeds the SnapshotAgeLimit and is deleted, an Event ID 5218 is logged in the System event log similar to the following:To set this property, run the following command in a elevated Windows PowerShell console:
Get-ClusterResourceType "physical disk" | Set-ClusterParameter snapshotagelimit  <Snapshot Age in Days>