"Event 5120" with STATUS_IO_TIMEOUT c00000b5 after an S2D node restart on Windows Server 2016 May 2018 update or later

Applies to: Windows Server 2016 Datacenter

Symptoms


Consider the following scenario:

  • You have a Windows Server 2016 node.
  • The node has a Windows Server 2016 cumulative update that were released from May 8, 2018 (KB4103723) to October 9, 2018(KB4462917) installed.
  • The node is running Storage Spaces Direct (S2D).

In this scenario, when you restart the node, Event 5120 is logged in the System event log and includes one of the following error codes.

Event log Event source ID Description
System Microsoft-Windows-FailoverClustering 5120 Cluster Shared Volume 'CSVName' ('Cluster Virtual Disk (CSVName)') has entered a paused state because of 'STATUS_IO_TIMEOUT(c00000b5)'. All I/O will temporarily be queued until a path to the volume is reestablished.
System Microsoft-Windows-FailoverClustering 5120 Cluster Shared Volume ‘CSVName’ ('Cluster Virtual Disk (CSVName)') has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.


When an Event 5120 is logged, a live dump is generated to collect debugging information that may cause additional symptoms or have a performance effect.

Generating the live dump creates a brief pause to enable taking a snapshot of memory to write the dump file. Systems that have lots of memory and are under stress may cause nodes to drop out of cluster membership and also cause the following Event 1135 to be logged.

Event log Event source ID Description
System Microsoft-Windows-FailoverClustering 1135 Cluster node 'NODENAME'was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

 

Cause


In the May 8, 2018, cumulative update, a change was introduced to add SMB Resilient Handles for the Storage Spaces Direct intra-cluster SMB network sessions. This was done to improve resiliency to transient network failures and improve how RoCE handles network congestion.

Adding these improvements has also inadvertently increased time-outs when SMB connections try to reconnect and waits to time-out when a node is restarted. These issues can affect a system that is under stress. During unplanned downtime, IO pauses of up to 60 seconds have also been observed while the system waits for connections to time-out.

Resolution


To fix this issue, install the October 18, 2018, cumulative update for Windows Server 2016 (KB4462928) or a later version.

Note This update aligns the CSV time-outs with SMB connection time-outs to fix this issue. It does not implement the changes to disable live dump generation mentioned in the Workaround section.

Important When installing the update to fix the issue, you may also experience the issue that is described in the "Symptoms" section. To reduce the chance of experiencing the issue, it is recommended to use the Storage Maintenance Mode procedure below to install the October 18, 2018, cumulative update for Windows Server 2016 (KB4462928) or a later version.

Workaround



Shutdown process flow

  1. Run the Get-VirtualDisk cmdlet, and make sure that the HealthStatus value is Healthy.
  2. Drain the node by running the following cmdlet:

    Suspend-ClusterNode -Drain
  3. Put the disks on that node in Storage Maintenance Mode by running the following cmdlet:

    Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Enable-StorageMaintenanceMode
  4. Run the Get-PhysicalDisk cmdlet, and make sure that the OperationalStatus value is In Maintenance Mode.
  5. Run the Restart-Computer cmdlet to restart the node.
  6. After node restarts, remove the disks on that node from Storage Maintenance Mode by running the following cmdlet:

    Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Disable-StorageMaintenanceMode
  7. Resume the node by running the following cmdlet:

    Resume-ClusterNode
  8. Check the status of the resync jobs by running the following cmdlet:

    Get-StorageJob


Disabling live dumps

To mitigate the effect of live dump generation on systems that have lots of memory and are under stress, you may additionally want to disable live dump generation. Three methods are provided below.


Method 1 (recommended in this scenario)

To completely disable all dumps, including live dumps system-wide, follow these steps:

  1. Create the following registry key:
     

    HKLM\System\CurrentControlSet\Control\CrashControl\ForceDumpsDisabled

  2. Under the new ForceDumpsDisabled key, create a REG_DWORD property as GuardedHost, and then set its value to 0x10000000.
  3. Apply the new registry key to each cluster node.
  4. Note You have to restart the computer for the change to take effect.

After this registry key is set, live dump creation will fail and generate a "STATUS_NOT_SUPPORTED" error.

Method 2

By default, Windows Error Reporting allows only one LiveDump per report type per 7 days and only 1 LiveDump per computer per 5 days. This can be changed by setting the following registry keys to only allow one LiveDump on the computer forever.

Start Command Prompt as administrator, and Run the following commands:

reg add "HKLM\Software\Microsoft\Windows\Windows Error Reporting\FullLiveKernelReports" /v SystemThrottleThreshold /t REG_DWORD /d 0xFFFFFFFF /freg add "HKLM\Software\Microsoft\Windows\Windows Error Reporting\FullLiveKernelReports" /v ComponentThrottleThreshold /t REG_DWORD /d 0xFFFFFFFF /f

Note You must restart the computer for the change to take effect.

Method 3

To disable cluster generation of live dumps (such as when an Event 5120 is logged), run the following cmdlet:

(Get-Cluster).DumpPolicy = ((Get-Cluster).DumpPolicy -band 0xFFFFFFFFFFFFFFFE)

This cmdlet has an immediate effect on all cluster nodes without a computer restart.