Microsoft has identified a critical issue that affects some Storage Spaces Direct (S2D) users who have hardware that’s based on the Intel P3x00 family of NVM Express (NVMe) devices. This includes P3500, P3600, and P3700 devices in all capacities.
We are providing steps in this article to help you to reduce the effect of the misbehaving devices. Until a resolution for this issue is available, we recommend that you update to the latest available firmware for your devices. Contact your hardware vendor to determine whether a more recent firmware version is available.
We are also providing a remedy to restore cluster functionality to affected virtual disks. If you are using one of these NVMe devices in a Storage Spaces Direct cluster, and you have experienced one of the symptoms that are mentioned in the "Symptoms" section, please immediately contact Microsoft Support and your hardware vendor to determine the appropriate remediation steps.
When this issue occurs, your cluster may experience any of the following symptoms:
- Slow workload performance
- Virtual disks in the cluster that have an Operational Status value of Detached or No Redundancy.
- Physical disks that report a status of Lost Communication or IO Error.
Reducing the latency effect
To reduce the effect of long latencies on your clusters, increase the hardware time-out and retry limits by applying the following registry subkey and cluster settings.
Note This registry update must be applied on all cluster nodes. The change requires a restart of each cluster node to take effect.
Value Name: HwTimeout
Default current value: 00001770 as DWORD (6000 as decimal)
Suggested value: 16000 (as decimal)
After you increase both the number of attempts and the time delay between attempts, the system will give the device more chances and time to recover and return to a functional state.
Note This cluster setting has to be applied only one time to the cluster. The change takes effect immediately without requiring a restart.
Get-StorageSubSystem clus* | Set-StorageHealthSetting -Name "System.Storage.PhysicalDisk.Unresponsive.Reset.CountAllowed" -Value 15
Get-StorageSubSystem clus* | Set-StorageHealthSetting -Name "System.Storage.PhysicalDisk.Unresponsive.Reset.CountResetIntervalSeconds" -Value 30
Microsoft has observed reports of unexpectedly long tail latencies for the Intel P3x00 family of NVMe devices. In some cases, these latencies exceed 30 seconds. This can cause Windows to mark the device as unresponsive.
After multiple unsuccessful attempts to reuse the hardware, Windows stops using the device within the cluster. If enough devices reach the hardware time-out and retry limits, the availability of virtual disks can be affected. This may cause downtime for the cluster.