KB5020450: Node drain failures occur in large cluster scenarios in Azure Stack HCI, version 21H2 and 22H2

Summary

When you use the drain-roles feature in the Azure Stack HCI, version 21H2 or 22H2 operating system, a node drain failure might occur on large cluster scenarios (such as eight or more clusters) because of a time-out when storage is put into maintenance mode. This issue especially occurs when you update or upgrade the Azure Stack HCI operating system.

More information

To resolve the drain failure time-out, follow these steps:

Before you enable maintenance mode or any operation which involves maintenance mode such as node drain or Cluster Aware Updating, first increase the health service physical disks scanning interval. To do this, change the health setting by running the following command:

get-storagesubsystem Cluster* | set-storagehealthsetting -name System.Storage.PhysicalDisk.CheckPeriodMs -Value 10800000

Note In this example, we increase the value from fifteen minutes to three hours. However, you should adjust this value to make sure that it is longer than the expected duration of the workflow that involves maintenance mode.
Wait until any ongoing scans to finish. The exact duration depends on the environment. It might take forty to sixty minutes on a 16-node cluster to finish. To verify all existing scans have finished, check the health service log on the owner node of the “SDDC Group” and search for the pattern:

'Maintenance Mode Event Interpreter' is interpreting Event Type - Origin 'Storage', EntityType 'SPACES_PhysicalDisk'.

Note If there is no such entry within the last minute, it means that all scans have finished. The health log can be retrieved by running the following command:

Get-ClusterLog -Destination . -TimeSpan 5 -UseLocalTime -Health
Run a maintenance mode operation or other workflow which involves maintenance mode.
Revert the health setting back to its original setting. This is important as a long interval could potentially cause some delay in certain health service functionality such as physical disk related errors or retirement. To revert the health setting, run the following command:

get-storagesubsystem Cluster* | remove-storagehealthsetting -name System.Storage.PhysicalDisk.CheckPeriodMs

References

Failover cluster maintenance procedures

Learn about the standard terminology that is used to describe Microsoft software updates.

KB5020450: Node drain failures occur in large cluster scenarios in Azure Stack HCI, version 21H2 and 22H2

Summary

More information

References

Need more help?

Want more options?

Was this information helpful?

Thank you for your feedback!