FIX: Intermittent Availability Group failover occurs as AG helper connection times out while connecting to SQL Server

Applies to: SQL Server 2017 Developer LinuxSQL Server 2017 Enterprise on LinuxSQL Server 2017 Enterprise Core on Linux

Symptoms


Assume that you have configured AlwaysOn AvailabilityGroup by using Pacemaker for SQL Server 2017 or 2019 on Linux. While connecting to SQL Server, you notice that intermittent Availability Groupfailover occurs as AG helper connection times out.

Status


Microsoft has confirmed that this is a problem in the Microsoft products that are listed in the "Applies to" section. 

Resolution


This issue is fixed in the following cumulative updates for SQL Server:
About cumulative updates for SQL Server:
Each new cumulative update for SQL Server contains all the hotfixes and all the security fixes that were included with the previous cumulative update. Check out the latest cumulative updates for SQL Server:

More Information


Assume that you have configured Availability Group (AG) by using Pacemaker for SQL Server 2017 or 2019 on Linux. Consider that the pacemaker AG helper resource agent is using the following cluster configuration file as highlighted. AG helper is using the connection interval of 10 seconds, connection timeout of 30 seconds and monitor timeout of 90 seconds for health check.

        <master id="ha_cluster-master">
        <primitive class="ocf" id="ha_cluster" provider="mssql" type="ag">
          <instance_attributes id="ha_cluster-instance_attributes">
            <nvpair id="ha_cluster-instance_attributes-ha_name" name="ha_name" value="TEST_AG"/>
            <nvpair id="ha_cluster-instance_attributes-trace_ra" name="trace_ra" value="1"/>
          </instance_attributes>
          <operations>
            <op id="ha_cluster-demote-interval-0s" interval="0s" name="demote" timeout="300"/>
            <op id="ha_cluster-monitor-interval-60s" interval="60s" name="monitor" timeout="100"/>
            <op id="ha_cluster-monitor-interval-11" interval="10" name="monitor" role="Master" timeout="90"/>
            <op id="ha_cluster-monitor-interval-12" interval="12" name="monitor" role="Slave" timeout="60"/>
            <op id="ha_cluster-notify-interval-0s" interval="0s" name="notify" timeout="60"/>
            <op id="ha_cluster-promote-interval-0s" interval="0s" name="promote" timeout="60"/>
            <op id="ha_cluster-start-interval-0s" interval="0s" name="start" timeout="60"/>
            <op id="ha_cluster-stop-interval-0s" interval="0s" name="stop" timeout="300"/>
          </operations>
          <meta_attributes id="ha_cluster-meta_attributes">
            <nvpair id="ha_cluster-meta_attributes-timeout" name="timeout" value="30s"/>
            <nvpair id="ha_cluster-meta_attributes-failure-timeout" name="failure-timeout" value="60s"/>
          </meta_attributes>
        </primitive>
        <meta_attributes id="ha_cluster-master-meta_attributes">
          <nvpair id="ha_cluster-master-meta_attributes-notify" name="notify" value="true"/>
          <nvpair id="ha_cluster-master-meta_attributes-trace_ra" name="trace_ra" value="1"/>
        </meta_attributes>
      </master>

Prior to Cumulative Update 21 (CU21) for SQL Server 2017, if AG health check connection times out while connecting to SQL Server, a demote action was initiated leading to failover of AG to secondary node.
 
From CU21 onwards, if a connection timeout occurs, AG helper resource agent will honor the monitor timeout of 90 seconds, and will attempt two more connections. If all three connection attempts fail, AG helper resource agent will declare the SQL Server as unresponsive and start the demote action leading to failover of the Availability Group to secondary node.

References


Learn about the terminology that Microsoft uses to describe software updates.