Saturday, July 11, 2015

Availability Groups Failover Scenarios


Having worked with Availability Groups (AG) and dealt with issues at cluster and AG levels, I felt the need to do a write up on different failover scenarios and remediation in each scenario. This is one article I take pride in sharing. Hopefully, someone will find it useful.
This article assumes that reader already has basic knowledge/understanding of AGs and Windows clustering.

Most of the scenarios presented have been tested on setup with Windows 2012 Enterprise edition and SQL 2014 Enterprise edition.

Brief on Availability Groups


Availability Groups use a combination of Mirroring (though a more efficient form) at SQL level and Clustering at Windows level to provide high availability/DR capability to SQL databases. Mirroring keeps data between primary and secondary databases of AG in sync. Windows Clustering keeps track of status of AGs and manages failover of AGs between nodes. SQL server is responsible for bringing the DBs online after Windows cluster fails over AG to a node. Multiple DBs can be part of one AG and they failover as one unit, either all or none can failover.

Databases of an AG failover; SQL instances do not fail over. In fact, you can install SQL instance first and setup clustering later.
There is option available to use a combination of traditional SQL clustering (Failover Clustered Instance FCI) and AGs between clustered instances. However, automatic failover of AG is not available in this option.

I have found that it is even more important for DBAs to understanding Windows Clustering now with AGs than it was with FCI. This helps in understanding issues, troubleshooting them, and better communication between DBA and Server teams. Let us map configuration of AG in SQL server to what happens in cluster at Windows level.

Availability Groups and Windows Cluster


Setup of Windows Cluster is a pre-requisite for setting up AGs in SQL.

Figure 1 shows screen shot of AG. An AG TestAG has been created with NodeA as primary replica and NodeB as secondary replica. An AG listener TestAGLSNR has been created with two IPs, one for subnet of NodeA and other for subnet of NodeB. The two IPs are shown in Figure 2.
 
Figure 1

Figure 2

Figure 3 shows properties of AG. The two replicas are setup with Synchronous Commit and Automatic Failover modes. Currently the AG databases are online on NodeA and so that is in Primary role. Note that role Primary and Secondary switch when AG is failed over to NodeB.

Figure 3

Figure 4 shows Failover Cluster Manager console. Let us see how AG configuration at SQL shows up as resources in cluster. The AG name TestAG shows up as a Role and a resource within role as Other Resources. AG listener TestAGLSNR shows up as Server Name, actually it’s a virtual network name. IP addresses assigned to TestAGLSNR show up as IPs under the virtual network name with one IP online. This is the IP that belongs to subnet of node where AG resides at the moment. The other IP is offline until AG fails over to the other node in which case the offline IP will become online and current online IP will go offline.

Figure 4

Automatic failover setting at AG level shows up as checkmarks in “Possible Owners” setting for AG resource “TestAG” at cluster level as in Figure 5. 



Figure 5

Both primary and secondary nodes are check marked as possible owners of AG resource in case of automatic failover.
In case of manual failover, only the node that is primary replica at the time is check marked as Possible Owner. In manual failover configuration if you failover AG to another node, the new primary replica node is check marked as Possible Owner and the previous primary node is unchecked for possible owner.

Requirements for an AG to failover

  1. AGs have to be setup with automatic failover.
  2. At the time of failure:
    • Windows Cluster should be up 
    • Cluster service should be running on secondary node. This can happen only if secondary node forms new quorum with FSW or odd node. 
    • Secondary node server remain part of cluster 
    • All Databases in AG should be in “SYNCHRONIZED” state on secondary replica. Note that for failover, status of databases on secondary side is important. Primary replica keeps showing its database as “SYNCHRONIZED” even when secondary replicas are not receiving any data and showing secondary databases as “NOT SYNCHRONIZING”. DBs could stop synchronizing due to
      • Network disconnect or secondary replicas not responding
      • Secondary replica DB running out of space
      • Primary or secondary DB goes offline
      • HADR endpoint issues
      • Other types of corruption in prod/DR
  3. As we will see soon, not all failures lead to failover
Note that in point 2, first three conditions are at cluster level and last condition is at SQL server level.

Availability Group Failover Scenarios


Setup for testing the scenarios is given in figure 6.


Figure 6

A Windows cluster with two nodes and a File Share Witness (FSW) has been setup. Each node and FSW is in different location and subnet. Each node can reach FSW independently. An AG is setup between primary and secondary replica with synchronous commit and automatic failover. Secondary replica is setup to be non-readable.

Let us look at some of the failure scenarios on this setup. Each scenario has a combination of state of synchronization of DBs at secondary replica and type of failure that happens.


 
#
 
Scenario
 
Result
 
Remediation
1
All DBs in AG are Synchronized
Primary SQL instance shuts down or Host goes down
DBs part of AG failover automatically to secondary replica
2
One or more DBs in AG are not Synchronized
Primary SQL instance Down
DBs fail back and forth between preferred replicas and settle in failed state on the replica where its maximum number of failovers is exhausted.
Force failover AG with data loss.
3
One or more DBs in AG are not Synchronized
Primary Host goes down
DBs fail over to preferred replica but fail to come up.
Force failover AG with data loss.
4
All DBs in AG are Synchronized
Primary node’s Cluster service shuts down
DBs part of AG failover automatically
5
One or more DBs in AG are not Synchronized
Primary node’s Cluster service shuts down
DBs fail over to preferred replica but, fail to come up.
Force failover AG with data loss
6
State of synchronization of DBs does not have significance in this case.
Primary or Secondary node’s HADR Endpoint goes down
No failover. DBs stop synchronizing.
(In mirroring, if endpoint goes down on principal and witness was setup, database would failover to mirror side )
Force failover AG with data loss
7
State of synchronization of DBs does not have significance in this case.
DB on Primary SQL instance goes offline or becomes unavailable due to one of its files being unavailable
No failover happens. DBs stop synchronizing
Force failover AG with data loss
8
DBs in sync but at some point stop synchronizing. Could be due to secondary running out of space.
Primary stops truncating log and synchronization also stop. DBs in Not Synchronizing state.
Resolve the cause of issue on secondary. If DBs don't start synchronizing, remove secondary DB from AG on secondary replica. Take full and tran log backup of primary replica and restore to secondary and add it back to AG.
9
DBs in Synchronized state.
Primary sends a commit transaction log record to Secondary. Secondary hardens the log record and sends acknowledgment back to Primary. Primary goes down before it could send commit message to client/application.
Auto-failover happens as secondary thinks it is in sync with primary. Applications that are resilient and rerun their transactions after failover of DBs will re-run the transaction on secondary unless they check for pre-transaction data.
Extremely rare scenario. Nothing can be done about it except for application to check pre transaction data before re-running the transaction again.
Let us discuss each of the scenarios in detail. I find following logic of thinking really useful in understanding what will happen after failure has occurred
  1. What happens to the windows cluster quorum on each node
  2. What actions cluster will take based on how it is setup
  3. Once cluster has taken action what actions SQL will take. Even if cluster fails over AGs, SQL on the failover node may not bring member DBs online.

Failure Scenario 1: Databases in AG are in SYNCHRONIZED state and primary node or primary SQL instance goes down.
This is a straight forward scenario. If primary SQL instance goes down, cluster service detects that AG is not available anymore and it will failover AG to secondary replica. If the primary node goes down, secondary node and file share witness still form quorum for cluster to operate. In this case as well, cluster service will failover AG to secondary node.

Since databases were in SYNCHRONIZED state on secondary replica before failure, SQL server will recover databases and bring them online on secondary replica and that becomes primary replica. Once the old primary replica comes up, Cluster service will let SQL know that the AGs are primary on the other node and so SQL on old primary will change itself to secondary replica keeping its databases unavailable for access but, able to receive updates from new primary replica.

Failure Scenario 2: One or more databases are not in SYNCHRONIZED state on secondary replica and primary SQL instance goes down. Primary node is still up and part of cluster.
Since the primary node is still up, cluster quorum has not changed at all. Cluster service will see that AG resources have gone down and it will start failover to secondary replica. SQL on secondary node will however fail to bring DBs online since the DBs were not in SYNCHRONIZED at the time of failover. Cluster service will see that the AG resources failed to come online on secondary node. It will fail the AG back to primary node, where AG cannot come up anyway since SQL itself is down. Cluster service will keep failing back and forth the AG until the maximum number of failures is exhausted. This is a setting in role definition of AG in cluster as in Figure 7. After maximum number of failures is exhausted, cluster service will stop failover of AG and AG will remain in failed state on the node where it was when this number got exhausted. If secondary DBs were setup with being readable, they will continue to be readable but will not be available for write.

Figure 7
If you want to bring databases online on secondary, force failover AGs with possibility of data loss.

Failure Scenario 3: One or more databases are not in SYNCHRONIZED state and primary node and in turn primary SQL instance also go down.
Secondary node will detect that primary node has gone down and will try to establish new quorum by trying to lock file in File Share Witness. It will be successful in doing so and maintaining quorum and keep the cluster up. Cluster service would now failover AG to secondary node where it will fail since SQL will not bring DBs up due to DBs being not in SYNCHRONIZED state when failure happened. If secondary DBs were setup with being readable, they will continue to be readable but will not be available for write.

If you want to bring databases online on secondary, force failover AGs with possibility of data loss.

Failure Scenario 4: Databases of AG in SYNCHRONIZED state and Primary node’s cluster service goes down for some reason.
Secondary node will detect that primary node’s cluster service is not responding and it will try to establish new quorum by trying to lock file in File Share Witness. It will be successful in doing so and maintaining quorum and keep the cluster up. Cluster service would now failover AG to secondary node where SQL will bring database up DBs were in SYNCHRONIZED state when failure happened. Primary SQL instance will make its databases that were part of this AG unavailable.

Failure Scenario 5: One or more databases of AG are not in SYNCHRONIZED state and Primary node’s cluster service goes down for some reason.
Secondary node will detect that primary node’s cluster service is not responding and it will try to establish new quorum by trying to lock file in File Share Witness. It will be successful in doing so and maintaining quorum and keep the cluster up. Cluster service would now failover AG to secondary node where SQL will fail to bring database online as DBs were not in SYNCHRONIZED state when failure happened. If secondary DBs were setup with being readable, they will continue to be readable but will not be available for write. Primary SQL instance will make its databases that were part of this AG unavailable.

Failure Scenario 6: Primary or Secondary node’s HADR Endpoint goes down
If HADR endpoint of primary or secondary SQL server goes down, databases stop synchronizing data but, primary databases do not stop working. They keep doing transactions without those transactions going to secondary replica. The AG remains up although in unhealthy state. Cluster service does not see any resource being down and so it does not failover resources. This illustrates that configuring synchronous commit with auto failover does not guarantee that secondary will always be in sync with primary. Once the endpoint on primary comes up, databases will start synchronizing data.

If you want to bring databases online on secondary replica, force failover AGs with possibility of data loss.
This scenario is interesting compared to mirroring with witness. If Mirroring endpoint on principal only fails, the mirror server will failover databases to itself and become principal. This is due to the fact that mirror and witness can both see each other(on endopints) but not the principal. All their checks happen on mirroring endpoints. After end point on old principal comes up, it will take its mirrored databases offline and become mirror to receive updates from new principal.

In case mirroring endpoint on mirror only fails, primary continues to run transactions and keep its databases up since it can see witness.

Failure Scenario 7: DBs part of AG on Primary SQL instance go offline or become unavailable due to one of its files being unavailable.
Surprisingly, failover does not happen in this case either. The problem databases go out of “SYNCHRONIZED” state on secondary while other databases keep doing transactions. AG resources do not go down and so the cluster service does not attempt failover of AGs.

If you want to failover AGs in this scenario, it will be a manual failover with possibility of data loss, which SQL will also tell you when you failover. SQL will also show which databases could lose data and which won’t.

Failure Scenario 8: DBs of AG were in sync but at some point stop synchronizing. This could be due to secondary running out of space.
Databases are obviously not synchronizing at this time but, primary keeps the databases running and transactions keep going through.

Failure Scenario 9: DBs in Synchronized state. Primary sends a commit transaction log record to Secondary. Secondary hardens the log record and sends acknowledgment back to Primary. Primary goes down before it could send commit message to client/application.
This is a rare scenario that I could not reproduce. But in my discussions with other SQL experts, they have expressed that this is very much possible.

You may be thinking that I did not cover network failures at site level in failure scenarios. Well, it was on purpose. I would like to cover that along with a perspective of where FSW can be placed in the next section. You could replace FSW with an odd node and all the points discussed below would still apply.

Placement of File Share Witness (FSW) and Impact on Failover


All scenarios below assume that databases are “SYNCHRONIZED” on secondary side when network disconnect happens.

Referring figure 8, FSW is placed at a third site with independent links from production and DR sites. Double sided arrows represent network connection. If production site loses network connection, secondary node and file share witness form new quorum and cluster fails over databases to secondary replica. Primary replica makes its databases unavailable. If DR site loses its network connection instead, primary replica and FSW will form new quorum and keep databases running on primary replica.


Figure 8

Now, let us consider that third site is not available for placing FSW on. This is the situation in several companies.
Referring figure 9, FSW is placed in DR site. If primary site loses network connection, secondary replica and FSW form new quorum and failover AGs to secondary replica.

Figure 9

Referring Figure 10, DR site loses network connection. Secondary replica can get lock on FSW and form new quorum. Cluster fails over AGs to secondary replica and SQL brings databases online. Primary replica is isolated, its cluster service stops, and SQL on primary replica takes its databases offline. This is a risk with keeping FSW on DR site. It could cause outage on production site even if the issue happens on DR site. If production site loses its network connection, databases will failover to DR site. It ensures automatic failover of AGs in case production site goes down.
In this scenario, if DR site outage extends and you want to bring up databases on production site, you would need to first force quorum on primary replica so the cluster can start on it and then force failover AG with data loss in SQL.

Figure 10
Now consider placing FSW at production site as in Figure 11. If production site loses network connection or the site itself is destroyed, Secondary node will be isolated as it cannot connect to both primary node and FSW. Cluster service on secondary node will go down and AGs will not failover to it. If it is just a network disconnect, AG will remain on primary replica in production site and DBs will also remain online there. If users cannot reach production site due to it losing network, it will cause outage. Placing FSW at production site removes the possibility of automatic failover of AGs to DR site. If production site network outage extends and you want to bring DBs online on DR site, you would need to first force quorum on secondary node so the cluster can start on it and then force failover AG with data loss in SQL.

Figure 11
Placing FSW at production site removes automatic failover to DR site in case of Disaster. Placing FSW at DR site introduces possibility of unnecessary outage even when there are no issues on production site. Ideal location for FSW or odd node is a third site with independent links to both production and DR.

As you can see, windows cluster and SQL each have an impact on failover of AGs between nodes. As a DBA, I consider it essential to understand how Windows cluster works before one can design and support AG infrastructure effectively.