Having worked with Availability Groups (AG) and dealt with
issues at cluster and AG levels, I felt the need to do a write up on different
failover scenarios and remediation in each scenario. This is one article
I take pride in sharing. Hopefully, someone will find it useful.
This article assumes that reader already has basic knowledge/understanding
of AGs and Windows clustering.Most of the scenarios presented have been tested on setup with Windows 2012 Enterprise edition and SQL 2014 Enterprise edition.
Brief on Availability
Groups
Availability Groups use a combination of Mirroring (though a more efficient form) at SQL level and Clustering at Windows level to provide high availability/DR capability to SQL databases. Mirroring keeps data between primary and secondary databases of AG in sync. Windows Clustering keeps track of status of AGs and manages failover of AGs between nodes. SQL server is responsible for bringing the DBs online after Windows cluster fails over AG to a node. Multiple DBs can be part of one AG and they failover as one unit, either all or none can failover.
Databases of an AG failover; SQL instances do not fail over.
In fact, you can install SQL instance first and setup clustering later.
There is option available to use a combination of
traditional SQL clustering (Failover Clustered Instance FCI) and AGs between
clustered instances. However, automatic failover of AG is not available in this
option.
I have found that it is even more important for DBAs to
understanding Windows Clustering now with AGs than it was with FCI. This helps in
understanding issues, troubleshooting them, and better communication between
DBA and Server teams. Let us map configuration of AG in SQL server to what
happens in cluster at Windows level.
Availability Groups
and Windows Cluster
Setup of Windows Cluster is a pre-requisite for setting up AGs in SQL.
Figure 1 shows screen shot of AG. An AG TestAG has been created with NodeA as primary replica and NodeB as
secondary replica. An AG listener TestAGLSNR
has been created with two IPs, one for subnet of NodeA and other for subnet of
NodeB. The two IPs are shown in Figure 2.
Figure 1
Figure 2
Figure 3 shows properties of AG. The two replicas are setup
with Synchronous Commit and Automatic Failover modes. Currently the AG
databases are online on NodeA and so that is in Primary role. Note that role
Primary and Secondary switch when AG is failed over to NodeB.
Figure 3
Figure 4 shows Failover Cluster Manager console. Let us see
how AG configuration at SQL shows up as resources in cluster. The AG name
TestAG shows up as a Role and a resource within role as Other Resources. AG
listener TestAGLSNR shows up as Server Name, actually it’s a virtual network
name. IP addresses assigned to TestAGLSNR show up as IPs under the virtual
network name with one IP online. This is the IP that belongs to subnet of node
where AG resides at the moment. The other IP is offline until AG fails over to
the other node in which case the offline IP will become online and current
online IP will go offline.
Figure 4
Automatic failover setting at AG level shows up as
checkmarks in “Possible Owners” setting for AG resource “TestAG” at cluster level as in Figure 5.
Figure 5
Both primary and secondary nodes are check marked as
possible owners of AG resource in case of automatic failover.
In case of manual failover, only the node that is primary
replica at the time is check marked as Possible Owner. In manual failover
configuration if you failover AG to another node, the new primary replica node
is check marked as Possible Owner and the previous primary node is unchecked
for possible owner.Requirements for an AG to failover
- AGs have to be setup with automatic failover.
- At the time of failure:
- Windows Cluster should be up
- Cluster service should be running on secondary node. This can happen only if secondary node forms new quorum with FSW or odd node.
- Secondary node server remain part of cluster
- All Databases in AG should be in “SYNCHRONIZED” state on secondary replica. Note that for failover, status of databases on secondary side is important. Primary replica keeps showing its database as “SYNCHRONIZED” even when secondary replicas are not receiving any data and showing secondary databases as “NOT SYNCHRONIZING”. DBs could stop synchronizing due to
- Network disconnect or secondary replicas not responding
- Secondary replica DB running out of space
- Primary or secondary DB goes offline
- HADR endpoint issues
- Other types of corruption in prod/DR
-
As we will see soon, not all failures lead to
failover
Availability Group Failover Scenarios
Setup for testing the scenarios is given in figure 6.
Figure 6
A Windows cluster with two nodes and a File Share Witness (FSW)
has been setup. Each node and FSW is in different location and subnet. Each
node can reach FSW independently. An AG is setup between primary and secondary
replica with synchronous commit and automatic failover. Secondary replica
is setup to be non-readable.
Let us look at some of the failure scenarios on this setup.
Each scenario has a combination of state of synchronization of DBs at secondary
replica and type of failure that happens.
Let us discuss each of the scenarios in detail. I find
following logic of thinking really useful in understanding what will happen
after failure has occurred
#
|
Scenario
|
Result
|
Remediation
|
1
|
All DBs in AG are
Synchronized
Primary SQL instance
shuts down or Host goes down
|
DBs part of AG
failover automatically to secondary replica
|
|
2
|
One or more DBs in
AG are not Synchronized
Primary SQL instance
Down
|
DBs fail back and
forth between preferred replicas and settle in failed state on the replica
where its maximum number of failovers is exhausted.
|
Force failover AG
with data loss.
|
3
|
One or more DBs in
AG are not Synchronized
Primary Host goes
down
|
DBs fail over to
preferred replica but fail to come up.
|
Force failover AG
with data loss.
|
4
|
All DBs in AG are
Synchronized
Primary node’s
Cluster service shuts down
|
DBs part of AG
failover automatically
|
|
5
|
One or more DBs in
AG are not Synchronized
Primary node’s
Cluster service shuts down
|
DBs fail over to
preferred replica but, fail to come up.
|
Force failover AG
with data loss
|
6
|
State of
synchronization of DBs does not have significance in this case.
Primary or Secondary
node’s HADR Endpoint goes down
|
No failover. DBs
stop synchronizing.
(In mirroring, if
endpoint goes down on principal and witness was setup, database would
failover to mirror side )
|
Force failover AG
with data loss
|
7
|
State of
synchronization of DBs does not have significance in this case.
DB on Primary SQL
instance goes offline or becomes unavailable due to one of its files being
unavailable
|
No failover happens.
DBs stop synchronizing
|
Force failover AG
with data loss
|
8
|
DBs in sync but at
some point stop synchronizing. Could be due to secondary running out of
space.
|
Primary stops
truncating log and synchronization also stop. DBs in Not Synchronizing state.
|
Resolve the cause of
issue on secondary. If DBs don't start synchronizing, remove secondary DB
from AG on secondary replica. Take full and tran log backup of primary
replica and restore to secondary and add it back to AG.
|
9
|
DBs in Synchronized
state.
Primary sends a
commit transaction log record to Secondary. Secondary hardens the log record
and sends acknowledgment back to Primary. Primary goes down before it could
send commit message to client/application.
|
Auto-failover
happens as secondary thinks it is in sync with primary. Applications that are
resilient and rerun their transactions after failover of DBs will re-run the
transaction on secondary unless they check for pre-transaction data.
|
Extremely rare scenario.
Nothing can be done about it except for application to check pre transaction
data before re-running the transaction again.
|
- What happens to the windows cluster quorum on each node
- What actions cluster will take based on how it is setup
- Once cluster has taken action what actions SQL will take. Even if cluster fails over AGs, SQL on the failover node may not bring member DBs online.
Failure Scenario 1:
Databases in AG are in SYNCHRONIZED state and primary node or primary SQL
instance goes down.
This is a straight forward scenario. If primary SQL instance
goes down, cluster service detects that AG is not available anymore and it will
failover AG to secondary replica. If the primary node goes down, secondary node
and file share witness still form quorum for cluster to operate. In this case
as well, cluster service will failover AG to secondary node.Since databases were in SYNCHRONIZED state on secondary replica before failure, SQL server will recover databases and bring them online on secondary replica and that becomes primary replica. Once the old primary replica comes up, Cluster service will let SQL know that the AGs are primary on the other node and so SQL on old primary will change itself to secondary replica keeping its databases unavailable for access but, able to receive updates from new primary replica.
Failure Scenario 2:
One or more databases are not in SYNCHRONIZED state on secondary replica and
primary SQL instance goes down. Primary node is still up and part of cluster.
Since the primary node is still up, cluster quorum has not
changed at all. Cluster service will see that AG resources have gone down and
it will start failover to secondary replica. SQL on secondary node will however
fail to bring DBs online since the DBs were not in SYNCHRONIZED at the time of
failover. Cluster service will see that the AG resources failed to come online
on secondary node. It will fail the AG back to primary node, where AG cannot
come up anyway since SQL itself is down. Cluster service will keep failing back
and forth the AG until the maximum number of failures is exhausted. This is a
setting in role definition of AG in cluster as in Figure 7. After maximum number
of failures is exhausted, cluster service will stop failover of AG and AG will
remain in failed state on the node where it was when this number got exhausted.
If secondary DBs were setup with being readable, they will continue to be
readable but will not be available for write.
Figure 7
If you want to bring databases online on secondary, force
failover AGs with possibility of data loss.
Failure Scenario 3:
One or more databases are not in SYNCHRONIZED state and primary node and in
turn primary SQL instance also go down.
Secondary node will detect that primary node has gone down
and will try to establish new quorum by trying to lock file in File Share
Witness. It will be successful in doing so and maintaining quorum and keep the
cluster up. Cluster service would now failover AG to secondary node where it
will fail since SQL will not bring DBs up due to DBs being not in SYNCHRONIZED
state when failure happened. If secondary DBs were setup with being readable,
they will continue to be readable but will not be available for write.
If you want to bring databases online on secondary, force
failover AGs with possibility of data loss.
Failure Scenario 4:
Databases of AG in SYNCHRONIZED state and Primary node’s cluster service goes
down for some reason.
Secondary node will detect that primary node’s cluster
service is not responding and it will try to establish new quorum by trying to
lock file in File Share Witness. It will be successful in doing so and
maintaining quorum and keep the cluster up. Cluster service would now failover
AG to secondary node where SQL will bring database up DBs were in SYNCHRONIZED
state when failure happened. Primary SQL instance will make its databases that
were part of this AG unavailable.
Failure Scenario 5:
One or more databases of AG are not in SYNCHRONIZED state and Primary node’s
cluster service goes down for some reason.
Secondary node will detect that primary node’s cluster
service is not responding and it will try to establish new quorum by trying to
lock file in File Share Witness. It will be successful in doing so and
maintaining quorum and keep the cluster up. Cluster service would now failover
AG to secondary node where SQL will fail to bring database online as DBs were
not in SYNCHRONIZED state when failure happened. If secondary DBs were setup
with being readable, they will continue to be readable but will not be
available for write. Primary SQL instance will make its databases that were
part of this AG unavailable.
Failure Scenario 6:
Primary or Secondary node’s HADR Endpoint goes down
If HADR endpoint of primary or secondary SQL server goes
down, databases stop synchronizing data but, primary databases do not stop
working. They keep doing transactions without those transactions going to
secondary replica. The AG remains up although in unhealthy state. Cluster
service does not see any resource being down and so it does not failover
resources. This illustrates that configuring synchronous commit with auto
failover does not guarantee that secondary will always be in sync with primary.
Once the endpoint on primary comes up, databases will start synchronizing data.
If you want to bring databases online on secondary replica,
force failover AGs with possibility of data loss.
This scenario is interesting compared to mirroring with
witness. If Mirroring endpoint on principal only fails, the mirror server will
failover databases to itself and become principal. This is due to the fact that
mirror and witness can both see each other(on endopints) but not the principal.
All their checks happen on mirroring endpoints. After end point on old
principal comes up, it will take its mirrored databases offline and become
mirror to receive updates from new principal.
In case mirroring endpoint on mirror only fails, primary
continues to run transactions and keep its databases up since it can see
witness.
Failure Scenario 7:
DBs part of AG on Primary SQL instance go offline or become unavailable due to
one of its files being unavailable.
Surprisingly, failover does not happen in this case either.
The problem databases go out of “SYNCHRONIZED” state on secondary while other
databases keep doing transactions. AG resources do not go down and so the
cluster service does not attempt failover of AGs.
If you want to failover AGs in this scenario, it will be a
manual failover with possibility of data loss, which SQL will also tell you
when you failover. SQL will also show which databases could lose data and which
won’t.
Failure Scenario 8: DBs of AG were in sync but at some
point stop synchronizing. This could be due to secondary running out of space.
Databases are obviously not synchronizing at this time but,
primary keeps the databases running and transactions keep going through.
Failure Scenario 9:
DBs in Synchronized state. Primary sends a commit transaction log record to
Secondary. Secondary hardens the log record and sends acknowledgment back to
Primary. Primary goes down before it could send commit message to
client/application.
This is a rare scenario that I could not reproduce. But in
my discussions with other SQL experts, they have expressed that this is very
much possible.
You may be thinking that I did not cover network failures at
site level in failure scenarios. Well, it was on purpose. I would like to cover
that along with a perspective of where FSW can be placed in the next section.
You could replace FSW with an odd node and all the points discussed below would
still apply.
Placement of File
Share Witness (FSW) and Impact on Failover
All scenarios below assume that databases are “SYNCHRONIZED” on secondary side when network disconnect happens.
Referring figure 8, FSW is placed at a third site with
independent links from production and DR sites. Double sided arrows represent
network connection. If production site loses network connection, secondary node
and file share witness form new quorum and cluster fails over databases to
secondary replica. Primary replica makes its databases unavailable. If DR site
loses its network connection instead, primary replica and FSW will form new
quorum and keep databases running on primary replica.
Figure 8
Now, let us consider that third site is not available for
placing FSW on. This is the situation in several companies.
Referring figure 9, FSW is placed in DR site. If primary
site loses network connection, secondary replica and FSW form new quorum and
failover AGs to secondary replica.
Figure 9
Referring Figure 10, DR site loses network connection. Secondary
replica can get lock on FSW and form new quorum. Cluster fails over AGs to
secondary replica and SQL brings databases online. Primary replica is isolated,
its cluster service stops, and SQL on primary replica takes its databases
offline. This is a risk with keeping FSW on DR site. It could cause outage on
production site even if the issue happens on DR site. If production site loses
its network connection, databases will failover to DR site. It ensures
automatic failover of AGs in case production site goes down.
In this scenario, if DR site outage extends and you want to
bring up databases on production site, you would need to first force quorum on
primary replica so the cluster can start on it and then force failover AG with
data loss in SQL.
Figure 10
Now consider placing FSW at production site as in Figure 11.
If production site loses network connection or the site itself is destroyed,
Secondary node will be isolated as it cannot connect to both primary node and
FSW. Cluster service on secondary node will go down and AGs will not failover
to it. If it is just a network disconnect, AG will remain on primary
replica in production site and DBs will also remain online there. If users
cannot reach production site due to it losing network, it will cause outage.
Placing FSW at production site removes the possibility of automatic failover of
AGs to DR site. If production site network outage extends and you want to bring
DBs online on DR site, you would need to first force quorum on secondary node
so the cluster can start on it and then force failover AG with data loss in
SQL.
Figure 11
Placing FSW at production site removes automatic failover to
DR site in case of Disaster. Placing FSW at DR site introduces possibility of
unnecessary outage even when there are no issues on production site. Ideal
location for FSW or odd node is a third site with independent links to both
production and DR.As you can see, windows cluster and SQL each have an impact on failover of AGs between nodes. As a DBA, I consider it essential to understand how Windows cluster works before one can design and support AG infrastructure effectively.
No comments:
Post a Comment