mediumAlways On

HADR_AG_MUTEX Wait Type Explained

Fix SQL Server HADR_AG_MUTEX waits caused by Always On Availability Group DDL operations blocking. Diagnostic queries, fix scripts, and prevention strategies.

Quick Answer

HADR_AG_MUTEX waits occur when Always On Availability Group DDL operations (ADD DATABASE, REMOVE REPLICA, FAILOVER) or Windows clustering commands wait for exclusive access to the AG configuration metadata. This wait is normal during AG maintenance but becomes problematic when DDL operations hang due to blocking transactions or cluster connectivity issues.

Root Cause Analysis

The HADR_AG_MUTEX wait protects the internal availability group configuration structures in SQL Server's Always On subsystem. When any process attempts to modify AG topology (adding databases, changing replica properties, performing failovers), it must acquire an exclusive mutex on the AG's configuration object.

The mutex serializes access to critical AG metadata stored in sys.availability_groups, sys.availability_replicas, and sys.availability_databases_cluster. Multiple operations cannot simultaneously modify these structures because configuration changes must be atomically applied across all replicas through the Windows Server Failover Cluster (WSFC) layer.

Common blocking scenarios include: long-running transactions preventing database addition to AGs, active backup operations blocking database removal, network partitions causing WSFC communication delays, and synchronous commit replicas with high network latency stalling failover operations.

In SQL Server 2016 and later, the AG Health Detection improved timeout handling, reducing stuck DDL operations. SQL Server 2019 introduced Accelerated Database Recovery which shortens transaction rollback times, indirectly reducing HADR_AG_MUTEX contention. SQL Server 2022's improvements to cluster integration provide better timeout management for cross-replica operations.

The wait specifically occurs in the AlwaysOn_health extended events session and correlates with cluster state changes visible in the Windows Failover Cluster Manager logs.

AutoDBA checks Always On Availability Group configuration, replica health monitoring, and cluster connectivity validation across your entire SQL Server instance in 60 seconds. Download the free diagnostic script and see what else needs attention.

Diagnostic Queries

-- Current AG-related waits and their duration
SELECT 
    session_id,
    wait_type,
    wait_time_ms,
    blocking_session_id,
    resource_description,
    command
FROM sys.dm_exec_requests
WHERE wait_type = 'HADR_AG_MUTEX'
OR session_id IN (
    SELECT session_id 
    FROM sys.dm_exec_requests 
    WHERE wait_type = 'HADR_AG_MUTEX'
);
-- Availability group DDL operations in progress
SELECT 
    ar.replica_server_name,
    ag.name AS ag_name,
    ar.availability_mode_desc,
    ar.failover_mode_desc,
    ars.role_desc,
    ars.operational_state_desc,
    ars.synchronization_state_desc,
    ars.last_commit_time
FROM sys.availability_replicas ar
JOIN sys.availability_groups ag ON ar.group_id = ag.group_id  
JOIN sys.dm_hadr_availability_replica_states ars ON ar.replica_id = ars.replica_id
WHERE ars.operational_state_desc <> 'ONLINE'
OR ars.synchronization_health_desc IN ('NOT_HEALTHY', 'PARTIALLY_HEALTHY');
-- Active transactions that might block AG operations
SELECT 
    s.session_id,
    s.login_name,
    t.transaction_id,
    t.transaction_begin_time,
    DATEDIFF(SECOND, t.transaction_begin_time, GETDATE()) AS duration_seconds,
    t.transaction_state,
    t.transaction_type,
    dt.database_transaction_state,
    db.name AS database_name
FROM sys.dm_tran_active_transactions t
JOIN sys.dm_tran_session_transactions st ON t.transaction_id = st.transaction_id
JOIN sys.dm_exec_sessions s ON st.session_id = s.session_id
LEFT JOIN sys.dm_tran_database_transactions dt ON t.transaction_id = dt.transaction_id
LEFT JOIN sys.databases db ON dt.database_id = db.database_id
WHERE t.transaction_begin_time < DATEADD(MINUTE, -5, GETDATE())
ORDER BY t.transaction_begin_time;
-- AG database states that might prevent DDL operations
SELECT 
    ag.name AS ag_name,
    db.name AS database_name,
    drs.synchronization_state_desc,
    drs.synchronization_health_desc,
    drs.last_hardened_lsn,
    drs.last_redone_lsn,
    drs.redo_queue_size,
    drs.is_suspended
FROM sys.availability_databases_cluster adc
JOIN sys.availability_groups ag ON adc.group_id = ag.group_id
JOIN sys.databases db ON adc.database_name = db.name
LEFT JOIN sys.dm_hadr_database_replica_states drs ON db.database_id = drs.database_id
WHERE drs.is_suspended = 1 
OR drs.synchronization_state_desc <> 'SYNCHRONIZED'
OR drs.synchronization_health_desc <> 'HEALTHY';
-- Cluster connectivity and quorum status
SELECT 
    cluster_name,
    quorum_type_desc,
    quorum_state_desc,
    cluster_name,
    member_name,
    member_type_desc,
    member_state_desc,
    number_of_quorum_votes
FROM sys.dm_hadr_cluster 
CROSS JOIN sys.dm_hadr_cluster_members
WHERE member_state_desc <> 'UP';

Fix Scripts

Kill blocking long-running transactions

-- Identify and terminate transactions blocking AG DDL
-- WARNING: This will rollback active transactions, test impact first
DECLARE @session_id INT;
DECLARE blocking_cursor CURSOR FOR
    SELECT DISTINCT blocking_session_id
    FROM sys.dm_exec_requests 
    WHERE wait_type = 'HADR_AG_MUTEX' 
    AND blocking_session_id > 0
    AND blocking_session_id <> @@SPID;

OPEN blocking_cursor;
FETCH NEXT FROM blocking_cursor INTO @session_id;

WHILE @@FETCH_STATUS = 0
BEGIN
    PRINT 'Killing session ' + CAST(@session_id AS VARCHAR(10));
    EXEC('KILL ' + @session_id);
    FETCH NEXT FROM blocking_cursor INTO @session_id;
END;

CLOSE blocking_cursor;
DEALLOCATE blocking_cursor;

Resume suspended AG databases

-- Resume any suspended AG databases that might block DDL
DECLARE @sql NVARCHAR(MAX) = '';
SELECT @sql = @sql + 
    'ALTER DATABASE [' + database_name + '] SET HADR RESUME;' + CHAR(13)
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.databases db ON drs.database_id = db.database_id
WHERE drs.is_suspended = 1 
AND drs.is_local = 1;

IF @sql <> ''
BEGIN
    PRINT 'Resuming suspended AG databases:';
    PRINT @sql;
    -- EXEC sp_executesql @sql; -- Uncomment after review
END
ELSE
    PRINT 'No suspended AG databases found on this replica';

Restart AG health monitoring

-- Restart Always On health detection by toggling the AlwaysOn_health extended event session
-- This can help clear stuck monitoring operations
ALTER EVENT SESSION AlwaysOn_health ON SERVER STATE = STOP;
ALTER EVENT SESSION AlwaysOn_health ON SERVER STATE = START;

Emergency cluster failover

-- Force failover to break deadlocked AG DDL operations
-- WARNING: Only use when AG DDL operations are completely stuck
-- Replace 'YourAGName' and 'TargetReplica' with actual values
/*
DECLARE @ag_name SYSNAME = 'YourAGName';
DECLARE @target_replica SYSNAME = 'TargetReplica';

-- Verify target replica is available
IF EXISTS (
    SELECT 1 FROM sys.dm_hadr_availability_replica_states ars
    JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id
    WHERE ar.replica_server_name = @target_replica
    AND ars.operational_state_desc = 'ONLINE'
)
BEGIN
    EXEC('ALTER AVAILABILITY GROUP [' + @ag_name + '] FAILOVER');
    PRINT 'Failover initiated to ' + @target_replica;
END
ELSE
    PRINT 'Target replica ' + @target_replica + ' is not available for failover';
*/

AutoDBA generates fix scripts like these automatically, with impact estimates and rollback SQL included.

Prevention

Configure AG operation timeouts appropriately using ALTER AVAILABILITY GROUP SET (SESSION_TIMEOUT = 15) to prevent indefinite waits. Set reasonable values (10-30 seconds) based on network latency between replicas.

Implement proper transaction management patterns. Keep transactions short when databases participate in AGs. Avoid long-running maintenance operations during business hours when AG topology changes are likely.

Monitor cluster health proactively using sys.dm_hadr_cluster_members and Windows cluster logs. Address network connectivity issues before they impact AG operations. Ensure all cluster nodes have stable network connections and proper DNS resolution.

Schedule AG maintenance operations during low-activity periods. Coordinate database additions, replica changes, and failovers to avoid conflicts. Use maintenance windows for operations that require exclusive AG configuration access.

Configure proper quorum settings for your cluster. Ensure sufficient voting members remain available during planned maintenance. Use file share witnesses or cloud witnesses for geographically distributed AGs to maintain quorum during site outages.

Need hands-on help?

Dealing with persistent hadr_ag_mutex issues across your environment? Samix Technology provides hands-on SQL Server performance consulting with 15+ years of production DBA experience.

Related Pages