mediumAlways On

sys.dm_hadr_availability_replica_states - Usage & Examples

Comprehensive guide to SQL Server sys.dm_hadr_availability_replica_states DMV for Always On diagnostics, troubleshooting replica synchronization issues and failover problems.

Quick Answer

sys.dm_hadr_availability_replica_states shows the operational state of each availability replica in Always On Availability Groups, including synchronization health, connection status, and failover readiness. This DMV reveals critical replica health metrics that directly impact high availability and disaster recovery capabilities. Unhealthy states in this view typically indicate network issues, performance bottlenecks, or configuration problems that can compromise data protection.

Root Cause Analysis

The sys.dm_hadr_availability_replica_states DMV exposes the internal state machine that governs Always On replica behavior within the SQL Server Database Engine. Each replica maintains a state vector tracked by the Always On state manager, which coordinates with the Windows Server Failover Clustering (WSFC) resource manager and SQL Server's transaction log hardening mechanism.

The synchronization_state_desc column reflects the log block hardening process between primary and secondary replicas. When a transaction commits on the primary replica, the log records must be hardened to disk on synchronous secondaries before the commit completes, enforcing the synchronization_state of SYNCHRONIZED. Asynchronous replicas show SYNCHRONIZING while log blocks are being transmitted and hardened independently of primary commits.

The connected_state_desc indicates the Always On transport layer connectivity between replicas. SQL Server 2016 introduced database-level health detection that can trigger automatic failovers when the primary replica cannot access databases, reflected in the operational_state_desc as NOT_HEALTHY. SQL Server 2019 enhanced this with accelerated database recovery impact on replica states during crash recovery scenarios.

The role_desc transitions occur through the Always On state machine coordinating with WSFC. PRIMARY to SECONDARY transitions happen during planned manual failovers or automatic failovers triggered by health policies. RESOLVING appears during split-brain scenarios or when WSFC arbitration is determining the new primary replica.

Recovery LSN tracking (via columns such as last_hardened_lsn, last_received_lsn, last_sent_lsn, and recovery_lsn) lives in sys.dm_hadr_database_replica_states, not in sys.dm_hadr_availability_replica_states. Those per-database LSN values show the point-in-time recovery capability of each secondary replica and represent the last log record hardened on the secondary, determining potential data loss during failovers. SQL Server 2022 improved LSN synchronization tracking for contained availability groups.

AutoDBA checks Always On availability group health, replica synchronization states, and failover readiness validation across your entire SQL Server instance in 60 seconds. Download the free diagnostic script and see what else needs attention.

Diagnostic Queries

-- Primary replica health and synchronization lag analysis
SELECT 
    ar.replica_server_name,
    ars.role_desc,
    ars.operational_state_desc,
    ars.connected_state_desc,
    ars.synchronization_state_desc,
    ars.last_connect_error_number,
    ars.last_connect_error_description,
    ars.last_connect_error_timestamp
FROM sys.dm_hadr_availability_replica_states ars
JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id
WHERE ars.is_local = 1;
-- Replica synchronization performance and data movement lag
SELECT 
    ar.replica_server_name,
    ag.name AS ag_name,
    ars.role_desc,
    ars.synchronization_state_desc,
    drs.database_name,
    drs.log_send_queue_size / 1024.0 AS log_send_queue_mb,
    drs.log_send_rate / 1024.0 AS log_send_rate_mb_sec,
    drs.redo_queue_size / 1024.0 AS redo_queue_mb,
    drs.redo_rate / 1024.0 AS redo_rate_mb_sec,
    CASE WHEN drs.redo_rate = 0 THEN NULL 
         ELSE CAST(drs.redo_queue_size / drs.redo_rate / 60.0 AS DECIMAL(10,2)) 
    END AS redo_lag_minutes
FROM sys.dm_hadr_availability_replica_states ars
JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id
JOIN sys.availability_groups ag ON ar.group_id = ag.group_id
JOIN sys.dm_hadr_database_replica_states drs ON ar.replica_id = drs.replica_id
ORDER BY log_send_queue_mb DESC;
-- Failover readiness and automatic failover eligibility
SELECT 
    ag.name AS ag_name,
    ar.replica_server_name,
    ar.availability_mode_desc,
    ar.failover_mode_desc,
    ars.role_desc,
    ars.operational_state_desc,
    ars.synchronization_state_desc,
    ars.is_local,
    CASE 
        WHEN ar.failover_mode = 1 AND ars.synchronization_state = 2 
             AND ars.operational_state = 0 THEN 'ELIGIBLE'
        ELSE 'NOT_ELIGIBLE' 
    END AS auto_failover_eligible
FROM sys.dm_hadr_availability_replica_states ars
JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id  
JOIN sys.availability_groups ag ON ar.group_id = ag.group_id
ORDER BY ag.name, ar.replica_server_name;
-- Historical connection and synchronization issues
SELECT 
    ar.replica_server_name,
    ars.last_connect_error_number,
    ars.last_connect_error_description,
    ars.last_connect_error_timestamp,
    DATEDIFF(second, ars.last_connect_error_timestamp, GETDATE()) AS seconds_since_last_error
FROM sys.dm_hadr_availability_replica_states ars
JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id
WHERE ars.last_connect_error_number IS NOT NULL
ORDER BY ars.last_connect_error_timestamp DESC;
-- Cross-replica synchronization health dashboard
-- LSN columns live on sys.dm_hadr_database_replica_states, so join through it
SELECT 
    ag.name AS ag_name,
    primary_replica.replica_server_name AS primary_replica,
    secondary_replica.replica_server_name AS secondary_replica,
    secondary_replica.availability_mode_desc AS sync_mode,
    secondary_ars.synchronization_state_desc,
    secondary_ars.connected_state_desc,
    secondary_ars.operational_state_desc,
    secondary_drs.database_name,
    CASE WHEN secondary_drs.last_hardened_lsn = primary_drs.last_hardened_lsn 
         THEN 'IN_SYNC' ELSE 'LAG_DETECTED' END AS lsn_comparison
FROM sys.availability_groups ag
JOIN sys.availability_replicas primary_replica ON ag.group_id = primary_replica.group_id
JOIN sys.dm_hadr_availability_replica_states primary_ars ON primary_replica.replica_id = primary_ars.replica_id
JOIN sys.dm_hadr_database_replica_states primary_drs ON primary_replica.replica_id = primary_drs.replica_id
JOIN sys.availability_replicas secondary_replica ON ag.group_id = secondary_replica.group_id  
JOIN sys.dm_hadr_availability_replica_states secondary_ars ON secondary_replica.replica_id = secondary_ars.replica_id
JOIN sys.dm_hadr_database_replica_states secondary_drs 
    ON secondary_replica.replica_id = secondary_drs.replica_id
   AND primary_drs.database_name = secondary_drs.database_name
WHERE primary_ars.role_desc = 'PRIMARY' 
  AND secondary_ars.role_desc = 'SECONDARY'
ORDER BY ag.name, secondary_replica.replica_server_name;

Fix Scripts

Resume Data Movement for Suspended Replicas Resumes data movement when replicas show SUSPENDED synchronization state due to space issues or manual suspension.

-- Resume data movement for all suspended databases in availability group
-- TEST IN DEV FIRST: Verify space availability and resolve root cause
DECLARE @sql NVARCHAR(MAX) = '';
SELECT @sql = @sql + 'ALTER DATABASE [' + drs.database_name + '] SET HADR RESUME;' + CHAR(13)
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.dm_hadr_availability_replica_states ars ON drs.replica_id = ars.replica_id
WHERE ars.is_local = 1 
  AND ars.role_desc = 'SECONDARY'
  AND drs.synchronization_state_desc = 'SUSPENDED';

PRINT @sql;
-- EXEC sp_executesql @sql; -- Uncomment after review

Expected Impact: Immediately resumes log hardening and redo processes. Monitor disk space and performance after execution.

Force Failover for Unresponsive Primary Performs manual failover with potential data loss when primary replica is unresponsive and secondary shows NOT_SYNCHRONIZED.

-- CRITICAL: This script can cause data loss. Use only when primary is confirmed down
-- Verify secondary replica has most recent data before executing
DECLARE @ag_name NVARCHAR(128) = 'YourAGName'; -- CHANGE THIS VALUE

-- Check current state before failover
-- last_hardened_lsn comes from sys.dm_hadr_database_replica_states, not the replica_states DMV
SELECT 
    ar.replica_server_name,
    ars.role_desc,
    ars.operational_state_desc,
    ars.synchronization_state_desc,
    drs.database_name,
    drs.last_hardened_lsn
FROM sys.dm_hadr_availability_replica_states ars
JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id
JOIN sys.availability_groups ag ON ar.group_id = ag.group_id
JOIN sys.dm_hadr_database_replica_states drs ON ars.replica_id = drs.replica_id
WHERE ag.name = @ag_name;

-- Uncomment only after confirming primary is down and data loss is acceptable
-- ALTER AVAILABILITY GROUP [YourAGName] FORCE_FAILOVER_ALLOW_DATA_LOSS;

Expected Impact: Promotes secondary to primary role with potential data loss equal to unsynchronized transactions. Requires immediate application connection string updates.

Repair Replica Connection Issues Attempts to resolve connectivity issues by cycling replica connections and clearing error states.

-- Cycle availability group replica connections to resolve transport issues
-- Safe for production: Does not affect data or availability
DECLARE @replica_url NVARCHAR(128);
DECLARE replica_cursor CURSOR FOR
SELECT ar.endpoint_url 
FROM sys.availability_replicas ar
JOIN sys.dm_hadr_availability_replica_states ars ON ar.replica_id = ars.replica_id
WHERE ars.connected_state_desc = 'DISCONNECTED' 
  AND ars.is_local = 0;

OPEN replica_cursor;
FETCH NEXT FROM replica_cursor INTO @replica_url;

WHILE @@FETCH_STATUS = 0
BEGIN
    PRINT 'Cycling connection to replica: ' + @replica_url;
    -- Connection cycling happens automatically when DMVs are queried
    -- This query forces the Always On transport to re-evaluate connections
    EXEC sp_executesql N'SELECT COUNT(*) FROM sys.dm_hadr_availability_replica_states WHERE connected_state_desc = ''CONNECTED''';
    WAITFOR DELAY '00:00:05';
    FETCH NEXT FROM replica_cursor INTO @replica_url;
END

CLOSE replica_cursor;
DEALLOCATE replica_cursor;

Expected Impact: May restore connectivity for transient network issues. No data impact, minimal performance overhead.

AutoDBA generates fix scripts like these automatically, with impact estimates and rollback SQL included.

Prevention

Configure Always On health policies with appropriate timeout values to prevent false positive failovers while maintaining responsiveness to genuine failures. Set HealthCheckTimeout to 60000ms minimum and LeaseTimeout to 20000ms for environments with network latency over 10ms. These values prevent connection state flapping visible in sys.dm_hadr_availability_replica_states.

Implement proactive monitoring of log_send_queue_size and redo_queue_size thresholds. Alert when log send queue exceeds 100MB or redo queue exceeds 50MB per database, indicating synchronization lag that will appear as SYNCHRONIZING state instead of SYNCHRONIZED. Size transaction log files appropriately with sufficient auto-growth increments to prevent log hardening delays.

Deploy dedicated network connections for Always On traffic using separate network adapters and VLANs. Configure database mirroring endpoints with TCP port assignments that avoid dynamic port allocation. Network connectivity issues are the primary cause of DISCONNECTED states in the connected_state_desc column.

Schedule regular validation of automatic failover eligibility using the failover readiness query. Synchronous replicas showing NOT_SYNCHRONIZED states cannot participate in automatic failovers, compromising high availability. Monitor synchronization_state transitions and investigate any replica stuck in SYNCHRONIZING state for extended periods.

Configure appropriate backup strategies that minimize log chain dependencies between replicas. Excessive log backup delays on secondary replicas can cause synchronization state degradation visible through recovery_lsn lag analysis against sys.dm_hadr_database_replica_states. Use backup compression and optimize backup network bandwidth to prevent log accumulation.

Need hands-on help?

Dealing with persistent sys.dm_hadr_availability_replica_states issues across your environment? Samix Technology provides hands-on SQL Server performance consulting with 15+ years of production DBA experience.

Related Pages