mediumAlways On

HADR_RECOVERY_WAIT_FOR_CONNECTION Wait Type Explained

Fix SQL Server HADR_RECOVERY_WAIT_FOR_CONNECTION wait type caused by Always On secondary replica connection issues. Includes diagnostic queries and solutions.

Quick Answer

HADR_RECOVERY_WAIT_FOR_CONNECTION occurs when an Always On secondary replica waits to establish connection with the primary replica before starting database recovery. This is normal during replica startup, failover, or network interruptions, but extended waits indicate connectivity or authentication issues between replicas.

Root Cause Analysis

This wait type manifests during the Always On availability group recovery process when the Database Engine's Always On availability groups thread manager cannot establish or maintain the required TCP connection to the primary replica. The secondary replica's recovery worker thread enters this wait state while the connection establishment logic attempts to reach the primary replica's database mirroring endpoint.

The wait occurs specifically in the CAlwaysOnSecondaryReplicaManager::WaitForPrimaryConnection() method, where the secondary replica's recovery process halts until the availability group transport layer confirms a stable connection to the primary. During this phase, the secondary database remains in a RESTORING state and cannot accept connections or process log records.

In SQL Server 2016 and later, the connection retry logic became more aggressive with exponential backoff, reducing unnecessary connection attempts during network instability. SQL Server 2019 introduced enhanced connection diagnostics that populate additional columns in sys.dm_hadr_availability_replica_states during connection establishment phases.

The wait duration directly correlates with network latency, DNS resolution time, authentication overhead (especially with Kerberos), and the configured connection timeout values on the database mirroring endpoints. Certificate-based authentication typically adds 50-200ms to initial connection establishment compared to Windows authentication.

SQL Server 2022 improved this mechanism by implementing connection pooling for availability group endpoints, reducing the frequency of complete connection re-establishment during minor network disruptions.

AutoDBA checks Always On availability group health, endpoint configuration, and replica connectivity monitoring across your entire SQL Server instance in 60 seconds. Download the free diagnostic script and see what else needs attention.

Diagnostic Queries

-- Check availability group connection states and last connection errors
SELECT 
    ar.replica_server_name,
    ars.role_desc,
    ars.connected_state_desc,
    ars.last_connect_error_number,
    ars.last_connect_error_description,
    ars.last_connect_error_timestamp
FROM sys.availability_replicas ar
INNER JOIN sys.dm_hadr_availability_replica_states ars 
    ON ar.replica_id = ars.replica_id;
-- Monitor current HADR_RECOVERY_WAIT_FOR_CONNECTION waits
SELECT 
    s.session_id,
    r.command,
    w.wait_type,
    w.wait_time_ms,
    w.resource_description,
    s.login_name,
    s.program_name
FROM sys.dm_exec_requests r
INNER JOIN sys.dm_exec_sessions s ON r.session_id = s.session_id
INNER JOIN sys.dm_os_waiting_tasks w ON r.session_id = w.session_id
WHERE w.wait_type = 'HADR_RECOVERY_WAIT_FOR_CONNECTION';
-- Check database mirroring endpoint configuration and status
SELECT 
    e.name,
    e.type_desc,
    e.state_desc,
    e.port,
    t.ip_address,
    t.port AS tcp_port
FROM sys.database_mirroring_endpoints e
INNER JOIN sys.tcp_endpoints t ON e.endpoint_id = t.endpoint_id
WHERE e.type = 4;
-- Examine Always On extended events for connection failures
SELECT 
    event_data.value('(event/@name)[1]', 'VARCHAR(50)') AS event_name,
    event_data.value('(event/@timestamp)[1]', 'DATETIME2') AS event_time,
    event_data.value('(event/data[@name="availability_group_name"]/value)[1]', 'VARCHAR(128)') AS ag_name,
    event_data.value('(event/data[@name="error_number"]/value)[1]', 'INT') AS error_number,
    event_data.value('(event/data[@name="error_message"]/value)[1]', 'VARCHAR(MAX)') AS error_message
FROM (
    SELECT CAST(target_data AS XML) AS target_data
    FROM sys.dm_xe_sessions s
    INNER JOIN sys.dm_xe_session_targets st ON s.address = st.event_session_address
    WHERE s.name = 'AlwaysOn_health'
) AS session_data
CROSS APPLY target_data.nodes('//event[data[@name="availability_group_name"]]') AS n(event_data)
WHERE event_data.value('(event/@name)[1]', 'VARCHAR(50)') LIKE '%connection%'
ORDER BY event_data.value('(event/@timestamp)[1]', 'DATETIME2') DESC;
-- Check network connectivity and latency between replicas
SELECT 
    ar.replica_server_name,
    ar.endpoint_url,
    ars.connected_state_desc,
    drs.synchronization_state_desc,
    drs.log_send_queue_size,
    drs.redo_queue_size,
    drs.last_sent_time,
    drs.last_received_time
FROM sys.availability_replicas ar
INNER JOIN sys.dm_hadr_availability_replica_states ars ON ar.replica_id = ars.replica_id
INNER JOIN sys.dm_hadr_database_replica_states drs ON ar.replica_id = drs.replica_id
WHERE ar.availability_group_id IN (
    SELECT group_id FROM sys.availability_groups WHERE name = 'YourAGName'
);

Fix Scripts

Restart availability group endpoint - Forces re-establishment of endpoint connections when authentication or configuration issues persist.

-- Get endpoint name first, then restart it
DECLARE @EndpointName NVARCHAR(128);
SELECT @EndpointName = name 
FROM sys.database_mirroring_endpoints 
WHERE type = 4;

-- Stop and start the endpoint to force connection reset
EXEC('ALTER ENDPOINT [' + @EndpointName + '] STATE = STOPPED');
WAITFOR DELAY '00:00:05';
EXEC('ALTER ENDPOINT [' + @EndpointName + '] STATE = STARTED');

Test in development first. This briefly interrupts all availability group communication and may cause brief connection errors.

Increase endpoint connection timeout - Addresses timeout issues in high-latency environments or during authentication delays.

-- Increase connection timeout for database mirroring endpoint
-- Default is typically 10 seconds, increase to 20-30 for slow networks
DECLARE @EndpointName NVARCHAR(128);
SELECT @EndpointName = name 
FROM sys.database_mirroring_endpoints 
WHERE type = 4;

EXEC('ALTER ENDPOINT [' + @EndpointName + '] 
    FOR DATABASE_MIRRORING (
        AUTHENTICATION = WINDOWS,
        ENCRYPTION = REQUIRED ALGORITHM AES,
        CONNECTION_TIMEOUT = 30
    )');

Monitor for any impact on failover times, as longer timeouts can delay failure detection.

Force availability group failover to reset connections - Use when persistent connection issues exist and other methods fail.

-- Emergency failover to reset connection state
-- Execute on intended new primary replica
ALTER AVAILABILITY GROUP [YourAGName] FORCE_FAILOVER_ALLOW_DATA_LOSS;

-- After failover, verify all replicas reconnect properly
-- Execute this query periodically to monitor reconnection
SELECT 
    replica_server_name,
    connected_state_desc,
    recovery_lsn,
    truncation_lsn
FROM sys.dm_hadr_availability_replica_states;

WARNING: This causes data loss and should only be used when availability is more critical than data consistency. Always attempt graceful failover first.

Reset availability group database on secondary - Resolves corruption in availability group metadata causing persistent connection failures.

-- Remove and re-add database to availability group on secondary
-- Execute on secondary replica experiencing connection issues
ALTER DATABASE [YourDatabase] SET HADR OFF;

-- Re-add database to availability group
-- Requires fresh backup/restore if database was removed
ALTER AVAILABILITY GROUP [YourAGName] ADD DATABASE [YourDatabase];

This requires re-initializing the database from backup, which can be time-consuming for large databases.

AutoDBA generates fix scripts like these automatically, with impact estimates and rollback SQL included.

Prevention

Configure database mirroring endpoints with appropriate connection timeouts (20-30 seconds) for your network latency characteristics. Networks with RTT above 50ms typically require timeout adjustments from the default 10-second value.

Implement dedicated network connections or VLANs for availability group traffic to isolate replication from application workloads. Place availability group replicas on separate subnets only when necessary, as cross-subnet communication introduces additional latency and DNS resolution overhead.

Monitor the Always On health extended event session continuously and establish automated alerting when connection error patterns emerge. Connection failures often precede more serious availability group issues by several minutes.

Use Windows authentication instead of certificate-based authentication when possible, as certificate validation adds authentication overhead and introduces additional failure points. If certificates are required, ensure certificate revocation list (CRL) checking is optimized or disabled in disconnected environments.

Configure multiple IP addresses for availability group listeners in multi-subnet configurations, but avoid placing replicas across high-latency WAN connections where connection timeouts become frequent.

Need hands-on help?

Dealing with persistent hadr_recovery_wait_for_connection issues across your environment? Samix Technology provides hands-on SQL Server performance consulting with 15+ years of production DBA experience.

Related Pages