Quick Answer

HADR_RECOVERY_WAIT_FOR_CONNECTION occurs when an Always On secondary replica waits to establish connection with the primary replica before starting database recovery. This is normal during replica startup, failover, or network interruptions, but extended waits indicate connectivity or authentication issues between replicas.

Root Cause Analysis

This wait type manifests during the Always On availability group recovery process when the Database Engine's Always On availability groups thread manager cannot establish or maintain the required TCP connection to the primary replica. The secondary replica's recovery worker thread enters this wait state while the connection establishment logic attempts to reach the primary replica's database mirroring endpoint.

The wait occurs specifically in the CAlwaysOnSecondaryReplicaManager::WaitForPrimaryConnection() method, where the secondary replica's recovery process halts until the availability group transport layer confirms a stable connection to the primary. During this phase, the secondary database remains in a RESTORING state and cannot accept connections or process log records.

In SQL Server 2016 and later, the connection retry logic became more aggressive with exponential backoff, reducing unnecessary connection attempts during network instability. SQL Server 2019 introduced enhanced connection diagnostics that populate additional columns in sys.dm_hadr_availability_replica_states during connection establishment phases.

The wait duration directly correlates with network latency, DNS resolution time, authentication overhead (especially with Kerberos), and the configured connection timeout values on the database mirroring endpoints. Certificate-based authentication typically adds 50-200ms to initial connection establishment compared to Windows authentication.

SQL Server 2022 improved this mechanism by implementing connection pooling for availability group endpoints, reducing the frequency of complete connection re-establishment during minor network disruptions.

Diagnostic Queries

-- Check availability group connection states and last connection errors
SELECT 
    ar.replica_server_name,
    ars.role_desc,
    ars.connected_state_desc,
    ars.last_connect_error_number,
    ars.last_connect_error_description,
    ars.last_connect_error_timestamp
FROM sys.availability_replicas ar
INNER JOIN sys.dm_hadr_availability_replica_states ars 
    ON ar.replica_id = ars.replica_id;

-- Monitor current HADR_RECOVERY_WAIT_FOR_CONNECTION waits
SELECT 
    s.session_id,
    r.command,
    w.wait_type,
    w.wait_time_ms,
    w.resource_description,
    s.login_name,
    s.program_name
FROM sys.dm_exec_requests r
INNER JOIN sys.dm_exec_sessions s ON r.session_id = s.session_id
INNER JOIN sys.dm_os_waiting_tasks w ON r.session_id = w.session_id
WHERE w.wait_type = 'HADR_RECOVERY_WAIT_FOR_CONNECTION';

-- Check database mirroring endpoint configuration and status
SELECT 
    e.name,
    e.type_desc,
    e.state_desc,
    e.port,
    t.ip_address,
    t.port AS tcp_port
FROM sys.database_mirroring_endpoints e
INNER JOIN sys.tcp_endpoints t ON e.endpoint_id = t.endpoint_id
WHERE e.type = 4;

-- Examine Always On extended events for connection failures
SELECT 
    event_data.value('(event/@name)[1]', 'VARCHAR(50)') AS event_name,
    event_data.value('(event/@timestamp)[1]', 'DATETIME2') AS event_time,
    event_data.value('(event/data[@name="availability_group_name"]/value)[1]', 'VARCHAR(128)') AS ag_name,
    event_data.value('(event/data[@name="error_number"]/value)[1]', 'INT') AS error_number,
    event_data.value('(event/data[@name="error_message"]/value)[1]', 'VARCHAR(MAX)') AS error_message
FROM (
    SELECT CAST(target_data AS XML) AS target_data
    FROM sys.dm_xe_sessions s
    INNER JOIN sys.dm_xe_session_targets st ON s.address = st.event_session_address
    WHERE s.name = 'AlwaysOn_health'
) AS session_data
CROSS APPLY target_data.nodes('//event[data[@name="availability_group_name"]]') AS n(event_data)
WHERE event_data.value('(event/@name)[1]', 'VARCHAR(50)') LIKE '%connection%'
ORDER BY event_data.value('(event/@timestamp)[1]', 'DATETIME2') DESC;

-- Check network connectivity and latency between replicas
SELECT 
    ar.replica_server_name,
    ar.endpoint_url,
    ars.connected_state_desc,
    drs.synchronization_state_desc,
    drs.log_send_queue_size,
    drs.redo_queue_size,
    drs.last_sent_time,
    drs.last_received_time
FROM sys.availability_replicas ar
INNER JOIN sys.dm_hadr_availability_replica_states ars ON ar.replica_id = ars.replica_id
INNER JOIN sys.dm_hadr_database_replica_states drs ON ar.replica_id = drs.replica_id
WHERE ar.availability_group_id IN (
    SELECT group_id FROM sys.availability_groups WHERE name = 'YourAGName'
);

Fix Scripts

Restart availability group endpoint - Forces re-establishment of endpoint connections when authentication or configuration issues persist.

-- Get endpoint name first, then restart it
DECLARE @EndpointName NVARCHAR(128);
SELECT @EndpointName = name 
FROM sys.database_mirroring_endpoints 
WHERE type = 4;

-- Stop and start the endpoint to force connection reset
EXEC('ALTER ENDPOINT [' + @EndpointName + '] STATE = STOPPED');
WAITFOR DELAY '00:00:05';
EXEC('ALTER ENDPOINT [' + @EndpointName + '] STATE = STARTED');

Test in development first. This briefly interrupts all availability group communication and may cause brief connection errors.

Increase endpoint connection timeout - Addresses timeout issues in high-latency environments or during authentication delays.

-- Increase connection timeout for database mirroring endpoint
-- Default is typically 10 seconds, increase to 20-30 for slow networks
DECLARE @EndpointName NVARCHAR(128);
SELECT @EndpointName = name 
FROM sys.database_mirroring_endpoints 
WHERE type = 4;

EXEC('ALTER ENDPOINT [' + @EndpointName + '] 
    FOR DATABASE_MIRRORING (
        AUTHENTICATION = WINDOWS,
        ENCRYPTION = REQUIRED ALGORITHM AES,
        CONNECTION_TIMEOUT = 30
    )');

Monitor for any impact on failover times, as longer timeouts can delay failure detection.

Force availability group failover to reset connections - Use when persistent connection issues exist and other methods fail.

-- Emergency failover to reset connection state
-- Execute on intended new primary replica
ALTER AVAILABILITY GROUP [YourAGName] FORCE_FAILOVER_ALLOW_DATA_LOSS;

-- After failover, verify all replicas reconnect properly
-- Execute this query periodically to monitor reconnection
SELECT 
    replica_server_name,
    connected_state_desc,
    recovery_lsn,
    truncation_lsn
FROM sys.dm_hadr_availability_replica_states;

WARNING: This causes data loss and should only be used when availability is more critical than data consistency. Always attempt graceful failover first.

Reset availability group database on secondary - Resolves corruption in availability group metadata causing persistent connection failures.

-- Remove and re-add database to availability group on secondary
-- Execute on secondary replica experiencing connection issues
ALTER DATABASE [YourDatabase] SET HADR OFF;

-- Re-add database to availability group
-- Requires fresh backup/restore if database was removed
ALTER AVAILABILITY GROUP [YourAGName] ADD DATABASE [YourDatabase];

This requires re-initializing the database from backup, which can be time-consuming for large databases.

AutoDBA generates fix scripts like these automatically, with impact estimates and rollback SQL included.

HADR_RECOVERY_WAIT_FOR_CONNECTION Wait Type Explained

Quick Answer

Root Cause Analysis

Diagnostic Queries

Fix Scripts

Prevention

Need hands-on help?

Related Pages

HADR_RECOVERY_WAIT_FOR_UNDO

HADR_AG_MUTEX

HADR_AR_CRITICAL_SECTION_ENTRY

HADR_BACKUP_BULK_LOCK

Always On AG Latency

sys.dm_hadr_availability_replica_states