mediumAlways On

HADR_RECOVERY_WAIT_FOR_UNDO Wait Type Explained

Fix HADR_RECOVERY_WAIT_FOR_UNDO waits in SQL Server Always On. Root cause analysis, diagnostic queries, and prevention strategies for availability group recovery delays.

Quick Answer

HADR_RECOVERY_WAIT_FOR_UNDO occurs during Always On availability group failover when the new primary database waits for the secondary to complete crash recovery undo operations. This wait happens when log records must be rolled back to reach the common recovery point, and duration depends on uncommitted transaction volume and IO subsystem performance.

Root Cause Analysis

When an Always On availability group failover occurs, SQL Server must synchronize both replicas to a consistent recovery point before the secondary can become the new primary. The HADR_RECOVERY_WAIT_FOR_UNDO wait specifically occurs during the undo phase of crash recovery on the former secondary database.

During normal operations, the secondary replica continuously applies log records from the primary in redo-only mode without rolling back uncommitted transactions. When failover triggers, the recovery process must complete three phases: analysis, redo, and undo. The undo phase rolls back all uncommitted transactions that were active at the time of failover to establish transactional consistency.

The wait manifests when the primary replica (now becoming secondary) waits for the new primary (former secondary) to complete its undo operations. The recovery manager coordinates this synchronization through the Always On state machine, which tracks recovery progress across both replicas.

SQL Server 2016 introduced improvements to parallel undo processing, reducing recovery times for databases with many concurrent transactions. SQL Server 2019 enhanced this further with accelerated database recovery (ADR), which can significantly reduce undo phase duration by deferring rollback operations for long-running transactions. However, ADR must be explicitly enabled and doesn't eliminate this wait type entirely.

The duration correlates directly with the amount of uncommitted work (measured in log records requiring rollback) and the IO subsystem's ability to handle random write patterns during undo operations. Large uncommitted transactions, particularly those involving extensive data modifications, extend recovery time substantially.

AutoDBA checks Always On availability group health, recovery performance, and failover readiness metrics across your entire SQL Server instance in 60 seconds. Download the free diagnostic script and see what else needs attention.

Diagnostic Queries

-- Monitor current Always On recovery state and progress
SELECT 
    ar.replica_server_name,
    adc.database_name,
    drs.database_state_desc,
    drs.is_primary_replica,
    drs.synchronization_state_desc,
    drs.redo_queue_size,
    drs.redo_rate,
    drs.log_send_queue_size
FROM sys.dm_hadr_database_replica_states drs
    INNER JOIN sys.dm_hadr_availability_replica_states ar 
        ON drs.group_id = ar.group_id AND drs.replica_id = ar.replica_id
    INNER JOIN sys.availability_databases_cluster adc 
        ON drs.group_id = adc.group_id AND drs.database_id = adc.database_id
WHERE drs.database_state_desc LIKE '%RECOVERY%';
-- Check for active HADR_RECOVERY_WAIT_FOR_UNDO waits
SELECT 
    session_id,
    wait_type,
    wait_time_ms,
    blocking_session_id,
    wait_resource,
    resource_description
FROM sys.dm_exec_requests
WHERE wait_type = 'HADR_RECOVERY_WAIT_FOR_UNDO'
    AND session_id > 50;
-- Examine recovery progress through database recovery percentages
SELECT 
    r.database_id,
    d.name AS database_name,
    d.recovery_model_desc,
    d.state_desc,
    r.percent_complete,
    r.start_time,
    r.status,
    r.command
FROM sys.dm_exec_requests r
    INNER JOIN sys.databases d ON d.database_id = r.database_id
WHERE r.command LIKE '%RECOVERY%'
    AND r.percent_complete > 0;
-- Monitor Always On extended events for recovery details
SELECT 
    event_data.value('(event/@name)[1]', 'varchar(50)') AS event_name,
    event_data.value('(event/@timestamp)[1]', 'datetime2') AS event_time,
    event_data.value('(event/data[@name="database_name"]/value)[1]', 'varchar(128)') AS database_name,
    event_data.value('(event/data[@name="recovery_phase"]/text)[1]', 'varchar(20)') AS recovery_phase,
    event_data.value('(event/data[@name="duration"]/value)[1]', 'bigint') AS duration_ms
FROM sys.fn_xe_file_target_read_file('AlwaysOn*.xel', NULL, NULL, NULL)
WHERE event_data.value('(event/@name)[1]', 'varchar(50)') LIKE '%recovery%'
    AND event_data.value('(event/@timestamp)[1]', 'datetime2') >= DATEADD(hour, -2, GETDATE())
ORDER BY event_time DESC;
-- Check for long-running transactions that could delay undo
SELECT 
    s.session_id,
    s.login_name,
    s.program_name,
    s.host_name,
    r.start_time,
    r.status,
    r.command,
    DATEDIFF(minute, r.start_time, GETDATE()) AS duration_minutes,
    r.cpu_time,
    r.logical_reads,
    t.text AS sql_text
FROM sys.dm_exec_sessions s
    INNER JOIN sys.dm_exec_requests r ON s.session_id = r.session_id
    CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) t
WHERE r.start_time <= DATEADD(minute, -5, GETDATE())
    AND s.is_user_process = 1
ORDER BY r.start_time;

Fix Scripts

Monitor and wait for recovery completion This script provides real-time monitoring during recovery operations. Run this to track progress rather than intervening, as recovery must complete naturally.

-- Monitor recovery progress with refresh loop
DECLARE @recovery_complete BIT = 0;
DECLARE @start_time DATETIME2 = GETDATE();

WHILE @recovery_complete = 0
BEGIN
    SELECT 
        DB_NAME(database_id) AS database_name,
        state_desc,
        recovery_model_desc,
        CASE 
            WHEN state_desc = 'ONLINE' THEN 1
            ELSE 0
        END AS is_online
    FROM sys.databases
    WHERE database_id IN (
        SELECT database_id 
        FROM sys.dm_hadr_database_replica_states 
        WHERE database_state_desc LIKE '%RECOVERY%'
    );
    
    -- Check if all AG databases are online
    IF NOT EXISTS (
        SELECT 1 FROM sys.databases d
        INNER JOIN sys.dm_hadr_database_replica_states drs 
            ON d.database_id = drs.database_id
        WHERE drs.database_state_desc LIKE '%RECOVERY%'
    )
    BEGIN
        SET @recovery_complete = 1;
        PRINT 'Recovery completed at: ' + CONVERT(VARCHAR(23), GETDATE(), 121);
        PRINT 'Total duration: ' + CAST(DATEDIFF(second, @start_time, GETDATE()) AS VARCHAR(10)) + ' seconds';
    END
    ELSE
    BEGIN
        PRINT 'Recovery in progress at: ' + CONVERT(VARCHAR(23), GETDATE(), 121);
        WAITFOR DELAY '00:00:05'; -- Wait 5 seconds between checks
    END
END;

Enable Accelerated Database Recovery (SQL Server 2019+) ADR reduces undo phase duration by deferring transaction rollback operations. Test thoroughly in development as ADR changes transaction log behavior.

-- Enable ADR on availability group databases
-- WARNING: Test in development first, requires SQL Server 2019+
DECLARE @sql NVARCHAR(MAX) = '';
DECLARE @database_name SYSNAME;

DECLARE db_cursor CURSOR FOR
SELECT adc.database_name
FROM sys.availability_databases_cluster adc
    INNER JOIN sys.databases d ON adc.database_name = d.name
WHERE d.is_accelerated_database_recovery_on = 0
    AND d.state_desc = 'ONLINE';

OPEN db_cursor;
FETCH NEXT FROM db_cursor INTO @database_name;

WHILE @@FETCH_STATUS = 0
BEGIN
    SET @sql = 'ALTER DATABASE [' + @database_name + '] SET ACCELERATED_DATABASE_RECOVERY = ON;';
    PRINT 'Enabling ADR on database: ' + @database_name;
    EXEC sp_executesql @sql;
    
    FETCH NEXT FROM db_cursor INTO @database_name;
END;

CLOSE db_cursor;
DEALLOCATE db_cursor;

-- Verify ADR status
SELECT 
    name,
    is_accelerated_database_recovery_on,
    accelerated_database_recovery_state_desc
FROM sys.databases
WHERE name IN (SELECT database_name FROM sys.availability_databases_cluster);

Optimize checkpoint frequency for AG databases More frequent checkpoints reduce recovery time by minimizing redo/undo work required during failover.

-- Configure recovery interval for shorter checkpoint intervals
-- Current setting check
SELECT 
    name,
    value_in_use,
    description
FROM sys.configurations
WHERE name = 'recovery interval (min)';

-- Reduce recovery interval to 1 minute for faster checkpoints
-- Default is 0 (auto), consider 1-5 minutes for AG databases
EXEC sp_configure 'recovery interval (min)', 1;
RECONFIGURE WITH OVERRIDE;

-- Verify change
SELECT 
    name,
    value_in_use,
    description
FROM sys.configurations
WHERE name = 'recovery interval (min)';

-- Note: This affects all databases on the instance
-- Monitor checkpoint frequency increase in error log

Force manual checkpoint on primary before planned failover Run this script on the primary replica before planned maintenance to minimize recovery time.

-- Force checkpoint on all AG databases before planned failover
-- This reduces the amount of log that needs to be processed during recovery
DECLARE @sql NVARCHAR(MAX);
DECLARE @database_name SYSNAME;

DECLARE db_cursor CURSOR FOR
SELECT database_name
FROM sys.availability_databases_cluster adc
    INNER JOIN sys.dm_hadr_database_replica_states drs 
        ON adc.group_id = drs.group_id
WHERE drs.is_primary_replica = 1
    AND drs.database_state_desc = 'ONLINE';

OPEN db_cursor;
FETCH NEXT FROM db_cursor INTO @database_name;

WHILE @@FETCH_STATUS = 0
BEGIN
    SET @sql = 'CHECKPOINT; -- Database: ' + @database_name;
    EXEC('USE [' + @database_name + ']; CHECKPOINT;');
    PRINT 'Checkpoint completed for database: ' + @database_name;
    
    FETCH NEXT FROM db_cursor INTO @database_name;
END;

CLOSE db_cursor;
DEALLOCATE db_cursor;

-- Wait a moment for log shipping to catch up
WAITFOR DELAY '00:00:10';
PRINT 'Manual checkpoints completed for all AG databases';

AutoDBA generates fix scripts like these automatically, with impact estimates and rollback SQL included.

Prevention

Configure aggressive checkpoint intervals using a recovery interval of 1-2 minutes instead of the default auto setting. This reduces the amount of log requiring processing during failover recovery operations.

Enable Accelerated Database Recovery on SQL Server 2019+ instances hosting availability groups. ADR significantly reduces undo phase duration by deferring rollback operations for long-running transactions, though it requires additional persistent version store space.

Implement transaction design patterns that avoid long-running uncommitted transactions spanning multiple minutes. Break large operations into smaller batches with intermediate commits, particularly for maintenance operations and data loads.

Monitor transaction log growth patterns and establish log backup frequencies that prevent excessive log accumulation. Large transaction logs increase recovery time even with frequent checkpoints.

Configure Always On dashboard monitoring with custom thresholds for recovery queue sizes and synchronization delays. Alert on redo queue sizes exceeding 100MB or synchronization delays over 30 seconds during normal operations.

Size tempdb appropriately on secondary replicas, as recovery operations may require substantial tempdb space for sorting and intermediate storage during undo processing. Plan for 20-30% of the largest AG database size as baseline tempdb sizing.

Establish maintenance windows for large transaction operations and coordinate with planned failovers to avoid recovery delays. Schedule batch operations and index maintenance during low-activity periods when failover likelihood remains minimal.

Test failover scenarios regularly under various transaction loads to establish baseline recovery times and identify transactions that consistently extend recovery duration beyond acceptable thresholds.

Need hands-on help?

Dealing with persistent hadr_recovery_wait_for_undo issues across your environment? Samix Technology provides hands-on SQL Server performance consulting with 15+ years of production DBA experience.

Related Pages