mediumAlways On

HADR_BACKUP_QUEUE Wait Type Explained

Learn how to diagnose and fix SQL Server HADR_BACKUP_QUEUE waits in Always On Availability Groups. Includes queries, scripts, and prevention strategies.

Quick Answer

HADR_BACKUP_QUEUE occurs when the Always On primary replica's backup background thread waits for secondary replicas to acknowledge bulk operation log releases. This happens during large operations like bulk inserts, index rebuilds, or backups when the primary holds BulkOp log records and needs secondary confirmation before releasing locks. Generally benign but can indicate network latency or secondary replica performance issues.

Root Cause Analysis

The Always On availability group architecture requires coordination between primary and secondary replicas for bulk operations that generate large amounts of transaction log. When bulk operations occur (BULK_INSERT, SELECT INTO, index operations with SORT_IN_TEMPDB, or certain backup operations), SQL Server generates BulkOp log records that contain metadata about the bulk operation rather than individual row changes.

The primary replica's backup background thread manages the release of these BulkOp log records. This thread must wait for confirmation from all synchronous secondary replicas (and asynchronous replicas within their timeout window) before releasing the bulk operation locks. The thread enters HADR_BACKUP_QUEUE state during this coordination phase.

SQL Server 2016 introduced improvements to bulk operation handling in Always On, reducing the frequency of these waits through better log record batching. SQL Server 2019 further optimized the backup thread's communication protocol with secondary replicas, particularly for scenarios involving multiple concurrent bulk operations.

The wait specifically occurs in the backup worker thread (not user threads performing the bulk operations). This thread runs continuously on the primary replica, processing a queue of bulk operation completion notifications. When the queue is empty or waiting for secondary acknowledgment, the thread enters this wait state.

Network latency, secondary replica disk I/O performance, or redo thread bottlenecks on secondary replicas directly impact the duration of these waits. The primary cannot release bulk operation resources until all configured replicas respond within their respective timeout thresholds.

AutoDBA checks Always On Availability Group health, replica synchronization status, and bulk operation impact analysis across your entire SQL Server instance in 60 seconds. Download the free diagnostic script and see what else needs attention.

Diagnostic Queries

-- Check current HADR_BACKUP_QUEUE waits and duration
SELECT 
    session_id,
    wait_type,
    wait_time_ms,
    blocking_session_id,
    wait_resource,
    command
FROM sys.dm_exec_requests 
WHERE wait_type = 'HADR_BACKUP_QUEUE'
AND session_id > 50;
-- Analyze recent bulk operations and their impact on AG replicas
SELECT 
    ar.replica_server_name,
    ar.availability_mode_desc,
    ars.synchronization_health_desc,
    ars.last_sent_time,
    ars.last_received_time,
    ars.last_hardened_time,
    DATEDIFF(ms, ars.last_sent_time, ars.last_received_time) AS send_receive_latency_ms,
    DATEDIFF(ms, ars.last_received_time, ars.last_hardened_time) AS harden_latency_ms
FROM sys.availability_replicas ar
INNER JOIN sys.dm_hadr_availability_replica_states ars
    ON ar.replica_id = ars.replica_id
WHERE ar.replica_server_name != @@SERVERNAME
ORDER BY send_receive_latency_ms DESC;
-- Identify bulk operations currently running or recently completed
SELECT 
    s.session_id,
    r.command,
    r.percent_complete,
    r.estimated_completion_time,
    r.database_id,
    DB_NAME(r.database_id) AS database_name,
    s.program_name,
    r.wait_type,
    r.wait_time_ms,
    t.text
FROM sys.dm_exec_requests r
INNER JOIN sys.dm_exec_sessions s ON r.session_id = s.session_id
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) t
WHERE r.command IN ('BULK INSERT', 'SELECT INTO', 'DbccFilesCompact', 'BACKUP DATABASE', 'BACKUP LOG')
   OR t.text LIKE '%BULK%INSERT%' 
   OR t.text LIKE '%SELECT%INTO%'
   OR r.wait_type LIKE 'HADR%';
-- Check AG redo queue size and performance on primary
SELECT 
    ar.replica_server_name,
    drs.database_id,
    DB_NAME(drs.database_id) AS database_name,
    drs.redo_queue_size,
    drs.redo_rate,
    drs.log_send_queue_size,
    drs.log_send_rate,
    CASE 
        WHEN drs.redo_rate > 0 THEN drs.redo_queue_size / drs.redo_rate 
        ELSE -1 
    END AS redo_queue_seconds_behind
FROM sys.availability_replicas ar
INNER JOIN sys.dm_hadr_database_replica_states drs
    ON ar.replica_id = drs.replica_id
WHERE ar.replica_server_name != @@SERVERNAME
  AND drs.redo_queue_size > 0
ORDER BY redo_queue_seconds_behind DESC;
-- Historical wait statistics for HADR_BACKUP_QUEUE
SELECT 
    wait_type,
    waiting_tasks_count,
    wait_time_ms,
    max_wait_time_ms,
    signal_wait_time_ms,
    wait_time_ms - signal_wait_time_ms AS resource_wait_time_ms,
    CAST(100.0 * wait_time_ms / SUM(wait_time_ms) OVER() AS DECIMAL(5,2)) AS pct_of_waits
FROM sys.dm_os_wait_stats
WHERE wait_type = 'HADR_BACKUP_QUEUE'
  AND waiting_tasks_count > 0;

Fix Scripts

Force secondary replica synchronization catch-up

-- Check and potentially force log synchronization on lagging secondaries
-- WARNING: Run only during maintenance windows, can impact performance
DECLARE @replica_name NVARCHAR(256) = 'YOUR_SECONDARY_REPLICA_NAME';

-- First verify the replica is actually behind
SELECT 
    replica_server_name,
    redo_queue_size,
    log_send_queue_size
FROM sys.availability_replicas ar
INNER JOIN sys.dm_hadr_database_replica_states drs ON ar.replica_id = drs.replica_id
WHERE replica_server_name = @replica_name;

-- If significantly behind, consider temporarily changing to async mode
-- ALTER AVAILABILITY GROUP [YourAGName] 
-- MODIFY REPLICA ON @replica_name
-- WITH (AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT);

Optimize bulk operations for Always On environments

-- Optimize backup and bulk operation performance in AG
-- Note: AG databases MUST remain in FULL recovery model

-- Enable backup compression to reduce data transfer to secondary replicas
EXEC sp_configure 'backup compression default', 1;
RECONFIGURE;

-- Use backup striping to parallelize backup I/O (example with 4 files)
/*
BACKUP DATABASE [YourDatabaseName]
TO DISK = 'D:\Backups\YourDB_1.bak',
   DISK = 'D:\Backups\YourDB_2.bak',
   DISK = 'D:\Backups\YourDB_3.bak',
   DISK = 'D:\Backups\YourDB_4.bak'
WITH COMPRESSION, FORMAT, INIT;
*/

-- For large bulk operations, consider batching
-- Example: Break large operations into smaller chunks
/*
DECLARE @BatchSize INT = 10000;
DECLARE @RowsProcessed INT = 0;

WHILE EXISTS (SELECT 1 FROM SourceTable WHERE ProcessFlag = 0)
BEGIN
    INSERT INTO TargetTable (Col1, Col2, Col3)
    SELECT TOP (@BatchSize) Col1, Col2, Col3
    FROM SourceTable 
    WHERE ProcessFlag = 0;
    
    UPDATE TOP (@BatchSize) SourceTable 
    SET ProcessFlag = 1 
    WHERE ProcessFlag = 0;
    
    SET @RowsProcessed = @RowsProcessed + @@ROWCOUNT;
    
    -- Allow log truncation opportunity
    CHECKPOINT;
    
    -- Brief pause to allow AG catch-up
    WAITFOR DELAY '00:00:01';
END
*/

Adjust Always On timeout settings

-- Increase session timeout for slow secondary replicas
-- Execute on primary replica
ALTER AVAILABILITY GROUP [YourAGName]
MODIFY REPLICA ON 'SecondaryReplicaServerName'
WITH (SESSION_TIMEOUT = 20); -- Default is 10 seconds

-- For asynchronous replicas, verify timeout settings
SELECT 
    replica_server_name,
    availability_mode_desc,
    failover_mode_desc,
    session_timeout,
    primary_role_allow_connections_desc,
    secondary_role_allow_connections_desc
FROM sys.availability_replicas
WHERE replica_server_name != @@SERVERNAME;

Emergency: Suspend problematic replica temporarily

-- EMERGENCY USE ONLY: Temporarily suspend a replica that's causing excessive waits
-- This allows bulk operations to complete but breaks HA protection
USE master;
GO

-- Suspend data movement (run on primary)
ALTER DATABASE [YourDatabaseName] 
SET HADR SUSPEND;

-- Resume after bulk operations complete and replica catches up
-- ALTER DATABASE [YourDatabaseName] 
-- SET HADR RESUME;

-- Monitor replica status
SELECT 
    database_name,
    synchronization_state_desc,
    suspend_reason_desc
FROM sys.dm_hadr_database_replica_states drs
INNER JOIN sys.databases d ON drs.database_id = d.database_id
WHERE is_primary_replica = 0;

AutoDBA generates fix scripts like these automatically, with impact estimates and rollback SQL included.

Prevention

Schedule large bulk operations during maintenance windows when secondary replica latency is less critical. Configure dedicated network connections between AG replicas using separate NICs or VLAN isolation to reduce network-related delays.

Monitor secondary replica redo queue sizes and set up alerts when queues exceed 100MB or redo rates drop below expected thresholds. Implement redo thread performance monitoring on secondary replicas, particularly disk I/O latency to data and log drives.

For environments with frequent bulk operations, consider using asynchronous commit mode for secondary replicas that don't require immediate consistency. This eliminates the primary's dependency on secondary acknowledgment for bulk operation completion.

Size secondary replica hardware appropriately, ensuring disk subsystems can handle redo workload peaks. Secondary replicas often underperform on redo operations due to single-threaded log replay limitations, requiring faster storage than the primary for equivalent performance.

Implement bulk operation batching patterns in application code to reduce the size of individual BulkOp log records. Breaking large operations into smaller chunks allows better interleaving with regular OLTP operations and reduces lock hold times.

Configure AG session timeouts based on actual network latency measurements between replicas. Default 10-second timeouts may be insufficient for geographically distributed replicas or during network congestion periods.

Use SQL Server 2019 or later for improved bulk operation handling in Always On scenarios. These versions include optimizations specifically targeting HADR_BACKUP_QUEUE wait reduction through better log batching and secondary communication protocols.

Need hands-on help?

Dealing with persistent hadr_backup_queue issues across your environment? Samix Technology provides hands-on SQL Server performance consulting with 15+ years of production DBA experience.

Related Pages