mediumAlways On

HADR_DB_COMMAND Wait Type Explained

Diagnose and resolve HADR_DB_COMMAND waits in SQL Server Always On Availability Groups. Includes root cause analysis, diagnostic queries, and proven fixes.

Quick Answer

HADR_DB_COMMAND waits occur when SQL Server is waiting for responses to conversational messages in Always On Availability Groups, specifically for database-level commands like synchronization, state changes, and metadata operations. These waits typically indicate communication delays between primary and secondary replicas during database operations rather than performance problems.

Root Cause Analysis

HADR_DB_COMMAND waits manifest when the primary replica sends conversational messages to secondary replicas through the Always On transport layer and must wait for acknowledgments. The Database Engine uses a request-response messaging pattern for critical database operations including log block shipping confirmations, database state transitions (online/offline/recovering), metadata synchronization, and failover coordination.

The wait occurs in the Always On worker threads that manage replica communication. When a primary replica executes operations requiring replica coordination, it sends messages through the availability group transport using TCP endpoints. The sending thread enters HADR_DB_COMMAND wait state until receiving the expected response or timing out.

Common triggers include ALTER DATABASE statements on availability databases, automatic seeding operations, database state changes during failover, and log truncation coordination. The wait duration correlates directly with network latency between replicas and the processing time on secondary replicas.

SQL Server 2016 introduced automatic seeding which increased HADR_DB_COMMAND waits during initial synchronization. SQL Server 2017 added clusterless availability groups affecting message routing patterns. SQL Server 2019 improved message batching reducing wait frequency for log shipping operations. SQL Server 2022 enhanced the transport protocol with better compression and multiplexing capabilities.

The internal message queue depth and worker thread availability directly impact wait duration. Secondary replica resource contention, particularly I/O bottlenecks during redo operations, extends response times causing longer waits on the primary.

AutoDBA checks Always On replica health monitoring, synchronization lag detection, and transport timeout configurations across your entire SQL Server instance in 60 seconds. Download the free diagnostic script and see what else needs attention.

Diagnostic Queries

-- Current HADR_DB_COMMAND waits and their context
SELECT 
    session_id,
    wait_type,
    wait_time_ms,
    blocking_session_id,
    wait_resource,
    command,
    status
FROM sys.dm_exec_requests 
WHERE wait_type = 'HADR_DB_COMMAND'
ORDER BY wait_time_ms DESC;
-- Availability group replica synchronization states
SELECT 
    ag.name AS availability_group,
    r.replica_server_name,
    rs.synchronization_state_desc,
    rs.last_commit_time,
    rs.log_send_queue_size,
    rs.redo_queue_size,
    rs.redo_rate
FROM sys.availability_groups ag
INNER JOIN sys.availability_replicas r ON ag.group_id = r.group_id
INNER JOIN sys.dm_hadr_database_replica_states rs ON r.replica_id = rs.replica_id
ORDER BY rs.log_send_queue_size DESC;
-- Check Always On replica connectivity and error states
SELECT 
    ar.replica_server_name,
    ars.connected_state_desc,
    ars.last_connect_error_description,
    ars.last_connect_error_number,
    ars.last_connect_error_timestamp
FROM sys.dm_hadr_availability_replica_states ars
INNER JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id
WHERE ars.connected_state_desc != 'CONNECTED'
   OR ars.last_connect_error_number IS NOT NULL;
-- Historical wait statistics for HADR operations
SELECT 
    wait_type,
    waiting_tasks_count,
    wait_time_ms,
    max_wait_time_ms,
    signal_wait_time_ms,
    wait_time_ms / NULLIF(waiting_tasks_count, 0) as avg_wait_time_ms
FROM sys.dm_os_wait_stats 
WHERE wait_type LIKE 'HADR%'
ORDER BY wait_time_ms DESC;
-- Active Always On worker threads and their current operations
SELECT 
    t.session_id,
    t.task_state,
    t.context_switches_count,
    s.host_name,
    s.program_name,
    r.command,
    r.wait_type,
    r.wait_time_ms
FROM sys.dm_os_tasks t
INNER JOIN sys.dm_exec_sessions s ON t.session_id = s.session_id
LEFT JOIN sys.dm_exec_requests r ON t.session_id = r.session_id
WHERE s.program_name LIKE '%Always On%'
OR r.wait_type LIKE 'HADR%';

Fix Scripts

Increase Always On Transport Timeout Values Extends message timeout periods to accommodate slower networks or overloaded secondary replicas.

-- Increase session timeout for availability group endpoints
-- Test thoroughly in development first
ALTER AVAILABILITY GROUP [YourAGName]
MODIFY REPLICA ON 'SecondaryServerName'
WITH (SESSION_TIMEOUT = 30); -- Default is 10 seconds

-- Apply to all replicas in the availability group
DECLARE @sql NVARCHAR(MAX) = '';
SELECT @sql = @sql + 'ALTER AVAILABILITY GROUP [' + ag.name + '] 
MODIFY REPLICA ON ''' + r.replica_server_name + ''' 
WITH (SESSION_TIMEOUT = 30);' + CHAR(13)
FROM sys.availability_groups ag
INNER JOIN sys.availability_replicas r ON ag.group_id = r.group_id;
PRINT @sql;
-- Execute after review: EXEC sp_executesql @sql;

Caution: Higher timeouts mask underlying network or performance issues. Monitor replica lag after implementation.

Optimize Log Transport Performance Reduces message frequency by adjusting log block size and compression settings.

-- Enable compression on availability group transport
-- Requires SQL Server 2016 SP1 or later
ALTER AVAILABILITY GROUP [YourAGName]  
MODIFY REPLICA ON 'SecondaryServerName'
WITH (SEEDING_MODE = MANUAL, -- Prevents automatic seeding overhead
      BACKUP_PRIORITY = 50,
      SECONDARY_ROLE(ALLOW_CONNECTIONS = READ_ONLY),
      PRIMARY_ROLE(ALLOW_CONNECTIONS = ALL));

-- Check current compression status
SELECT 
    replica_server_name,
    endpoint_url,
    is_compression_enabled
FROM sys.availability_replicas r
INNER JOIN sys.database_mirroring_endpoints e ON r.endpoint_url LIKE '%' + CAST(e.port AS VARCHAR(10)) + '%';

Impact: Reduces network bandwidth usage but increases CPU overhead for compression operations.

Force Log Truncation on Overloaded Secondary Prevents log growth when secondary replicas fall behind in log processing.

-- Identify databases with large unsent log queues
SELECT 
    database_name,
    log_send_queue_size,
    redo_queue_size,
    recovery_lsn,
    last_commit_time
FROM sys.dm_hadr_database_replica_states
WHERE is_primary_replica = 1
AND log_send_queue_size > 10000; -- 10MB threshold

-- Emergency log truncation (use with extreme caution)
-- This will break the AG synchronization temporarily
-- BACKUP LOG [YourDatabase] TO DISK = 'NUL:';
-- Then re-establish AG after addressing secondary performance

Warning: Breaking log chain requires full database re-seeding to secondary replicas. Use only in emergencies.

AutoDBA generates fix scripts like these automatically, with impact estimates and rollback SQL included.

Prevention

Configure appropriate session timeout values based on network latency measurements between replicas, typically 20-30 seconds for WAN connections. Monitor log send queue sizes and redo queue depths continuously, alerting when queues exceed 50MB or redo falls behind by more than 5 minutes.

Size secondary replica hardware appropriately for redo workload, ensuring I/O subsystem can handle peak transaction rates from primary. Configure secondary replicas with sufficient memory for buffer pool operations and enable read-only routing only after verifying performance impact.

Implement network monitoring between replica locations, tracking packet loss, latency, and bandwidth utilization. Use dedicated network connections for availability group traffic when possible, avoiding shared connections with application traffic.

Schedule maintenance operations during low-activity periods to minimize message queue buildup. Avoid running multiple ALTER DATABASE operations simultaneously on availability group databases. Configure automatic seeding only for initial setup, then switch to manual seeding for ongoing operations.

Establish baseline metrics for normal HADR_DB_COMMAND wait times and implement alerting for deviations exceeding 200% of baseline values. Regular health checks should verify replica synchronization states and identify communication bottlenecks before they impact application performance.

Need hands-on help?

Dealing with persistent hadr_db_command issues across your environment? Samix Technology provides hands-on SQL Server performance consulting with 15+ years of production DBA experience.

Related Pages