Quick Answer

HADR_DATABASE_FLOW_CONTROL occurs when Always On Availability Groups hit the maximum queued message limit (typically 11,000 messages) for log block transfers to secondary replicas. This indicates log generation is outpacing network transmission speed, creating a bottleneck that forces the primary replica to wait before sending additional log records.

Root Cause Analysis

The Always On log transport mechanism operates through a producer-consumer model where the log capture thread on the primary replica scans the transaction log and packages log blocks into messages for transmission to secondary replicas. Each secondary replica connection maintains a separate message queue with a hardcoded limit of 11,000 messages in SQL Server 2012 through 2022.

When log generation rate exceeds network throughput capacity, these per-replica queues fill completely. The log capture thread must then wait for acknowledgments from secondary replicas before queuing additional messages, manifesting as HADR_DATABASE_FLOW_CONTROL waits. This wait type specifically measures time spent in the flow control mechanism, not network latency itself.

SQL Server 2016 introduced improvements to message compression and batching efficiency, reducing the frequency of this wait under similar workload conditions. SQL Server 2019 enhanced the log transport protocol with better congestion detection algorithms. SQL Server 2022 maintains the same 11,000 message queue limit but optimizes message size calculations for better throughput.

The wait occurs in the context of the Always On log capture worker thread, which runs on a dedicated scheduler. Unlike typical user request waits, these waits directly impact transaction log hardening on the primary replica when synchronous commit mode is configured, potentially affecting user transaction response times.

Network bandwidth limitations, high network latency, secondary replica processing delays (due to redo bottlenecks), or sudden spikes in transaction log generation can all trigger sustained HADR_DATABASE_FLOW_CONTROL waits. The issue becomes critical when the primary replica cannot maintain its expected transaction commit rates due to backpressure from the flow control mechanism.

Diagnostic Queries

-- Check current Always On wait statistics and flow control frequency
SELECT 
    wait_type,
    waiting_tasks_count,
    wait_time_ms,
    max_wait_time_ms,
    signal_wait_time_ms,
    wait_time_ms / NULLIF(waiting_tasks_count, 0) as avg_wait_ms
FROM sys.dm_os_wait_stats 
WHERE wait_type = 'HADR_DATABASE_FLOW_CONTROL'
    AND waiting_tasks_count > 0;

-- Examine log send queue sizes and rates for each replica
SELECT 
    ar.replica_server_name,
    db_name(ars.database_id) as database_name,
    ars.log_send_queue_size,
    ars.log_send_rate,
    ars.redo_queue_size,
    ars.redo_rate,
    ars.last_sent_time,
    ars.last_received_time
FROM sys.dm_hadr_database_replica_states ars
INNER JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id
WHERE ars.is_local = 0
ORDER BY ars.log_send_queue_size DESC;

-- Monitor real-time log generation vs send rates
SELECT 
    db_name(database_id) as database_name,
    log_send_queue_size,
    log_send_rate,
    CASE 
        WHEN log_send_rate > 0 THEN log_send_queue_size / log_send_rate 
        ELSE NULL 
    END as estimated_catch_up_time_seconds
FROM sys.dm_hadr_database_replica_states
WHERE is_local = 1
    AND log_send_queue_size > 1000
ORDER BY log_send_queue_size DESC;

-- Check for secondary replica processing bottlenecks
SELECT 
    ar.replica_server_name,
    db_name(ars.database_id) as database_name,
    ars.redo_queue_size,
    ars.redo_rate,
    ars.last_redone_time,
    DATEDIFF(second, ars.last_redone_time, GETDATE()) as seconds_behind_redo
FROM sys.dm_hadr_database_replica_states ars
INNER JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id
WHERE ars.is_local = 0 
    AND ars.redo_queue_size > 0
ORDER BY ars.redo_queue_size DESC;

-- Identify active sessions generating high log volume
SELECT TOP 10
    s.session_id,
    s.login_name,
    s.program_name,
    s.host_name,
    dt.database_transaction_log_bytes_used,
    dt.database_transaction_log_bytes_reserved,
    r.command,
    r.percent_complete
FROM sys.dm_tran_database_transactions dt
INNER JOIN sys.dm_tran_session_transactions st ON dt.transaction_id = st.transaction_id
INNER JOIN sys.dm_exec_sessions s ON st.session_id = s.session_id
LEFT JOIN sys.dm_exec_requests r ON s.session_id = r.session_id
WHERE dt.database_transaction_log_bytes_used > 1048576  -- 1MB+
ORDER BY dt.database_transaction_log_bytes_used DESC;

Fix Scripts

Increase network bandwidth or optimize network path

-- Verify current network configuration and identify bottlenecks
-- This query helps identify which replicas have consistently high send queues
SELECT 
    ar.replica_server_name,
    ar.endpoint_url,
    ars.log_send_queue_size,
    ars.log_send_rate,
    CASE 
        WHEN ars.log_send_rate > 0 AND ars.log_send_queue_size > 10000 
        THEN 'Network bottleneck likely'
        ELSE 'Normal'
    END as status
FROM sys.dm_hadr_database_replica_states ars
INNER JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id
WHERE ars.is_local = 0;

Network infrastructure changes required outside SQL Server. Focus on replicas showing consistent high queue sizes with low send rates.

Temporarily switch to asynchronous commit mode during high volume periods

-- Switch problematic replicas to async mode (TEST IN DEV FIRST)
ALTER AVAILABILITY GROUP [YourAGName]
MODIFY REPLICA ON N'SecondaryServerName'
WITH (AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT);

-- Monitor for improvement, then switch back during low activity
-- ALTER AVAILABILITY GROUP [YourAGName]
-- MODIFY REPLICA ON N'SecondaryServerName'
-- WITH (AVAILABILITY_MODE = SYNCHRONOUS_COMMIT);

This reduces primary replica blocking but eliminates zero data loss guarantee. Plan the switchback during maintenance windows.

Optimize secondary replica redo performance

-- Enable multiple redo threads per database (SQL Server 2016+)
ALTER DATABASE [YourDatabase] SET ACCELERATED_DATABASE_RECOVERY = ON;

-- Check if trace flag 3459 helps with redo parallelism
-- DBCC TRACEON(3459, -1);  -- TEST THOROUGHLY FIRST
-- This trace flag enables parallel redo operations for Always On

Accelerated Database Recovery can significantly improve redo performance by reducing the need to roll back long-running transactions during recovery.

Monitor and batch large transactions

-- Identify and reschedule large data operations causing log spikes
-- Create a monitoring job to track transaction log usage patterns
DECLARE @LogUsageThreshold BIGINT = 100 * 1024 * 1024; -- 100MB

SELECT 
    s.session_id,
    dt.database_transaction_log_bytes_used / 1024 / 1024 as log_mb_used,
    t.text as current_statement,
    s.program_name,
    s.login_name
FROM sys.dm_tran_database_transactions dt
INNER JOIN sys.dm_tran_session_transactions st ON dt.transaction_id = st.transaction_id
INNER JOIN sys.dm_exec_sessions s ON st.session_id = s.session_id
CROSS APPLY sys.dm_exec_sql_text(s.most_recent_sql_handle) t
WHERE dt.database_transaction_log_bytes_used > @LogUsageThreshold;

Break large operations into smaller batches with commit points to reduce log generation spikes and give log transport time to catch up.

AutoDBA generates fix scripts like these automatically, with impact estimates and rollback SQL included.

HADR_DATABASE_FLOW_CONTROL Wait Type Explained

Quick Answer

Root Cause Analysis

Diagnostic Queries

Fix Scripts

Prevention

Need hands-on help?

Related Pages

HADR_AG_MUTEX

HADR_AR_CRITICAL_SECTION_ENTRY

HADR_BACKUP_BULK_LOCK

Always On AG Latency

sys.dm_hadr_availability_replica_states