Quick Answer
HADR_TRANSPORT_FLOW_CONTROL occurs when Always On availability groups hit the unacknowledged message threshold between replicas, typically 1024 outstanding messages. This wait indicates network congestion, slow secondary replicas, or oversized transactions causing communication backpressure. Usually signals performance degradation rather than critical failure.
Root Cause Analysis
The Always On transport layer implements flow control to prevent memory exhaustion from unbounded message queues. Each availability replica maintains separate message channels for log records, metadata, and heartbeats. When the primary replica sends log records to secondaries, it tracks unacknowledged messages per replica connection.
The flow control threshold defaults to 1024 outstanding messages per replica. Once exceeded, the primary's log capture process blocks on HADR_TRANSPORT_FLOW_CONTROL until acknowledgments reduce the queue below threshold. This blocking affects the entire log hardening pipeline, potentially causing HADR_SYNC_COMMIT waits on synchronous replicas.
SQL Server 2016 introduced improvements to reduce false flow control triggers from network micro-disconnections. SQL Server 2019 added better telemetry through extended events. SQL Server 2022 enhanced the transport layer with optimized batching, reducing message overhead for large transactions.
The wait occurs in the context of system threads (not user connections), specifically the log capture thread and redo threads. Network latency amplifies this issue because acknowledgment round-trips take longer, keeping messages in the outstanding queue longer even under normal operation.
AutoDBA checks Always On configuration, transport latency monitoring, and replica performance metrics across your entire SQL Server instance in 60 seconds. Download the free diagnostic script and see what else needs attention.
Diagnostic Queries
-- Check per-replica synchronization state and log send performance
-- Note: sys.dm_hadr_database_replica_states is the primary DMV for AG transport monitoring
SELECT
ar.replica_server_name,
ars.role_desc,
ars.connected_state_desc,
ars.last_connect_error_description,
DB_NAME(drs.database_id) AS database_name,
drs.synchronization_state_desc,
drs.log_send_queue_size,
drs.log_send_rate,
drs.redo_queue_size,
drs.redo_rate,
drs.last_sent_time,
drs.last_received_time,
drs.last_hardened_time
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar ON drs.replica_id = ar.replica_id
JOIN sys.dm_hadr_availability_replica_states ars ON ar.replica_id = ars.replica_id
WHERE drs.is_local = 0
ORDER BY drs.log_send_queue_size DESC;
-- Identify databases with log send queue buildup
SELECT
ag.name AS availability_group_name,
ar.replica_server_name,
drs.database_name,
drs.log_send_queue_size,
drs.log_send_rate,
drs.redo_queue_size,
drs.redo_rate,
drs.last_sent_time,
drs.last_received_time,
drs.last_hardened_time,
drs.secondary_lag_seconds -- Available in SQL Server 2016 SP2+ / 2017 CU3+
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar ON drs.replica_id = ar.replica_id
JOIN sys.availability_groups ag ON ar.group_id = ag.group_id
WHERE drs.log_send_queue_size > 10240 -- 10MB threshold
ORDER BY drs.log_send_queue_size DESC;
-- Monitor HADR wait statistics over time
SELECT
wait_type,
waiting_tasks_count,
wait_time_ms,
max_wait_time_ms,
signal_wait_time_ms,
wait_time_ms / NULLIF(waiting_tasks_count, 0) AS avg_wait_time_ms
FROM sys.dm_os_wait_stats
WHERE wait_type LIKE 'HADR%'
AND waiting_tasks_count > 0
ORDER BY wait_time_ms DESC;
-- Check for large uncommitted transactions affecting log transport
SELECT
s.session_id,
s.login_name,
s.host_name,
s.program_name,
dt.database_transaction_begin_time,
dt.database_transaction_log_bytes_used,
dt.database_transaction_log_bytes_reserved,
at.transaction_begin_time,
DATEDIFF(second, dt.database_transaction_begin_time, GETDATE()) AS transaction_duration_seconds
FROM sys.dm_tran_active_transactions at
JOIN sys.dm_tran_database_transactions dt ON at.transaction_id = dt.transaction_id
JOIN sys.dm_exec_sessions s ON dt.database_id = DB_ID()
AND s.session_id IN (
SELECT session_id
FROM sys.dm_tran_session_transactions st
WHERE st.transaction_id = at.transaction_id
)
WHERE dt.database_transaction_log_bytes_reserved > 100000000 -- 100MB
ORDER BY dt.database_transaction_log_bytes_reserved DESC;
-- Extended events session to capture transport flow control details
CREATE EVENT SESSION [HADR_Transport_Analysis] ON SERVER
ADD EVENT sqlserver.hadr_transport_flow_control_action(
ACTION(sqlserver.database_name, sqlserver.session_id)
),
ADD EVENT sqlserver.hadr_transport_receive_log_block_timeout(
ACTION(sqlserver.database_name)
)
ADD TARGET package0.event_file(SET filename=N'C:\temp\hadr_transport.xel')
WITH (MAX_MEMORY=4096 KB, EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS, MAX_DISPATCH_LATENCY=30 SECONDS);
Fix Scripts
Increase transport timeout thresholds
-- Modify availability group to increase session timeout
-- This reduces false flow control triggers from brief network hiccups
-- TEST IN DEV: Monitor for increased failover times
ALTER AVAILABILITY GROUP [YourAGName]
MODIFY REPLICA ON 'SecondaryServerName'
WITH (SESSION_TIMEOUT = 20); -- Default is 10 seconds
Optimize log backup frequency
-- Implement frequent log backups to reduce log send queue pressure
-- This script creates a maintenance job for 2-minute log backups
-- WARNING: Adjust retention policy to handle increased backup volume
EXEC msdb.dbo.sp_add_job
@job_name = 'AG_Frequent_Log_Backup',
@enabled = 1;
EXEC msdb.dbo.sp_add_jobstep
@job_name = 'AG_Frequent_Log_Backup',
@step_name = 'Backup_Transaction_Log',
@command = '
DECLARE @sql NVARCHAR(MAX);
SELECT @sql = STRING_AGG(
''BACKUP LOG ['' + name + ''] TO DISK = ''''\\BackupServer\LogBackups\'' + name + ''_'' + FORMAT(GETDATE(), ''yyyyMMdd_HHmmss'') + ''.trn'''' WITH COMPRESSION;'',
CHAR(13)
)
FROM sys.databases
WHERE database_id > 4
AND state = 0
AND is_read_only = 0;
EXEC sp_executesql @sql;';
EXEC msdb.dbo.sp_add_schedule
@schedule_name = 'Every_2_Minutes',
@freq_type = 4,
@freq_interval = 1,
@freq_subday_type = 4,
@freq_subday_interval = 2;
EXEC msdb.dbo.sp_attach_schedule
@job_name = 'AG_Frequent_Log_Backup',
@schedule_name = 'Every_2_Minutes';
Force log record acknowledgment for stuck queues
-- Emergency script to force acknowledgment of pending log records
-- USE WITH EXTREME CAUTION: Can cause data loss if secondary is truly behind
-- Only use when secondary shows as synchronized but transport is stuck
DECLARE @database_name SYSNAME = 'YourDatabaseName';
DECLARE @sql NVARCHAR(MAX);
SET @sql = 'ALTER DATABASE [' + @database_name + '] SET HADR SUSPEND;
WAITFOR DELAY ''00:00:10'';
ALTER DATABASE [' + @database_name + '] SET HADR RESUME;';
-- Execute only after confirming secondary is healthy
PRINT 'WARNING: This will temporarily suspend data movement';
PRINT 'Confirm secondary replica is online and healthy before executing:';
PRINT @sql;
-- EXEC sp_executesql @sql; -- Uncomment after review
Network connection optimization
-- Configure TCP chimney offload and receive side scaling if available
-- This is a server-level configuration requiring restart
-- Check current network adapter settings first
EXEC sp_configure 'network packet size', 8192; -- Increase from default 4096
RECONFIGURE;
-- Document current settings for rollback
SELECT name, value, value_in_use, is_dynamic
FROM sys.configurations
WHERE name = 'network packet size';
AutoDBA generates fix scripts like these automatically, with impact estimates and rollback SQL included.
Prevention
Configure dedicated network adapters for Always On traffic with RSS (Receive Side Scaling) enabled and interrupt moderation disabled. Set network packet size to 8192 bytes to reduce transport overhead for large log records. Monitor transport latency baselines during normal operations to establish thresholds.
Implement log backup frequencies based on log generation rate, targeting log send queue sizes under 100MB during peak periods. For high-throughput systems, consider 1-2 minute log backup intervals. Use compression for both log backups and Always On transport compression where SQL Server 2016+ Enterprise edition is available.
Size secondary replica hardware appropriately for redo rate requirements. Slow redo processing creates backpressure that manifests as transport flow control. Monitor redo queue sizes and ensure secondary storage can handle the redo workload, particularly for tempdb rebuilds and index maintenance operations.
Establish connection monitoring using availability group dashboard or custom scripts checking sys.dm_hadr_database_replica_states. Alert on transport latency exceeding 100ms consistently or outstanding message counts approaching 800 (80% of the 1024 threshold). Network path redundancy with NIC teaming helps eliminate single points of failure causing flow control events.
Need hands-on help?
Dealing with persistent hadr_transport_flow_control issues across your environment? Samix Technology provides hands-on SQL Server performance consulting with 15+ years of production DBA experience.