Quick Answer
Always On AG latency occurs when secondary replicas fall behind the primary replica in applying transaction log records, creating data lag that can impact read operations and failover capabilities. This is typically caused by network bottlenecks, resource constraints on secondary replicas, or synchronous commit modes blocking primary transactions. High latency degrades RTO/RPO objectives and can cause application timeouts.
Root Cause Analysis
Always On Availability Groups use a log transport mechanism where the primary replica sends transaction log records to secondary replicas through dedicated database mirroring endpoints. The log capture process on the primary reads committed transactions from the transaction log and queues them for transmission. Secondary replicas receive these log records into a receive queue, then the redo thread applies changes to the secondary database.
Latency manifests in three distinct areas: send queue buildup on the primary (indicating network or secondary processing bottlenecks), receive queue accumulation on secondaries (showing local processing delays), and redo queue growth (revealing disk I/O or CPU constraints during log application). SQL Server 2016 introduced direct seeding and enabled parallel redo on secondary replicas by default, significantly improving redo throughput. SQL Server 2019 added accelerated database recovery which can reduce redo times but may increase version store overhead.
The log transport architecture operates independently per database within an AG. Synchronous commit replicas must acknowledge log hardening before primary transactions can commit, directly impacting primary performance when latency exists. Asynchronous replicas don't block primary commits but accumulate lag that affects read workloads and increases potential data loss windows. The Always On health model monitors these queues and triggers alerts when thresholds exceed configured values.
Resource pressure on secondary replicas, particularly CPU and disk I/O contention, compounds latency issues. Read workloads against secondaries compete with redo threads for resources. Network packet loss or bandwidth limitations between replicas cause retransmissions and transport delays.
AutoDBA checks Always On AG replica health, queue sizes, and network transport performance across your entire SQL Server instance in 60 seconds. Download the free diagnostic script and see what else needs attention.
Diagnostic Queries
-- Monitor AG replica synchronization health and queue sizes
SELECT
r.replica_server_name,
r.availability_mode_desc,
rs.synchronization_health_desc,
drs.synchronization_state_desc,
DB_NAME(drs.database_id) AS database_name,
drs.log_send_queue_size,
drs.log_send_rate,
drs.redo_queue_size,
drs.redo_rate,
drs.last_sent_time,
drs.last_received_time,
drs.last_hardened_time,
drs.last_redone_time
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas r ON drs.replica_id = r.replica_id
JOIN sys.dm_hadr_availability_replica_states rs ON rs.replica_id = r.replica_id
ORDER BY drs.log_send_queue_size DESC;
-- Identify AG transport and redo thread wait statistics
SELECT
wait_type,
waiting_tasks_count,
wait_time_ms,
max_wait_time_ms,
signal_wait_time_ms,
wait_time_ms - signal_wait_time_ms AS resource_wait_time_ms
FROM sys.dm_os_wait_stats
WHERE wait_type IN (
'HADR_LOGCAPTURE_WAIT',
'HADR_NOTIFICATION_DEQUEUE',
'HADR_TIMER_TASK',
'HADR_WORK_QUEUE',
'HADR_TRANSPORT_SESSION',
'HADR_SYNC_COMMIT'
)
ORDER BY wait_time_ms DESC;
-- Check for blocking and resource contention on secondary replicas
SELECT
s.session_id,
s.status,
s.blocking_session_id,
s.wait_type,
s.wait_time,
s.cpu_time,
s.logical_reads,
s.writes,
t.text AS current_statement
FROM sys.dm_exec_sessions s
CROSS APPLY sys.dm_exec_sql_text(s.sql_handle) t
WHERE s.database_id = DB_ID()
AND s.is_user_process = 1
AND (s.blocking_session_id > 0 OR s.wait_time > 5000)
ORDER BY s.wait_time DESC;
-- Monitor per-replica send/redo performance via sys.dm_hadr_database_replica_states
SELECT
r.replica_server_name,
r.endpoint_url,
DB_NAME(drs.database_id) AS database_name,
drs.synchronization_state_desc,
drs.log_send_queue_size,
drs.log_send_rate,
drs.redo_queue_size,
drs.redo_rate,
drs.last_sent_time,
drs.last_received_time,
drs.last_hardened_time,
drs.last_redone_time
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas r ON drs.replica_id = r.replica_id
ORDER BY drs.log_send_queue_size DESC;
-- Examine redo thread performance and blocking
SELECT
DB_NAME(database_id) AS database_name,
redo_queue_size,
redo_rate,
CASE
WHEN redo_rate = 0 THEN NULL
ELSE redo_queue_size / redo_rate
END AS estimated_recovery_time_sec,
last_redone_time,
DATEDIFF(second, last_redone_time, GETDATE()) AS seconds_behind_redo
FROM sys.dm_hadr_database_replica_states
WHERE is_local = 1
AND redo_queue_size > 0
ORDER BY redo_queue_size DESC;
Fix Scripts
Optimize secondary replica read workload resource usage This script helps reduce resource contention between read workloads and redo threads by implementing read-only routing and resource governor controls.
-- Configure read-only routing to distribute load
ALTER AVAILABILITY GROUP [YourAGName]
MODIFY REPLICA ON N'SecondaryServer1'
WITH (SECONDARY_ROLE (READ_ONLY_ROUTING_URL = N'TCP://SecondaryServer1.domain.com:1433'));
-- Create resource pool for read workloads to prevent redo thread starvation
CREATE RESOURCE POOL ReadOnlyPool
WITH (
MIN_CPU_PERCENT = 0,
MAX_CPU_PERCENT = 70, -- Reserve 30% CPU for redo operations
MIN_MEMORY_PERCENT = 0,
MAX_MEMORY_PERCENT = 60
);
CREATE WORKLOAD GROUP ReadOnlyGroup
WITH (
REQUEST_MAX_MEMORY_GRANT_PERCENT = 25,
MAX_DOP = 4
)
USING ReadOnlyPool;
Caveats: Test resource limits in development. Monitor redo queue sizes after implementation to ensure redo threads have sufficient resources.
Enable AG endpoint compression Reduces network bandwidth utilization between replicas. Endpoint compression is supported in SQL Server 2016+.
-- Enable compression on the database mirroring endpoint used by the AG
ALTER ENDPOINT [YourAGEndpoint]
FOR DATABASE_MIRRORING (
COMPRESSION = ENABLED
);
-- Verify endpoint state
SELECT
name,
protocol_desc,
type_desc,
state_desc
FROM sys.endpoints
WHERE type = 4; -- DATABASE_MIRRORING endpoints
Expected impact: Meaningful reduction in network traffic on compressible workloads. Slight CPU overhead for compression. Test in development to validate network and CPU impact.
Enable Accelerated Database Recovery (ADR) on secondaries Parallel redo is already enabled by default on secondary replicas since SQL Server 2016. Enabling ADR on the database can further reduce recovery time and version store scan overhead for workloads with long-running transactions.
-- Enable ADR on the AG database (SQL Server 2019+)
ALTER DATABASE [YourDatabase]
SET ACCELERATED_DATABASE_RECOVERY = ON;
-- Observe redo progress from the secondary replica
SELECT
DB_NAME(drs.database_id) AS database_name,
drs.redo_queue_size,
drs.redo_rate,
drs.last_redone_time
FROM sys.dm_hadr_database_replica_states drs
WHERE drs.is_local = 1;
Caveats: ADR increases persistent version store usage inside the user database. Monitor PVS growth on each replica.
Configure synchronous replica timeouts and failover policies Prevents synchronous replicas from blocking primary when latency spikes occur.
-- Set reasonable session timeout for AG replicas
ALTER AVAILABILITY GROUP [YourAGName]
MODIFY REPLICA ON N'SyncSecondaryServer'
WITH (SESSION_TIMEOUT = 20); -- 20 seconds instead of default 10
-- Configure flexible failover policy
ALTER AVAILABILITY GROUP [YourAGName]
WITH (FAILURE_CONDITION_LEVEL = 3); -- Moderate failures trigger failover
-- Monitor timeout events
SELECT
DATEADD(ms, -1 * (ts_now - timestamp), GETDATE()) AS event_time,
object_name,
counter_name,
cntr_value
FROM sys.dm_os_performance_counters
WHERE object_name LIKE '%Availability Replica%'
AND counter_name LIKE '%Flow Control Time%'
ORDER BY cntr_value DESC;
Expected impact: Reduces primary blocking during temporary network issues. May increase data loss window for synchronous replicas during extended outages.
AutoDBA generates fix scripts like these automatically, with impact estimates and rollback SQL included.
Prevention
Configure Always On dashboard monitoring with custom thresholds for send queue (>1GB), redo queue (>500MB), and RPO lag (>60 seconds). Implement dedicated network infrastructure for AG traffic with sufficient bandwidth, preferably 10Gbps or higher for production workloads. Use separate subnets or VLANs to isolate AG replication traffic from application traffic.
Size secondary replica hardware appropriately, particularly disk subsystem performance for redo operations. Secondary replicas require similar disk I/O capabilities to the primary for large transaction workloads. Configure tempdb properly on secondaries when using readable secondaries, as read workloads generate tempdb activity competing with redo operations.
Establish baseline performance metrics for send rates, redo rates, and network latency between replicas. Monitor these continuously through SQL Server Agent jobs or external monitoring solutions. Create alerts for queue size thresholds and sustained latency periods exceeding RTO objectives.
Consider geographic placement carefully for disaster recovery replicas. Cross-datacenter latency adds unavoidable delays to synchronous replication. Use asynchronous replicas for distant locations unless regulatory requirements mandate synchronous replication with corresponding performance trade-offs.
Implement maintenance procedures that account for AG topology. Index rebuilds and large data modifications generate substantial log activity that amplifies transport latency. Schedule maintenance during low-activity periods and consider temporarily modifying replication modes during intensive operations.
Need hands-on help?
Dealing with persistent always on ag latency issues across your environment? Samix Technology provides hands-on SQL Server performance consulting with 15+ years of production DBA experience.