Quick Answer

Always On AG latency occurs when secondary replicas fall behind the primary replica in applying transaction log records, creating data lag that can impact read operations and failover capabilities. This is typically caused by network bottlenecks, resource constraints on secondary replicas, or synchronous commit modes blocking primary transactions. High latency degrades RTO/RPO objectives and can cause application timeouts.

Root Cause Analysis

Always On Availability Groups use a log transport mechanism where the primary replica sends transaction log records to secondary replicas through dedicated database mirroring endpoints. The log capture process on the primary reads committed transactions from the transaction log and queues them for transmission. Secondary replicas receive these log records into a receive queue, then the redo thread applies changes to the secondary database.

Latency manifests in three distinct areas: send queue buildup on the primary (indicating network or secondary processing bottlenecks), receive queue accumulation on secondaries (showing local processing delays), and redo queue growth (revealing disk I/O or CPU constraints during log application). SQL Server 2016 introduced direct seeding and enabled parallel redo on secondary replicas by default, significantly improving redo throughput. SQL Server 2019 added accelerated database recovery which can reduce redo times but may increase version store overhead.

The log transport architecture operates independently per database within an AG. Synchronous commit replicas must acknowledge log hardening before primary transactions can commit, directly impacting primary performance when latency exists. Asynchronous replicas don't block primary commits but accumulate lag that affects read workloads and increases potential data loss windows. The Always On health model monitors these queues and triggers alerts when thresholds exceed configured values.

Resource pressure on secondary replicas, particularly CPU and disk I/O contention, compounds latency issues. Read workloads against secondaries compete with redo threads for resources. Network packet loss or bandwidth limitations between replicas cause retransmissions and transport delays.

Diagnostic Queries

-- Monitor AG replica synchronization health and queue sizes
SELECT 
    r.replica_server_name,
    r.availability_mode_desc,
    rs.synchronization_health_desc,
    drs.synchronization_state_desc,
    DB_NAME(drs.database_id) AS database_name,
    drs.log_send_queue_size,
    drs.log_send_rate,
    drs.redo_queue_size,
    drs.redo_rate,
    drs.last_sent_time,
    drs.last_received_time,
    drs.last_hardened_time,
    drs.last_redone_time
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas r ON drs.replica_id = r.replica_id
JOIN sys.dm_hadr_availability_replica_states rs ON rs.replica_id = r.replica_id
ORDER BY drs.log_send_queue_size DESC;

-- Identify AG transport and redo thread wait statistics
SELECT 
    wait_type,
    waiting_tasks_count,
    wait_time_ms,
    max_wait_time_ms,
    signal_wait_time_ms,
    wait_time_ms - signal_wait_time_ms AS resource_wait_time_ms
FROM sys.dm_os_wait_stats 
WHERE wait_type IN (
    'HADR_LOGCAPTURE_WAIT',
    'HADR_NOTIFICATION_DEQUEUE', 
    'HADR_TIMER_TASK',
    'HADR_WORK_QUEUE',
    'HADR_TRANSPORT_SESSION',
    'HADR_SYNC_COMMIT'
)
ORDER BY wait_time_ms DESC;

-- Check for blocking and resource contention on secondary replicas
SELECT 
    s.session_id,
    s.status,
    s.blocking_session_id,
    s.wait_type,
    s.wait_time,
    s.cpu_time,
    s.logical_reads,
    s.writes,
    t.text AS current_statement
FROM sys.dm_exec_sessions s
CROSS APPLY sys.dm_exec_sql_text(s.sql_handle) t
WHERE s.database_id = DB_ID()
AND s.is_user_process = 1
AND (s.blocking_session_id > 0 OR s.wait_time > 5000)
ORDER BY s.wait_time DESC;

-- Monitor per-replica send/redo performance via sys.dm_hadr_database_replica_states
SELECT 
    r.replica_server_name,
    r.endpoint_url,
    DB_NAME(drs.database_id) AS database_name,
    drs.synchronization_state_desc,
    drs.log_send_queue_size,
    drs.log_send_rate,
    drs.redo_queue_size,
    drs.redo_rate,
    drs.last_sent_time,
    drs.last_received_time,
    drs.last_hardened_time,
    drs.last_redone_time
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas r ON drs.replica_id = r.replica_id
ORDER BY drs.log_send_queue_size DESC;

-- Examine redo thread performance and blocking
SELECT 
    DB_NAME(database_id) AS database_name,
    redo_queue_size,
    redo_rate,
    CASE 
        WHEN redo_rate = 0 THEN NULL
        ELSE redo_queue_size / redo_rate 
    END AS estimated_recovery_time_sec,
    last_redone_time,
    DATEDIFF(second, last_redone_time, GETDATE()) AS seconds_behind_redo
FROM sys.dm_hadr_database_replica_states
WHERE is_local = 1 
AND redo_queue_size > 0
ORDER BY redo_queue_size DESC;

Fix Scripts

Optimize secondary replica read workload resource usage This script helps reduce resource contention between read workloads and redo threads by implementing read-only routing and resource governor controls.

-- Configure read-only routing to distribute load
ALTER AVAILABILITY GROUP [YourAGName]
MODIFY REPLICA ON N'SecondaryServer1'
WITH (SECONDARY_ROLE (READ_ONLY_ROUTING_URL = N'TCP://SecondaryServer1.domain.com:1433'));

-- Create resource pool for read workloads to prevent redo thread starvation
CREATE RESOURCE POOL ReadOnlyPool
WITH (
    MIN_CPU_PERCENT = 0,
    MAX_CPU_PERCENT = 70,  -- Reserve 30% CPU for redo operations
    MIN_MEMORY_PERCENT = 0,
    MAX_MEMORY_PERCENT = 60
);

CREATE WORKLOAD GROUP ReadOnlyGroup
WITH (
    REQUEST_MAX_MEMORY_GRANT_PERCENT = 25,
    MAX_DOP = 4
) 
USING ReadOnlyPool;

Caveats: Test resource limits in development. Monitor redo queue sizes after implementation to ensure redo threads have sufficient resources.

Enable AG endpoint compression Reduces network bandwidth utilization between replicas. Endpoint compression is supported in SQL Server 2016+.

-- Enable compression on the database mirroring endpoint used by the AG
ALTER ENDPOINT [YourAGEndpoint]
FOR DATABASE_MIRRORING (
    COMPRESSION = ENABLED
);

-- Verify endpoint state
SELECT 
    name,
    protocol_desc,
    type_desc,
    state_desc
FROM sys.endpoints 
WHERE type = 4;  -- DATABASE_MIRRORING endpoints

Expected impact: Meaningful reduction in network traffic on compressible workloads. Slight CPU overhead for compression. Test in development to validate network and CPU impact.

Enable Accelerated Database Recovery (ADR) on secondaries Parallel redo is already enabled by default on secondary replicas since SQL Server 2016. Enabling ADR on the database can further reduce recovery time and version store scan overhead for workloads with long-running transactions.

-- Enable ADR on the AG database (SQL Server 2019+)
ALTER DATABASE [YourDatabase] 
SET ACCELERATED_DATABASE_RECOVERY = ON;

-- Observe redo progress from the secondary replica
SELECT 
    DB_NAME(drs.database_id) AS database_name,
    drs.redo_queue_size,
    drs.redo_rate,
    drs.last_redone_time
FROM sys.dm_hadr_database_replica_states drs
WHERE drs.is_local = 1;

Caveats: ADR increases persistent version store usage inside the user database. Monitor PVS growth on each replica.

Configure synchronous replica timeouts and failover policies Prevents synchronous replicas from blocking primary when latency spikes occur.

-- Set reasonable session timeout for AG replicas
ALTER AVAILABILITY GROUP [YourAGName]
MODIFY REPLICA ON N'SyncSecondaryServer'
WITH (SESSION_TIMEOUT = 20);  -- 20 seconds instead of default 10

-- Configure flexible failover policy
ALTER AVAILABILITY GROUP [YourAGName]
WITH (FAILURE_CONDITION_LEVEL = 3);  -- Moderate failures trigger failover

-- Monitor timeout events
SELECT 
    DATEADD(ms, -1 * (ts_now - timestamp), GETDATE()) AS event_time,
    object_name,
    counter_name,
    cntr_value
FROM sys.dm_os_performance_counters
WHERE object_name LIKE '%Availability Replica%'
AND counter_name LIKE '%Flow Control Time%'
ORDER BY cntr_value DESC;

Expected impact: Reduces primary blocking during temporary network issues. May increase data loss window for synchronous replicas during extended outages.

AutoDBA generates fix scripts like these automatically, with impact estimates and rollback SQL included.

Always On AG Latency - Diagnosis & Resolution

Quick Answer

Root Cause Analysis

Diagnostic Queries

Fix Scripts

Prevention

Need hands-on help?

Related Pages

HADR_AG_MUTEX

HADR_AR_CRITICAL_SECTION_ENTRY

HADR_BACKUP_BULK_LOCK

HADR_BACKUP_QUEUE

HADR_COMPRESSED_CACHE_SYNC

HADR_DATABASE_FLOW_CONTROL