Back to blog
Technical12 min read

The Wait Stats Everyone Ignores (And the Ones That Get Too Much Attention)

Evan Barke·DBA & Founder
·

Wait statistics are the first thing every DBA checks when investigating a performance problem. They should be. Waits tell you what SQL Server is spending its time waiting for, which is the most direct path to understanding what's actually slow.

But here's the dirty secret of wait stats analysis: most of the conventional wisdom about which waits matter is either oversimplified or outright wrong. Some waits that look alarming are completely harmless. Some waits that nobody pays attention to are silently destroying your throughput. And the way most people rank waits, by total wait time, hides more than it reveals.

Waits That Get Too Much Attention

CXPACKET. This is probably the most over-diagnosed wait type in SQL Server. Every blog post, every community forum, every monitoring tool flags CXPACKET as a problem. "You have high CXPACKET waits, you need to adjust your parallelism settings."

Here's the thing: CXPACKET is a coordination wait. It means a parallel query has threads that finished their work and are waiting for other threads to catch up. Some amount of CXPACKET is completely normal on any system that runs parallel queries. If your server has 64 cores and runs reporting workloads, CXPACKET will be near the top of your wait stats. That's expected behaviour, not a problem.

CXPACKET becomes a problem only when it's accompanied by other symptoms: queries running slow, CPU fully utilized, worker threads exhausted. By itself, high CXPACKET just means "parallel queries are running." The useful signal is the wait time per execution, not the total accumulated time.

SQL Server 2016 SP2 and later split CXPACKET into CXPACKET (producer) and CXCONSUMER (consumer). If your CXCONSUMER waits are low and CXPACKET is high, it usually means your parallel queries are just fine and the coordinator thread is doing its job. If both are high, that's when you investigate parallelism settings.

ASYNC_NETWORK_IO. This wait means SQL Server has results ready to send to the client, but the client isn't consuming them fast enough. When you see this at the top of your wait list, the instinct is to blame the network. It's almost never the network.

The overwhelming majority of ASYNC_NETWORK_IO waits are caused by client applications doing row-by-row processing of result sets. The application opens a query that returns 100,000 rows and processes each row with application logic before fetching the next one. SQL Server has the next batch of rows ready, but the client is busy doing work on the current batch.

The fix isn't a network upgrade. The fix is client-side: fetch all the data first, then process it. Or better yet, do the processing in SQL. But this is an application problem, not a SQL Server problem, and no amount of server-side tuning will fix it.

SOS_SCHEDULER_YIELD. This wait indicates that a thread used its full quantum (4ms) of CPU time and yielded to let other threads run. It's often flagged as "CPU pressure." It can indicate CPU pressure, but it can also just mean that you have compute-intensive queries running on a busy server. The scheduler yield mechanism is SQL Server doing its job, ensuring no single thread monopolizes the CPU.

True CPU pressure shows up as a pattern: high SOS_SCHEDULER_YIELD combined with high signal wait times (the time a thread spends in the runnable queue waiting for CPU). If your signal waits are above 20-25% of total waits, you have CPU pressure. If SOS_SCHEDULER_YIELD is high but signal waits are low, you just have busy queries.

Sponsored autodba.samix-technology.com
AutoDBA – SQL Server Diagnostics – Free Scan in 60 Seconds

Find missing indexes, blocking queries & performance issues instantly. No agent install. Upload a snapshot and get actionable recommendations.

Waits That Deserve More Attention

RESOURCE_SEMAPHORE. This wait means a query requested a memory grant and couldn't get one. It's sitting in a queue waiting for memory to become available before it can start executing. This is one of the most impactful waits you can have because queries don't just run slowly while waiting, they don't run at all. They're queued.

The frustrating thing about RESOURCE_SEMAPHORE is that it doesn't show up as a high total wait time because it affects fewer queries. A server might have 100,000 CXPACKET waits per day and 50 RESOURCE_SEMAPHORE waits. But those 50 waits might represent queries that sat in the queue for 30 seconds each, which is 25 minutes of total user-facing delay. If those are customer-facing queries, that's a far worse problem than the CXPACKET.

Check for it specifically:

SELECT *
FROM sys.dm_exec_query_resource_semaphores;

-- If any of these show non-zero waiter_count, you have active memory grant waits
-- Check what's consuming the grants:
SELECT
    mg.session_id,
    mg.requested_memory_kb / 1024 AS requested_mb,
    mg.granted_memory_kb / 1024 AS granted_mb,
    mg.wait_time_ms,
    t.text AS query_text
FROM sys.dm_exec_query_memory_grants mg
CROSS APPLY sys.dm_exec_sql_text(mg.sql_handle) t
WHERE mg.grant_time IS NULL  -- waiting for grant
ORDER BY mg.wait_time_ms DESC;

The fixes for RESOURCE_SEMAPHORE are varied: increase max server memory, fix queries that request oversized grants (often caused by bad cardinality estimates), enable Resource Governor to limit grant sizes, or add memory to the server. But you can't fix what you don't notice, and most wait stats dashboards bury this one.

WRITELOG. Every transaction commit requires a log write, and WRITELOG is the wait incurred during that write. On OLTP systems doing thousands of transactions per second, even small increases in WRITELOG wait time have an outsized impact because every single write operation is affected.

A healthy WRITELOG wait time is under 1ms average. When it climbs to 2-3ms, you probably won't notice. At 5-10ms, your batch operations start slowing down. At 20ms+, everything that writes data is in trouble.

The cause is almost always IO latency on the log drive. The fix is faster storage for your transaction log. Put the log on the fastest storage you have, ideally dedicated NVMe. No amount of query tuning helps if your log writes are slow because every query that modifies data waits for the log.

What makes WRITELOG tricky is that it's often masked by other waits. A slow log makes everything that writes data slower, which can manifest as higher latch waits, lock waits, or even CXPACKET waits as parallel insert operations wait for log flushes. You chase the downstream symptoms and miss the root cause.

-- Check current WRITELOG wait times
SELECT
    wait_type,
    waiting_tasks_count,
    wait_time_ms,
    wait_time_ms / NULLIF(waiting_tasks_count, 0) AS avg_wait_ms
FROM sys.dm_os_wait_stats
WHERE wait_type = 'WRITELOG';

-- Cross-reference with IO latency on the log file
SELECT
    DB_NAME(vfs.database_id) AS database_name,
    mf.name AS file_name,
    mf.type_desc,
    vfs.io_stall_write_ms / NULLIF(vfs.num_of_writes, 0) AS avg_write_latency_ms
FROM sys.dm_io_virtual_file_stats(NULL, NULL) vfs
JOIN sys.master_files mf
    ON vfs.database_id = mf.database_id AND vfs.file_id = mf.file_id
WHERE mf.type_desc = 'LOG'
ORDER BY avg_write_latency_ms DESC;

PAGEIOLATCH_SH / PAGEIOLATCH_EX. These waits actually do get attention, but they're often misdiagnosed. The standard interpretation is "you have slow storage." Sometimes that's true. Often the real problem is insufficient memory.

PAGEIOLATCH waits occur when SQL Server needs a data page that isn't in the buffer pool and has to read it from disk. If your buffer pool is too small for your working set, SQL Server constantly evicts pages and reads them back in. The wait type says IO, but the root cause is memory.

Before blaming your storage, check page life expectancy (PLE) and buffer pool hit ratio. If PLE is low (under 300 seconds per 4GB of buffer pool is a reasonable baseline) and your hit ratio is below 95%, you likely have a buffer pool sizing problem. Adding memory or increasing max server memory will reduce these waits more effectively than faster disks.

LCK_M_S, LCK_M_X, LCK_M_U (lock waits). Lock waits are often brushed off as "that's just how databases work." But persistent lock waits usually indicate a design problem. Either transactions are held open too long, or access patterns are creating avoidable contention.

The one that bothers me most is LCK_M_S (shared lock waits). Shared locks should be compatible with other shared locks. If you're seeing high LCK_M_S waits, it means read queries are being blocked by write transactions. On most OLTP systems, this is fixable with read committed snapshot isolation (RCSI):

ALTER DATABASE [YourDatabase] SET READ_COMMITTED_SNAPSHOT ON;

With RCSI, readers don't block writers and writers don't block readers. The trade-off is a modest increase in tempdb usage for version storage. On modern hardware, that trade-off is almost always worth it. I've seen systems where enabling RCSI cut average query latency by 40% because readers were no longer queuing behind writers.

Yet I still encounter production systems every month where RCSI is disabled and lock waits are in the top 5. It's one of those changes that feels scary ("we're changing the isolation level!") but the risk is low and the benefit is often dramatic.

The Ranking Problem

Most wait stats analysis starts with "show me the top waits by total wait time." This is a reasonable starting point, but it's biased toward high-frequency, low-impact waits. A wait that occurs a million times at 1ms each (total: 1,000 seconds) will rank above a wait that occurs 100 times at 5 seconds each (total: 500 seconds). But those 100 five-second waits probably represent user-visible timeouts while the million one-millisecond waits are invisible background noise.

A better approach is to look at waits from two angles:

Total impact (total wait time). This tells you where the server is spending time overall. Useful for capacity planning and identifying systemic issues.

Per-occurrence severity (average wait time). This tells you how bad each individual wait event is. High average wait times point to specific, acute problems that affect individual queries.

-- Both perspectives in one query
SELECT
    wait_type,
    waiting_tasks_count AS occurrences,
    wait_time_ms AS total_wait_ms,
    wait_time_ms / NULLIF(waiting_tasks_count, 0) AS avg_wait_ms,
    signal_wait_time_ms,
    signal_wait_time_ms * 100.0 /
        NULLIF(wait_time_ms, 0) AS signal_pct
FROM sys.dm_os_wait_stats
WHERE waiting_tasks_count > 0
    AND wait_type NOT IN (
        'SLEEP_TASK', 'BROKER_TO_FLUSH', 'SQLTRACE_BUFFER_FLUSH',
        'CLR_AUTO_EVENT', 'CLR_MANUAL_EVENT', 'LAZYWRITER_SLEEP',
        'CHECKPOINT_QUEUE', 'WAITFOR', 'XE_TIMER_EVENT',
        'FT_IFTS_SCHEDULER_IDLE_WAIT', 'BROKER_TASK_STOP',
        'XE_DISPATCHER_WAIT', 'BROKER_EVENTHANDLER',
        'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', 'DIRTY_PAGE_POLL',
        'HADR_FILESTREAM_IOMGR_IOCOMPLETION', 'SP_SERVER_DIAGNOSTICS_SLEEP',
        'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP', 'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP'
    )
ORDER BY avg_wait_ms DESC;

Sort by avg_wait_ms instead of total_wait_ms and you'll see a completely different picture. The waits at the top of this list are the ones causing individual queries to run slow. That's usually what your users are actually complaining about.

Sponsored company.samix-technology.com/services/database-administration
Expert Database Administration – Samix Technology – DBA Services

Professional SQL Server & database administration services. Performance tuning, migration, high availability & 24/7 monitoring by certified DBAs.

Signal Waits: The Metric Inside the Metric

Every wait stat has two components: resource wait time (waiting for the resource) and signal wait time (waiting for CPU after the resource became available). The signal_wait_time_ms column tells you how long threads spent in the runnable queue waiting for a CPU to become free.

If signal waits are consistently above 20% of total wait time, you have CPU pressure regardless of what the top wait types are. A server might show PAGEIOLATCH as the top wait, but if signal waits are 30%, the real bottleneck is CPU. Queries are waiting for IO, then when the IO completes, they're waiting again for a CPU to resume execution.

This is one of the most commonly missed insights in wait stats analysis. Everyone looks at the wait types but not at the signal component. Two servers can have identical wait type rankings but completely different bottlenecks depending on their signal wait percentages.

Stop Accumulating, Start Sampling

One more thing. The sys.dm_os_wait_stats view is cumulative since the last server restart (or the last time someone ran DBCC SQLPERF('sys.dm_os_wait_stats', CLEAR)). If your server has been running for six months, you're looking at six months of accumulated waits. That 2TB of CXPACKET waits might have happened during a one-time migration three months ago and hasn't been relevant since.

To get actionable data, you need to sample. Capture wait stats at two points in time and look at the delta:

-- Snapshot 1: capture into a temp table
SELECT wait_type, waiting_tasks_count, wait_time_ms, signal_wait_time_ms
INTO #waits_before
FROM sys.dm_os_wait_stats
WHERE waiting_tasks_count > 0;

-- Wait for your sample period (5 minutes, 1 hour, etc.)
-- Then take snapshot 2 and compare:
SELECT
    a.wait_type,
    b.waiting_tasks_count - a.waiting_tasks_count AS delta_count,
    b.wait_time_ms - a.wait_time_ms AS delta_wait_ms,
    (b.wait_time_ms - a.wait_time_ms) /
        NULLIF(b.waiting_tasks_count - a.waiting_tasks_count, 0) AS avg_wait_ms
FROM #waits_before a
JOIN sys.dm_os_wait_stats b ON a.wait_type = b.wait_type
WHERE b.waiting_tasks_count > a.waiting_tasks_count
ORDER BY delta_wait_ms DESC;

This gives you waits from the last N minutes, not the last N months. Much more useful for diagnosing current issues.

This is one of the things we do automatically in AutoDBA's analysis. The diagnostic captures a point-in-time snapshot and the AI correlates waits with other metrics from the same window: memory pressure, IO latency, CPU utilization, active queries. The result is wait stats in context, which is where the real insights are. Waits in isolation tell you what's happening. Waits in context tell you why.

Want to find issues on your SQL Server?

Run AutoDBA's free diagnostic scan and get results in 60 seconds. No agent install required.

Get Started Free