Amy P., a student in Learning Tree’s 535 Class, Developing High Performance SQL Server Databases, asks “There are so many waits and so many different wait types, I’m confused.” Well, actually, that’s a statement, not a question, but I know what you mean. You get the impression that SQL Server wait stats can be an important tool, but it’s not easy to know where to start.
All SQL Server admins know that SQL Server report a great many performance counters to the Windows performance monitor. Very often, recording individual values over time is less valuable than calculating aggregate results. Wait stats are simply the aggregated values of virtually all the times any SQL Server process has had to wait. Every time SQL Server waits for anything, the wait statistics are updated. You can query these values with the dynamic management view (DMV) sys.dm_os_wait_stats, and it’s reassuring to know that, since the statistics are being gathered anyway, checking these statistics puts no extra load on your system as does profiling. Since the wait stats are already there, you might as well use them.
Wait stats are not persistently stored; the values provided by sys.dm_os_wait_stats represents the data collected since the last time SQL Server was started.
If you wish, you can reset the values by invoking DBCC SQLPERF (‘sys.dm_os_wait_stats’, CLEAR). In the output, wait_time_ms is the total number of milliseconds spent waiting; signal_wait_time_ms is the time spent waiting for a thread of execution once a resource has become available. For those wait stats that involve waiting for resources, the total time spent waiting for the resource is the total time minus the signal wait time.
Signal wait time can therefore tell us about CPU usage. Because CPU usage is both important and complex, we will consider it as a separate topic in the follow-up blog. Here we will consider waiting for other resources, like memory and hard drives.
I’ve read widely varied estimates of how many different wait stat types there are. I count 403 in the MSDN documentation for sys.dm_os_wait_stats, but on my test SQL Server 2012 649 wait types are returned when I actually query sys.dm_os_wait_stats. Many of these wait stats are likely to be irrelevant to your work. If you are not using full-text search, you won’t have to worry about FT waits. Similarly, if you are not using Service Broker, SB waits will be of no concern. For some waits, including some associated with FT and SB, you will see large numbers that are essentially proportional to the amount of time your server has been up and running. These are generally background tasks that are waiting because they have nothing else to do.
The sql query files for this blog exclude waits for things like the service broker, but it is trivially easy to comment out those lines if you want to monitor SB.
As is usually the case with performance tuning, gathering the data is only step one. The real trick is interpreting the data in a way that directly leads to improvement. The first thing to appreciate is that it is not necessarily bad if SQL Server has to wait. In this sense, waits are somewhat analogous to blocks, which are much more familiar to most SQL Server DBAs. Waits, like blocks, are a natural part of the operation of any database. What we are looking for is not waits themselves, but waits that are too numerous or too long to be normal. We are seeking the pathological.
There are two sets of wait types that are unlikely to be helpful. One consists of the waits that occur during system startup and do not affect a running server. The second set are those processes that spend most of their time waiting simply because there is nothing for them to do. Many normal background tasks can spend a lot of time waiting for something to do; the checkpoint and the lazywriter are just two examples. If there is no need to flush pages out of the buffer,, then the lazywriter will spend all of its time waiting. We already know this and do not need to monitor it.
Clearly, we do not wish to clutter our output with waits that have occurred, but which will not lead us to any sort of performance-improving action. While we could filter these out directly in a WHERE clause, if we plan to run wait stats queries on a regular basis it would be beneficial to create a common table expression (CTE) that we can copy and paste, or, better yet, a view.
Both an example CTE and a CREATE VIEW statement is provided for download. This file also includes comments on many of the wait stats that are being ignored so you can fully appreciate exactly what is being ignored and why. Because the list of ignored wait types is quite long, we won’t show the SQL here. As mentioned, some processes like CHECKPOINT and LAZYWRITER normally spend much of their time waiting, so the CHECKPOINT_QUEUE and LAZYWRITER_SLEEP waits cansafely be ignored..
Some wait stats have been deprecated by Microsoft. These are unsupported and there is no point in monitoring them, so they are also excluded in the queries for this blog.
As is so often the case, our main concerns are IO, memory, transactions, and CPU time. Here we will focus on the first three and leave CPU waits until next time. Here a some of the most important wait types for you to be looking at.
Wait Type Relevant to Memory:
LOWFAIL_MEMMGR_QUEUE indicates waiting for memory to be made available. Clearly, this can reflect a system bound by insufficient RAM.
Wait Types Relevant to IO
PAGEIOLATCH% There a number of wait types that start with PAGEIOLATCH. Briefly, these waits indicate that a page is latched while SQL Server reads it from the hard drive into the RAM. Although this is an IO wait, a large number of waits may be an indication of inadequate RAM or poorly constructed queries rather than by a problem with the disk subsystem. In contrast, if the wait times are longer on average, the the disk subsystem may be inadequate.
IO_COMPLETION indicates a wait for IO to complete (obviously), but waits for data pages are generally represented as PAGEIOLATCH wait types. IO_COMPLETION indicates a wait for non-data-page IO.
Wait Types Relevant to the Transaction Log:
LOGMGR_RESERVE_APPEND indicates that SQL Server is waiting to see if free space can be made available in the log. This might indicate that the current log file size is inadequate.
LOGBUFFER indicates a wait for free space in the log buffer, which may mean that the disk subsystem for the transaction log may not be able to keep up with the workload
Wait Types Relevant to Locking:
LCK_M% The many lock waits are useful primarily to provide direction for more research, Since, as you know, waiting for locks is normal, it is not always easy to identify a potential problem from lock waits. As is often the case, noticing changes is important; are locks more numerous now or are they last longer than last month? It is also valuable to keep an eye on the maximum lock waits. Are there long-running queries that are holding locks for an excessive amount of time?
If better performance is our goal, we want to focus on long waits and pathological waits. For this reason, we will invariably want to sort our results in descending order by wait_time_ms. It might be very helpful to calculate the mean wait time by dividing the total wait time (wait_time_ms) by the number of wait events (waiting_tasks_count). Some authors calculate a percent of total wait in similar queries, but this value will not provide any additional information to help you decide which waits are the important ones to follow up.
Ironically, one of our conclusions is that we can draw no conclusions from wait stats. While it is hard to imagine a performance tuning approach that does not include watching wait stats, all we are really doing is keeping an eye out for unusual numbers. Spotting potential problems in the wait stats is the beginning of a tuning process, not the end; wait stats results are indicators about which system functions might need further investigation. In other words, wait stats are symptoms. As with all symptoms, a thorough examination is necessary before pronouncing a diagnosis and prescribing a treatment.
Stay tuned for my next post where we will look at how to interpret CPU-related wait times.