A stockbroker once told me he had made a poor decision because he had been mislead by the numbers. A curious choice of words, indeed; it was as if he were an innocent bystander who had been deceived by the nefarious digits on some spreadsheet. In any sort of analysis, the responsibility lies firmly upon the analyst for the correct interpretation of the data represented by the numbers.
This is just as important in the analysis of SQL Server performance as it is in economic analysis, the study of medical research data, or financial markets. In this post we will look at how to correctly interpret CPU wait stats to see if they really do indicate a problem and if so, what action can be taken.
In the previous instalment on monitoring SQL Server wait stats we saw the output of sys.dm_os_wait_stats. We saw that signal_wait_time_ms represented the number of milliseconds tasks had to wait for a thread after whatever resources they had been waiting for became available. If a substantial amount of time is spent waiting for a thread, then we might easily imagine that the system CPU cores are not keeping up with demand and are a performance bottleneck. This is not necessarily the case.
Some authors have correctly suggested that the total of the signal wait times is less important than the fraction of the total wait time taken up by signal waits. This is only half the picture. If you aggregate signal wait times over all waits, you will see high ratios that do not represent any sort of performance bottleneck whatsoever.
The following query was taken from a blog discussing cpu-related waits.
SELECT CAST(100.0 * SUM(signal_wait_time_ms) / SUM(wait_time_ms)
AS NUMERIC(20,2)) AS signal_cpu_waits
If we run this query we might see a result like this:
According to the source, this would represent moderately high percentage. Let’s see what happens if we look at the individual signal waits.
SELECT *, CAST(100.0 * signal_wait_time_ms / NULLIF(wait_time_ms,0)
AS DECIMAL(5,2)) AS [signal wait percent]
We see that the the first three rows include signal wait times that are not only large, they are exactly equal to the total wait times. This is behavior common for those waits that I have called the “nothing to do” waits. Since the wait times for nothing-to-do waits are often huge, their contribution to the calculated ratio seen in the above query is disproportionately large. This might cause someone to believe there is a CPU problem when in fact there is not. Perhaps worse, any actual CPU problem would make a less significant contribution to this inflated ratio and might go unnoticed.
Note that the fourth wait type on the list is SOS_SCHEDULER_YIELD.. We will consider this shortly.
The overall ratio between signal wait times and total wait times is only useful when the nothing-to-do waits have all been filtered out. An example query for accomplishing this was provided in the previous instalment and may be downloaded here.
We don’t have time or space right now to discuss the details of user mode scheduling (UMS) of thread execution; too bad, because it is a fascinating topic. UMS is based on the concept that applications know better than the operating system when and how to shift execution to another thread, and can therefore reduce the costly kernel context switches required by the operating system. UMS did this by manipulating the priority of the waiting threads, so that the operating system would always give a time slice to the thread SQL Server wanted it to. UMS was introduced in SQL Server 7, and for years SQL Server was pretty much the only program to use it. However, the Windows folks at Microsoft took notice, thought it was a great idea, and with the introduction of Windows 7 and Windows 2008 Server R2 the operating system itself included a new set of API functions to provide UMS to any program that wishes it.
For our purposes here, the most important fact is that SQL Server’s thread schedulers always dole out a uniform timeslice of 4 milliseconds called a quantum. Therefore the threads of all but the shortest tasks will relinquish the CPU and, if the system is not busy, will almost immediately reacquire it. Therefore a thread yielding and another task being assigned to a thread is a common occurrence, and a large count of SOS_SCHEDULER_YIELD waits tells us nothing.
Since long-running queries will hit the 4 millisecond thread execution limit often, they contribute greatly to the observance of high SOS_SCHEDULER_YIELD wait counts. As long as everything is optimized appropriately there is nothing wrong with having long-running queries; some tasks simply require more time than others.
Note that in the query results above, the ratio of signal wait to total wait is very high, yet the average wait time, i.e. the total wait divided by the number of waits, is exceptionally low. CPU availability was not a problem here, illustrating another potential danger of relying solely on ratios.
SOS_SCHEDULER_YIELD waits are a good illustration of how, ultimately, many performance conditions cannot be diagnosed on the basis of single measurements from individual tools. Imagine that the Windows performance monitor is showing consistently high CPU activity, perhaps 80-90%. A problem? We cannot tell. There may simply be a lot to do and the CPU is humming away performing those tasks in an orderly and efficient fashion. However, the same CPU percentage might be seen in a situation where there were lots of threads ready to go but not getting a CPU slice in a timely manner. In a situation like this, it would be the combination of high CPU usage from the performance monitor ( or sys.dm_os_performance_counters if you prefer) and high signal waits from sys.dm_os_wait_stats that would be a primary clue.
A THREADPOOL wait should rarely be seen on a healthy system. A THREADPOOL wait occurs when there a no threads in the thread pool for assignment to a runnable task. This may occur if the configured maximum number of worker threads is tool small for the workload. Before adjusting the configuration value, however, you should examine whether this is a common condition or has only occurred during a rare period of exceptionally high usage. The maintenance of threads comes at a cost, and the thread maximum should not be adjusted for conditional that occur rarely.
CXPACKET waits can occur when SQL Server is trying to synchronize multiple threads for a query with a parallel execution plan. The response to CXPACKET waits depends on the query itself. Might the query benefit from additional indexes? Are there any flaws in the query that might cause problems, such as inappropriate data types? If the query seems OK and the indexes seem suitable, it may be possible to reduce CXPACKET waits by adjusting the maximum degree of parallelism.
Wait stats relating to CPU usage rarely provide a direct indicator of a single problem. At best, CPU wait stats can call attention to the possibility of problems, but only further investigation of the queries involved and the server workload can determine whether problems actually exist and, if so, what action is to be taken.