“I know there are two functions for standard deviation and for variance in Transact-SQL, but I can never remember when to use which.”
You are so right, Amy. It is hard to know when to use STDEV or STDEVP for standard deviation, or VAR and VARP for variance, because the documentation just doesn’t explain it very clearly..
Here’s what we get if we go to the MSDN T-SQL documentation for VAR:
“Returns the statistical variance for all values in the specified expression.”
In my opinion, not only does this definition fail to shed light on the problem, strictly speaking it’s actually wrong. The MSDN description for VARP doesn’t add much more to our understanding:
“Returns the statistical variance for the population for all values in the specified expression.
Identical functions can be found in MDX and Microsoft Excel, although in Excel they are named VAR.S, VAR.P, STDEV.S and STDEV.P In the tooltip for MDX, the query language for the Analysis Services, it tells us that VARP: “Returns the variance of a numeric expression evaluated over a set, using a biased population.” A biased population? What is that supposed to mean?
STDEV and VAR do not actually return the standard deviation and the variance at all. The return estimates of the standard deviation and variance for a population. The concept of estimation is critical for understanding the difference!
Remember, the key word is estimation. There are situations where we want to calculate the standard deviation and variance of some data. As we will see, that is what STDEVP and VARP are for. There are other times where we would like the standard deviation or variance but cannot calculate it. We don’t have all the data! The best we can do is estimate the standard deviation and variance.
Let’s imagine that we decide to do a casual health study and obtain the weight of everyone at your company. Never mind how you managed to get everyone to step on the scales for you, you got the numbers. It is easy to calculate the arithmetic mean, which SQL Server calls the “average” and calculates using AVG. But which function will give us the appropriate standard deviation?
Let’s consider our population. We measured the weight of everyone at the company, every last one of them. Short, tall, chubby or thin, we got ’em all. We are not trying to estimate anything. We wanted the weight of everyone and now we have it. We are therefore directly calculating the standard deviation of the entire population of employees and will use STDEVP. The variance of our data is calculated with VARP.
A much more formidable problem. Obviously, the employees at your company may not be representative of tech employees in general, so gathering data will require some sort of plan to achieve a reliable and representative sample. Our goal here is not to discuss the intricacies of such an undertaking, but to emphasize that we cannot weigh everyone at all tech companies; we must rely on a sampling to estimate the information we want.
Once we have our weight data, we again calculate the average. Our best estimate of the average weight of the population turns out to be the same as the average of the employees we actually measured. The mean of our sample is said to be an unbiased estimator of the mean of the population.
This is not the case for the standard deviation. The standard deviation of the data we actually measured is likely to be just a tad smaller than the actual standard deviation of the population, that thing we cannot calculate exactly but are trying to estimate. Using the results from STDEVP to estimate the standard deviation of everybody’s weight would yield a biased estimate. For this, we use the T-SQL STDEV function, which is an unbiased estimator.
If things weren’t confusing enough, some authors have referred to STDEV and VAR as giving the standard deviation and variance of the “sample”. This is more than confusing; it is just plain wrong.
Rather than get confused in terminology, the most important consideration is whether your numbers represent a complete set, or whether they are being used to make estimates about a much larger set. If we are calculating the standard deviation of a sample to estimate the standard deviation of a much larger set, we are really not calculating the standard deviation of the sample at all, we are calculating an estimator. To get the best estimate for the standard deviation and variance of the population we need to calculate the unbiased estimators from the sample data. This is what we do when we use STDEV and VAR.
Widgets International is very concerned about product quality, and to help ensure product quality the widgets coming off the assembly line are exhaustively tested. Some physical measurements are made by a laser scanner and every last widget is scanned, bar none. The standard deviation of this data is calculated with STDEVP. Some of the tests are destructive, however. The resistance-to-sledge-hammering test is administered to only one widget out of twenty. The standard deviation for all widgets cannot be known exactly, but we estimate it from the measurements of smashed widgets using the STDEV function.
Non-math folks can skip to the bottom line if they wish.
Here are the formulas for the variance, standard deviation, and the corresponding unbiased estimators. Both the T-SQL function names and the Microsoft Excel function names are shown.
Note that VARP and STDEVP have “n” in their denominators, while VAR and STDEV have “n-1” . This means that the value given by VARP will always be a just a bit smaller than the value given by VAR. This also means that if the variance of a population is estimated using VARP, the estimate can be expected to be a bit less than the actual variance.
Just for fun, we can confirm this is T-SQL by querying a test table containing the integers from 1 to 100.
SELECT COUNT(X) AS Count, SUM(X) AS Sum, VAR(X) AS VAR, VARP(X) AS VARP FROM TestNumbers
We confirm that VARP is just a bit less than VAR.
What is the source of this shortfall?
To estimate the variance of a population, we first estimate its mean. Let’s assume that the mean is expected to be exactly 100. (Quite a coincidence!) There are twenty values in our sample that we are using to estimate the mean and the variance. In principle, the first value in our sample can be anything, as can the second. The mean of the first two is their sum divided in half which can also have any value. Similarly, the third sample value can be anything and the mean is that sum divided by three. In fact, the first 19 values in our sample can have any value. But there our freedom disappears. Say, for argument’s sake, that the first 19 values sum to 1950. The twentieth value must be 50; it cannot be anything else or the mean will not be 100. The twentieth value is exactly determined by the first 19 values once we have committed to a mean of 50.
Since only 19 numbers in our sample can vary, the variance of the sample will underestimate the variance of the population if we calculate using 20 rather than 19.
This concept surfaces often not only in descriptive statistics as we have seen here, but in hypothesis testing as well. Statisticians call it degrees of freedom.
The Bottom Line:
Use STDEVP and VARP when you have all the data and you want to calculate the standard deviation and variance of the data you have. Use STDEV and VAR when the data you have is a subset of the world of data you are interested in (a “sample” if you must use that word). STDEV and VAR provide unbiased estimations of the standard deviation and variance of the world of data.
If you have further questions our suggestions on using Standard Deviations, please post your comment here. I’m looking forward to hearing from you!