jan 18, 2015

**Linear Regression using Microsoft Excel – Part 3**

**Interpreting the Results of a Linear Regression**

At first glance, the summary report for an Excel Linear Regression might seem to be a hodgepodge of cryptic numbers. In fact, the basic Excel summary report is pretty typical.

The first thing we might take a look at is the value of the square of the R-value, in this case 0.36363. This tells us that of the variability in data, about 36% can be explained by the values of our independent variables.

In the ANOVA section, the teeny-tiny value for the significance of the F statistic tells us that we can bet next month’s rent on the fact that *something* is going on in our data. Unfortunately, at this point we don’t know what it is or how important it might be. It is the regression equation described in the table of coefficients, that begins to help us understand what the data might mean.

We observe, for instance, that for each additional year of education the wage increases, on average, about $0.57. Similarly, the females in the present study earned about $1.81 less per hour than the males. We might also note that the lower wage earned by nonwhites is not statistically significant. Can we conclude that race or ethnic group has no effect on earnings? Absolutely not. In the present data, about half the people were female while only about 10% were nonwhite. There may simply not be enough data for a significant difference in the population to be found to be *statistically* significant in this particular study. There’s an important lesson here: Excel provides the numbers, but the interpretation of those numbers is still up to you.

**How About Those Residuals?**

Part of any reasonable data analysis is examining the ways in which our analysis might have gone astray. When we do a regression, we are trying to find an equation to predict values. The differences between the predicted values and the actual observed values is what we mean when we talk about the * residuals*. Examining the residuals provides diagnostic information that might inform us about potential problems in our analysis. Excel provides some very basic information about the residuals to get us started.

**Interpreting a Residuals Plot**

Here we see Excel’s residuals plot for education.

In a residuals plot, the horizontal line at “0” on the Y-axis represents the predicted value. If our predictions were perfect, then all the actual data would lie on this line. Points above and below this line indicate that our prediction is not perfect. No big surprise there, so what, exactly are we looking for? Ideally, the errors illustrated by the residual plot should be * normally distributed*. Most points should be close to the line; further away from the line there should be fewer points. But a normal distribution is symmetric, so the residuals should also be symmetric about the zero line. We see that they are not, but this is to be expected. There is no set limit to hourly wage on the high end, but a fixed limit of nothing on the low end. Nobody pays their employer for the privilege of going to work. Nevertheless, it does mean that our errors are not normally distributed and we must show caution.

We also observe that the variability among the data point is greater the further to the right we look at the chart. In other words, the variance of the data increases with the years of education. This is what the statistics folks refer to as heteroskedasticity. (Why use a simple word when a complicated one will do?) The failure of our errors to be normally distributed and the failure of the variance to be constant across different levels of education are two ways in which our data violates the mathematical assumptions of linear regression.

There are other potential pitfalls such as collinearity, but Excel does not provide the necessary diagnostic information to determine if this is a problem. Ultimately, Excel provides a great way to get started with linear regression, but there is a bigger world of analysis still waiting to be explored.