Linear Regression using Microsoft Excel – Part 2.
Don’t be fooled by the “Part 2”. Since we took a moment to review loading an Excel add-in, Part 2 is actually the beginning!
The first concern is locating the data that you want to analyze. It’s generally easy to identify the dependent variable. It’s the thing you are interested in. For our example, we will look at the impact of some demographic variables on wages; the dependent variable is the value of the wages themselves. The independent variables may be a little more problematic, however. Unlike some statistical programs, Excel will demand that the dependent variables be contiguous. In other words, we can examine the effects on wages of the columns B through F, but we cannot examine just columns B, D, and F. If you wanted to analyze just these three columns, you would have to move or copy them to another location where they could all be next to each other.
If you would like to analyze the wage data yourself, you can download the Excel workbook here.
The wage1_table worksheet contains a table with 24 columns of data; we’ll focus our attention on the first six. Note that we do not need to have our data in a table; a regular Excel range will do nicely. I have selected the table format pretty much because I like the colors.
Just a quick aside: Note that column F is what analysts refer to as a “dummy” variable. Linear regression applies only to numbers so if we want to analyze the effects of gender we assign a value of 0 or 1. It’s generally not a good idea to call such a column “Gender”. Later on, somebody will forget whether “1” means male or female and may misinterpret some chart on some future report. If you call the column “female”, or better yet, “IsFemale”, then it is clear to most everyone that “1” means female and “0” means male.
I can’t wait any longer to examine the data so let’s do it. Click on the “Data Analysis” button on Excel’s Data ribbon tab and choose “Regresion”. When you click “OK” the regression dialog will appear.
Since we are analyzing the effects of other variables on wages, column A, the column containing the wage values, is our dependent variable, that is to say, the “Y Range”. Sadly, we cannot enter table column names; it just doesn’t work. There are many rows, so it is probably easiest just to type the range rather than attempt to highlight it. Similarly, columns B through F represent the variables that may or may not be found to influence wages, so we enter this data as the “X Range”.
Row 2 contains the table column names, so the Labels checkbox on the Regression dialog must be checked. Generally, you will want to put the results of your analysis on a separate worksheet, and this is the default. That’s all we need to do our analysis, but a statistician will want to look at the residuals and perhaps even spend some time exploring them. We therefore check “Residuals”, “Residual Plots”, and “Line Fit Plots”. We will discuss these further in Part 3. Standardized Residuals can be valuable too, but heck, this is a blog not a textbook.
When we click OK, Excel cranks a bit and we briefly see an hourglass (or maybe not so briefly if you have a lot of data). Then a fairly crowded output window appears containing, among other things, our regression summary.
A quick glance at the summary shows us some very low p-values, which means we might actually have some interesting things to discuss in Part 3.