okt 11, 2017

Logistic regression is one of the most important techniques in the toolbox of the statistician and the data miner. In contrast with multiple linear regression, however, the mathematics is a bit more complicated to grasp the first time one encounters it. We’re going to gain some insight into how logistic regression works by building a model in Microsoft Excel. It is important to appreciate that our goal here is to learn about logistic regression. If all you want are logistic regression results, there are tools, including the Excel Analysis ToolPack, that will take you there directly. We are going spend more time to obtain our logistic regression by brute force, and the reward for this extra effort will be some greater understanding.

Our problem will be the simplest of all categorization problems, a binary categorization. There are only two possibilities to decide between. In our hypothetical data, this will be admitted to a hypothetical university program. Either you get in or you don’t. (For nit-pickers out there, Hypothetical U. does not have the third possibility, putting you on the waiting list.)

The first few rows of our data look like this:

Our goal is to create an equation with coefficients b_{0} through b_{4} and then plug in values for gre, gpa, and rank to predict admission. So far, it looks a lot like a good old-fashioned linear regression. But there is one very important distinction. In a linear regression, an economist might be predicting wages or prices. Here, however, we are going to predict a probability value. Our input data represents events that have already occurred, and so 0 means “didn’t get in” and 1 means “got in”. The regression equation we hope to create cannot be linear since the permissible output values must fall in the range from zero to one. Close to one means it probably will get in. Close to zero means start looking at other schools. Of course exactly one means a certainty.

To create an equation that yields such output will require several calculations. The first is the “logit”. Briefly put, the logit is a function that takes a probability of an event as input and returns the logarithm of the odds of that event as output. Trouble is, this is the related to the results we want to predict; how can we enter an Excel formula for results we don’t have yet? This is where we create our regression equation. We will have four coefficients: one constant and one coefficient for each of our three input values, gre, gpa, and rank. We’ll give these coefficients some arbitrary value (in this example 0.001) with the understanding that these starting value will be replaced by optimized values. (Just like we can give coefficients to any old line in a linear regression, and then minimize the least squares difference later.) In this example, if we do not pick a pretty small value for the coefficients, the exponentiation to follow will make our values “blow-up” and the model won’t work. We’ll have more to say about that at the end.

=$F$2 + $F$3 * $E8 + $F$4 * $F8 + $F$5 * $G8

The next column is easy; we just raise e to the power of the logit.

We then calculate the probability of the event (i.e. getting admitted).

P(X) = e

^{L}/ (1 + e^{L})

Remember that L is the logit, and right now the logit is just being estimated by the totally wrong starting values we have put in b_{0} through b_{4} (cells F2:F5 here) as nothing more than guesses.

The next column will calculate the log-likelihood. Briefly, the likelihood function calculates a probability based on the values of the input variables. The overall likelihood will be the product of individual values for each row. Using calculate the log of the likelihood function we can sum over the rows. Our best estimate of the coefficients will be those that maximize the sum of the log-likelihoods over all the rows.

Here is the sum we wish to maximize:

The only values we can change are the guesses for the parameters b0 through b_{4}. Fortunately, we don’t have to optimize them ourselves, we can call upon Solver.

Of course, Solver is an Excel add-in. Although it comes with Excel, it is provided not by Microsoft but by Frontline Systems and is a subset of their more powerful Analytic Solver add-in. Assuming the Solver add-in is already loaded, we need only go to the Data tab on the Excel ribbon and click the Solver button.

The objective is to minimize the sum we just took of the log-likelihood column. We do this by changing the values in F2:F5, representing coefficients b_{0}-b_{4}. It is very important to uncheck the box labeled Make Unconstrained Variables Non-Negative. For many of the problems given to Solver, for example, the optimization of inventory or retail shelf usage, negative numbers make no sense, so this box is checked by default. However, for our purposes, we must allow negative values for the sum of the log-likelihood.

Notice also that we are not performing a linear optimization so we ask Solver to use the GRG algorithm.

After clicking “Solve” we immediately get new values for our parameters.

What do these numbers actually do for us? Just like a linear regression, we plug them into our regression equation to predict a value. But unlike a linear regression that predicts values like wages or consumer price index, the logistic regression equation predicts probabilities. Let’s imagine a student with a GRE score of 580 and a grade-point average of 3.81 who went to a rank 1 school. We plug those numbers into our equation

P = -3.450 + 0.00229 * 580 + 0.777 * 3.81 – 0.560

The calculation yields 0.190, which is a whole lot closer to 0 than to 1. For this individual, good grades at a top-notch school don’t compensate for mucking up the GRE.

As mentioned, our goal here was to illustrate that logistic regression has features in common with linear regression, but unlike linear regression, here we are building a model based on probabilities rather than on measured values like wages or industrial output. It cannot be stressed strongly enough that the method shown here is for learning, not for real statistical calculations.

When faced with a statistical problem, it is tempting to dust off your statistical textbook and look for the formula you want. Remember that a formula is not an algorithm! Formulas are in your book for mathematical correctness; the way you should actually calculate numbers on a computer is often quite different. This is especially true for problems like we saw here that involve exponentiation that can exacerbate the problems of roundoff error. Virtually all statistical packages have tools for logistic regression. When it’s a reliable answer you need, turn to professional statistical toolkits.

An Excel workbook containing both the data and a completed logistic regression worksheet can be downloaded here.