Basic Stats, Advanced Stats, and other stuff: Why is it important to get the 0 and 1 right in a logistic regression?

Return to FAMQ

(WARNING: This post gets into some advance mathematics and it is not expected that the average reader will understand everything in this post. :-)

It is sometimes important as to which category gets assigned to 0 and 1. Not for modeling purposes, but for interpretation of the final score. This is particularly true if logistic regression is being used as the modeling tool. As experienced modelers know, the only difference in the wrong classification of 0 and 1 is that the signs of the logit equation will be reversed.

The usual objective in modeling a binary target is to estimate the probability that a 1 occurs. This is normally done using logistic regression. The logistic regression equation estimates the probability of a 1 using the log odds equation. This logit equation is converted to an exponential form to produce the actual probability; however, the basic underlying equation is a linear form. This form has a linear relationship between the equation and the log of the odds. This results in an easy conversion of the logit score to a predetermined log odds/score relationship.

This relationship is usually defined by a specific odds value at some base score and Points to Double the Odds (PDO) factor. Because of the linear relationship, this results in a constant additive factor and a constant multiplicative factor for the original score.

Mathematically:

Mathematical Definition	Example	Description
P = Probability of a 1	P = .9	Probability of a 1 (Good account) is 90%, 0 (Bad account) is 10%.
Odds = P/(1-P)	Odds = .9/.1 =9	9 Good accounts for every Bad account
Logit = K + ∑(b_i*x_i)	Logit = 0.2+.8x₁-5.5x₂	Two variable equation, x₁ is Number of Trade Lines and x₂ is number of delinquent accounts
P = e^{[K +}^∑(^bixi)]/ (1+ e^{[K +}^∑(^bixi)])	x₁ = 5, x₂ = 1 Logit = -1.3 P = e^-1.3/(1+ e^-1.3) = 0.214 Odds = 0.214/ (1-0.214) = 0.272	Account with 5 Trades, and 1 delinquent trade has a probability of .214 or 21% of these accounts are bad and the resulting odds of being Good .272 to 1 or 27 out of 100 are Good

In the above equation if the number of Trade Lines changes from 5 to 6 the P value goes from .214 to .377 and the odds more than doubles, going from .272 to .607. So an increase of 1 in the Trades slightly more than doubles the odds. This is a linear relationship and that implies that going from 6 to 7 Trade Lines will double the odds again.

Mathematically, it is clear that if we just adjust the X₁ coefficient then the equation will produce an exact doubling of the odds. Likewise, adding a constant to the equation can produce a score that has a value of 100 for odds of 10:1.

The reason this adjustment may be important is best explained with an example. For business purposes, let us assume that it takes 10 good accounts to generate enough profit to pay for 1 bad account. Thus, a score of 100 (10 to 1 odds) is the business breakeven point and any account that has a score less than 100 is not profitable. Thus a business person, knowing the PDO and base odds can easily look at a score and understand not only the business implications of that score but also how the business is impacted by changes in the score.

The point of this lengthy discussion is that having the correct 0, 1 definition at the beginning of the model develop makes this score adjustment more straightforward at implementation time

Return to FAMQ

Basic Stats, Advanced Stats, and other stuff

Tuesday, January 8, 2013

Why is it important to get the 0 and 1 right in a logistic regression?

No comments:

Post a Comment