Friday, February 15, 2013

FAMQ - Frequently Asked Modeling Questions


This blog is a list of Modeling Questions and links to other posts that have detailed answers to these questions.


During the Model Development and Review process a number of issues may arise that may best be addressed proactively. This document is intended to address some of the more common issues that tend to come to the forefront during the model review process. Many of these items may be subject to the personal opinions of the individual reviewer; however, they are all valid issues. A resolution process should be developed to resolve any conflicts.

In addition, these questions can serve as a reference for new modelers who may just have some basic questions on the modeling process. How to get started? How to plan a project?



FAMQ:

Why are First Steps important? – Defining the Problem

How do you approach first steps if you are a client?

What kind of problem do you have?

What is the modeling target (dependent variable(s))?


What are predictive (independent) variables

Can a predictive variable be used in implementation?

What are the time windows for both the target and predictive variables?

What is Performance Inference?
What is model “engineering”?

Why does parceling work?

                                                        Return to FAMQ
Parceling works because of the nature of logistic regression. Logistic regression essentially is modeling the linear relationship between the independent variables and the natural log of the odds of 1 to 0. If we look at the process of duplicating observations and giving them weights proportional to their “1-ness” or “0-ness” then the 1/0 odds become the Loss Ratio or % recovery or Profit/Cost ratio.

One advantage of modeling this continuous target using logistic regression instead of linear regression is that it simplifies the modeling assumptions on the distribution of the target ratio since the kind of ratios discussed here often quite skewed due to a high percentage of 0 or small values and a few very large values. This technique and also has added power to the “goodness” or “badness” for the true 0’s and 1’s that may be ignored in a linear regression of a ratio.

For example:
  • Loan Collections – In a charged off loan with no recoveries the recovery % is zero regardless of how much was owed. In the parceling technique, this “Bad” account is extra bad if the amount owed was high, but not nearly as bad if the amount owed was small.
  • Insurance Risk –When an insurance policy has a no claims it is Good, but it’s Loss Ratio is 0 regardless of how much premium is paid. In the parceling technique, that account is extra “Good” if the premium paid is large and not so good when the premium paid is small. Likewise, when a claim is paid that is large then that account is extra “Bad.”
  • Profitability – If an account in any sort of business has no revenue, then the revenue cost ratio is 0 regardless as to how much that account has cost the company. With parceling a zero revenue account is extra bad if its associated costs are high and not so bad if its costs are low.
                                                            Return to FAMQ

What is Parceling?

                                                        Return to FAMQ
Parceling is a technique often used in analytical performance inference to help account for a “fuzzy” or probabilistic 0/1 outcome for a given observation It is also used in model development as an alternative to linear regression for to convert a ratio variable to a 0/1 or binary target. For example:
Inference
The problem in inference is to determine how a specific “unknown” observation would have performed had it been in the known population.
  • Lending or credit
    • How would a rejected applicant have performed had it been accepted?
    • How would a “unbooked” (accepted but walked away) applicant have performed had they taken the loan.
  • Direct Mail
    • Would a potential customer have responded had they been mailed an offer?
    • If they had responded, would they have purchased something?
  • Fraud
    • Would a credit application been identified as fraudulent had it been investigated?
    • Would an insurance claim been identified as fraudulent had it been investigated?

In this situation, the analysis of the known population can be extrapolated (very carefully) into the unknown population to derive a probability of the target performance (for example 1=Good or 0=Bad). This probability is then used to divide an unknown observation into two separate observations, a Good observation and a Bad observation. The good observation is given a weight equivalent to the probability of a 1 and the Bad observation is given a weight equivalent to the probability of 0.
Ratios
This technique is only applicable in some very specific conditions. In general, if the target can have different degrees of “Badness” or “Goodness” then parceling can be used. For example:
  • Loan Collections – When a loan has been charged off, some or all of the money can be recovered. If none is recovered it is Bad, if all is recovered, it is Good but if a portion is recovered then it is partially good and partially bad. The ratio here is % recovered.
  • Insurance Risk –When an insurance policy has a no claims it is Good, when it has claims it is somewhat Bad, depending on how large the claim is. The ratio here is Loss Ratio, or Loss/(Premiums Paid).
  • Profitability – An account in any sort of business could be classified as Good or Bad depending on the revenue generated from the account compared to the costs associated with the account. Those accounts with no costs are Good, those accounts with no revenue are Bad. The ratio here is Revenue/Cost.

Parceling is used in these ratio examples in a similar way to the inference solution. Each partial observation is duplicated making a “Good” observation and a “Bad” observation. The Good observations are given weights proportional to their “Goodness” ($ recovered on the charged off loan, insurance premiums paid, revenue generated by the acount) and the Bad observations are given weights proportional to their “Badness” ($ owed on the charged off loan, losses on the insurance due to claim(s), costs associated with the account).

For inference situations, parceling allows a single observation with unknown performance to be split into a “good” observation with a weight proportional to the estimated probability that observation would have been good and a :bad” observation with a weight proportional to the estimated probability that observation would have been “bad.” These parceled values are then added to the known population to build a final model based on the full TTD population.
                                                        Return to FAMQ