The first part of sampling is to understand that if you want to make a broad statement about a population you need to take a sample and calculate stats from that sample. What is crucial is that the sample be Random and unbiased.

## What is a Random Sample

The classical definition of a random sample is that every member of the population has a equal chance of being chosen in the sample. This is almost always impossible; however, the key to getting a reasonable sample is to consider all the possibilities from the sampling process that could contribute to bias.

## What is Bias?

Bias has to do with the sampling being non-random. I know, this seems circular and it is. The key is to understand whether or not the non-randomness has an impact on the values that are being collected from the sample. This is best demonstrated by an example

Let us consider an elementary class room of children and we want to collect a sample to estimate the weight of the average child in the school, Let's say that you have the children line up by height and pick the first 10 children and measure their weight.

First, if you just select children from one class room there will be the age bias (younger children are lighter). Second, by sorting the sample by height there will a bias towards heavier children since taller kids are usually heavier. The ultimate result will be biased in the calculation of weight.

Now, if you were trying to measure the average hair length, this process might not be biased, unless there is a gender bias in the sorting process.

## How to protect against bias?

This is where "local" knowledge comes into play. As a stats guy, I can sometimes identify sources of sample bias, but there is no substitute for "local" knowledge or what we call SME (Subject Matter Expertise).

In the above example, an elementary school teacher would be better able to judge whether there is a height gender bias in that school. Obviously, as children age, males tend to be taller, but is that true with elementary students.

A typical example in the credit card industry is to find a random sample to determine basic stats about a specific population of accounts. Often these accounts will be sorted by date of origination or payment. Does this contribute to a bias? Again, a SME experienced with the data would be helpful if identifying any potential bias. For example, it could be that accounts that have a history of delinquency are sent out early in the billing cycle or accounts that have been on the books a long time are likely to have higher limits since they have been on the books longer and had the "bad" accounts charged off.

Identifying bias is an extremely important step in stats. It can influence everything that happens in the use of the sample. It goes back to the old acronym GIGO, Garbage In, Garbage Out.

## No comments:

## Post a Comment