Statistics for Practical People (6)

In the previous parts of this series we have talked about what Standard Deviation (SD) and Standard Error (SE) really mean. The formulas for actually calculating them are not really important in this day and age. How to use them and what they mean becomes more important all the time.

A brief review:

SD is how spread out THINGS in the population are, and this is calculated (somehow) from the data in your sample. It is useful in describing the population itself.
SE is how spread out the SAMPLE MEAN will be around the true population mean. It is useful in describing how close your cruise will be to the right answer.

HOW ARE STANDARD DEVIATION AND STANDARD ERROR RELATED?
It so happens that there is a very simple relationship between SD and SE. You can calculate SE by the following formula:

SE = SD /n^0.5

Now this is really quite a simple and beautiful little formula. It should be considered THE most important statistical formula in all of statistics.

It starts out with the way the world IS (that's SD - how spread out the data are, and there is virtually NOTHING you can do about it).
It then talks about how hard you WORK (that's the sample size "n"), and you ARE in control of that. Please note that is how hard you work, not how smart).
It then tells you HOW GOOD your average is likely to be with that amount of effort (the Standard Error).

All this happens with a simple little formula that anybody can understand and remember. It only gets ugly and complicated looking when you have multiple layers of sampling or lots of strata mixed together. The IDEA is simple and easy to grasp.

WHAT IS A "t-TABLE"?
We have talked about using a Z-table to tell how far to go when creating a confidence interval. You do this when you know what the standard deviate REALLY is. If you don't KNOW the standard deviate (and you hardly ever do), you can still estimate it from the data you gathered in your sample. The complication is that you won't get it quite right. In general you will slightly underestimate it, particularly if you have a small sample. Since this is the case, you need to go out just a little bit farther in each direction than a Z-table would tell you.

Luckily, some nice person has figured this out, and published another table called the "t-table". It is very close to the Z-table except in the very small sample sizes. In fact, after a sample size of about 30 or so there is virtually no difference (which just means that you are now getting a very good estimate of the standard deviate). You often hear in statistics that "after a sample size of 30 it is correct to use the Z-table". This isn't really true, but there is so little difference between the tables that nobody worries about doing it.

The t-table value depends on the sample size you have used to estimate the standard deviation. These tables sometimes use a special term for "the sample size minus 1" (n-1). They call this the "degrees of freedom". At any rate, the t-table just tells you how many standard deviates to go, each way, when you are making a confidence interval. An example of such a t-table is shown below.

	t Value for Confidence Interval
Degrees of Freedom (n-1)	90%	95%
1	6.314	12.706
2	2.920	4.303
3	2.353	3.182
4	2.132	2.776
5	2.015	2.571
10	1.812	2.228
15	1.753	2.132
20	1.725	2.086
30	1.697	2.042
60	1.671	2.000
Infinite	1.645	1.960

A COMPLETE EXAMPLE
Suppose we have just done a sample of 21 weights, and we calculate that the mean is 200 pounds. We want to describe how spread out the population is, so we would calculate the STANDARD DEVIATION from the data and find it to be 25 pounds. Now if the population itself is normally distributed then we can make a confidence interval for the THINGS in the population. Let's say we want a 95% confidence interval. How many standard deviations do we go each way? We look in the t-table under sample size 21 (or 20 degrees of freedom depending on how the table is labeled) and get the t value of 2.086.

We now know that 95% of the things in the population are within ą2.086 standard deviations of the sample mean. What is that in pounds? 2.086 * 25 pounds = 52.15 pounds each way. The "confidence interval" is therefore 200 pounds ą52 pounds (between 148 and 252 pounds if you prefer to state the end points).

And how close is our SAMPLE MEAN to the true population mean? Well, even if the population was not normally distributed we can still use its SD to estimate how widely spread the sample means will be. We know that sample means are always normally distributed. We need to calculate the STANDARD ERROR, and we do this using the SE formula. 25 / 21^0.5 = ą5.45 pounds. Suppose we have decided to get a 90% confidence interval for the sample mean. We have to go out 1.725 standard errors each way according to the t-table, and in units this would be 1.725 * 5.45 = ą9.4 pounds. We can now estimate that the true population mean is 200 ą9.4 pounds (or 190.6 to 209.4 pounds).

If you can follow the logic of this example you will be able to do the most practical parts of statistical analysis. It may take practice to do it quickly, but these are the main logical ideas you need to understand. When you read a statistics book there are a lot more terms you run into, but many of them are just slightly different ways of saying the same thing. Next time we will try to sort out a few of these so they don't get in your way. Once you see the pattern you will realize that SD and SE are really ALL you need to worry about. The business of how to create a confidence interval, and understanding standard deviation and standard error, are the longest and hardest part of this series.

From now on it gets easier. Remember -- this statistics business has to do with somebody's MONEY and SWEAT, and if you can understand some of the basics, you might save a lot of each. It's worth the effort.

Statistics for Practical People

PART VI - How are Standard Deviation and Standard Error related? Published July 1989

PART VI - How are Standard Deviation
and Standard Error related?
Published July 1989