Statistics for Practical People

Part III - Choosing a Sample and Why it Matters
Published October 1988


Even when you can get good advice, it's nice to know something about the important principles in statistics. It is hard to get past the details and math to reach the main ideas, and that's what this series of articles is about. We hope it will be useful to you.

The value of a sample hinges on the concept of a "representative" sample. Most of statistics looks at a population in 3 ways:

  1. The "center", usually the "mean" (simple average).
  2. The "spread", measured by the "standard deviation".
  3. The "shape", or "distribution type".

It would be nice if the sample represented a fair picture of all 3 items, but the most important is 1 (we'll discuss 2 and 3 in later issues). The sample mean (x-bar) should have an "expected value" equal to the population mean (). That Greek letter () is pronounced "mew". If the sample was repeated many times and the sample means (x-bar) "average out" (theoretically) to the true mean () then the sampling process is called "unbiased". If repeated samples would average to the wrong answer then the difference is called the system's "bias". If you have shown that a bias is small, you might not care about it, but often we don't know how bad a biased method might be, and therefore try to stay with unbiased methods as much as possible. There are lots of ways to create a bias large enough to worry about. The two biggest are bad sample selection (measuring the wrong thing) and poor measurement technique (measuring it poorly).

Why do you hear so much about "random samples"? Random samples are always representative in the ways we discussed, and the math has been worked out thoroughly -- partly because it is simpler to calculate than more practical methods. With random sampling, many terms in the equation go to zero and simplify the math.

Are random samples necessary? No, there are lots of other ways to gather representative samples, and one of the best ones is a grid or systematic sample. Such a sample is unbiased, gives "better answers" (closer to the truth with the same sample size) and is more practical. The only drawback is theoretical -- the math people aren't sure how much better the answers are, and would rather simplify their lives by sticking to the study of random samples. In fact, there are theoretical ways of examining systematic samples quite correctly, but the myth lives on that "only random samples are really statistically sound". That's not true. When you treat a systematic sample as if it was random the effect is just to underestimate the quality of the answer -- a conservative approach. Systematic samples have a long history of good results, and are used widely.

You can mess up of course. You can always run North-South lines in an area with ridges that go North-South and slightly juggle the lines so they go up the roaded valley bottoms. Good production -- bad answers. Random samples would avoid that, but so would a random orientation of the grid.Next Column

.

Are there any advantages to random samples? Yes. They sound a little better when written up in a scientific paper. They seem "more professional and scientific" -- but there are virtually no advantages beyond these (sometimes important) psychological ones. Besides, they are almost never done correctly.

Taking a random sample is something that has to be carefully done by someone who understands statistics. If you ever need to do it, then make sure such a person checks it. Then document exactly what you did, because nobody is ever likely to believe you did it right. They will believe you tried your best, or even that it didn't matter to the results, but unless you have a fully traceable method they won't believe it was correct. If you spend the money to do it right, then get the credit too.

Why should we care about all this? Because if you select a bad sample there is very seldom a way to fix the answer, and that usually depends on good documentation of the methods. You know how often that happens. A bad sample (systematic or random) will lose a court case, destroy your credibility, and lead you toward a wrong answer. How wrong? That depends on how lucky you are. I was once involved with a sampling system that selected the edge of the stand with twice the proper frequency, and involved thousands of already measured samples. Luckily we could compare a large number of plots from the inside and outside zone and found no overall difference in the item we were measuring, so it didn't matter that there was a bad sample. Did that mean the sampling system was "correct"? No, just that we were lucky that time -- what about next time? Sometimes a badly selected sample will give OK results, but it gives a chance of error and credibility loss that you don't need to take. One last word about choosing a "representative line of plots" through a stand versus a random or systematic method. Research has shown that people aren't very good doing that. But even if they could get a better answer for every stand in the forest, they might well get a worse overall answer.

An example might help here. Suppose we have 1,000 stands, each with 10 plots in it, and none of the answers are closer than 10% of the stand volume. We now use "good judgement" to shift every answer closer to the true stand volume by 9%. Next, we add 4% to every answer. The net change is from +13% to -5% for every stand in the forest, and every stand is closer to its true value than before the adjustment. The problem is that we now have a forest level bias of +4%. The statistics for our forest inventory will probably indicate a sampling error (with 10,000 plots) of 1%, but the sampling bias alone is 4 times that amount.

The point here is that how you chose the sample locations can make a difference! You can always recalculate the data, fix a computer program or use a better volume table -- but there is virtually no recovery from a bad sample, and seldom any warning from the data itself. Statistics can just tell you what would be likely assuming that you picked a correct sample. In the next installation we'll discuss the standard deviation and what it tells us about a population.


Return to Home
Back to Contents