Another Average Post

MATH TIME!!!!

(Originally, I put that as filler in the draft of this blog just so I’d remember what the title was supposed to refer to. But then I decided it was a much better first line than anything else I could come up with.)

So I’m sure everyone has heard at some point  a “studies have shown” along these lines: 75% of mothers believe their kid is above average.

Generally, people laugh at facts like that. An average, we are often taught, is roughly the “normal” value of a certain variable among a population. Statements like “90% of people consider themselves to be of above average intelligence” are funny because they seem ridiculous. The humour or sense of personal superiority comes from the belief that it’s mathematically impossible for far more than half the population to be above the average value for that population. So at least some of the people in that 90% have to be self-deluding, wrong-headed, or arrogant.

Don’t they?

Well, obviously I wasn’t going to end on “yeah, they do, good-bye, see you next week”. But I might as well write it out anyway: if a study found out that 90% of people consider themselves to be above average intelligence, there’s no inherent reason that couldn’t be true.

First, there are the mundane reason, like it might just happen that the way subjects were recruited for the study somehow attracted subjects of above average intelligence (compared to the entire population – more on that later). But that’s not what I’m going to write a whole big long math-y (but don’t run away!) blog on.

It’s a common misconception that about 50% of people are above average and about 50% of people are below average. It can work out that way, but it’s perfectly plausible that 99% of people could be above average and 1% below. To show why, first I’ll review what an average is.

To find the average value of a particular set of data, you add together all the data points on your list and then divide that sum by the number of data points you used. So, if the test scores of the four students in my very small statistics class are 60, 73, 91, and 88, I get the class average by adding: 60+73+91+88=312 and dividing that sum by the number of scores, in this case, 4.

Since 312/4 = 78, the class average was 78.

In that case, the class did conform to our expectations: half the class scored above average and half the class scored below average. But what about this case?

Alice, Barbara, Carol, and Deedee, being pleased that they all scored between 60 and 91 on the previous test, went into their next midterm with high expectations. Thanks to their self-confidence, Barbara, Carol, and Deedee all scored 100. Poor Alice, however, was over-confident, and decided not to study at all. She got a 0. What’s the class average on the midterm?

As per the definition discussed above, it’s 100+100+100+0 = 300/4 = 75. So the average midterm grade is 75. However, unlike last time, it’s not the case that half the class did better and half the class did worse. Three-quarters of the class – way more than half – did better than average, and only one-quarter scored below average.

Obviously, I’ve exaggerated this example for pedagogical purposes, but the main concept holds: if a lot of people have a very high score, and a smaller number of people have a very low score, it’s mathematically possible for more than half the population to score above average. Furthermore, making up data sets like that imaginary midterm, it’s pretty easy to construct scores such that any percentage of the population we want, no matter how high, is above average. (For instance, if you want to show that it’s possible for 99.9% of a population to be above average, just imagine a class of 1000 in which 999 students get 80 and one poor schmo fails with a 20.)

MATH WAY OF PUTTING IT: Arithmetical averages aren’t good descriptors of central tendency unless the distribution is reasonably regular.

BLOG WAY OF PUTTING IT: You can’t assume that 50% of people are above average and 50% of people are below average unless you have good reason to believe that people’s scores are relatively evenly spread out across the possible values.

There is a value  designed to give us precisely the value that does separate half the scores from the other half, but it’s not the average, it’s the median. And it’s exactly what it sounds like: if you have an odd number of data points, then it’s the value exactly in the middle (for example, if your set is 4, 6, 31, 0, -3, then your median is 4). If you have an even number of data points, then it’s the average of the two middle values (for example, if your set is 23, 2, 34, 1, then your median is (2+23)/2 = 12.5).

Another, better reason to be wary of assumptions about what “average” means is you always need to ask, the average of what? Let’s go back to that confident statistics class in which Barbara, Carol, and DeeDee all scored 100% and Alice got a 0. The average score of the class is 75%. There are statistical methods that one can use to take that average score and estimate the average score of every statistics student, of female statistics students, of students at that university, etc. We’d need to know some other information about those groups – how many people they contained, etc., but we could still extrapolate these pieces of information in a logically and mathematically sound way, provided one thing.

That one thing is the most important thing to ask about any statistical claim: is the sample used a representative sample of the population? Is it (or is it enough like) a group selected at random from the entire pool of people we’re trying to study? Is it large enough compared to the total population that small weirdnesses in one subject won’t distort the entire study?

In the case of the statistics ladies, they probably aren’t a representative sample of any of the groups I listed, but I’ll go with “statistics students” to explain why. First, they weren’t selected randomly or close to it. They’re a self-selected group of people taking a particular statistics course at a particular time. Maybe there’s some hidden factor that makes them different from most other statistics students.

For instance, maybe this particular statistics course is a night course, and maybe these ladies are taking it because they’re mature students who have jobs during the daytime, while most of the statistics students enrolled in other courses are young people without jobs. Maybe the students who attend this university have a different socioeconomic status than ones who attend swankier schools. Maybe students in this geographic area are different than ones in other places. Maybe the four ladies really like the teaching style of the instructor who gives this course because of uncommon personality traits they share.

Second, given the total number of statistics students in the world is well into the hundreds of thousands, if not millions, four seems like an awfully small number of students to take as a sample. We can’t reasonably expect just four individuals to accurately represent the variation of such a large number of students. It’s like if the government decided to hold the next election by allowing just four people in the entire country to vote. There’s no way they could reflect the various political opinions held by the country as a whole.

Think about it like this: in the past five years, the Canadian Green Party has garnered between 4 and 7 percent of the popular vote in federal elections. However, there’s no way a four-person voting contingent could reflect this: either no one would vote Green (to be expected in a pool of this size), falsely indicating no support, or one person would vote Green, grossly inflating the party’s actual support.

So before anyone uses the statistics ladies’ results in an argument, he or she has to do the thinking to show they’re a good representation of the thing about which he or she is using the results to argue. It may be true that 75% of students surveyed in one statistics class got 100% on the midterm, but that tells us nothing about statistics students in general if we’re using a small sample. The numbers are only half the battle.

Mark Twain famously said that there are “lies, damned lies, and statistics,” but that’s unfair. There are lies, damned lies, and statements that take advantage of the fact that most people have never been properly trained to interpret statistics. Yes, it’s less catchy, and the faulty parallel structure is grammatically problematic, but it’s more accurate. Statistics are generally correct. The hard part is that the people consuming them often don’t know what they’re correct about, and those providing them hardly care to elaborate.

4 Replies to “Another Average Post”

  1. Hi Sarah. I enjoyed your post, and the way you used some nice examples to explain the different measures of central tendency. Problems like you describe, where a small number of individuals have much higher or lower values than others (statisticians call this skewness, which you probably knew) is the reason why data published on income will use median income, not average income. A few high income earners raises the arithmetic average income, so the median is a much better measure in this case. In your next post you might like to also mention another measure of central tendency called the mode, which is the most frequent value (e.g., 100 in your four-student example).
    Just a couple of minor quibbles – the term “representative sample” is statistically meaningless; what you really should be referring to is a probability sample – specifically, a sample where every individual in the population has a known, non-zero probability of being selected. And probability samples don’t need to be “representative” in the sense of being like a mirror of the population. Most surveys in real life oversample some groups and under sample others, but can still produce statistically valid results by properly weighting the sample. For example, you have a much higher chance of being selected in Canada’s Labour Force Survey (the survey that produces the unemployment rate each month) if you live in PEI than if you live in Ontario. And of course there is the issue of non-response – a good subject for another blog.
    Finally, your quote attributed to Mark Twain was not his originally. He himself attributed it to Benjamin Disraeli, the British Prime Minister – but the experts differ – see here: http://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statistics.
    Thanks again.

  2. @ Diana – hey, did you get the package yet? If it arrives now, pretend it’s for Easter instead of Christmas :P

    @ Mr. R – glad you enjoyed the post, and thanks for the more rigorous definition of sound sampling. I’m definitely not an expert on the subject, and since I came to statistics through a math degree, my instruction often focussed on what to do after data has been obtained and glossed over the sampling procedures and theory. Maybe I will post more about that in the future.

    I debated for some time over whether to credit Disraeli or Twain, but finally decided that, as a historian of technology, I’d be consistent and credit the man who made it work rather than the originator ;)

  3. @Sarah – Yes, I did get the package! It was all gobbled up within the hour. Thank you very much Sarah!

    @Dad – Granny said you’re back at the office?!? Couldn’t stay away from those sexy statistics yearbooks eh?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.