## Sunday, September 13, 2009

### Who's Lying With Statistics?*

In my wanderings, I came across this (at O'Reilly Radar):
How the UK Government Spun 136 People into 7 Million -- a radio show looked into the government's claim of 7 million illegal filesharers and discovered it came down to 136 people in a survey admitting they'd used it.

But when I read the original article at PC Pro, I was moved to reply:
Your headline is more misleading than what the government originally reported. The 136 was 11.6% of the responses to a "survey of 1,176 net-connected households". That's a good sample size for a survey.

[pinero50 says "Extrapolating 3.9 million from a sample of ~1000 odd still seems pretty suspect to me." Pinero, I know it seems strange, but that's a basic thing you learn when studying statistics. A sample size on the order of 1,000 gives very accurate results, if picked randomly.]

There were problems, including having an interested party be involved in the research, but a more sensible comparison would be 3.9 million (mentioned at the end of the article) vs 7 million.

Commenters are right to question the wording used in the survey. If it just mentioned file-sharing, all bets are off as to how many people share illegally.
Why is it possible (if you ask a well-worded question and pick people randomly) to accurately gauge what's happened in a large population by asking only 1,000 people to respond to a survey?

It's counter-intuitive at first. But try this thought experiment: Imagine a silo full of grains of corn. You want to pull some out and examine it to get a sense of the quality of the corn. Imagine that this is a high-tech silo, and the corn gets thoroughly mixed, so if you reach in and pull out a scoop, it will be a good random sample of the corn. Can you see that it doesn't really matter whether the silo is 10 feet tall or a hundred? The cup of corn you pull out should give you a sense of what's in there, as long as there's not too much variability (i.e., as long as the kernels as all close to the same size, water content, etc).

The people surveyed are like the kernels in the scoop. In statistics courses, you learn to put together something called a confidence interval, so you can come up with a precise way of talking about how accurately the sample reflects the population. You end up with something along the lines of: "We're 95% sure that 3.9 million people are doing illegal file-sharing, with an error margin of .05 million."

A more accurate (and less attention-grabbing) title? Perhaps How the UK Government Spun 3.9 Million People into 7 Million. But who'd buy your news if it weren't inflammatory?

*My title takes off from a charming little book called How to Lie with Statistics, by Darrel Huff, written in 1954, an enjoyable and educational read.

1. Funnily enough, I've just made a similar post on my blog about a billboard ad I saw today, and the 'survey' it references. My gripe was with the response section rather than the title or the question asked ('Does God exist' of all the possibilities!) My post is at http://bit.ly/14G0d3 for anyone who's interested!

TK

2. I'd have issues with the response rate and the selection criteria.

How random? What was the selection criteria, other than net-connect?

Who in the family was asked? Age has a great deal to do with file-sharing, but the survey may have been filled out by the parents.

They've selected for net-connectedness -- broadband or dial-up? If there was a large group of dial-up, then I'm not surprised they've done little file sharing.

Participation bias? Does the 136 represent the number out of 1200 who answered "Yes" or is it the number who answered "yes" out of the group who responded (<1200)?

Self-reporting? What person is going to reply "Yes" to this question anyway? Only one who has no fear of reprisal when the music industry wants to prosecute -- we know how secret these ballots aren't.

What was the question? Was it too limited in its scope? The question "Have you ever participated in file-sharing?" is too limited as the respondent might have shared a CD he owned with one friend or made it publicly available to millions.

I'd also like to know how many did it once compared to the umber who did it with thousands of songs.

You picked up on the conflict of interest and some other concerns. I'd be suspicious of this study and the report.

I don't have a problem with the extrapolation, though.