Statistical Error in Usenet Arbitron from Brian Reid

What error?

The statistical error in the raw arbitron was the result of Brian making a double correction for the propagation of newsgroups.

When he generated the estimated readership figures, he multiplied the measured readership by the ratio of his sample size to the estimated total size of the Usenet. He then multiplied this number by the propagation of the group to get his final number. This last was his error.

The readership is implicitly corrected for propagation by the very nature of the sampling technique. By explicity correcting for it, he in fact throws the readership figures off by large factors.

Why did he do this?

I'm not sure. The consensus in the news.admin.* groups was that he was probably trying to correct for another statistical error (specifically the one caused by some sites reporting purely local groups) and made a mistake.

As he has not publically commented on it, and has not answered any e-mail on the subject, I can't really speculate beyond that.

Are there any other statistical errors in the stats?

Yes. The largest such error is the one I mentioned above, that some sites are reporting purely local groups. In particular, nyx.* and ott.* groups have wildly inaccurate estimated readerships because of that. Their true readerships are much closer to their raw sampled readership.

The other major source of error in the statistics is that the sites sampled are not likely to be truely statistically representative of the readership of the net. This is a consequence of many factors and is an insoluble problem due to the self-selecting nature of sites participating in Brian Reid's arbitron.

So the numbers are wrong?!

Yep. Sorry. But they are the best numbers we've got.



Benjamin "Snowhare" Franz / snowhare@netimages.com