A great post illustrating selection bias using the 2012 US Election and Twitter.

normally distributed

According to the Guardian Data Blog, Obama is heading for electoral success, on the basis of a Twitter-based analysis.

It’s all very nice to see mapped out, and the use of geocoding is cool (though possibly flawed), but underlying the approach is a massive potential for selection bias.

The problem is quite simply this: if Democrat supporters use Twitter more frequently (or are more likely to tweet about their political preferences) than Republicans, then the number of tweets supporting Obama over Romney is of course going to suggest that Obama is in the lead. On the other hand, if Republicans are more Twitter-active than Democrats, then there could be an underestimation of the level of support for Obama. Essentially, we’ve got a reasonable estimate for a numerator, but no clue about the denominator.

To answer a question well, the design of the study is crucial. It’s so…

View original post 765 more words