Sunday, March 3, 2013

Lies, damned lies, and statistics

From my experience most PhDs in engineering and computer science must use quantitative data analyses and statistical techniques to evaluate and validate data within experiments at some point in their research. Increasingly social science researchers also use these techniques, mainly through well established software packages, such as SPSS, R-Statistics or other.

I have to say that most papers I read, even if they are relatively theoretical, do have a strong empirical data analysis component. One can sometime loose sight of the problems associated with statistical evaluations of empirical work, and hence I thought it be a good reminder for my readers and myself to refresh some common pitfalls with statistical techniques.


  • Discarding unfavorable data

  • Loaded questions

  • Overgeneralization

  • Biased samples

  • Misreporting or misunderstanding of estimated error

  • False causality

  • Proof of the null hypothesis

  • Data dredging

  • Data manipulation

Wikipedia is a good source. I also think a good statistics text will help in combination with some lighter reading.

There are also calls for researchers to make their code and datasets publicly available so that experiments can be repeated independently. This is now increasingly becoming a common practice, especially with high profile journals and conferences, but there are still numerous issues associated with making datasets and code-bases publicly available.

No comments:

Post a Comment