In machine learning research the main approach is experimental since analytic approaches are seldom able to cope with the complexities of modern algorithms and data sets. So we typically conduct computational experiments to compare, say, the predictive performance of competing algorithms for a set of problems represented by different datasets. Then, as the community pools and compares results from a metaphorical patchwork of experiments, we gain knowledge.
Recently, I and my co-researchers Ning Li and Yuchen Guo, were studying machine learning experiments that compared supervised and unsupervised classifiers for software defect prediction . In order to make make comparisons between vastly differing experiments we needed to reconstruct the confusion matrix which for binary classification is a 2×2 table comprising counts of true positives, false positives, false negatives and true negatives.
In the process of doing this we became aware that our analysis didn’t always agree with that contained in the original papers. Therefore, we decided to look more systematically and recruited fellow researchers from the Brunel Software Engineering Lab (Mahir Arzoky, Andrea Capiluppi, Steve Counsell, Giuseppe Destefanis) and the Intelligent Data Analysis Group (Stephen Swift, Allan Tucker, and Leila Yousefi). The upshot is paper we will shortly be presenting at the 20th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL) .
In total, we analysed 49 published experiments containing 2456 individual experimental results. Overall we found 22 out of 49 papers contain demonstrable errors or inconsistencies. Of these 16 related to confusion matrix inconsistency and 7 were statistical (one paper contained both classes of error). For example, the marginal probabilities of a confusion matrix must sum to one (of course allowing for rounding errors). Or analyses that make multiple statistical inferences using null hypothesis significance testing need to adjust the acceptance level of alpha (which is conventionally set to 0.05). NB there is a debate about the wisdom of NHST, but for the purposes of error checking that’s a separate matter.
This error rate (22 / 49 papers) sounds bad enough, but approximately a third of the results provided insufficient information for us to even check. Plus we focused on errors that are easy to detect. So sadly, there are reasons to suspect the real error rate in computational experiments may actually be worse than this!
All the papers we analysed had undergone peer review; a number were from top journals. However we speculate that there are multiple causes:
- Our experiments are becoming increasingly complex (the algorithms, the dataset pre-processing and the experimental design which often entails complex cross-validation strategies and parameter tuning).
- Many algorithms and methods are stochastic, leading to reproducibility challenges.
- It is not often obvious what a ‘valid’ answer should look like.
- Lastly, researchers are often under pressure to be productive which frquently entails maximising the number of published papers per unit resource.
Nor is this unique to machine learning. Similar analyses of error rates in other disciplines e.g., social psychology  also revealed an abundance of problems such as simple arithmetic errors. They reported that of 71 testable experiments, half (36/71) appeared to contain at least one inconsistent mean.
A major review by Nuijten et al.  (who provide an R package
statcheck to assist in the checking of inferential statistics) found that of 250,000 p-values from psychology experiments, half of all published papers contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. In 12% of papers the error was sufficient to potentially impact the conclusion.
This sounds rather disturbing. What can be done? Well apart from general appeals to be more careful, there are two practical suggestions.
- There are consistency checking tools available such as David Bowes and David Gray’s excellent DConfusion tool (which inspired much of our analysis) and is described in .
- We should whole heartedly embrace the ideas of open science, so the whole community can provide “many eyes”. In this spirit, we have made our data and analysis available at http://tiny.cc/vvvqbz.
Perhaps I should end this blog by completing Alexander Pope’s aphorism “to err is human” with “to forgive, divine”. The last thing that is needed some kind of scientific witch hunt or even snarkiness. Most of us have made errors from time to time. Making an error is not the same as scientific malpractice. We should both highlight, and respond to, the identification of errors with courtesy and professionalism.
 N Li, M Shepperd, Y Guo: “A systematic review of unsupervised learning techniques for software defect prediction”, arXiv preprint arXiv:1907.12027
 Shepperd, M., Guo, Y., Li, N., Arzoky, M., Capiluppi, A., Counsell, S., Destefanis, G., Swift, S., Tucker, A., and Yousefi, L.: “The Prevalence of Errors in Machine Learning Experiments”, IDEAL 2019, Manchester, UK, Springer LNCS 11871.
 Brown, N., Heathers, J.: “The GRIM test: A simple technique detects numerous anomalies in the reporting of results in psychology”, Social Psychological and Personality Science 8(4), 363–369 (2017)
 Nuijten, M., Hartgerink, C., van Assen, M., Epskamp, S., Wicherts, J.: “The prevalence of statistical reporting errors in psychology (1985–2013)”, Behavior Research Methods 48(4), 1205–1226 (2016)
 Bowes, D., Hall, T., Gray, D.: “DConfusion: a technique to allow cross study performance evaluation of fault prediction studies”, Automated Software Engineering 21(2), 287–313 (2014)