Some years back I, my sadly passed away colleague Prof. Qinbao Song, along with Dr Zhongbin Sun and Prof. Carolyn Mair wrote a paper  on the subtleties and perils of using poor quality data sets in the domain of software defect prediction. We focused on a collection of widely used data sets known as the NASA data set which were demonstrably problematic, e.g., implied relational integrity constraints are violated such as LOC TOTAL cannot be less than Commented LOC, since the former must subsume the latter. Worse, inconsistent versions of the data sets are in circulation. These kind of problems have particular impact when research is data-driven.
In our paper we proposed two cleaning algorithms that generate data-sets D’ and D”. We recommend D”. Unfortunately although the data sets were hosted on wikispaces, this turned into a pay for service and now is closing down altogether. Hence the data sets have not been accessible for a period of time. Belatedly (sorry!) I’ve now hosted the cleaned data on figshare which should provide a more permanent solution.
Reflecting on the process I see three lessons.
- It’s important to choose a stable and accessible home for shared data sets, otherwise the goal of reproducible research is hindered.
- Data cleaning is often a good deal more subtle than we give credit for. Jean Petrić and colleagues have further refined our cleaning rules .
- I’m still of the view that researchers need to pay more attention to the quality and provenance of their data sets (or at least this process needs to be explicit). Otherwise how can we have confidence in our results?
PS Presently, un-cleaned versions are also available from the PROMISE repository along with additional background information, however, I would caution researchers from using these versions due to their obvious errors and inconsistencies.
 Shepperd, M., Song, Q., Sun, Z. and Mair, C., 2013. “Data quality: Some comments on the NASA software defect datasets.” IEEE Transactions on Software Engineering, 39(9), pp.1208-1215.
 Petrić, J., Bowes, D., Hall, T., Christianson, B. and Baddoo, N., 2016, “The jinx on the NASA software defect data sets.” In Proceedings of the ACM 20th International Conference on Evaluation and Assessment in Software Engineering (EASE). Limerick, Ireland.
 Liebchen, G. and Shepperd, M., 2016. “Data sets and data quality in software engineering: eight years on.” In Proceedings of The 12th ACM International Conference on Predictive Models and Data Analytics in Software Engineering, Ciudad Real, Spain.