To err is human …

In machine learning research the main approach is experimental since analytic approaches are seldom able to cope with the complexities of modern algorithms and data sets. So we typically conduct computational experiments to compare, say, the predictive performance of competing algorithms for a set of problems represented by different datasets.  Then, as the community pools and compares results from a metaphorical patchwork of experiments, we gain knowledge.

Recently, I and my co-researchers Ning Li and Yuchen Guo, were studying machine learning experiments that compared supervised and unsupervised classifiers for software defect prediction [1].  In order to make make comparisons between vastly differing experiments we needed to reconstruct the confusion matrix which for binary classification is a 2×2 table comprising counts of true positives, false positives, false negatives and true negatives.  

In the process of doing this we became aware that our analysis didn’t always agree with that contained in the original papers.  Therefore, we decided to look more systematically and recruited fellow researchers from the Brunel Software Engineering Lab (Mahir Arzoky, Andrea Capiluppi, Steve Counsell, Giuseppe Destefanis) and the Intelligent Data Analysis Group (Stephen Swift, Allan Tucker, and Leila Yousefi). The upshot is paper we will shortly be presenting at the 20th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL) [2].

In total, we analysed 49 published experiments containing 2456 individual experimental results.  Overall we found 22 out of 49 papers contain demonstrable errors or inconsistencies. Of these 16 related to confusion matrix inconsistency and 7 were statistical (one paper contained both classes of error). For example, the marginal probabilities of a confusion matrix must sum to one (of course allowing for rounding errors).  Or analyses that make multiple statistical inferences using null hypothesis significance testing need to adjust the acceptance level of alpha (which is conventionally set to 0.05).  NB there is a debate about the wisdom of NHST, but for the purposes of error checking that’s a separate matter.

This error rate (22 / 49 papers) sounds bad enough, but approximately a third of the results provided insufficient information for us to even check.  Plus we focused on errors that are easy to detect.  So sadly, there are reasons to suspect the real error rate in computational experiments may actually be worse than this!  

All the papers we analysed had undergone peer review; a number were from top journals.  However we speculate that there are multiple causes:

  1. Our experiments are becoming increasingly complex (the algorithms, the dataset pre-processing and the experimental design which often entails complex cross-validation strategies and parameter tuning).
  2. Many algorithms and methods are stochastic, leading to reproducibility challenges.
  3. It is not often obvious what a ‘valid’ answer should look like.
  4. Lastly, researchers are often under pressure to be productive which frquently entails maximising the number of published papers per unit resource.

Nor is this unique to machine learning.  Similar analyses of error rates in other disciplines e.g., social psychology [3] also revealed an abundance of problems such as simple arithmetic errors.  They reported that of 71 testable experiments, half (36/71) appeared to contain at least one inconsistent mean.

A major review by Nuijten et al. [4] (who provide an R package statcheck to assist in the checking of inferential statistics) found that of 250,000 p-values from psychology experiments, half of all published papers contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. In 12% of papers the error was sufficient to potentially impact the conclusion.

This sounds rather disturbing. What can be done?  Well apart from general appeals to be more careful, there are two practical suggestions.

  1. There are consistency checking tools available such as David Bowes and David Gray’s excellent DConfusion tool (which inspired much of our analysis) and is described in [5].
  2. We should whole heartedly embrace the ideas of open science, so the whole community can provide “many eyes”. In this spirit, we have made our data and analysis available at http://tiny.cc/vvvqbz.

Perhaps I should end this blog by completing Alexander Pope’s aphorism “to err is human” with “to forgive, divine”. The last thing that is needed some kind of scientific witch hunt or even snarkiness.  Most of us have made errors from time to time.  Making an error is not the same as scientific malpractice.  We should both highlight, and respond to, the identification of errors with courtesy and professionalism.

References

[1] N Li, M Shepperd, Y Guo: “A systematic review of unsupervised learning techniques for software defect prediction”, arXiv preprint arXiv:1907.12027
[2] Shepperd, M., Guo, Y., Li, N., Arzoky, M., Capiluppi, A., Counsell, S., Destefanis, G., Swift, S., Tucker, A., and Yousefi, L.: “The Prevalence of Errors in Machine Learning Experiments”, IDEAL 2019, Manchester, UK, Springer LNCS 11871.

[3] Brown, N., Heathers, J.: “The GRIM test: A simple technique detects numerous anomalies in the reporting of results in psychology”, Social Psychological and Personality Science 8(4), 363–369 (2017)
[4] Nuijten, M., Hartgerink, C., van Assen, M., Epskamp, S., Wicherts, J.: “The prevalence of statistical reporting errors in psychology (1985–2013)”, Behavior Research Methods 48(4), 1205–1226 (2016)

[5] Bowes, D., Hall, T., Gray, D.: “DConfusion: a technique to allow cross study performance evaluation of fault prediction studies”, Automated Software Engineering 21(2), 287–313 (2014)

Advertisements

Does it matter if we use student participants in our software engineering experiments?

Recently Davide Falessi and co-workers published a paper entitled “Empirical software engineering experts on the use of students and professionals in experiments” [1] in which they make some quite strong claims about the benefits of using students as opposed professional participants in empirical software engineering experiments.  This has provoked considerable discussion in the research community and consequently there will be a series of commentaries on the original paper:
  • Dag Sjøberg and Gunnar Bergersen – “Comments on ‘Empirical Software Engineering Experts on the Use of Students and 31 Professionals in Experiments’ by Falessi et al., EMSE 2018
  • Per Runeson – “Realism in Software Engineering Experiments — A Search in Vain?”
  • myself (Martin Shepperd) – “Inferencing into the Void: Problems with Implicit Populations
followed by a response from some of the original authors.  The complete article will appear in the December issue of the journal Empirical Software Engineering [2].
In the meantime, I will try to summarise my arguments and link to an early version of my comments on arXiv.
  • It’s important to make considered decisions about the types of participant we use and appreciate the potential threats to validity.
  • We should avoid dichotomising participants as professionals/students since a student might be a part-time or former professional and in any case their experience may be more or less relevant.
  • It’s important to be explicit about the type of population being investigated.
  • We also need to consider how we sample tasks, artefacts and settings (as well as participants).  Are these representative? Are there potential interactions?
  • Sometimes pragmatism wins the argument: using students may be the only option, but if that’s the case let’s be honest and say it’s a matter of expediency rather than claim it’s actually better than using professionals (if that is indeed the case).
  • In terms of advocacy, using professionals is more likely to be persuasive for getting new software engineering techniques adopted in practice.
So whilst I appreciate Falessi et al. [1] initiating a discussion around the choice of participant in our experiments, my fear is some researchers may attempt to use this paper as a blanket justification for taking the easier path of using students when they are not representative of the population of interest.  Presently (18.10.2018), they have 14 citations.  After excluding one duplicate and one in Portuguese, 8/12 argue that it’s ok to use students because Falessi et al. say so.  A typical example is “Falessi et al. state that controlled experiments with students are as valid as experiments with experts”. Obviously there are occasions when this may be so, but to use this paper as blanket permission worries me.  In fact it worries a good deal.
References:
[1] Falessi, D., Juristo, N., Wohlin, C.,Turhan, B., Münch, J., Jedlitschka, A., Oivo, M., “Empirical software engineering experts on the use of students and professionals in experiments”, Empirical Software Engineering 23(1), pp452–489 (2018).
[2] Feldt, R., Zimmermann, T., Bergersen, G., Falessi, D., Jedlitschka, A., Juristo, N., Münch, J., Oivo, M., Runeson, P., Shepperd, M., Sjøberg, D., Turhan, B., “Four commentaries on the use of students and professionals in empirical software engineering experiments”, Empirical Software Engineering 23(6), pp3801-3820 (2018).

Updated:

This blog was updated (25.10.2018) to reflect the new title and authorship of [2] which has been changed at the request of the journal editors Robert Feldt and Tom Zimmerman to better reflect the content.
It was further updated (28.11.2018) to provide a link and full publication details for reference [2].

Replication studies considered harmful!

On June 1st I will be presenting a paper at the 2018 40th International Conference on Software Engineering (#ICSE18) in Gothenburg as part of the New Ideas and Emerging Results track [8].  As the title suggests, I will be somewhat controversial.  The paper addresses two questions:
  1. How similar must a replication result be  to constitute confirmation?
  2. How effective is the process of replication for adding empirically-derived, software engineering knowledge?

In empirical software engineering, it is more or less a given that replication is the best way to test our confidence in an empirical result.  Consequently the number of replication studies has been growing recently.  For example, a mapping study found 16 replications in 2000-2 and a decade later 2010-12 a total of 63 replications, almost a four-fold growth [3].

Defining replication

However, first things first.  What do we mean by replication?  In a meticulous review of definitions Gómez et al. [5] found more than 70 different definitions which they classified as:
Group 1: essentially a faithful replication of the original experiment
Group 2: some variation from the original experiment e.g., measurement instruments, metrics, protocol, populations, experimental design or researchers
Group 3: shares the same constructs and hypotheses
It is not obvious to me how Group 2 and 3 differ, so it seems easier to refer to Group 1 as a reproduction (as is commonly the case in other scientific disciplines [7]) and treat Groups 2 and 3 as being replications on some continuum.

Q1: How similar must a replication result be to constitute confirmation?

Returning to our first question: how similar must a replication result be to the original experiment to constitute confirmation? Remarkably, we don’t seem to have been explicitly addressed this question of #replicability in software engineering.  Perhaps the answer is so obvious it doesn’t need to be; but is it?  Clearly we wouldn’t expect identical results, other than dealing with #reproducibility.  So how much difference might be acceptable?
An obvious and common approach is to use p-values and null hypothesis significance testing (NHST). If the original study calculated p-value falls below a threshold (typically this is α = 0.05, possibly with correction for multiple tests) then the effect is deemed to be “statistically significant” so one would expect the replication to also be significant if it is confirmatory.  Unfortunately this is mistaken, particularly if the original study is underpowered [1, 2, 4]. Worse, if the null hypothesis is true then p becomes a random variable following a uniform distribution which means all values of p are equally likely.
So, even if both studies are sampling the same population and the intervention is perfectly replicated, the measurement instrument identical and without error, sampling error alone can cause differences in results and these differences can be surprisingly large. 

Simulation

In my paper I explore this through simulation.
Suppose we have two treatments X and Y and we want to compare them experimentally. Each experiment has 30 units, where a unit might be a participant, a data set, and so forth.  This seems reasonable based on Jørgensen et al. [6] who found in their survey of software engineering experiments that 47% had a sample size of 25 or less.  Let’s also suppose the experimental design is extremely simple and that the two samples are independent, as opposed to paired.  We also assume the rather unlikely situation of no measurement errors and no publication bias.
We investigate two underlying population distributions: (i) normally distributed and (ii) a more realistic mixed-normal distributions (contamination level = 10.sd, mix probability = 0.2) that yields a heavy tailed but still symmetric distribution.
I simulate two conditions: (i) no effect i.e., μ(X) =μ(Y)  = 0 and (ii) a small effect (μ(X) – μ(Y) = 0.2). Note that small effect sizes dominate our research [6] and this is exacerbated by the tendency of under-powered studies to over-estimate the true effect, not to mention selective reporting and flexible analysis practices.
Then I simulate the replication process by randomly drawing pairs of studies, without replacement, and observing the difference in results. So for my simulation of 10,000 experiments this gives 5,000 replications.

Results

The R code, additional figures and associated materials can be downloaded.  However, in brief, there are three main findings:
  1. for the no effect condition we observe a surprisingly wide range of possible effect sizes with only just over half the experiments finding negligible or no effect [-0.2, +0.2].
  2. departures from normality greatly harm our ability to reliably detect effects.
  3. in the face of heavy tails, ~32% of replications agree in the correct direction, ~48% disagree and ~19% agree but in the opposite direction to the true effect.
Alternatively, if we ask the question what variability in results between the original and the replication studies might we expect solely due to sampling error then we can construct a prediction interval [9].  In other words, how different can two results be and still be explained by nothing more than random differences between the two samples? Essentially if we obtain the first n1 observations (from the original study) what might we expect from the next n2 (in the replication).  To compute this we need both sample sizes, the statistic of interest from the original study (in our case the standardised mean difference aka Cohen’s d) and the variance or standard deviation.  Unfortunately researchers are seldom in the habit of reporting this information which greatly reduces the value of published results.
The following table shows three examples of studies that have been replicated. They were chosen simply because I have access to the variance of the results. There are two points to make. First, the good news: each study is confirmed by the replication.  Second, the bad news: the prediction intervals are all so broad it’s hard to conceive of a replication that wouldn’t ‘confirm’ the original study.  Consequently we learn relatively little, particularly when the effect size is small and the variance high. Note that for R1 and R2, a result in either direction would constitute a confirmation!  Even for R3 the prediction interval includes no effect.
ReplicationResultsTable

Q2: How effective is the process of replication?

So for the second question, how effective is the process of replication for adding empirically-derived, software engineering knowledge? My answer is hardly at all, hence the title of the paper and my blog.
To finish on a positive note, my strong recommendation is to use meta-analysis to combine results to estimate the population effect. To that end below is a simple forest plot of R3 which by pooling the results narrows the confidence interval around the overall estimate.
 ForestPlotBriand

References:

[1] V. Amrhein, F. Korner-Nievergelt, and T. Roth, “The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research,” PeerJ 5, pp e3544, 2017.
[2] G. Cumming, “Replication and p Intervals: p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better,” Perspectives on Psychological Science,3(4), pp286–300, 2008.
[3] C. de Magalhães et al., “Investigations about replication of empirical studies in software engineering: A systematic mapping study,” Information and Software Technology, 64, pp76–101, 2015.
[4] A. Gelman and J. Carlin. “Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors,” Perspectives on Psychological Science 9(6), pp641–651, 2014.
[5] O. Gómez, N. Juristo, S. Vegas, “Understanding replication of experiments in software engineering: A classification,” Information and Software Technology 56 pp1033–1048, 2014.
[6] M. Jørgensen, T. Dybå, K. Liestøl, and D. Sjøberg. “Incorrect Results in Software Engineering Experiments: How to Improve Research Practices,” J. of Syst. & Softw. 116, pp133–145, 2016.
[7] R. Peng,Reproducible Research in Computational Science”, Science, 334(6060), pp1226–1227, 2011.
[8] M. Shepperd,Replication studies considered harmful,” in 40th ACM/IEEE International Conference on Software Engineering – New Ideas and Emerging Results, 2018.
[9] J. Spence, and D. Stanley, “Prediction Interval: What to Expect When You’re Expecting … A Replication,” PLoS ONE, 11(9), pp e0162874, 2016.

Continue reading “Replication studies considered harmful!”

Experiments with software professionals

Recently I presented a paper at the 33rd ACM Software Applications Conference (SAC’18) entitled “An Experimental Evaluation of a De-biasing Intervention for Professional Software Developers” which was co-authored with Prof. Carolyn Mair (consultant and cognitive psychologist) and Prof. Magne Jørgensen (researcher in human aspects of software engineering).
 
Our goal was to examine how cognitive biases (specifically the anchoring bias) impact software engineers’ judgement and decision making and how it might be mitigated.

The anchor bias is powerful and has been widely documented [1].  It arises from being influenced by initial information, even when it’s totally misleading or irrelevant. This strong effect can be very problematic when making decisions or judgements.  Even extreme anchors e.g., the length of a blue whale is 900m (unreasonably high anchor) or 0.2m (unreasonably low anchor), influence people’s judgements about the length of whales. Jørgensen has been active in demonstrating that software engineering professionals are not immune from this bias (see his new book on time predictions: free download).

Therefore we decided to experimentally investigate whether de-biasing interventions such as a 2-3 hour workshop can reduce, or even, eliminate the anchor bias.  Given the many concerns that have been expressed about under-powered studies and being able to reliably identify small effects in the context of noisy data, we made four decisions.
  1. Use professional software engineers (there is an ongoing debate about the value of student participants, e.g., the article by Falessi et al. [2] that is in favour which contrasts with the strong call for realism from Sjøberg et al. [3].  We side with realism).
  2. Use a large sample (n=410).
  3. Use robust statistics.
  4. Share our data and analysis (to improve the reproducibility of the study)
In brief, we used a 2×2 experimental design with high and low anchors combined with a de-biasing workshop and control.  Participants were randomly exposed to a high or low anchor and then asked to estimate their own productivity on the last software project they had completed (EstProd).  Some had previously undertaken our de-biasing workshop while others received no intervention, i.e., the control group.
The interaction plot below shows a large effect between high and low anchor.  It also shows that the workshop reduces the effect of the high anchor (the slope of the solid line is less steep) but has far less effect on the low anchor. However, it does not eliminate the bias.

DeBiasExptTableInteractionPlot

We conclude that:
  1. We show how professionals can be misled easily into making highly distorted judgements.
  2. This matters because despite all our tools and automation, software engineering remains a profession that requires judgement and flair.
  3. So try to avoid anchors.
  4. But it is possible to reduce bias.
  5. We believe there are many other opportunities for refining and improving de-biasing interventions.
Caveats are:
  1. We only considered one type of bias.
  2. Used a relatively simple de-biasing intervention based on a 2-3 hour workshop.
  3. We don’t know how long the de-biasing effect will last.

REFERENCES:

[1] Halkjelsvik, T. and Jørgensen, M. “From origami to software development: A review of studies on judgment-based predictions of performance time”, Psychological Bulletin, 138(2), pp238–271, 2012.
[2] Falessi, D. et al., “Empirical software engineering experts on the use of students and professionals in experiments”, Empirical Software Engineering, 23(1), pp452–489, 2018.
[3] Sjøberg, D. et al., “Conducting realistic experiments in software engineering”, IEEE International Symposium on Empirical Software Engineering, pp17–26, 2002.

The cleaned NASA MDP data sets

Some years back I, my sadly passed away colleague Prof. Qinbao Song, along with Dr Zhongbin Sun and Prof. Carolyn Mair wrote a paper [1] on the subtleties and perils of using poor quality data sets in the domain of software defect prediction.  We focused on a collection of widely used data sets known as the NASA data set which were demonstrably problematic, e.g.,  implied relational integrity constraints are violated such as LOC TOTAL cannot be less than Commented LOC, since the former must subsume the latter.  Worse, inconsistent versions of the data sets are in circulation.  These kind of problems have particular impact when research is data-driven.

In our paper we proposed two cleaning algorithms that generate data-sets D’ and D”. We recommend D”.   Unfortunately although the data sets were hosted on wikispaces, this turned into a pay for service and now is closing down altogether.  Hence the data sets have not been accessible for a period of time.  Belatedly (sorry!) I’ve now hosted the cleaned data on figshare which should provide a more permanent solution.

Reflecting on the process I see three lessons.

  1. It’s important to choose a stable and accessible home for shared data sets, otherwise the goal of reproducible research is hindered.
  2. Data cleaning is often a good deal more subtle than we give credit for.  Jean Petrić and colleagues have further refined our cleaning rules [2].
  3. I’m still of the view that researchers need to pay more attention to the quality and provenance of their data sets (or at least this process needs to be explicit).  Otherwise how can we have confidence in our results?

PS Presently, un-cleaned versions are also available from the PROMISE repository along with additional background information, however, I would caution researchers from using these versions due to their obvious errors and inconsistencies.

References:

[1] Shepperd, M., Song, Q., Sun, Z. and Mair, C., 2013. “Data quality: Some comments on the NASA software defect datasets.” IEEE Transactions on Software Engineering, 39(9), pp.1208-1215.

[2] Petrić, J., Bowes, D., Hall, T., Christianson, B. and Baddoo, N., 2016, “The jinx on the NASA software defect data sets.” In Proceedings of the ACM 20th International Conference on Evaluation and Assessment in Software Engineering (EASE). Limerick, Ireland.

[3] Liebchen, G. and Shepperd, M., 2016. “Data sets and data quality in software engineering: eight years on.” In Proceedings of The 12th ACM International Conference on Predictive Models and Data Analytics in Software Engineering, Ciudad Real, Spain.

FaceTAV 2017 Workshop at Facebook

On Monday and Tuesday (6-7th Nov, 2017) I was privileged to attend a two day workshop at Facebook, London on Testing and Verification organised by Mark Harman and Peter O’Hearn, along with about 90 other people.  Unusually for these kinds of events it was a very balanced mix of industry and academia.
The theme was testing and verification and some kind of rapprochement between the two groups. As it turned out this was hardly contentious as everyone seemed to agree that these two areas are mutually supportive.
As ever it seems a bit invidious to pick out the key lessons, but for me, I was most struck by:
  1. Google has ~2 billion LOC.  That is hard to appreciate!
  2. Practitioners strongly dislike false positives (i.e. predicted errors that do not exist), it was suggested more than 10% is problematic.
  3. Testing multiple interleaved threads is complex and it’s possible to occasionally observe very unexpected behaviours.
  4. Testing costs are a key concern to developers but not something that researchers, yet, have a good handle on.
  5. The quality and sophistication of some automated testing tools e.g. DiffBlue and Sapienz (see the paper by  Ke Mao, Mark Harman and Yue Jia) and their ability to scale is seriously impressive!
So thank you speakers and organisers!

Why I disagree with double blind reviewing

As both an author and member of a number programme committees, I’ve been reflecting on the recent decision of various academic conferences including ICSE 2018, to opt for double blinding of the review process. Essentially this means both the identify of the reviewer and the author are hidden; in the case of triple binding, which has been mooted, even the identity of one’s fellow reviewers is hidden.

But there remain many potential revealing factors e.g. the use of British or US English, the use of a comma or period as a decimal point indicator, choice of word processor and the need to cite and build upon one’s own work. Why not quadruple or even quintuple blinding?!! Should the editor or programme chair be known? What about the dangers of well-established researchers declining to serve on PCs for second-tier conferences? Perhaps there should just be a pool of anonymous papers randomly assigned to anonymous reviewers that will be randomly allocated to conferences?

Personally, I’m strongly opposed to any blinding in the review process. And here’s why.

Instinctively I feel that openness and transparency lead to better outcomes than hiding behind anonymity. Be that as it may, let’s try to be a little more analytical. First, what kinds of bias are we trying to address? There seem to be five types of bias derived from:

– personal animosity
– characteristics of the author e.g. leading to misogynist or racist bias
– the alignment / proximity to the reviewer’s research beliefs and values
– citation(s) of the reviewer’s work
– the reviewer’s narcissism and the need for self-aggrandisement

It seems that only the first two biases could be addressed through blinding, and this of course assumes that the blinding is successful, which in small fields may be difficult. Although I would never seek to actively discover the identity of the authors of a blinded paper, in many cases I am pretty certain as to who the authors are. And it doesn’t matter.

In my opinion, double blinding is a distraction, but one with some negative side effects. The blinding process harms the paper. As an author I’m asked to withhold supplementary data and scripts because this might reveal who I am. Furthermore sections must be written in an extremely convoluted fashion so I don’t refer to my previous work, reference work under review or any of the other perfectly natural parts of positioning a new study. It promotes the idea that each piece of research is in some sense atomic.

Why not do the opposite and make reviewers accountable for their opinions by requiring them to disclose who they are. Journal editors or programme chairs are known so why not the reviewers too?

Open reviewing would reduce negative and destructive reviews. It might also help deal with the situation where the reviewer demands additional references be added all of which seem tangential to the paper but, coincidentally, are authored by the reviewer! The only danger I foresee, might be that reviews become more anodyne as reviewers do not wish to be publicly controversial. But then this supposes I, as a reviewer, wish to have a reputation as a bland yes-person. I would be unlikely to want this, so I’m unconvinced by this argument.

So whilst I accept that the motivation for double blind reviewing is good, and I also accept I seem to be in a minority (see the excellent investigation of attitudes in software engineering by Lutz Prechelt, Daniel Graziotin and Daniel Fernández) but I think it’s unfortunate.