FaceTAV 2017 Workshop at Facebook

On Monday and Tuesday (6-7th Nov, 2017) I was privileged to attend a two day workshop at Facebook, London on Testing and Verification organised by Mark Harman and Peter O’Hearn, along with about 90 other people.  Unusually for these kinds of events it was a very balanced mix of industry and academia.
The theme was testing and verification and some kind of rapprochement between the two groups. As it turned out this was hardly contentious as everyone seemed to agree that these two areas are mutually supportive.
As ever it seems a bit invidious to pick out the key lessons, but for me, I was most struck by:
  1. Google has ~2 billion LOC.  That is hard to appreciate!
  2. Practitioners strongly dislike false positives (i.e. predicted errors that do not exist), it was suggested more than 10% is problematic.
  3. Testing multiple interleaved threads is complex and it’s possible to occasionally observe very unexpected behaviours.
  4. Testing costs are a key concern to developers but not something that researchers, yet, have a good handle on.
  5. The quality and sophistication of some automated testing tools e.g. DiffBlue and Sapienz (see the paper by  Ke Mao, Mark Harman and Yue Jia) and their ability to scale is seriously impressive!
So thank you speakers and organisers!
Advertisements

“Estimating software project effort using analogies” 20 years on

1000cites

Almost 20 years ago, Chris Schofield and I published a paper entitled “Estimating software project effort using analogies” [1] that described the idea of using case-based (or analogical) reasoning to predict software project development effort using outcomes from previous projects.  We tested the ideas out on 9 different data sets and used stepwise regression as a benchmark and reported that in “all cases analogy outperforms” the benchmark.

Today (27.10.2017) is a landmark in that not only is our paper 20 years old, it has (according to Google Scholar) 1000 citations.  So I thought it appropriate to take stock and offer some reflections.

Why has the paper been widely cited?

I think the citations are for four reasons.  First, the paper proposed a relatively new approach to an important but tough problem. Actually, the idea wasn’t new but the application was.  Second, we tried to be thorough in the experimental evaluation and provide a meaningful comparator.  Third, the publication venue of the IEEE Transactions on Software Engineering is highly visible to the community.  Finally, there is an element of luck in the citation ‘game’.  Timing is all important, plus once a paper becomes well known, it garners citations just because other writers can recall the paper more easily than alternatives, that might be more recent or relevant but less well known.

What ideas have endured?

I see three aspects of our paper that I think remain important.  First, we used meaningful benchmarks with which to compare our prediction approach. We chose stepwise regression because it’s well understood, simple and requires little effort.  If analogy-based prediction cannot ‘beat’ regression then it’s not a competitive technique.  I think having such benchmarks is important, otherwise showing an elaborate technique is better than a slightly less elaborate technique isn’t that practically useful.  At it’s extreme Steve MacDonell and myself showed [2] that yet another study using regression-to-the-mean and analogy [3] was actually worse than guessing, however, I hadn’t realised because at the time I hadn’t used meaningful benchmarks.

Second, we used a cross validation procedure, specifically leave-one-out cross validation (LOOCV). Although cross-validation is a complex topic the underlying idea of trying to simulate predicting unseen cases (or projects in our study) is important.

Third, in terms of realism we also noted that data sets grow one project at a time so in that sense LOOCV is an unrealistic validation procedure. Unfortunately our data did not include start and end dates so we were unable to properly explore this question except through simple simulation.

What would I do differently if we were to rewrite the paper today?

There are three areas that I would definitely try to improve if I were to re-do this study. The first is — and it’s quite embarrassing — the fact that the results cannot be exactly reproduced. This is mainly because the analogy software was written in Visual Basic and ran on Windows NT. It also used some paid-for VBX components. We no longer have access to this environment and so cannot run exactly the same software. Likewise the exact settings for the stepwise regression modelling are now lost and I can only generate close, but not identical, results. A clear lesson is to properly archive scripts, raw and intermediate results. However this would still not address the problem of no longer being able to execute this early version of our Analogy tool (ANGEL).

Second, the evaluation was biased in that we optimised settings for the analogy-based predictions by exploring different values for k (the number of neighbours) but the regression modelling was taken straight out of the box.

Third and finally, we reported predictive performance in terms of problematic measures such as MMRE and pred(25). We did not consider effect size or the variability of the results.  Subsequent development in this area has greatly improved research practice.

References

[1] M. Shepperd, and C. Schofield, “Estimating software project effort using analogies”, IEEE Transactions on Software Engineering, vol. 23, no. 11, pp. 736-743, 1997.
[2] M. Shepperd, and S. MacDonell, “Evaluating prediction systems in software project estimation”, Information and Software Technology, vol. 54, no. 8, pp. 820-827, 2012.
[3] M. Shepperd, and M. Cartwright, “A Replication of the Use of Regression Towards the Mean (R2M) as an Adjustment to Effort Estimation Models”, 11th IEEE Intl. Softw. Metrics Symp. (Metrics05), Como, Italy.