Mutation score can be used to compare different test suites in relation to mutants detection. However, it is not known if the mutation score, being a summary of the detection ratios of different mutation types, is a fair metric to do such comparison. In this paper, we present an empirical study, with 10 open-source projects, which compares developer-written and automatically generated test suites in terms of mutation score and in relation to the detection ratios of 7 mutation types. Our results indicate fairness on the mutation score but also suggest equivalence among mutants generated by PIT with different mutation operators.
This page provides the experimental material and the statistical analysis used in this experiment.
EvoSuite’s Maven Plugin (version 1.0.6)
We used the argument Duse_separate_classloader=false, in order to avoid problems with measuring code coverage, otherwise it could cause conflicts with PIT’s bytecode instrumentations.
Randoop (version 4.1.1)
Changed arguments: flaky-test-behavior=DISCARD (in order to remove flaky tests) and randomseed=x, in order to generate different test suites, since Randoop is deterministic by default. For each execution, x was randomly generated by a pesudorandom integer generator. All the values that we used as randomseeds can be seen here.
We used 10 projects from the Apache Commons Repository. These projects already have developer-manually written test suites.
We manually removed all the tests that did not pass.