Is Mutation Score a Fair Metric?

Abstract

Mutation score can be used to compare different test suites in relation to mutants detection. However, it is not known if the mutation score, being a summary of the detection ratios of different mutation types, is a fair metric to do such comparison. In this paper, we present an empirical study, with 10 open-source projects, which compares developer-written and automatically generated test suites in terms of mutation score and in relation to the detection ratios of 7 mutation types. Our results indicate fairness on the mutation score but also suggest equivalence among mutants generated by PIT with different mutation operators.

This page provides the experimental material and the statistical analysis used in this experiment.

Experimental Material

Test Generation Tools

EvoSuite’s Maven Plugin (version 1.0.6)

We used the argument Duse_separate_classloader=false, in order to avoid problems with measuring code coverage, otherwise it could cause conflicts with PIT’s bytecode instrumentations.
Randoop (version 4.1.1)

Changed arguments: flaky-test-behavior=DISCARD (in order to remove flaky tests) and randomseed=x, in order to generate different test suites, since Randoop is deterministic by default. For each execution, x was randomly generated by a pesudorandom integer generator. All the values that we used as randomseeds can be seen here.

Mutation Testing Tool

PITest’s Maven Plugin (version 1.4.5)

Case Study Applications

We used 10 projects from the Apache Commons Repository. These projects already have developer-manually written test suites.

SRC-2019

Is Mutation Score a Fair Metric?

Abstract

Experimental Material

Test Generation Tools

Mutation Testing Tool

Case Study Applications

Applications with regression test suites for each test generation technique

Data Analysis Scripts

Data Analysis Results