Mutation testing with PIT and Descartes

Version 7.1 by Vincent Massol on 2017/09/28 14:25

Sep 28 2017

Warning
This blog post is not published yet.

XWiki SAS is part of an European research project named STAMP. As part of this project I've been able to experiment a bit with Descartes, a mutation engine for PIT.

What PIT does is mutate the code under test and check if the existing test suite is able to detect those mutations. In other words, it checks the quality of your test suite.

Descartes plugs into PIT by providing a set of specific mutators. For example one mutator will replace the output of methods by some fixed value (for example a method returning a boolean will always return true). Another will remove the content of void methods. It then generates a report.

Here's an example of running Descartes on a module of XWiki:

report.png

You can see both the test coverage score (computed automatically by PIT using Jacoco) and the Mutation score. 

If we drill down to one class (MacroId.java) we can see for example the following report for the equals() method:

equals.png

What's interesting to note is that the test coverage says that the following code has been tested:

result =
   (getId() == macroId.getId() || (getId() != null && getId().equals(macroId.getId())))
   && (getSyntax() == macroId.getSyntax() || (getSyntax() != null && getSyntax().equals(
    macroId.getSyntax())));

However, the mutation testing is telling us a different story. It says that if you change the equals method code with negative conditions (i.e. testing for inequality), the test still reports success.

If we check the test code:

@Test
public void testEquality()
{
    MacroId id1 = new MacroId("id", Syntax.XWIKI_2_0);
    MacroId id2 = new MacroId("id", Syntax.XWIKI_2_0);
    MacroId id3 = new MacroId("otherid", Syntax.XWIKI_2_0);
    MacroId id4 = new MacroId("id", Syntax.XHTML_1_0);
    MacroId id5 = new MacroId("otherid", Syntax.XHTML_1_0);
    MacroId id6 = new MacroId("id");
    MacroId id7 = new MacroId("id");

    Assert.assertEquals(id2, id1);
   // Equal objects must have equal hashcode
   Assert.assertTrue(id1.hashCode() == id2.hashCode());

    Assert.assertFalse(id3 == id1);
    Assert.assertFalse(id4 == id1);
    Assert.assertFalse(id5 == id3);
    Assert.assertFalse(id6 == id1);

    Assert.assertEquals(id7, id6);
   // Equal objects must have equal hashcode
   Assert.assertTrue(id6.hashCode() == id7.hashCode());
}

We can indeed see that the test doesn't test for inequality. Thus in practice if we replace the equals method by return true; then the test still pass.

That's interesting because that's something that test coverage didn't notice!

More generally the report provides a summary of all mutations it has done and whether they were killed or not by the tests. For example on this class:

mutations.png

Here's what I learnt while trying to use Descartes on XWiki:

  • It's being actively developed
  • It's interesting to classify the results in 3 categories:
    • strong pseudo-tested methods: no matter the return values of a method, the tests still passes. This is the worst offender since it means the tests really needs to be improved. This was the case in the example above.
    • weak pseudo-tested methods: the tests passes with at least one modified value. Not as bad as strong pseudo-tested but you may want still want to check it out.
    • fully tested methods: the tests fail for all mutations and thus can be considered rock-solid!
  • So in the future, the generated report should provide this classification to help analyze the results and focus on important problems.
  • It would be nice if the Maven plugin was improved and be able to fail if the mutation score was below a certain threshold (as we do for test coverage).
  • Performance: It's quite slow compared to Jacoco execution time for example. In my example above it took 34 seconds to execute will all possible mutations (for a project with 14 test classes, 31 tests and 20 classes).
  • Big limitation: ATM there's a big limitation: PIT (and/or Descartes) doesn't support being executed on a multi-module project. This means that right now you need to compute the full classpath for all modules and run all sources and tests as if it was a single module. This causes problems for all tests that depend on the filesystem and expect a given directory structure. It's also tedious and a error-prone problem since the classpath order can have side effects.

Conclusion:

While PIT/Descartes is very nice, I feel that they're not bringing enough added-value yet for the XWiki open source project to use it in an automated manner. Said differently the test coverage report we have are already providing a lot of information about the code that is not tested at all and if we have 5 hours to spend, we would probably rather spend them on adding tests rather than improving further existing tests. YMMV. If you have a very strong suite of tests and you want to check its quality, then PIT/Descartes is your friend!

I'm also currently testing DSpot. DSpot uses PIT and Descartes but in addition it uses the results to generate new tests automatically. I feel that this is providing higher value (if it can work well-enough). I'll post back when I've been able to run it on XWiki and learn more by using it.

Now the Descartes project could also use the information provided by line coverage to automatically generate tests to cover the spotted issues.

I'd like to thank Oscar Luis Vera Pérez who's actively working on Descartes and who's shown me how to use it and how to analyze the results. Thanks Oscar! I'll also continue to work with Oscar on improving Descartes and executing it on the XWiki code base. 

I'll continue blogging about my findings on DSpot and Descartes!

Created by Vincent Massol on 2017/09/28 13:35