Jun 15 2020

Docker and iptables

On the XWiki project, we use TestContainers to run our functional Selenium tests inside Docker containers, in order to test various configurations.

We've been struggling with various test flickering due to infra issues and it took us a long time to find out the issue. We could see for example problems from time to time starting the Ruyk docker container by TestContainers and we couldn't understand it. But we had plenty of other issues that were hard to track too.

After researching it we've found that the main issue was caused by the way we set up our iptables on our CI agents.

These agent machines have network interfaces exposed to the internet (on eth0 for example) and in order to be safe, our Infra Admin, had blocked all incoming and outgoing network traffic by default. The itables config was like this:

-A INPUT -i lo -j ACCEPT
-A INPUT -i br0 -j ACCEPT
-A INPUT -i tap0 -j ACCEPT
-A INPUT -i eth0 ... specific rules
-A OUTOUT -i eth0 ... specific rules
-A OUTPUT -o br0 -j ACCEPT
-A OUTPUT -o tap0 -j ACCEPT

Note that since we block all interfaces by default, we need to add explicit rules to allow some (lo, br0, tap0 in our case).

The problem is that Docker doesn't add rules for the INPUT/OUTPUT chains! It only adds iptable NAT rules (see Docker and iptables for more details). And it creates new network interfaces such as docker0 and br*. Since by default we forbid INPUT/OUPUT on all interfaces, this meant that the connections between container and the host were refused on lots of cases.

So we changed it to:

-A INPUT -i eth0 ... specific rules
-A OUTOUT -i eth0 ... specific rules
-A INPUT -i eth0 -j DROP
-A OUTPUT -o eth0 -j DROP

This config allows all internal interfaces to accept incoming/outgoing connections (INPUT/OUTPUT) while making sure that the eth0 only exposes the minimum to the internet.

This solved our docker network issues.

Feb 18 2020

STAMP, the end

The STAMP research project has been a boon for the XWiki open source project. It has allowed the project to gain a lot in terms of quality. There have been lots of areas of improvements but the main domains that have substantially benefited the quality are:

  • Increase of the test coverage. At the start of STAMP the XWiki project already had a substantial automated suite of tests covering 65.29% of the whole case base. Not only the project was able to increase it to over 71.4% (a major achievement for a large code base of 500K NCLOC such as XWiki) but also improve the quality of these tests themselves thanks to increasing the mutation score for them by using PIT/Descartes.
    • XWiki’s build and CI now fail when either coverage or mutation score are reduced
  • Addition of configuration testing. At the start of STAMP the XWiki project was only testing automatically a single configuration (latest HSQLDB + latest Jetty + latest Firefox). From time to time some testers were doing manual tests on different configurations but this was a very intensive process and very random and ad hoc. We were having a substantial number of issues raised in the XWiki issue tracker about configuration-related bugs. Thanks to STAMP the XWiki project has been able to cover all the configurations it supports and to execute its functional UI tests on all of them (every day, every week and every month based on different criteria). This is leading to a huge improvement in quality and in developer's productivity since developers don't need to manually setup multiple environments on their machine to test their new code (that was very time-consuming and difficult when onboarding new developers).

XWiki has spread its own engineering practices to the other STAMP project members and has benefitted from the interactions with the students, researchers and project members to accrue and firm up its own practices.

Last but not least, the STAMP research project has allowed the XWiki project to get time to work on testing in general, on its CI/CD pipeline, on adding new distribution packagings (such as the new Docker-based distribution which has increased XWiki's Active Installs) which were prerequisites for STAMP-developed tools, but which have tremendous benefits by themselves even outside of pure testing. Globally, this has raised the bar for the project, placing it even higher in the category of projects with a strong engineering practice with controlled quality. The net result is a better and more stable product with an increased development productivity. Globally STAMP has allowed a European software editor (XWiki SAS) and open source software (XWiki) to match and possibly even surpass non-European (American, etc) software editors in terms of engineering practices.

So while it's the end of STAMP, its results continue to live on inside the XWiki project.

I'm personally very happy to have been part of STAMP and I have learnt a lot about the topics at hand but also about European research projects (it was my first project).

Jun 09 2019

Scheduled Jenkinsfile


On the XWiki project we use Jenkinsfiles in our GitHub repositories, along with "Github Organization" type of jobs so that Jenkins handles automatically creating and destroying jobs based on git branches in these repositories. This is very convenient and we have a pretty elaborate Jenksinfile (using shared global libraries we developed) in which we execute about 14 different Maven builds, some in series and others in parallel to validate different things, including execution of functional tests.

We recently introduced functional tests that can be executed with different configurations (different databases, different servlet containers, different browsers, etc). Now that represents a lot of combinations and we can't run all of these every time there's a commit in GitHub. So we need to run some of them only once per day, others once per week and the rest once per month.

The problem is that Jenkins doesn't seem to support this feature out of the box when using a Jenkinsfile. In an ideal world, Jenkins would support several Jenkinsfile to achieve this. Right now the obvious solution is to create manually new jobs to run these configuration tests. However, doing this removes the benefits of the Jenkinsfile, the main one being the automatic creation and destruction of job for branches. We started with this and after a few months it became too painful to maintain. So we had to find a better solution...

The Solution

Let me start by saying that I find this solution suboptimal as it's complex and fraught with several problems.

Generally speaking the solution we implemented is based on the Parameterized Shcheduler Plugin but the devil is in the details.

  • Step 1: Make your job a parameterized job by defining a type variable that will hold what type of job you want to execute. In our case standard or docker-latest (to be executed daily), docker-all (to be executed weekly) and docker-unsupported (to be executed monthly). All the docker-* job types will execute our functional tests on various configurations. Also configure the parameterized scheduler plugin accordingly:
    private def getCustomJobProperties()
     return [
        parameters([string(defaultValue: 'standard', description: 'Job type', name: 'type')]),
          parameterizedCron('''@midnight %type=docker-latest
    @weekly %type=docker-all
    @monthly %type=docker-unsupported'''

    You set this in the job with:


    Important note: The job will need to be triggered once before the scheduler and the new parameter are effective!

  • Step 2: Based on the type parameter value, decide what to execute.For example:
    if (params.type && params.type == 'docker-latest') {
  • Step 3: You may want to manually trigger your job using the Jenkins UI and decide what type of build to execute (this is useful to debug some test problems for example). You can do it this way:
    def choices = 'Standard\nDocker Latest\nDocker All\nDocker Unsupported'
    def selection = askUser(choices)
    if (selection == 'Standard') {
    } else of (selection == 'Docker Latest') {
    } else ...

    In our case askUSer is a custom pipeline library defined like this:

    def call(choices)
       def selection

       // If a user is manually triggering this job, then ask what to build
       if (currentBuild.rawBuild.getCauses()[0].toString().contains('UserIdCause')) {
            echo "Build triggered by user, asking question..."
           try {
                timeout(time: 60, unit: 'SECONDS') {
                    selection = input(id: 'selection', message: 'Select what to build', parameters: [
                        choice(choices: choices, description: 'Choose which build to execute', name: 'build')
           } catch(err) {
               def user = err.getCauses()[0].getUser()
               if ('SYSTEM' == user.toString()) { // SYSTEM means timeout.
                   selection = 'Standard'
               } else {
                   // Aborted by user
                   throw err
       } else {
            echo "Build triggered automatically, building 'All'..."
            selection = 'Standard'

       return selection


While this may sound like a nice solution, it has a drawback. Jenkins's build history gets messed up, because you're reusing the same job name but running different builds. For example, test failure age will get reset every time a different type of build is ran. Note that at least individual test history is kept.

Since different types of builds are executed in the same job, we also wanted the job history to visibly show when scheduled jobs are executed vs the standard jobs. Thus we added the following in our pipeline:

import com.cloudbees.groovy.cps.NonCPS
import com.jenkinsci.plugins.badge.action.BadgeAction
def badgeText = 'Docker Build'
def badgeFound = isBadgeFound(currentBuild.getRawBuild().getActions(BadgeAction.class), badgeText)
if (!badgeFound) {
    manager.createSummary('green.gif').appendText("<h1>${badgeText}</h1>", false, false, false, 'green')

private def isBadgeFound(def badgeActionItems, def badgeText)
   def badgeFound = false
    badgeActionItems.each() {
       if (it.getText().contains(badgeText)) {
            badgeFound = true
   return badgeFound

Visually this gives the following where you can see information icons for the configuration tests (and you can hover over the information icon with the mouse to see the text):

What's your solution to this problem? I'd be very eager to know if someone has found a better solution to implement this in Jenkins.

Feb 09 2019

Global vs Local Coverage


On the XWiki project, we've been pursuing a strategy of failing our Maven build automatically whenever the test coverage of each Maven module is below a threshold indicated in the pom.xml of that module. We're using Jacoco to measure this local coverage.

We've been doing this for over 6 years now and we've been generally happy about it. This has allowed us to raise the global test coverage of XWiki by a few percent every year.

More recently, I joined the STAMP European Research Project and one our KPIs is the global coverage, so I got curious and wanted to look at precisely how much we're winning every year. 

I realized that, even though we've been generally increasing our global coverage (computed using Clover), there are times when we actually reduce it or increase very little, even though at the local level all modules increase their local coverage...


So I implemented a Jenkins pipeline script that is using Open Clover, that runs every day and that gets the raw Clover data and generates a report. This report shows how the global coverage evolves, Maven module by Maven module and the contribution of each module to the global coverage.

Here's a recent example report comparing global coverage from 2019-01-01 to 2019-01-08, i.e. just 8 days.

The lines in red are modules that have had changes lowering the global coverage (even though the local coverage for these modules didn't change or even increased!).


Analyzing a difference

So once we find that a module has lowered the global coverage, how do we analyze where it's coming from?

It's not easy though! What I've done is to take the 2 Clover reports for both dates and compare all packages in the modume and pinpoint the exact code where the coverage was lowered. Then it's about knowing the code base and the existing tests to find why those places are not executed anymore by the tests. Note that Clover helps a lot since its reports can tell you which tests contribute to coverage for each covered line!

I've you're interested, check for example a real analysis of the xwiki-commons-job coverage drop.


Here are some reasons I analyzed that can cause a module to lower the global coverage even though its local coverage is stable or increases:

  1. Some functional tests exercise (directly or indirectly) code lines in this module that are not covered by its unit tests.
    1. Then some of this code is removed because it's a) no longer necessary, or b) it's deprecated, or c) it's moved to another module. Since there are no unit tests that covers it in the module, the local coverage doesn't change but the global one for the module does and it's lowered. Note that the full global coverage may not change if the code is moved to another module which itself is covered by unit or functional tests.
    2. It could also happen that the code line was hit because of a bug somewhere. Not a bug that throws an Exception (since that would have failed the test) but a bug that results in some IF path entered and for example generating a warning log. Then the bug is fixed and thus the functional tests don't enter this IF anymore and the coverage is lowered... emoticon_smile (FTR this is what happened for the xwiki-commons-job coverage drop in the shown report above)
  2. Some new module is added and its local coverage is below the average coverage of the other modules.
  3. Some module is removed and it had a higher than average global coverage.
  4. Some tests have failed and especially some functional tests are flickering. This will reduce the coverage of all module code lines that are only tested through tests located in other modules. It's thus important to check the test successes before "trusting" the coverage
  5. The local coverage is computed by Jacoco and we use instructions ratio, whereas the global coverage is computed using Clover which uses the TPC formula. There are cases where the covered instructions would stay stable but the TPC value would decrease. For example if a method is split into 2 methods, the covered byte case instructions remain the same but the TPC will decrease since the number of covered methods will stay fixed but the total number of methods will increase by 1...
  6. Rounding errors. We should ignore too low differences because it's possible that the local coverage would seem to remain the same (for example we round it to 2 digits) while the global coverage decreases (we round it to 4 digits in the shown report - we do that because the contribution of each module to the global coverage is low).


So what strategy can we apply to ensure that the global coverage doesn't go down?

Here's the strategy that we're currently discussing/trying to setup on the XWiki project:

  • We run the Clover Jenkins pipeline every night (between 11PM-8AM)
  • The pipeline sends an email whenever the new report has its global TPC going down when compared with the baseline
  • The baseline report is the report generated just after each XWiki release. This means that we keep the same baseline during a whole release
  • We add a step in the Release Plan Template to have the report passing before we can release.
  • The Release Manager is in charge of a release from day 1 to the release day, and is also in charge of making sure that the global coverage job failures get addressed before the release day so that we’re ready on the release day, i.e that the global coverage doesn't go down.
  • Implementation detail: don’t send a failure email when there are failing tests in the build, to avoid false positives.

For reference, the various discussions on the XWiki list:


This experimentation has shown that in the case of XWiki, the global coverage is increasing consistently over the years, even though if, technically, it could go down. Now it also shows that with a bit more care and by ensuring that we always grow the global coverage between releases, we could make that global coverage increase a bit faster.

Additional Learnings

  • Clover does a bad job for comparing reports.
  • Don't trust Clover reports at package level either, they don't include all files.
  • Test failures reported by the Clover report is not accurate at all. For example on this report Clover shows 276 failures and 0 errors. I checked the build logs and the reality is 109 failures and 37 errors. Some tests are reported as failing when they're passing.
    • Here's an interesting example where Clover says that MacroBlockSignatureGeneratorTest#testIncompatibleBlockSignature() is failing but in the logs we have:
      [INFO] Running MacroBlockSignatureGeneratorTest
      [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.023 s - in MacroBlockSignatureGeneratorTest

      What's interesting is that Clover reports:

      And the test contains:

      public void testIncompatibleBlockSignature() throws Exception
          thrown.expectMessage("Unsupported block [org.xwiki.rendering.block.WordBlock].");

          assertThat(signer.generate(new WordBlock("macro"), CMS_PARAMS), equalTo(BLOCK_SIGNATURE));

      This is a [[known Clover issue with test that asserts exceptions>>https://community.atlassian.com/t5/Questions/JUnit-Rule-ExpectedException-marked-as-failure/qaq-p/76884]]...

Apr 23 2018

Mar 21 2018

Mar 20 2018

