This lack of confidence results in a marked and measurable increase in effort on behalf of the development team to process check-ins. On any given week, this might result in anywhere between a 7% to 45% decrease in job throughput. Obviously not being able to run unattended during off-business hours can greatly impact the achievable throughput. All is not lost though, since a few manual jobs the next day always seems to "play catch-up".
At what cost does that "catch-up" come? Well, we get similar (thought slightly lower) commit counts since we can do large merge jobs. The system processes the commits together though, so we get fewer data points for other tests that trigger from the committed build. Things like performance tests which run on the merge instead of on the individual commit being a perfect example. We now have a haystack and must look for needles when performance tests fail in these merges. We also get social changes. Instead of smaller incremental commits, devs will "stack up" multiple dependent commits into a single commit. This in turn breaks principles like "Check-in Early, Check-in Often" that tend to lead to easier diagnosis of code defects. Developers also begin to accept sporadic failure and retry a failure many times during a merge in the hopes it will "just pass" and let the merge through. This can cover up various timing conditions and concurrency issues. Those concurrency and timing issues can be both in the test themselves or in the code being tested, both of which contribute to further flakiness and issues later.
Failure Rates on Throughput
We had a bunch of historical data on the system, but it was draped in controversy. Which jobs failed due to bad changes? Which failed due to a sporadic test failure? How confident were we in classifying one way or the other? How do we detect infrastructure and VM setup problems from sporadic test failures? These should all be easy questions to answer if the data is clean, but in a system that processes hundreds of thousands of tests per day across hundreds of commits, the data is anything but clean. Instead we attacked the problem from the other direction and simply looked at the current corpus of tests and reported failure rates.
Our per check-in test infrastructure contains ~1200 test suites. Our ability to turn off tests within the suite is all or nothing, but a developer can choose to turn off a sub-test within the suite by manually committing a test update that comments out or otherwise removes the test. Initial data showed that ~150 tests were on our watch list. 80 were already disabled, some were run to collect data, but ignored on failure, and more were marked as potentially unstable meaning we had previously seen them fail at least once.
In addition to the ~150 already marked, we estimated another 50 tests which failed greater than 1 in 100 runs. Using these numbers, we tried to calculate the theoretical throughput of the queue. The following table shows two immediate observations. First, 200 flaky tests would really suck, but even at 100, which is less than our potential 70 + 50 additional for 120 total, would mean we only pass jobs 1 in 3 runs. Our live data was actually around the ~12% range so it meant that something else was still off. Either our infrastructure modelling was wrong, or we had more flaky tests than we though. Or both!
|Test Pass Rate||Test Variations|
|Unattended Pass Rate||~12%||~33%||~74%||~86%|
This at least gave us a good measure of how many flaky tests we could have and still achieve a decent throughput. It also gave us some options around impacting the test pass rate, such as automatic retry counts for some tests which could allow us to maintain some form of coverage while dramatically increasing the pass rate for those tests. We'll talk about why this can be bad later.
Retrying TestsAt this point, disabling a bunch of tests seems pretty viable, but we'd still be failing 1 in 4 jobs if we got down to say 20. Do we have any reason to leave those tests running? Potentially. Can we drive them to 0? Potentially. But what about those retry counts. If the probability a test will fail is 1%, then the possibility it would fail 2, or 3, or 4 times in a row gets increasingly smaller right? Well, statistically yes, assuming that the trials are independent.
The types of failures that a retry would help with might include, a timing or concurrency issue. A pre or post configuration issue (something in the environment that might change between test runs). A machine load issue, if the re-run is done at a time when the machine has more resources available.
The types of failures that a retry would not help with would be a persistent machine issue. In this case clearing the test to run on a different machine might be more applicable. If the entire VM infrastructure is constantly under load then it may drive the normal 99% pass rate down significantly that the even the retry failing is statistically significant.
So are retries good or bad? It all depends. They are bad if they are used as a crutch to allow flaky tests to persist in the system indefinitely. They are also bad if they allow chronic timing and concurrency issues to sneak threw over and over again, as they fly below the retry radar. If your team consistently relies on this approach, then eventually even a retry won't save the system. At 200 flaky tests you still have a 12% chance of failing. A team relying on this crutch can allow huge portions of their testing infrastructure to degrade to the point that even retries don't help anymore.
In our situation, where we have been 5-20 remaining tests, it might be worth enabling retries to drive the numbers up and increase developer confidence in the runs. For instance, if you know that the tests only fail due to specifically understood timing conditions, then you might make retry only when that occurs. Adding a retry of 2 for the above table could increase throughput to 90% all assuming that each test will fail or pass independently of the first test run.
Infrastructure DegradationAs we started identifying and disabling tests, we quickly found some categories of test failures. It turns out, in many cases, that the tests we thought were flaky were not flaky at all. The infrastructure itself had already failed. Turns out VMs are pretty finicky things and setting up unreleased OSes that change daily on them can be quite challenging. Sometimes the installation mechanisms change and your prerequisites are not present. Maybe an entire tools folder is missing. Maybe something as innocuous as a font failure. Whatever it might be, if the failure does not happen equally on every VM and either retrying or clearing can fix the problem, then you can easily overlook the fact that your infrastructure is flaky.
In addition, a lot of tests run on the flaky infrastructure. In fact, 1000 tests run so there is a 1% chance that one of your very stable tests will still fail. With the new knowledge of categories, it now becomes the likelihood of the infrastructure failure by the probability that a dependent test would then run on that machine. These equations get complicated pretty fast so I won't go into them, but it actually turns out, our infrastructure failure rate of 1 in 10,000 was probably closer to 1 in 2000 or even 1 in 1000. So even if every single test was awesome, our chances of passing start to fall rapidly.
Note, that retry counts for infrastructure problems is never a good idea. Since infrastructure would tend to cause a persistent test failure on that machine, only clearing the test to run on another machine would work. When you have large scale testing systems these types of things are usually possible. In our case, while possible, they are not trivial to configure. They also degrade confidence and have the same problem as using retries for flaky tests. How can you be sure you aren't allowing very rare timing conditions into the code?
ConfidenceHow to fix the problem and have confidence that you can maintain the throughput moving forward? After all, how did you get to a point where confidence was lost to begin with? That aside, the fastest way to confidence and throughput in our case was disabling and prioritizing the fixes for flaky tests. With all flaky tests disabled and various infrastructure issues categorized the amount of work to fix the infrastructure is also able to be scheduled and prioritized. While I don't know what our final unattended pass rates will be I can provide some insights into our short term throughput rates after having implemented these changes.
First and foremost our test coverage reductions were in the area of 0.8% of our test suite. This in turn will represent some set of actual code coverage by block and/or arc. In our case the arcs are important since they cover a lot of important edge cases. Note that this won't add up if you crunch the numbers. Suffice to say that during the same time we disabled some tests, we also fixed some infrastructure and some existing tests. Also, losing 0.8% of our test suite doesn't represent a number large enough that it lowers confidence in the suite itself. After all, these flaky tests were already being rerun and ignored since they cried wolf far too often.
Our statistically unsound pass rates versus test coverage was as high as 100% (not enough samples, but that was the daily number). This didn't account for some infrastructure issues though, which were as high as 70% failure rates for us (bad VMs that had to be diagnosed and then rebuilt). For 3 days though, all accounted for, no mucking with the numbers, we saw up to a 50% throughput. Not bad for a weeks work, to go from a fully manual check-in queue to a queue which can pass jobs cleanly at least half the time. This in turn raised our confidence in the suite significantly. We've already seen great response on failures since they represent something real and tangible, not simply a flaky test which needs to be rerun.
My major learning from this process has really been how important it is to maintain the stability of your testing systems and not let flakiness creep in. Even small sets of failures in the system can dramatically reduce your throughput. Small failures quickly allow the system to degrade further and there is a vicious, statistical model that eventually leads to a forced mode of manual operation. Once the system is manual it becomes a herculean effort to pull it back.
I would also say it seems counter-intuitive that to improve the system you have to begin turning off your test coverage. Removing devalued coverage in turn increases the value of the rest of the system. Further, once coverage is stable, remaining problems become infrastructural and fair game to correct as well.
This will be a continued effort for me. I'll be stabilizing, improving and fixing the test coverage until we are happy and confident with the throughput, infrastructure and test coverage. I would say it should be fun, but its more likely it will simply be enlightening and rewarding ;-)