Mmm... Buttery, Flaky... Tests?

We've all encountered unexplainable CI failures. They are a total pain. But, let's be honest, most of us just cross our fingers and click the button to run the build again, right? What if I could show you what to look for to debug flaky tests and fix them once and for all? Would your behavior change? Would you actually stop and fix it?

The cynic in me says you'd probably just click retry in the moment. I mean, that flaky test had nothing to do with your code. It's my hope, however, that given the information I'm about to share, you would then use that build time to try to debug the flakiness.

Not all failures are caused by flaky tests

First, though, I feel it necessary to point out the obvious: not all build failures are caused by flaky tests. I've seen the build server run out of disk space, OOM, and lose network connectivity. I've seen problems generating a container, publishing that container to the registry, and pulling the container back down from the registry (on a different server so we could parallelize horizontally). And, don't get me started on bugs in Jenkins pipelines.

So, what if it is a flaky test?

Flaky tests are tests that fail intermitently. The primary cause for these failures is a dependency on something in the environment that changes from one run to the another. Here are the three ways that the environment might change between test runs:

Non-determinism
Leakiness
Race conditions

Non-determinism

Tests sometimes rely on a non-deterministic part of the environment, like the system clock, a random number, or access to a network resource. The system clock and random numbers are supposed to change. Network connections are not meant to change, but they do.

The good thing about this kind of flakiness is that you can reproduce it locally and in isolation. So, the only thing you need to do is find the thing that is changing in the environment and mock it. For example:

If you are using the system clock, freeze time during your test to
ensure a deterministic result.
If you are using a random number generator, mock it and return just
the value you need in the current test case.
If you are using the network, don't. Mock the network call and return
a payload matched to the running test.

Leakiness

A leaky test is one that modifies some global state, then fails to clean up after itself. After it runs, all subsequent tests are starting from an unpredictable environment, which may cause some of them to fail. Some kinds of global state to watch out for are environment variables, class variables, and global data stores, like memcache, redis, or a database.

Leaky tests are harder to pin down. They can be reproduced locally, but because they are order dependent, they will not fail when run in isolation. The key to debugging these failures is to inspect the state of the environment at the beginning of the test to ensure that it matches your expectations.

The key to permanently resolving these failures lies in finding out which test(s) are not cleaning up after themselves. Some tools (like RSpec) have the ability to "bisect" a test suite to determine which specific tests run in what specific order will cause a downstream test to fail consistently. This is a great help, but doesn't always lead to a definitive answer. It could be the combination of multiple leaky tests that cause a downstream test to fail.

Race Conditions

Race conditions occur when running tests that leverage the same shared resources in parallel. This is not something you'd ever do locally. And, you cannot reproduce these failures in isolation. They usually only fail on CI.

This kind of failure is the hardest to reproduce. Once you see a failure though, look for where the test might be using some globally available shared resource and find a way to stop using it. For example, try giving each test its own in memory implementation of the shared resource.

Summary

Now that you know that flakiness can almost always be traced back to tests that depend on something in their environment that changes. Now that you know this, please use the time while the build is running for a second or third time to debug the issue. Here are a couple of places to look:

Mock non-deterministic system resources
Avoid the use of global state
Clean up after yourself if you must use global state
Favor one-off in memory stores for each test over persistent shared resources

Also, if use this guide to determine what kind of flakiness to look for:

If you can reproduce the failure locally, and in isolation, you're likely dealing with a non-deterministic test.
If you can repro locally, but not in isolation, you're likely dealing with a leaky test. If your tooling supports bisect, leverage that to help identify which tests are leaking state.
And, if you cannot repro locally and the issue only occurs in CI, then you're likely dealing with a race condition.

Ok! Build's done! Did you fix the flaky test before the build finished?

I learned everything I shared here from thoughtbot engineer, RubyConf and RailsConf speaker, and co-host of The Bike Shed podcast, Joël Quenneville.

If you enjoyed the read and (hopefully) learned something, please consider clicking subscribe to receive an occasional post, like this one, directly in your inbox. Thanks for the support!

Mmm... Buttery, Flaky... Tests?

Not all failures are caused by flaky tests