When do tests fail? by Mark Seemann
Optimise for the common scenario.
Unit tests occasionally fail. When does that happen? How often? What triggers it? What information is important when tests fail?
Regularly I encounter the viewpoint that it should be easy to understand the purpose of a test when it fails. Some people consider test names important, a topic that I've previously discussed. Recently I discussed the Assertion Roulette test smell on Twitter, and again I learned some surprising things about what people value in unit tests.
The importance of clear assertion messages #
The Assertion Roulette test smell is often simplified to degeneracy, but it really describes situations where it may be a problem if you can't tell which of several assertions actually caused a test to fail.
"Background. In a legacy product, we saw some tests start failing intermittently. They weren’t just flakey, but also failed without providing enough info to fix. One of things which caused time to fix to increase was multiple ways of a single test to fail."
He goes on:
"I.e. if you fix the first assertion and you know there still could be flakiness, or long cycle times to see the failure. Multiple assertions makes any test problem worse. In an ideal state, they are fine, but every assertion doubles the amount of failures a test catches."
"the other main way (unrelated) was things like:
Which tells you what failed, but nothing about how.
But the following is worse. You must run the test twice to fix:
The final point is due to the short-circuiting nature of most assertion libraries. That, however, is a solvable problem.
I find the above a compelling example of why Assertion Roulette may be problematic.
It did give me pause, though. How common is this scenario?
Out of the blue #
The situation described by Josh McKinney comes with more than a single warning flag. I hope that it's okay to point some of them out. I didn't get the impression from my interaction with Josh McKinney that he considered the situation ideal in any way.
First, of course, there's the lack of information about the problem. Here, that's a real problem. As I understand it, it makes it harder to reproduce the problem in a development environment.
Next, there's long cycle times, which I interpret as significant time may pass from when you attempt a fix until you can actually observe whether or not it worked. Josh McKinney doesn't say how long, but I wouldn't surprised if it was measured in days. At least, if the cycle time is measured in days, I can see how this is a problem.
Finally, there's the observation that "some tests start failing intermittently". This was the remark that caught my attention. How often does that happen?
Tests shouldn't do that. Tests should be deterministic. If they're not, you should work to eradicate non-determinism in tests.
I'll be the first to admit that that I also write non-deterministic tests. Not by design, but because I make mistakes. I've written many Erratic Tests in my career, and I've documented a few of them here:
- Waiting to happen
- Waiting to never happen
- Fortunately, I don't squash my commits
- Make pre-conditions explicit in Property-Based Tests
While it can happen, it shouldn't be the norm. When it nonetheless happens, eradicating that source of non-determinism should be top priority. Pull the andon cord.
When tests fail #
Ideally, tests should rarely fail. As examined above, you may have Erratic Tests in your test suite, and if you do, these tests will occasionally (or often) fail. As Martin Fowler writes, this is a problem and you should do something about it. He also outlines strategies for it.
Once you've eradicated non-determinism in unit tests, then when do tests fail?
I can think of a couple of situations.
Tests routinely fail as part of the red-green-refactor cycle. This is by design. If no test is failing in the red phase, you probably made a mistake (which also regularly happens to me), or you may not really be doing test-driven development (TDD).
Another situation that may cause a test to fail is if you changed some code and triggered a regression test.
In both cases, tests don't just fail out of the blue. They fail as an immediate consequence of something you did.
Optimise for the common scenario #
In both cases you're (hopefully) in a tight feedback loop. If you're in a tight feedback loop, then how important is the assertion message really? How important is the test name?
You work on the code base, make some changes, run the tests. If one or more tests fail, it's correlated to the change you just made. You should have a good idea of what went wrong. Are code forensics and elaborate documentation really necessary to understand a test that failed because you just did something a few minutes before?
The reason I don't care much about test names or whether there's one or more assertion in a unit test is exactly that: When tests fail, it's usually because of something I just did. I don't need diagnostics tools to find the root cause. The root cause is the change that I just made.
That's my common scenario, and I try to optimise my processes for the common scenarios.
Fast feedback #
There's an implied way of working that affects such attitudes. Since I learned about TDD in 2003 I've always relished the fast feedback I get from a test suite. Since I tried continuous deployment around 2014, I consider it central to modern software engineering (and Accelerate strongly suggests so, too).
The modus operandi I outline above is one of fast feedback. If you're sitting on a feature branch for weeks before integrating into master, or if you can only deploy two times a year, this influences what works and what doesn't.
Both Modern Software Engineering and Accelerate make a strong case that short feedback cycles are pivotal for successful software development organisations.
I also understand that that's not the reality for everyone. When faced with long cycle times, a multitude of Erratic Tests, a legacy code base, and so on, other things become important. In those circumstances, tests may fail for different reasons.
When you work with TDD, continuous integration (CI), and continuous deployment (CD), then when do tests fail? They fail because you made them fail, only minutes earlier. Fix your code and move forward.
When discussing test names and assertion messages, I've been surprised by the emphasis some people put on what I consider to be of secondary importance. I think the explanation is that circumstances differ.
With TDD and CI/CD you mostly look at a unit test when you write it, or if some regression test fails because you changed some code (perhaps in response to a test you just wrote). Your test suite may have hundreds or thousands of tests. Most of these pass every time you run the test suite. That's the normal state of affairs.
In other circumstances, you may have Erratic Tests that fail unpredictably. You should make it a priority to stop that, but as part of that process, you may need good assertion messages and good test names.
Different circumstances call for different reactions, so what works well in one situation may be a liability in other situations. I hope that this article has shed a little light on the forces you may want to consider.