100% coverage is not that trivial

Monday, 10 November 2025 12:00:00 UTC

Dispelling a myth I helped propagate.

Most people who have been around automated testing for a few years understand that code coverage is a useless target measure. Unfortunately, through a game of Chinese whispers, this message often degenerates to the simpler, but incorrect, notion that code coverage is useless.

As I've already covered in that article, code coverage may be useful for other reasons. That's not my agenda for this article. Rather, something about this discussion have been bothering me for a long time.

Have you ever had an uneasy feeling about a topic, without being able to put your finger on exactly what the problem is? This happens to me regularly. I'm going along with the accepted narrative until the cognitive dissonance becomes so conspicuous that I can no longer ignore it.

In this article, I'll grapple with the notion that 'reaching 100% code coverage is easy.'

Origins #

This tends to come up when discussing code coverage. People will say that 100% code coverage isn't a useful measure, because it's easy to reach 100%. I have used that argument myself. Fortunately I also cited my influences in 2015; in this case Martin Fowler's Assertion Free Testing.

"[...] of course you can do this and have 100% code coverage - which is one reason why you have to be careful on interpreting code coverage data."

This may not be the only source of such a claim, but it may have been a contributing factor. There's little wrong with Fowler's article, which doesn't make any groundless claims, but I can imagine how semantic diffusion works on an idea like that.

Fowler also wrote that it's "a story from a friend of a friend." When the source of a story is twice-removed like that, alarm bells should go off. This is the stuff that urban legends are made of, and I wonder if this isn't rather an example of 'programmer folk wisdom'. I've heard variations of that story many times over the years, from various people.

It's not that easy #

Even though I've helped promulgate the idea that reaching 100% code coverage is easy if you cheat, I now realise that that's an overstatement. Even if you write no assertions, and surround the test code with a try/catch block, you can't trivially reach 100% coverage. There are going to be branches that you can't reach.

This often happens in real code bases that query databases, call web services, and so on. If a branch depends on indirect input, you can't force execution down that path just by suppressing exceptions.

An example is warranted.

Example #

Consider this ReadReservation method in the SqlReservationsRepository class from the code base that accompanies my book Code That Fits in Your Head:

public async Task<Reservation?> ReadReservation(int restaurantIdGuid id)
{
    const string readByIdSql = @"
        SELECT [PublicId], [At], [Name], [Email], [Quantity]
        FROM [dbo].[Reservations]
        WHERE [PublicId] = @id";
 
    using var conn = new SqlConnection(ConnectionString);
    using var cmd = new SqlCommand(readByIdSql, conn);
    cmd.Parameters.AddWithValue("@id"id);
 
    await conn.OpenAsync().ConfigureAwait(false);
    using var rdr = await cmd.ExecuteReaderAsync().ConfigureAwait(false);
    if (!rdr.Read())
        return null;
 
    return ReadReservationRow(rdr);
}

Even though it only has a cyclomatic complexity of 2, most of it is unreachable to a test that tries to avoid hard work.

You can try to cheat in the suggested way by adding a test like this:

[Fact]
public async Task ReadReservation()
{
    try
    {
        var sut = new SqlReservationsRepository("dunno");
        var actual = await sut.ReadReservation(0, Guid.NewGuid());
    }
    catch { }
}

Granted, this test passes, and if you had 0% code coverage before, it does improve the metric slightly. Interestingly, the Coverlet collector for .NET reports that only the first line, which creates the conn variable, is covered. I wonder, though, if this is due to some kind of compiler optimization associated with asynchronous execution that the coverage tool fails to capture.

More understandably, execution reaches conn.OpenAsync() and crashes, since the test hasn't provided a connection to a real database. This is what happens if you run the test without the surrounding try/catch block.

Coverlet reports 18% coverage, and that's as high you can get with 'the easy hack'. 100% is some distance away.

Toward better coverage #

You may protest that we can do better than this. After all, with utter disregard for using proper arguments, I passed "dunno" as a connection string. Clearly that doesn't work.

Couldn't we easily get to 100% by providing a proper connection string? Perhaps, but what's a proper connection string?

It doesn't help if you pass a well-formed connection string instead of "dunno". In fact, it will only slow down the test, because then conn.OpenAsync() will attempt to open the connection. If the database is unreachable, that statement will eventually time out and fail with an exception.

Couldn't you, though, give it a connection string to a real database?

Yes, you could. If you do that, though, you should make sure that the database has a schema compatible with readByIdSql. Otherwise, the query will fail. What happens if the implied schema changes? Now you need to make sure that the database is updated, too. This sounds error-prone. Perhaps you should automate that.

Furthermore, you may easily cover the branch that returns null. After all, when you query for Guid.NewGuid(), that value is not going to be in the table. On the other hand, how will you cover the other branch; the one that returns a row?

You can only do that if you know the ID of a value already in that table. You may write a second test that queries for that known value. Now you have 100% coverage.

What you have done at this point, however, is no longer an easy cheat to get to 100%. You have, essentially, added integration tests of the data access subsystem.

How about adding some assertions to make the tests useful?

Integration tests for 100% #

In most systems, you will at least need some integration tests to reach 100% code coverage. While the code shown in Code That Fits in Your Head doesn't have 100% code coverage (that was never my goal), it looks quite good. (It's hard to get a single number, because Coverlet apparently can't measure coverage by running multiple test projects, so I can only get partial results. Coverage is probably better than 80%, I estimate.)

To test ReadReservation I wrote integration tests that automate setup and tear-down of a local test-specific database. The book, and the Git repository that accompanies it, has all the details.

Getting to 100%, or even 80%, requires dedicated work. In a realistic code base, the claim that reaching 100% is trivial is hardly true.

Conclusion #

Programmer folk wisdom 'knows' that code coverage is useless. One argument is that any fool can reach 100% by writing assertion-free tests surrounded by try/catch blocks.

This is hardly true in most significant code bases. Whenever you deal with indirect input, try/catch is insufficient to control whereto execution branches.

This suggests that high code-coverage numbers are good, and low numbers bad. What constitutes high and low is context-dependent. What seems to remain true, however, is that code coverage is a useless target. This has little to do with how trivial it is to reach 100%, but rather everything to do with how humans respond to incentives.


Empirical Characterization Testing

Monday, 03 November 2025 13:08:00 UTC

Gathering empirical evidence while adding tests to legacy code.

This article is part of a short series on empirical test-after techniques. Sometimes, test-driven development (TDD) is impractical. This often happens when faced with legacy code. Although there's a dearth of hard data, I guess that most code in the world falls into this category. Other software thought leaders seem to suggest the same notion.

For the purposes of this discussion, the definition of legacy code is code without automated tests.

"Code without tests is bad code. It doesn't matter how well written it is; it doesn't matter how pretty or object-oriented or well-encapsulated it is. With tests, we can change the behavior of our code quickly and verifiably. Without them, we really don't know if our code is getting better or worse."

As Michael Feathers suggest, the accumulation of knowledge is at the root of this definition. As I outlined in Epistemology of software, tests are the source of empirical evidence. In principle it's possible to apply a rigorous testing regimen with manual testing, but in most cases this is (also) impractical for reasons that are different from the barriers to automated testing. In the rest of this article, I'll exclusively discuss automated testing.

We may reasonably extend the definition of legacy code to a code base without adequate testing support.

When do we have enough tests? #

What, exactly, is adequate testing support? The answer is the same as in science, overall. When do you have enough scientific evidence that a particular theory is widely accepted? There's no universal answer to that, and no, p-values less than 0.05 isn't the answer, either.

In short, adequate empirical evidence is when a hypothesis is sufficiently corroborated to be accepted. Keep in mind that science can never prove a theory correct, but performing experiments against falsifiable predictions can disprove it. This applies to software, too.

"Testing shows the presence, not the absence of bugs."

Software Engineering Techniques, April 1970. Report on a conference sponsored by the NATO Science Committee, Edsger W. Dijkstra, Rome, Italy, 1969.

The terminology of hypothesis, corroboration, etc. may be opaque to many software developers. Here's what it means in terms of software engineering: You have unspoken and implicit hypotheses about the code you're writing. Usually, once you're done with a task, your hypothesis is that the code makes the software work as intended. Anyone who's written more than a hello-world program knows, however, that believing the code to be correct is not enough. How many times have you written code that you assumed correct, only to find that it was not?

That's the lack of knowledge that testing attempts to address. Even manual testing. A test is an experiment that produces empirical evidence. As Dijkstra quipped, passing tests don't prove the software correct, but the more passing tests we have, the more confidence we gain. At some point, the passing tests provide enough confidence that you and other stakeholders consider it sensible to release or deploy the software. We say that the failure to find failing tests corroborates our hypothesis that the software works as intended.

In the context of legacy code, it's not the absolute lack of automated tests that characterizes legacy code. Rather, it's the lack of adequate test coverage. It's that you don't have enough tests. Thus, when working with legacy code, you want to add tests after the fact.

Characterization Test recipes #

A test written after the fact against a legacy code base is called a Characterization Test, because it characterizes (i.e. describes) the behaviour of the system under test (SUT) at the time it was written. It's not a given that this behaviour is correct or desirable.

Michael Feathers gives this recipe for writing a Characterization Test:

  1. "Use a piece of code in a test harness.
  2. "Write an assertion that you know will fail.
  3. "Let the failure tell you what the behavior is.
  4. "Change the test so that it expects the behavior that the code produces.
  5. "Repeat."

Notice the second step: "Write an assertion that you know will fail." Why is that important? Why not write the 'correct' assertion from the outset?

The reason is the same as I outlined in Epistemology of software: It happens with surprising regularity that you inadvertently write a tautological assertion. You could also make other mistakes, but writing a failing test is a falsifiable experiment. In this case, the implied hypothesis is that the test will fail. If it does not fail, you've falsified the implied prediction. To paraphrase Dijkstra, you've proven the test wrong.

If, on the other hand, the test fails, you've failed to falsify the hypothesis. You have not proven the test correct, but you've failed in proving it wrong. Epistemologically, that's the best result you may hope for.

I'm a little uneasy about the above recipe, because it involves a step where you change the test code to pass the test. How can you know that you, without meaning to, replaced a proper assertion with a tautological assertion?

For that reason, I sometimes follow a variation of the recipe:

  1. Write a test that exercises the SUT, including the correct assertion you have in mind.
  2. Run the test to see it pass.
  3. Sabotage the SUT so that it fails the assertion. If there are several assertions, do this for each, one after the other.
  4. Run the test to see it fail.
  5. Revert the sabotage.
  6. Run the test again to see it pass.
  7. Repeat.

The last test run is strictly not necessary if you've been rigorous about how you revert the sabotage, but psychologically, it gives me a better sense that all is good if I can end each cycle with a green test suite.

Example #

I don't get to interact that much with legacy code, but even so, I find myself writing Characterization Tests with surprising regularity. One example was when I was characterizing the song recommendations example. If you have the Git repository that accompanies that article series, you can see that the initial setup is adding one Characterization Test after the other. Even so, as I follow a policy of not adding commits with failing tests, you can't see the details of the process leading to each commit.

Perhaps a better example can be found in the Git repository that accompanies Code That Fits in Your Head. If you own the book, you also have access to the repository. In commit d66bc89443dc10a418837c0ae5b85e06272bd12b I wrote this message:

"Remove PostOffice dependency from Controller

"Instead, the PostOffice behaviour is now the responsibility of the EmailingReservationsRepository Decorator, which is configured in Startup.

"I meticulously edited the unit tests and introduced new unit tests as necessary. All new unit tests I added by following the checklist for Characterisation Tests, including seeing all the assertions fail by temporarily editing the SUT."

Notice the last paragraph, which is quite typical for how I tend to document my process when it's otherwise invisible in the Git history. Here's a breakdown of the process.

I first created the EmailingReservationsRepository without tests. This class is a Decorator, so quite a bit of it is boilerplate code. For instance, one method looks like this:

public Task<Reservation?> ReadReservation(int restaurantIdGuid id)
{
    return Inner.ReadReservation(restaurantIdid);
}

That's usually the case with such Decorators, but then one of the methods turned out like this:

public async Task Update(int restaurantIdReservation reservation)
{
    if (reservation is null)
        throw new ArgumentNullException(nameof(reservation));
 
    var existing =
        await Inner.ReadReservation(restaurantIdreservation.Id)
            .ConfigureAwait(false);
    if (existing is { } && existing.Email != reservation.Email)
        await PostOffice
            .EmailReservationUpdating(restaurantIdexisting)
            .ConfigureAwait(false);
 
    await Inner.Update(restaurantIdreservation)
        .ConfigureAwait(false);
 
    await PostOffice.EmailReservationUpdated(restaurantIdreservation)
        .ConfigureAwait(false);
}

I then realized that I should probably cover this class with some tests after all, which I then proceeded to do in the above commit.

Consider one of the state-based Characterisation Tests I added to cover the Update method.

[Theory]
[InlineData(32, "David")]
[InlineData(58, "Robert")]
[InlineData(58, "Jones")]
public async Task UpdateSendsEmail(int restaurantIdstring newName)
{
    var postOffice = new SpyPostOffice();
    var existing = Some.Reservation;
    var db = new FakeDatabase();
    await db.Create(restaurantIdexisting);
    var sut = new EmailingReservationsRepository(postOfficedb);
 
    var updated = existing.WithName(new Name(newName));
    await sut.Update(restaurantIdupdated);
 
    var expected = new SpyPostOffice.Observation(
        SpyPostOffice.Event.Updated,
        restaurantId,
        updated);
    Assert.Contains(updateddb[restaurantId]);
    Assert.Contains(expectedpostOffice);
    Assert.DoesNotContain(
        postOffice,
        o => o.Event == SpyPostOffice.Event.Updating);
}

This test immediately passed when I added it, so I had to sabotage the Update method to see the assertions fail. Since there are three assertions, I had to sabotage the SUT in three different ways.

To see the first assertion fail, the most obvious sabotage was to simply comment out or delete the delegation to Inner.Update:

//await Inner.Update(restaurantId, reservation)
//    .ConfigureAwait(false);

This caused the first assertion to fail. I was sure to actually look at the error message and follow the link to the test failure to make sure that it was, indeed, that assertion that was failing, and not something else. Once I had that verified, I undid the sabotage.

With the SUT back to unedited state, it was time to sabotage the second assertion. Just like FakeDatabase inherits from ConcurrentDictionary, SpyPostOffice inherits from Collection, which means that the assertion can simply verify whether the postOffice contains the expected observation. Sabotaging that part was as easy as the first one:

//await PostOffice.EmailReservationUpdated(restaurantId, reservation)
//    .ConfigureAwait(false);

The test failed, but again I meticulously verified that the error was the expected error at the expected line. Once I'd done that, I again reverted the SUT to its virgin state, and ran the test to verify that all tests passed.

The last assertion is a bit different, because it checks that no Updating message is being sent. This should only happen if the user updates the reservation by changing his or her email address. In that case, but only in that case, should the system send an Updating message to the old address, and an Updated message to the new address. There's a separate test for that, but as it follows the same overall template as the one shown here, I'm not showing it. You can see it in the Git repository.

Here's how to sabotage the SUT to see the third assertion fail:

if (existing is { } /*&& existing.Email != reservation.Email*/)
    await PostOffice
        .EmailReservationUpdating(restaurantIdexisting)
        .ConfigureAwait(false);

It's enough to comment out (or delete) the second Boolean check to fail the assertion. Again, I made sure to check that the test failed on the exact line of the third assertion. Once I'd made sure of that, I undid the change, ran the tests again, and committed the changes.

Conclusion #

When working with automated tests, a classic conundrum is that you're writing code to test some other code. How do you know that the test code is correct? After all, you're writing test code because you don't trust your abilities to produce perfect production code. The way out of that quandary is to first predict that the test will fail and run that experiment. If you haven't touched the production code, but the test passes, odds are that there's something wrong with the test.

When you are adding tests to an existing code base, you can't perform that experiment without jumping through some hoops. After all, the behaviour you want to observe is already implemented. You must therefore either write a variation of a test that deliberately fails (as Michael Feathers recommends), or temporarily sabotage the system under test so that you can verify that the new test fails as expected.

The example shows how to proceed empirically with a Characterisation Test of a C# class that I'd earlier added without tests. Perhaps, however, I should have rather approached the situation in another way.

Next: Empirical software prototyping.


Empirical test-after development

Monday, 27 October 2025 06:42:00 UTC

A few techniques for situations where TDD is impractical.

In Epistemology of software I described how test-driven development (TDD) is a scientific approach to software development. By running tests, we conduct falsifiable experiments to gather empirical evidence that corroborate our hypothesis about the software we're developing.

TDD is, in my experience, the most effective way to deliver useful software within reasonable time frames. Even so, I also understand that there are situations where TDD is impractical. I can think of a few overall categories where this is the case, but undoubtedly, there are more than those I enumerate in this article.

Not all is lost in those cases. What do you do when TDD seems impractical? The key is to understand how empirical methods work. How do you gather evidence that corroborates your hypotheses? In subsequent articles, I'll share some techniques that I've found useful.

Each of these articles will contain tips and examples that apply in those situations where TDD is impractical. Most of the ideas and techniques I've learned from other people, and I'll be as diligent as possible to cite sources of inspiration.

Next: Empirical Characterization Testing.


Epistemology of software

Monday, 20 October 2025 06:16:00 UTC

How do you know that your code works?

In 2023 I gave a conference keynote titled Epistemology of software, a recording of which is available on YouTube. In it, I try to answer the question: How do we know that software works?

The keynote was for a mixed audience with some technical, but also a big contingent of non-technical, software people, so I took a long detour around general epistemology, and particularly the philosophy of science as it pertains to empirical science. Towards the end of the presentation, I returned to the epistemology of software in particular. While I recommend that you watch the recording for a broader perspective, I want to reiterate the points about software development here. Personally, I like prose better than video when it comes to succinctly present and preserve ideas.

How do we know anything? #

In philosophy of science it's long been an established truth that we can't know anything with certainty. We can only edge asymptotically closer to what we believe is the 'truth'. The most effective method for that is the 'scientific method', which grossly simplified is an iterative process of forming hypotheses, making predictions, performing experiments, and corroborating or falsifying predictions. When experiments don't quite turn out as predicted, you may adjust your hypothesis accordingly.

Cycle with arrows from prediction to experiment, from experiment to adjustment, and from adjustment to prediction.

An example, however, may still be useful to set the stage. Consider Galilei's idea of dropping a small and a big ball from the Leaning Tower of Pisa. The prediction is that two objects of different mass will hit the ground simultaneously, if dropped from the same height simultaneously. This experiment have been carried out multiple times, even in vacuum chambers.

Even so, thousands of experiments do not constitute proof that two objects fall with the same acceleration. The experiments only make it exceedingly likely that this is so.

What happens if you apply scientific thinking to software development?

Empirical epistemology of software #

Put yourself in the shoes of a non-coding product owner. He or she needs an application to solve a particular problem. How does he or she know when that goal has been reached?

Ultimately, he or she can only do what any scientist can do: Form hypotheses, make predictions, and perform experiments. Sometimes product owners assign these jobs to other people, but so do scientists.

Cycle with arrows from write software to test, from test to shatter all illusions, and from shatter all illusions to write software.

Testing can be ad-hoc or planned, automated or manual, thorough or cursory, but it's really the only way a product owner can determine whether or not the software works.

When I started as a software developer, testing was often the purview of a dedicated testing department. The test manager would oversee the production of a test plan, which was a written document that manual testers were supposed to follow. Even so, this, too, is an empirical approach to software verification. Each test essentially implies a hypothesis: If we perform these steps, the application will behave in this particular, observable manner.

For an application designed for human interaction, letting a real human tester interact with it may be the most realistic test scenario, but human testers are slow. They also tend to make mistakes. The twentieth time they run through the same test scenario, they are likely to overlook details.

If you want to perform faster, and more reliable, tests, you may wish to automated the tests. You could, for example, write code that executes a test plan. This, however, raises another problem: If tests are also code, how do we know that the tests contain no bugs?

The scientific method of software #

How about using the the scientific method? Even more specifically, proceed by making incremental falsifiable predictions about the code you write.

For instance, write a single automated test without accompanying production code:

Two boxes labelled 'production code' and 'test code'. The test-code box contains a single red arrow going from a to b.

Before you run the test code, you make a prediction based on the implicit hypothesis formulated by the red-green-refactor checklist: If this test runs, it will fail.

This is a falsifiable prediction. While you expect the test to fail, it may succeed if, for example, you've inadvertently written a tautological assertion. In other words, if your prediction is falsified, you know that the test code is somehow wrong. On the other hand, if the test fails (as predicted), you've failed to falsify your prediction. As empirical science goes, this is the best you can hope for. It doesn't prove that the test is correct, but corroborates the hypothesis that it is.

The next step in the red-green-refactor cycle is to write just enough code to pass the test. You do that, and before rerunning the test implicitly formulate the opposite hypothesis: If this test runs, it will succeed.

Two boxes labelled 'production code' and 'test code'. The test-code box contains a single red arrow going from a to b. The product code box contains green arrows going from a to b to c.

This, again, is a falsifiable prediction. If, despite expectations, the test fails, you know that something is wrong. Most likely, it's the implementation that you just wrote, but it could also be the test which, after all, is somehow defective. Or perhaps a circuit in your computer was struck by a cosmic ray. On the other hand, if the test passes, you've failed to falsify your prediction, which is the best you can hope for.

You now write a second test, which comes with the implicit falsifiable prediction: If I run all tests, the new test will fail.

Two boxes labelled 'production code' and 'test code'. The test-code box contains a single blue arrow going from a to b, and red arrow going from a to g. The product code box contains green arrows going from a to b to c.

The process repeats. A succeeding test falsifies the prediction, while a failing test only corroborates the hypothesis.

Again, implement just enough code for the hypothesis that if you run all tests, they will pass.

Two boxes labelled 'production code' and 'test code'. The test-code box contains a single blue arrow going from a to b, and red arrow going from a to g. The product code box contains green arrows going from a to b to c, and other arrows branching off from b to go to g via e.

If this hypothesis, too, is corroborated (i.e. you failed to falsify it), you move on until you believe that you're done.

Two boxes labelled 'production code' and 'test code'. The test-code box contains blue arrows going from a to b, from a to g, and another arrow going from a to h. The product code box contains green arrows going from a to b to c, and other arrows branching off from b to go to g via e, and yet another set of arrows going from a to d to f to h.

As this process proceeds, you corroborate two related hypotheses: That the test code is correct, and that the production code is. None of these hypotheses are ever proven, but as you add tests that are first red, then green, and then stay green, you increase confidence that the entire code complex works as intended.

If you don't write the test first, you don't get to perform the first experiment: That you predict the new test to fail. If you don't do that, you collect no empirical evidence that the tests work as hypothesized. In other words, you'd lose half of the scientific evidence you otherwise could have gathered.

TDD is the scientific method #

Let me spell this out: Test-driven development (TDD) is an example of the scientific method. Watching a new test fail is an important part of the process. Without it, you have no empirical reason to believe that the tests are correct.

While you may read the test code, that only puts you on the same scientific footing as the ancient Greeks' introspective philosophy: The four humours, extramission, the elements, etc. By reading test code, or even writing it, you may believe that you understand what the code does, but reading it without running it gives you no empirical way to verify whether that belief is correct.

Consider: How many times have you written code that you believed was correct, but turned out contained errors?

If you write the test after the system under test (SUT), you can run the test to see it pass, but consider what you can only falsify in that case: If the test passes, you've learned little. It may be that the test exercises the SUT, but it may also be that you've written a tautological assertion. It may also be that you've faithfully captured a bug in the production code, and now preserved in for eternity as something that looks like a regression test. Or perhaps the test doesn't even cover the code path that you believe it covers.

Conversely, if such a test (that you believe to be correct) fails, you're also in the dark. Was the test wrong, after all? Or does the SUT have a defect?

This is the reason that the process for writing Characterization Tests includes a step where you

"Write an assertion that you know will fail."

I prefer a variation where I write what I believe is the correct assertion, but then temporarily sabotage the SUT to fail the assertion. The important part is to see the test fail, because the failure to falsify a strong prediction is important empirical evidence.

Conclusion #

How do we know that the software we develop works as intended? The answer lies in the much larger question: How do we know anything?

Scientific thinking effectively answers this by 'the scientific method': Form a hypothesis, make falsifiable predictions, perform experiments, adjust, repeat.

We can subject software to the same rigorous regimen that scientists do: Hypothesize that the software works in certain ways under given conditions, predict observable behaviour, test, record outcomes, fix defects, repeat.

Test-driven development closely follows that process, so is a highly scientific methodology for developing software. It should be noted that science is hard, and so is TDD. Still, if you care that your software behaves as it's supposed to, it's one of the most rigorous and effective processes I'm aware of.


Result isomorphism

Wednesday, 15 October 2025 14:47:00 UTC

Result types are roughly equivalent to exceptions.

This article is part of a a series about software design isomorphisms, although naming this one an isomorphism is a stretch. A real isomorphism is when a lossless translation exists between two or more different representations. This article series has already shown a few examples that fit the definition better than what the present article will manage.

The reader, I hope, will bear with me. The overall series of software design isomorphisms establishes a theme, and even when a topic doesn't fit the definition to a T, I find that it harmonizes well enough that it still belongs.

In short, the claim made here is that 'Result' (or Either) types are equivalent to exceptions.

Two boxes labelled 'exception' and 'result', respectively, with curved arrows pointing from each to the other.

I've deliberately drawn the arrows in such a way that they fade or wash out as they approach their target. My intent is to suggest that there is some loss of information. We may consider exceptions and result types to be roughly equivalent, but they do, in general, have different semantics. The exact semantics are language-dependent, but most languages tend to align with each other when it comes to exceptions. If they have exceptions at all.

Checked exceptions #

As far as I'm aware, the language where exceptions and results are most similar may be Java, which has checked exceptions. This means that a method may declare that it throws certain exceptions. Any callers must either handle all declared exceptions, or rethrow them, thereby transitively declare to their callers that they must expect certain exceptions to be thrown.

Imagine, for example, that you want to create a library of basic statistical calculations. You may start out with this variation of mean:

public double mean(double[] values) {
    if (values == null || values.length == 0) {
        throw new IllegalArgumentException(
            "The parameter 'values' must not be null or empty.");
    }
    double sum = 0;
    for (double value : values) {
        sum += value;
    }
    return sum / values.length;
}

Since it's impossible to calculate the mean for an empty data set, this method throws an exception. If we had omitted the Guard Clause, the method would have returned NaN, a questionable language design choice, if you ask me.

One would think that you could add throws IllegalArgumentException to the method declaration in order to force callers to deal with the problem, but alas, IllegalArgumentException is a RuntimeException, so no caller is forced to deal with this exception, after all.

Purely for the sake of argument, we may introduce a special StatisticsException class as a checked exception, and change the mean method to this variation:

public static double mean(double[] values) throws StatisticsException {
    if (values == null || values.length == 0) {
        throw new StatisticsException(
            "The parameter 'values' must not be null or empty.");
    }
    double sum = 0;
    for (double value : values) {
        sum += value;
    }
    return sum / values.length;
}

Since the new StatisticsException class is a checked exception, callers must handle that exception, or declare that they themselves throw that exception type. Even unit tests have to do that:

@Test void meanOfOneValueIsTheValueItself() throws StatisticsException {
    double actual = Statistics.mean(new double[] { 42.0 });
    assertEquals(42.0, actual);
}

Instead of 'rethrowing' checked exceptions, you may also handle them, if you can.

Handling checked exceptions #

If you have a sensible way to deal with error values, you may handle checked exceptions. Let's assume, mostly to have an example to look at, that we also need a function to calculate the empirical variance of a data set. Furthermore, for the sole benefit of the example, let's handwave and say that if the data set is empty, this means that the variance is zero. (I do understand that that's not how variance is defined, but work with me: It's only for the sake of the example.)

public static double variance(double[] values) {
    try {
        double mean = mean(values);
        double sumOfSquares = 0;
        for (double value : values) {
            double deviation = value - mean;
            sumOfSquares += deviation * deviation;
        }
        return sumOfSquares / values.length;
    } catch (StatisticsException e) {
        return 0;
    }
}

Since the variance function handles StatisticsExceptions in a try/catch construction, the function doesn't throw that exception, and therefore doesn't have to declare that it throws anything. To belabour the obvious: The method is not adorned with any throws StatisticsException declaration.

Refactoring to Result values #

The claim in this article is that throwing exceptions is sufficiently equivalent to returning Result values that it warrants investigation. As far as I can tell, Java doesn't come with any built-in Result type (and neither does C#), mostly, it seems, because Result values seem rather redundant in a language with checked exceptions.

Still, imagine that we define a Church-encoded Either, but call it Result<Succ, Fail>. You can now refactor mean to return a Result value:

public static Result<Double, StatisticsException> mean(double[] values) {
    if (values == null || values.length == 0) {
        return Result.failure(new StatisticsException(
            "The parameter 'values' must not be null or empty."));
    }
    double sum = 0;
    for (double value : values) {
        sum += value;
    }
    return Result.success(sum / values.length);
}

In order to make the change as understandable as possible, I've only changed the function to return a Result value, while most other design choices remain as before. Particularly, the failure case contains StatisticsException, although as a general rule, I'd consider that an anti-pattern: You're better off using exceptions if exceptions are what your are dealing with.

That said, the above variation of mean no longer has to declare that it throws StatisticsException, because it implicitly does that by its static return type.

Furthermore, variance can still handle both success and failure cases with match:

public static double variance(double[] values) {
    return mean(values).match(
        mean -> {
            double sumOfSquares = 0;
            for (double value : values) {
                double deviation = value - mean;
                sumOfSquares += deviation * deviation;
            }
            return sumOfSquares / values.length;
        },
        error -> 0.0
    );
}

Just like try/catch enables you to 'escape' having to propagate a checked exception, match allows you to handle both cases of a sum type in order to instead return an 'unwrapped' value, like a double.

Not a true isomorphism #

Some languages (e.g. F# and Haskell) already have built-in Result types (although they may instead be called Either). In other languages, you can either find a reusable library that provides such a type, or you can add one yourself.

Once you have a Result type, you can always refactor exception-throwing code to Result-returning code. This applies even if the language in question doesn't have checked exceptions. In fact, I've mostly performed this manoeuvre in C#, which doesn't have checked exceptions.

Most mainstream languages also support exceptions, so if you have a Result-valued method, you can also refactor the 'other way'. For the above statistics example, you simply read the examples from bottom toward the top.

Because it's possible to go back and forth like that, this relationship looks like a software design isomorphism. It's not quite, however, since information is lost in both directions.

When you refactor from throwing an exception to returning a Result value, you lose the stack trace embedded in the exception. Additionally, languages that support exceptions have very specific semantics for that language construct. Specifically, an unhandled exception crashes its program, and although this may look catastrophic, it usually happens in an orderly way. The compiler or language runtime makes sure that the process exits with a proper error code. Usually, an unhandled exception is communicated to the operating system, which logs the error, including the stack trace. All of this happens automatically.

As Eirik Tsarpalis points out, you lose all of this 'convenience' if you instead use Result values. A Result is just another data structure, and the semantics associated with failure cases are entirely your responsibility. If you need an 'unhandled' failure case to crash the program, you must explicitly write code to make the program return an error code to the operating system.

So why would you ever want to use Result types? Because you also, typically, lose information going from Result-valued operations to throwing exceptions.

Most importantly, you lose static type information about error conditions. Java is the odd man out in this respect, since checked exceptions actually do statically advertise to callers the error cases with which they must deal. Even so, in the first example, above, IllegalArgumentException is not part of the statically-typed method signature, since IllegalArgumentException is not a checked exception. Consequently, I had to invent the custom StatisticsException to make the example work. Other languages don't support checked exceptions, so there, a compiler or static analyser can't help you identify whether or not you've dealt with all error cases.

Thus, in statically typed languages, a Result value contains information about error cases. A compiler or static analyser can check whether you've dealt with all possible errors. If you refactor to throwing exceptions, this information is lost.

The bottom line is that you can refactor from exceptions to Results, or from Results to exceptions. As far as I can tell, these refactorings are always possible, but you gain and lose some capabilities in both directions.

Other languages #

So far, I've only shown a single example in Java. You can, however, easily do the same exercise in other languages that support exceptions. My article Non-exceptional averages goes over some of the same ground in C#, and Conservative codomain conjecture expands on the average example in both C# and F#.

You can even implement Result types in Python; the reusable library returns, for example, comes with Maybe and Result. Given that Python is fundamentally dynamically typed, however, I'm not sure I'm convinced of the utility of that.

At the other extreme, Haskell idiomatically uses Either for most error handling. Even so, the language also has exceptions, and even if some may think that they're mostly present for historical reasons, they're still used in modern Haskell code to model predictable errors that you probably can't handle, such as various IO-related problems: The file is gone, the network is down, the database is not responding, etc.

Finally, we should note that the similarity between exceptions and Result values depends on the language in question. Some languages don't support parametric polymorphism (AKA generics), and while I haven't tried, I'd expect the utility of Result values to be limited in those cases. On the other hand, some languages don't have exceptions, C being perhaps the most notable example.

Conclusion #

All programs can fail. Over the decades, various languages have had different approaches to error handling. Exceptions, including mechanisms for throwing and catching them, is perhaps the most common. Another strategy is to rely on Result values. In their given contexts, each offer benefits and drawbacks.

In many languages, you have a choice of both. In Haskell, Results (Either) is the idiomatic solution, but exceptions are still possible. In C-like languages, ironically, exceptions are the norm, but in many (like Java and C#) you can bolt on Results if you so decide, although it's likely to alienate some developers. In a language like F#, both options are present at an almost equal proportion. I'd consider it idiomatic to use Result in 'native' F# code, while when interoperating with the rest of the .NET ecosystem (which is almost exclusively written in C#) it may be more prudent to just stick to exceptions.

In those languages where you have both options, you can go back and forth between exceptions and Result values. Since you can refactor both ways, this relationship looks like a software design isomorphism. It isn't, though. There are differences in language semantics between the two, so a choice of one or the other has consequences. Recall, however, as Sartre said, not making a choice is also making a choice.

Next: Builder isomorphisms.


Comments

Thank you for the article — an interesting perspective on comparing exceptions and the Result pattern. I recently wrote an article where I explored specific scenarios in which the Result pattern offers advantages over exceptions, and vice versa. Exception Handling in .NET I’d be very interested to hear your thoughts on it.

2025-10-17 08:34 UTC

Thank you for writing. Proper exception-throwing and -handling is indeed complicated, and I can tell from your article that you've seen many of the same antipatterns and much of the same cargo-cult programming that I have. Clearly, there's a need for content such as your article to educate people on better modelling of errors.

2025-10-22 05:49 UTC

Shift left on x

Monday, 06 October 2025 07:57:00 UTC

A difficult task may be easier if done sooner.

You've probably seen a figure like this before:

Graph with time along the x-axis and cost on the y-axis. One curve goes from low cost to high cost as time increases.

The point is that as time passes, the cost of doing something increases. This is often used to explain why test-driven development (TDD) or other agile methods are cost-effective alternatives to a waterfall process.

Last time I checked, however, there was scant scientific evidence for this curve.

Even so, it feels right. If you discover a bug while you write the code, it's much easier to fix it than if it's discovered weeks later.

Make security easier #

I was recently reminded of the above curve because a customer of mine was struggling with security; mostly authentication and authorization. They asked me if there was a software-engineering practice that could help them get a better handle on security. Since this is a customer who's otherwise quite knowledgeable about agile methods and software engineering, I was a little surprised that they hadn't heard the phrase shift left on security.

The idea fits with the above diagram. 'Shifting left' implies moving to the left on the time axis. In other words, do things sooner. Specifically related to security, the idea is to include security concerns early in every software development process.

There's little new in this. Writing Secure Code from 2004 describes how threat modelling is part of secure coding practices. This is something I've had in the back of my mind since reading the book. I also describe the technique and give an example in Code That Fits in Your Head.

If I know that a system I'm developing requires authentication, some of the first automated acceptance tests I write is one that successfully authenticates against the system, and one or more that fail to do so. Again, the code base that accompanies Code That Fits in Your Head has examples of this.

Since my customer's question reminded me of this practice, I began pondering the idea of 'shifting left'. Since it's both touted as a benefit of TDD and DevSecOps, an obvious pattern suggests itself.

Sufficient requirements #

It's been years since I last drew the above diagram. As I implied, one problem with it is that there seems to be little quantifiable evidence for that relationship. On the other hand, you've surely had the experience that some tasks become harder, the longer you wait. I'll list some example later.

While we may not have solid scientific evidence that a cost curve looks like above, it doesn't have to look like that to make shifting left worthwhile. All it takes, really, is that the relationship is non-decreasing, and increases at least once. It doesn't have to be polynomial or exponential; it may be linear or logarithmic. It may even be a non-decreasing step function, like this:

Graph with time along the x-axis and cost on the y-axis. A staircase-shaped figure indicates a stepwise increasing function.

This, as far as I can tell, is a sufficient condition to warrant shifting left on an activity. If you have even anecdotal evidence that it may be more costly to postpone an activity, do it sooner. In practice, I don't think that you need to wait for solid scientific evidence before you do this.

While not quite the same, it's a notion similar to the old agile saw: If it hurts, do it more often. Instead, we may phrase it as: If it gets harder with time, do it sooner.

Examples #

You've already seen two examples: TDD and security. Are there other examples where tackling problems sooner may decrease cost? Certainly.

A few, I cover in Code That Fits in Your Head. The earlier you automate the build process, the easier it is. The earlier you treat all warnings as errors, the easier it is. This seems almost self-explanatory, particularly when it comes to treating warnings as errors. In a brand-new code base, you have no warnings. In that situation, treating warnings as errors is free. When, later, a compiler warning appears, your code doesn't compile, and you're forced to immediately deal with it. At that time, it tends to be much easier to fix the issue, because no other code depends on the code with the warning.

A similar situation applies to an automated build. At the beginning, an automated build is a simple batch file with one command. dotnet test -c Release, stack test, py -m pytest, and so on. Later, when you need to deal with databases, security, third-party components, etc. you enhance the automated build 'just in time'.

Once you have an automated build, deployment is a small step further. In the beginning, deploying an application is typically as easy as copying some program files (compiled code or source files, depending on language) to the machines on which it's going to run. An exception may be if the deployment target is some sort of app store, with a vetting process that prevents you from deploying a walking skeleton. If, on the other hand, your organization controls the deployment target, the sooner you deploy a hello-world application, the easier it is.

Yet another shift-left example is using static code analysis or linting, particularly when combined with treating warnings as errors. Linters are usually free, and as I describe in Code That Fits in Your Head, I've always found it irrational that teams don't use them. Not that I don't understand the mechanism, because if you only turn them on at a time when the code base has already accumulated thousands of linter issues, the sheer volume is overwhelming.

Closely related to this discussion is the lean development notion that bugs are stop-the-line issues. The correct number of known bugs in the code base is zero. The correct number of unhandled exceptions in production is zero. This sounds unattainable to most people, but is possible if you shift left on managing defects.

In short:

  • Shift left on security
  • Shift left on testing
  • Shift left on treating warnings as errors
  • Shift left on automated builds
  • Shift left on deployment
  • Shift left on linting
  • Shift left on defect management

This list is hardly exhaustive.

Shift right #

While it increases your productivity to do some things sooner, it's not a universal rule. Some things become easier, the longer you wait. In terms of time-to-cost curves, this happens whenever the curve is decreasing, even if only step-wise.

Graph with time along the x-axis and cost on the y-axis. A staircase-shaped figure indicates a stepwise decreasing function.

The danger of these figures is that they may give the impression of a deterministic process. That need not be the case, but if you have reason to believe that waiting until later may make solving a problem easier, consider waiting. The notion of waiting until the last responsible moment is central to lean or agile software development.

In a sense, you could view this is 'shifting right' on certain tasks. More than once I've experienced that if you wait long enough with a certain task, it becomes irrelevant. Not just easier to perform, but something that you don't need to do at all. What looked like a requirement early on turned out to be not at all what the customer or user wanted, after all.

When to do what #

How do you distinguish? How do you decide if shifting left or shifting right is more appropriate? In practice, it's rarely difficult. The shift-left list above contains the usual suspects. While the list may not be exhaustive, it's a well-known list of practices that countless teams have found easier to do sooner rather than later.

My own inclination would be to treat most other things as tasks that are better postponed. After all, you can't do everything as the first thing. Naturally, there has to be some sequencing of tasks.

Thinking about such decisions in terms of time-cost curves feels natural to me. I find it an easy framework to consider whether I should shift left or right on some activity.

Conclusion #

Some things are easier if you get started as soon as possible. Candidates include testing, deployment, security, and defect management. This is the case when there's a increasing relationship between time and cost. This relationship need not be a quantified function. Often, you can get by with a sense that 'if I do this now, it's going to be easy; if I wait, it's going to be harder.'

Conversely, some things are better postponed to the last responsible moment. This happens if the relation between time and cost is decreasing.

Perhaps we can simplify this analysis even further. Perhaps you don't even need to think of (step) functions. All you may need is to consider the partial order of tasks in terms of cost. Since I'm a visual thinker, however, increasing and decreasing functions come more naturally to me.


Composing pure Haskell assertions

Monday, 29 September 2025 07:43:00 UTC

With HUnit and QuickCheck examples.

A question had been in the back of my mind for a long time, but I always got caught up in something seemingly more important, so I didn't get around to investigate until recently. It's simply this:

How do you compose pure assertions in HUnit or QuickCheck?

Let me explain what I mean, and why this isn't quite as straightforward as it may sound.

Assertions as statements #

What do I mean by composing assertions? Really nothing more than wanting to verify more than a single outcome of a test.

If you're used to writing test assertions in imperative languages like C#, Java, Python, or JavaScript, you think nothing of it. Just write an assertion on one line, and the next assertion on the next line.

If you're writing impure Haskell, you can also do that.

"CLRS example" ~: do
  p :: IOArray Int Int <-
    newListArray (1, 10) [1, 5, 8, 9, 10, 17, 17, 20, 24, 30]
 
  (r, s) <- cutRod p 10
  actualRevenue <- getElems r
  actualSizes <- getElems s
 
  let expectedRevenue = [0, 1, 5, 8, 10, 13, 17, 18, 22, 25, 30]
  let expectedSizes = [1, 2, 3, 2, 2, 6, 1, 2, 3, 10]
  expectedRevenue @=? actualRevenue
  expectedSizes @=? actualSizes

This example is an inlined HUnit test which tests the impure cutRod variation. The two final statements are assertions that use the @=? assertion operator. The value on the left side is the expected value, and to the right goes the actual value. This operator returns a type called Assertion, which turns out to be nothing but an alias for IO ().

In other words, those assertions are impure actions, and they work similarly to assertions in imperative languages. If the actual value passes the assertion, nothing happens and execution moves on to the assertion on the next line. If, on the other hand, the assertion fails, execution short-circuits, and an error is reported.

Imperative languages typically throw exceptions to achieve that behaviour. Even Unquote does this. Exactly how HUnit does it I don't know; I haven't looked under the hood.

You can do the same with Test.QuickCheck.Monadic:

testProperty "cutRod returns correct arrays" $ \ p -> monadicIO $ do
  let n = length p
  p' :: IOArray Int Int <- run $ newListArray (1, n) p
 
  (r, s) :: (IOArray Int Int, IOArray Int Int) <- run $ cutRod p' n
  actualRevenue <- run $ getElems r
  actualSizes <- run $ getElems s
 
  assertWith (length actualRevenue == n + 1) "Revenue length is incorrect"
  assertWith (length actualSizes == n) "Size length is incorrect"
  assertWith (all (\i -> 0 <= i && i <= n) actualSizes) "Sizes are not all in [1..n]"

Like the previous example, you can repeatedly call assertWith, since this action, too, is a statement that returns no value.

So far, so good.

Composing assertions #

What if, however, you want to write tests as pure functions?

Pure functions are composed from expressions, while statements aren't allowed (or, at least, ineffective, and subject to be optimized away by a compiler). In other words, the above strategy isn't going to work. If you want to write more than one assertion, you need to figure out how they compose.

The naive answer might be to use logical conjunction (AKA Boolean and). Write one assertion as a Boolean expression, another assertion as another Boolean expression, and just compose them using the standard 'and' operator. In Haskell, that would be &&.

This works to a fashion, but has a major drawback. If such a composed assertion fails, it doesn't tell you why. All you know is that the entire Boolean expression evaluated to False.

This is the reason that most testing libraries come with explicit assertion APIs. In HUnit, you may wish to use the ~=? operator, and in QuickCheck the === operator.

The question, however, is how they compose. Ideally, assertions should compose applicatively, but I've never seen that in the wild. If not, look for a monoid, or at least a semigroup.

Let's do that for both HUnit and QuickCheck.

Composing HUnit assertions #

My favourite HUnit assertion is the ~=? operator, which has the (simplified) type a > a -> Test. In other words, an expression like expectedRevenue ~=? actualRevenue has the type Test. The question, then, is: How does Test compose?

Not that well, I'm afraid, but I find the following workable. You can compose one or more Test values with the TestList constructor, but if you're already using the ~: operator, as I usually do (see below), then you just need a Testable instance, and it turns out that a list of Testable values is itself a Testable instance. This means that you can write a pure unit test and compose ~=? like this:

"CLRS example" ~:
  let p = [0, 1, 5, 8, 9, 10, 17, 17, 20, 24, 30] :: [Int]
 
      (r, s) = cutRod p 10
      actualRevenue = Map.elems r
      actualSizes = Map.elems s
 
      expectedRevenue = [0, 1, 5, 8, 10, 13, 17, 18, 22, 25, 30]
      expectedSizes = [1, 2, 3, 2, 2, 6, 1, 2, 3, 10]
  in [expectedRevenue ~=? actualRevenue,
      expectedSizes ~=? actualSizes]

This is a refactoring of the above test, now as a pure function, because it tests the pure variation of cutRod. Notice that the two assertions are simply returned as a list.

While this has enough syntactical elegance to satisfy me, it does have the disadvantage that it actually creates two test cases. One that runs with the first assertion, and one that executes with the second:

:CLRS example:
  : [OK]
  : [OK]

In most cases this is unlikely to be a problem, but it could be if the test performs a resource-intensive computation. Each assertion you add makes it run one more time.

A test like the one shown here is so 'small' that this is rarely much of an issue. On the other hand, a property-based testing library might stress a System Under Test more, so fortunately, QuickCheck assertions compose better than HUnit assertions.

Composing QuickCheck assertions #

The === operator has the (simplified) type a -> a -> Property. Hoogling for a combinator with the type Property -> Property -> Property doesn't reveal anything useful, but fortunately it turns out that for running QuickCheck properties, all you really need is a Testable instance (not the same Testable as HUnit defines). And lo and behold! The .&&. operator is just what we need. That, or the conjoin function, if you have more than two assertions to combine, as in this example:

testProperty "cutRod returns correct arrays" $ \ p -> do
  let n = length p
  let p' = 0 : p  -- Ensure the first element is 0

  let (r, s) :: (Map Int Int, Map Int Int) = cutRod p' n
  let actualRevenue = Map.elems r
  let actualSizes = Map.elems s
 
  conjoin [
    length actualRevenue === n + 1,
    length actualSizes === n,
    counterexample "Sizes are not all in [0..n]" $
      all (\i -> 0 <= i && i <= n) actualSizes ]

The .&&. operator is actually a bit more flexible than conjoin, but due to operator precedence and indentation rules, trying to chain those three assertions with .&&. is less elegant than using conjoin. In this case.

Conclusion #

In imperative languages, composing test assertions is as simple as writing one assertion after another. Since assertions are statements, and imperative languages allow you to sequence statements, this is such a trivial way to compose assertions that you've probably never given it much thought.

Pure programs, however, are not composed from statements, but rather from expressions. A pure assertion is an expression that returns a value, so if you want to compose two or more pure assertions, you need to figure out how to compose the values that the assertions return.

Ideally, assertions should compose as applicative functors, but they rarely do. Instead, you'll have to go looking for combinators that enable you to combine two or more of a test library's built-in assertions. In this article, you've seen how to compose assertions in HUnit and QuickCheck.


It's striking so quickly the industry forgets that lines of code isn't a measure of productivity

Monday, 22 September 2025 06:52:00 UTC

Code is a liability, not an asset.

It's not a new idea that the more source code you have, the greater the maintenance burden. Dijkstra already touched on this topic in his Turing Award lecture in 1972, and later wrote,

"if we wish to count lines of code, we should not regard them as "lines produced" but as "lines spent""

On the cruelty of really teaching computing science, Edsger W. Dijkstra, 1988

He went on to note that

"the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger."

On the cruelty of really teaching computing science, Edsger W. Dijkstra, 1988

The use of the word ledger suggests an accounting perspective that was later also adopted by Tim Ottinger, who observed that Code is a Liability.

The entire premise of my book Code That Fits in Your Head is also that the more code you have, the harder it becomes to evolve and maintain the code base.

Even so, it seems to me that, once more, most of the software industry is equating the ability to spew out as much code as fast as possible with productivity. My guess is that some people are in for a rude awakening in a couple of years.


Greyscale-box test-driven development

Monday, 15 September 2025 18:45:00 UTC

Is TDD white-box testing or black-box testing?

Surely you're aware of the terms black-box testing and white-box testing, but have you ever wondered where test-driven development (TDD) fits in that picture?

The short answer is that TDD as a software development practice sits somewhere between the two. It really isn't black and white, and exactly where TDD sits on the spectrum changes with circumstances.

Black-box testing to the left, white-box testing to the right, with four grey boxes of varying greyscales in between, all four labeled TDD.

If the above diagram indicates that TDD can't occupy the space of undiluted black- or white-box testing, that's not my intent. In my experience, however, you rarely do neither when you engage with the TDD process. Rather, you find yourself somewhere in-between.

In the following, I'll examine the two extremes in order to explain why TDD rarely leads to either, starting with black-box testing.

Compartmentalization of knowledge #

If you follow the usual red-green-refactor checklist, you write test code and production code in tight loops. You write some test code, some production code, more test code, then more production code, and so on.

If you're working by yourself, at least, that makes it almost impossible to treat the System Under Test (SUT) as a black box. After all, you're also the one who writes the production code.

You can try to 'forget' the production code you just wrote whenever you circle back to writing another test, but in practice, you can't. Even so, it may still be a useful exercise. I call this technique Gollum style (originally introduced in the Pluralsight course Outside-In Test-Driven Development as a variation on the Devil's advocate technique). The idea is to assume two roles, and to explicitly distinguish the goals of the tester from the aims of the implementer.

Still, while this can be an illuminating exercise, I don't pretend that this is truly black-box testing.

Pair programming #

If you pair-program, you have better options. You could have one person write a test, and another person implement the code to pass the test. I could imagine a setup where the tester can't see the production code. Although I've never seen or heard about anyone doing that, this would get close to true black-box TDD.

To demonstrate, imagine a team doing the FizzBuzz kata in this way. The tester writes the first test:

[Fact]
public void One()
{
    var actual = FizzBuzzer.Convert(1);
    Assert.Equal("1"actual);
}

Either the implementer is allowed to see the test, or the specification is communicated to him or her in some other way. In any case, the natural response to the first test is an implementation like this:

public static string Convert(int number)
{
    return "1";
}

In TDD, this is expected. This is the simplest implementation that passes all tests. We imagine that the tester already knows this, and therefore adds this test next:

[Fact]
public void Two()
{
    var actual = FizzBuzzer.Convert(2);
    Assert.Equal("2"actual);
}

The implementer's response is this:

public static string Convert(int number)
{
    if (number == 1)
        return "1";
    return "2";
}

The tester can't see the implementation, so may believe that the implementation is now 'appropriate'. Even if he or she wants to be 'more sure', a few more test cases (for, say, 4, 7, or 38) could be added; it doesn't make any difference for the following argument.

Next, incrementally, the tester may add a few test cases that cover the "Fizz" behaviour:

[Fact]
public void Three()
{
    var actual = FizzBuzzer.Convert(3);
    Assert.Equal("Fizz"actual);
}
 
[Fact]
public void Six()
{
    var actual = FizzBuzzer.Convert(6);
    Assert.Equal("Fizz"actual);
}

Similar test cases cover the "Buzz" and "FizzBuzz" behaviours. For this example, I wrote eight test cases in total, but a more sceptical tester might write twelve or even sixteen before feeling confident that the test suite sufficiently describes the desired behaviour of the system. Even so, a sufficiently adversarial implementer might (given eight test cases) deliver this implementation:

public static string Convert(int number)
{
    switch (number)
    {
        case  1: return "1";
        case  2: return "2";
        case  5:
        case 10: return "Buzz";
        case 15:
        case 30: return "FizzBuzz";
        defaultreturn "Fizz";
    }
}

To be clear, it's not that I expect real-world programmers to be either obtuse or nefarious. In real life, on the other hand, requirements are more complicated, and may be introduced piecemeal in a fashion that may lead to buggy, overly-complicated implementations.

Under-determination #

Remarkably, black-box testing may work better as an ex-post technique, compared to TDD. If we imagine that an implementer has made an effort to correctly implement a system according to specification, a tester may use black-box testing to poke at the SUT, using both randomly selected test cases, and by explicitly exercising the SUT at boundary cases.

Even so, black-box testing in reality tends to run into the problem of under-determination, also known from philosophy of science. As I outlined in Code That Fits in Your Head, software testing has many similarities with empirical science. We use experiments (tests) to corroborate hypotheses that we have about software: Typically either that it doesn't pass tests, or that it does pass all tests, depending on where in the red-green-refactor cycle we are.

Similar to science, we are faced with the basic epistemological problem that we have a finite number of tests, but usually an infinite (or at least extremely big) state space. Thus, as pointed out by the problem of under-determination, more than one 'reality' fits the available observations (i.e. test cases). The above FizzBuzz implementation is an example of this.

As an aside, certain problems actually have images that are sufficiently small that you can cover everything in total. In its most common description, the FizzBuzz kata, too, falls into this category.

"Write a program that prints the numbers from 1 to 100."

This means that you can, in fact, write 100 test cases and thereby specify the problem in its totality. What you still can't do with black-box testing, however, is impose a particular implementation. An adversarial implementer could write the Convert function as one big switch statement. Just like I did with the Tennis kata, another kata with a small state space.

This, however, rarely happens in the real world. Example-driven testing is under-determined. And no, property-based testing doesn't fundamentally change that conclusion. It behoves you to look critically at the actual implementation code, and not rely exclusively on testing.

Working with implementation code #

It's hardly a surprise that TDD isn't black-box testing. Is it white-box testing, then? Since the red-green-refactor cycle dictates a tight loop between test and production code, you always have the implementation code at hand. In that sense, the SUT is a white box.

That said, the common view on white-box testing is that you work with knowledge about the internal implementation of an already-written system, and use that to design test cases. Typically, looking at the code should enable a tester to identify weak spots that warrant testing.

This isn't always the case with TDD. If you follow the red-green-refactor checklist, each cycle should leave you with a SUT that passes all tests in the simplest way that could possibly work. Consider the first incarnation of Convert, above (the one that always returns "1"). It passes all tests, and from a white-box-testing perspective, it has no weak spots. You can't identify a test case that'll make it crash.

If you consider the test suite as an executable specification, that degenerate implementation is correct, since it passes all tests. Of course, according to the kata description, it's wrong. Looking at the SUT code will tell you that in a heartbeat. It should prompt you to add another test case. The question is, though, whether that qualifies as white-box testing, or it's rather reminiscent of the transformation priority premise. Not that that's necessarily a dichotomy.

Overspecified software #

Perhaps a more common problem with white-box testing in relation to TDD is the tendency to take a given implementation for granted. Of course, working according to the red-green-refactor cycle, there's no implementation before the test, but a common technique is to use Mock Objects to let tests specify how the SUT should be implemented. This leads to the familiar problem of Overspecified Software.

Here's an example.

Finding values in an interval #

In the code base that accompanies Code That Fits in Your Head, the code that handles a new restaurant reservation contains this code snippet:

var reservations = await Repository
    .ReadReservations(restaurant.Id, reservation.At)
    .ConfigureAwait(false);
var now = Clock.GetCurrentDateTime();
if (!restaurant.MaitreD.WillAccept(nowreservationsreservation))
    return NoTables500InternalServerError();
 
await Repository.Create(restaurant.Id, reservation)
    .ConfigureAwait(false);

The ReadReservations method is of particular interest in this context. It turns out to be a small extension method on a more general interface method:

internal static Task<IReadOnlyCollection<Reservation>> ReadReservations(
    this IReservationsRepository repository,
    int restaurantId,
    DateTime date)
{
    var min = date.Date;
    var max = min.AddDays(1).AddTicks(-1);
    return repository.ReadReservations(restaurantIdminmax);
}

The IReservationsRepository interface doesn't have a method that allows a client to search for all reservations on a given date. Rather, it defines a more general method that enables clients to search for reservations in a given interval:

Task<IReadOnlyCollection<Reservation>> ReadReservations(
    int restaurantIdDateTime minDateTime max);

As the parameter names imply, the method finds and returns all the reservations for a given restaurant between the min and max values. A previous article already covers this method in much detail.

I think I've stated this more than once before: Code is never perfect. Although I made a genuine attempt to write quality code for the book's examples, now that I revisit this API, I realize that there's room for improvement. The most obvious problem with that method definition is that it's not clear whether the range includes, or excludes, the boundary values. Would it improve encapsulation if the method instead took a Range<DateTime> parameter?

At the very least, I could have named the parameters inclusiveMin and inclusiveMax. That's how the system is implemented, and you can see an artefact of that in the above extension method. It searches from midnight of date to the tick just before midnight on the next day.

The SQL implementation reflects that contract, too.

SELECT [PublicId], [At], [Name], [Email], [Quantity]
FROM [dbo].[Reservations]
WHERE [RestaurantId] = @RestaurantId AND
      @Min <= [At] AND [At] <= @Max

Here, @RestaurantId, @Min, and @Max are query parameters. Notice that the query uses the <= relation for both @Min and @Max, making both endpoints inclusive.

Interactive white-box testing #

Since I'm aware of the problem of overspecified software, I test-drove the entire code base using state-based testing. Imagine, however, that I'd instead used a dynamic mock library. If so, a test could have looked like this:

[Fact]
public async Task PostUsingMoq()
{
    var now = DateTime.Now;
    var reservation =
        Some.Reservation.WithDate(now.AddDays(2).At(20, 15));
    var repoTD = new Mock<IReservationsRepository>();
    repoTD
        .Setup(r => r.ReadReservations(
            Some.Restaurant.Id,
            reservation.At.Date,
            reservation.At.Date.AddDays(1).AddTicks(-1)))
        .ReturnsAsync(new Collection<Reservation>());
    var sut = new ReservationsController(
        new SystemClock(),
        new InMemoryRestaurantDatabase(Some.Restaurant),
        repoTD.Object);
 
    var ar = await sut.Post(Some.Restaurant.Id, reservation.ToDto());
 
    Assert.IsAssignableFrom<CreatedAtActionResult>(ar);
    // More assertions could go here.
}

This test uses Moq, but the example doesn't hinge on that. I rarely use dynamic mock libraries these days, but when I do, I still prefer Moq.

Notice how the Setup reproduces the implementation of the ReadReservations extension method. The implication is that if you change the implementation code, you break the test.

Even so, we may consider this an example of a test-driven white-box test. While, according to the red-green-refactor cycle, you're supposed to write the test before the implementation, this style of TDD only works if you, the test writer, has an exact plan for how the SUT is going to look.

An innocent refactoring? #

Don't you find that min.AddDays(1).AddTicks(-1) expression a bit odd? Wouldn't the code be cleaner if you could avoid the AddTicks(-1) part?

Well, you can.

A tick is the smallest unit of measurement of DateTime values. Since ticks are discrete, the range defined by the extension method would be equivalent to a right-open interval, where the minimum value is still included, but the maximum is not. If you made that change, the extension method would be simpler:

internal static Task<IReadOnlyCollection<Reservation>> ReadReservations(
    this IReservationsRepository repository,
    int restaurantId,
    DateTime date)
{
    var min = date.Date;
    var max = min.AddDays(1);
    return repository.ReadReservations(restaurantIdminmax);
}

In order to offset that change, you also change the SQL accordingly:

SELECT [PublicId], [At], [Name], [Email], [Quantity]
FROM [dbo].[Reservations]
WHERE [RestaurantId] = @RestaurantId AND
      @Min <= [At] AND [At] < @Max

Notice that the query now compares [At] with @Max using the < relation.

While this is formally a breaking change of the interface, it's entirely internal to the application code base. No external systems or libraries depend on IReservationsRepository. Thus, this change is a true refactoring: It improves the code without changing the observable behaviour of the system.

Even so, this change breaks the PostUsingMoq test.

To make the test pass, you'll need to repeat the change you made to the SUT:

[Fact]
public async Task PostUsingMoq()
{
    var now = DateTime.Now;
    var reservation =
        Some.Reservation.WithDate(now.AddDays(2).At(20, 15));
    var repoTD = new Mock<IReservationsRepository>();
    repoTD
        .Setup(r => r.ReadReservations(
            Some.Restaurant.Id,
            reservation.At.Date,
            reservation.At.Date.AddDays(1)))
        .ReturnsAsync(new Collection<Reservation>());
    var sut = new ReservationsController(
        new SystemClock(),
        new InMemoryRestaurantDatabase(Some.Restaurant),
        repoTD.Object);
 
    var ar = await sut.Post(Some.Restaurant.Id, reservation.ToDto());
 
    Assert.IsAssignableFrom<CreatedAtActionResult>(ar);
    // More assertions could go here.
}

If it's only one test, you can probably live with that, but it's the opposite of a robust test; it's a Fragile Test.

"to refactor, the essential precondition is [...] solid tests"

A common problem with interaction-based testing is that even small refactorings break many tests. We might see that as a symptom of having too much knowledge of implementation details. We might view this as related to white-box testing.

To be clear, the tests in the code base that accompanies Code That Fits in Your Head are all state-based, so contrary to the PostUsingMoq test, all 161 tests easily survive the above refactoring.

Greyscale-box TDD #

It's not too hard to argue that TDD isn't black-box testing, but it's harder to argue that it's not white-box testing. Naturally, as you follow the red-green-refactor cycle, you know all about the implementation. Still, the danger of being too aware of the SUT code is being trapped in an implementation mindset.

While there's nothing wrong with getting the implementation right, many maintainability problems originate in insufficient encapsulation. Deliberately treating the SUT as a grey box helps in discovering a SUT's contract. That's why I recommend techniques like Devil's Advocate. Pretending to view the SUT from the outside can shed valuable light on usability and maintainability issues.

Conclusion #

The notions of white-box and black-box testing have been around for decades. So has TDD. Even so, it's not always clear to practitioners whether TDD is one or the other. The reason is, I believe, that TDD is neither. Good TDD practice sits somewhere between white-box testing and black-box testing.

Exactly where on that greyscale spectrum TDD belongs depends on context. The more important encapsulation is, the closer you should move towards black-box testing. The more important correctness or algorithm performance is, the closer to white-box testing you should move.

You can, however, move position on the spectrum even in the same code base. Perhaps you want to start close to white-box testing as you focus on getting the implementation right. Once the SUT works as intended, you may then decide to shift your focus towards encapsulation, in which case moving closer to black-box testing could prove beneficial.


IO is special

Monday, 08 September 2025 05:36:00 UTC

Are IO expressions really referentially transparent programs?

Sometimes, when I discuss functional architecture or the IO container, a reader will argue that Haskell IO really is 'pure', 'referentially transparent', 'functional', or has another similar property.

The argument usually goes like this: An IO value is a composable description of an action, but not in itself an action. Since IO is a Monad instance, it composes via the usual monadic bind combinator >>=, or one of its derivatives.

Another point sometimes made is that you can 'call' an IO-valued action from within a pure function, as demonstrated by this toy example:

greet :: TimeOfDay -> String -> String
greet timeOfDay name =
  let greeting = case () of
        _ | isMorning timeOfDay -> "Good morning"
          | isAfternoon timeOfDay -> "Good afternoon"
          | isEvening timeOfDay -> "Good evening"
          | otherwise -> "Hello"
 
      sideEffect = putStrLn "Side effect!"
 
  in if null name
     then greeting ++ "."
     else greeting ++ ", " ++ name ++ "."

This is effectively a Haskell port of the example given in Referential transparency of IO. Here, sideEffect is a value of the type IO (), even though greet is a pure function. Such examples are sometimes used to argue that the expression putStrLn "Side effect!" is pure, because it's deterministic and has no side effects.

Rather, sideEffect is a 'program' that describes an action. The program is a referentially transparent value, although actually running it is not.

As I also explained in Referential transparency of IO, the above function application is legal because greet never uses the value 'inside' the IO action. In fact, the compiler may choose to optimize the sideEffect expression away, and I believe that GHC does just that.

I've tried to summarize the most common arguments as succinctly as I can. While I could cite actual online discussions that I've had, I don't wish to single out anyone. I don't want to make this article appear as though it's an attack on anyone in particular. Rather, my position remains that IO is special, and I'll subsequently try to explain the reasoning.

Reductio ad absurdum #

While I could begin my argument stating the general case, backed up by citing some papers, I'm afraid I'll lose most readers in the process. Therefore I'll flip the argument around and start with a counter-example. What would happen if we accept the claim that IO is pure or referentially transparent?

It would follow that all Haskell code should be considered pure. That would include putStrLn "Hello, world." or launchMissiles. That I find that conclusion absurd may just be my subjective opinion, but it also seems to go against the original purpose of using IO to tackle the awkward squad.

Furthermore, and this may be more objective, it seems to allow writing everything in IO, and still call it 'functional'. What do I mean by that?

Functional imperative code #

If we accept that IO is pure, then we may decide to write everything in procedural style. We could, for example, implement rod-cutting by mirroring the imperative pseudocode used to describe the algorithm.

{-# LANGUAGE FlexibleContexts #-}
module RodCutting where
 
import Control.Monad (forM_when)
import Data.Array.IO
import Data.IORef (newIORefwriteIORefreadIORefmodifyIORef)
 
cutRod :: (Ix i, Num i, Enum i, Num e, Bounded e, Ord e)
       => IOArray i e -> i -> IO (IOArray i e, IOArray i i)
cutRod p n = do
  r <- newArray_ (0, n)
  s <- newArray_ (1, n)
  writeArray r 0 0  -- r[0] = 0
  forM_ [1..n] $ \j -> do
    q <- newIORef minBound  -- q = -∞
    forM_ [1..j] $ \i -> do
      qValue <- readIORef q
      p_i <- readArray p i
      r_j_i <- readArray r (j - i)
      when (qValue < p_i + r_j_i) $ do
        writeIORef q (p_i + r_j_i)  -- q = p[i] + r[j - i]
        writeArray s j i            -- s[j] = i

      qValue' <- readIORef q
      writeArray r j qValue'  -- r[j] = q

  return (r, s)

Ironically, the cutRod action remains referentially transparent, as is the original pseudocode from CLRS. This is because the algorithm itself is deterministic, and has no (external) side effects. Even so, the Haskell type system can't 'see' that. This implementation is intrinsically IO-valued.

Functional encapsulation #

You may think that this just proves the point that IO is pure, but it doesn't. We've always known that we can lift any pure value into IO using return: return 42 remains referentially transparent, even if it's contained in IO.

The reverse isn't always true. We can't conclude that code is referentially transparent when it's contained in IO. Usually, it isn't.

Be that as it may, why do we even care?

The problem is one of encapsulation. When an action like cutRod, above, returns an IO value, we're facing a dearth of guarantees. As users of the action, we may have many questions, most of which aren't answered by the type:

  • Does cutRod modify the input array p?
  • Is cutRod deterministic?
  • Does cutRod launch missiles?
  • Can I memoize the return values of cutRod?
  • Does cutRod somehow keep a reference to the arrays that it returns? Can I be sure that a background thread, or a subsequent API call, doesn't mutate these arrays? In other words, is there a potential aliasing problem?

At best, such lack of guarantees lead to defensive coding, but usually it leads to bugs.

If, on the other hand, we were to write a version of cutRod that does not involve IO, we'd be able to answer all the above questions. The advantage would be that the function would be safer and easier to consume.

Referential transparency is not the same as purity #

This leads to a point that I failed to understand for years, until Tyson Williams pointed it out to me. Referential transparency is not the same as purity, although the overlap is substantial.

Venn diagram of two sets: Referential transparency and purity. The intersection is considerable.

Of course, such a claim requires me to define the terms, but I'll try to keep it light. I'll define referential transparency as the property that allows replacing a function with the value it produces. Practically, it allows memoization. On the other hand, I'll define purity as functions that Haskell can distinguish from impure actions. In practice, this implies the absence of IO.

Usually this amounts to the same thing, but as we've seen above, it's possible to write referentially transparent code that nonetheless is embedded in IO. There are also examples of functions that look pure, although they may not be referentially transparent. Fortunately these are, in my experience, more rare.

That said, this is a digression. My agenda is to argue that IO is special. Yes, it's a Monad instance. Yes, it composes. No, it's not referentially transparent.

Semantics #

From the point of encapsulation, I've previously argued that referential transparency is attractive because it fits in your head. Code that is not referentially transparent usually doesn't.

Why is IO not referentially transparent? To repeat the argument that I sometimes run into, IO values describe programs. Every time your Haskell code runs, the same IO value describes the same program.

This strikes me as about as useful an assertion as insisting that all C code is referentially transparent. After all, a C program also describes the same program even if executed multiple times.

But you don't have to take my word for it. In Tackling the Awkward Squad: monadic input/output, concurrency, exceptions, and foreign-language calls in Haskell Simon Peyton Jones presents the semantics of Haskell.

"Our semantics is stratified in two levels: an inner denotational semantics that describes the behaviour of pure terms, while an outer monadic transition semantics describes the behaviour of IO computations."

Tackling the Awkward Squad: monadic input/output, concurrency, exceptions, and foreign-language calls in Haskell, Simon Peyton Jones, 2000

Over the next 20 pages, that paper goes into details on how IO is special. The point is that it has different semantics from the rest of Haskell.

Pure rod-cutting #

Before I close, I realize that the above cutRod action may cause distress with some readers. To relieve the tension I'll leave you with a pure implementation.

{-# LANGUAGE TupleSections #-}
module RodCutting (cutRodsolvewhere
 
import Data.Foldable (foldl')
import Data.Map.Strict ((!))
import qualified Data.Map.Strict as Map
 
seekBetterCut :: (Ord a, Num a)
              => [a] -> Int -> (a, Map.Map Int a, Map.Map Int Int-> Int
              -> (a, Map.Map Int a, Map.Map Int Int)
seekBetterCut p j (q, r, s) i =
  let price = p !! i
      remainingRevenue = r ! (j - i)
      (q', s') =
        if q < price + remainingRevenue then
          (price + remainingRevenue, Map.insert j i s)
        else (q, s)
 
      r' = Map.insert j q' r
  in (q', r', s')
 
findBestCut :: (Bounded a, Ord a, Num a)
            => [a] -> (Map.Map Int a, Map.Map Int Int-> Int
            -> (Map.Map Int a, Map.Map Int Int)
findBestCut p (r, s) j =
  let q = minBound  -- q = -∞
      (_, r', s') = foldl' (seekBetterCut p j) (q, r, s) [1..j]
  in (r', s')
 
cutRod :: (Bounded a, Ord a, Num a)
       => [a] -> Int -> (Map.Map Int a, Map.Map Int Int)
cutRod p n = do
  let r = Map.fromAscList $ map (, 0) [0..n]  -- r[0:n] initialized to 0
  let s = Map.fromAscList $ map (, 0) [1..n]  -- s[1:n] initialized to 0
  foldl' (findBestCut p) (r, s) [1..n]
 
solve :: (Bounded a, Ord a, Num a) => [a] -> Int -> [Int]
solve p n =
  let (_, s) = cutRod p n
      loop l n' =
        if n' > 0 then
          let cut = s ! n'
          in loop (cut : l) (n' - cut)
        else l
      l' = loop [] n
  in reverse l'

This is a fairly direct translation of the imperative algorithm. It's possible that you could come up with something more elegant. At least, I think that I did so in F#.

Regardless of the level of elegance of the implementation, this version of cutRod advertises its properties via its type. A client developer can now trivially answer the above questions, just by looking at the type: No, the function doesn't mutate the input list p. Yes, the function is deterministic. No, it doesn't launch missiles. Yes, you can memoize it. No, there's no aliasing problem.

Conclusion #

From time to time, I run into the claim that Haskell IO, being monadic and composable, is referentially transparent, and that it's only during execution that this property is lost.

I argue that such claims are of little practical interest. There are other parts of Haskell that remain referentially transparent, even during execution. Thus, IO is still special.

From a practical perspective, the reason I care about referential transparency is because the more you have of it, the simpler your code is; the better it fits in your head. The kind of referential transparency that some people argue that IO has does not have the property of making code simpler. In reality, IO code has the same inherent properties as code written in C, Python, Java, Fortran, etc.


Page 1 of 81