Is cyclomatic complexity really related to branch coverage?

Monday, 08 May 2023 05:38:00 UTC

A genuine case of doubt and bewilderment.

Regular readers of this blog may be used to its confident and opinionated tone. I write that way, not because I'm always convinced that I'm right, but because prose with too many caveats and qualifications tends to bury the message in verbose and circumlocutory ambiguity.

This time, however, I write to solicit feedback, and because I'm surprised to the edge of bemusement by a recent experience.

Collatz sequence #

Consider the following code:

public static class Collatz
{
    public static IReadOnlyCollection<intSequence(int n)
    {
        if (n < 1)
            throw new ArgumentOutOfRangeException(
                nameof(n),
                $"Only natural numbers allowed, but given {n}.");
 
        var sequence = new List<int>();
        var current = n;
        while (current != 1)
        {
            sequence.Add(current);
            if (current % 2 == 0)
                current = current / 2;
            else
                current = current * 3 + 1;
        }
        sequence.Add(current);
        return sequence;
    }
}

As the names imply, the Sequence function calculates the Collatz sequence for a given natural number.

Please don't tune out if that sounds mathematical and difficult, because it really isn't. While the Collatz conjecture still evades mathematical proof, the sequence is easy to calculate and understand. Given a number, produce a sequence starting with that number and stop when you arrive at 1. Every new number in the sequence is based on the previous number. If the input is even, divide it by two. If it's odd, multiply it by three and add one. Repeat until you arrive at one.

The conjecture is that any natural number will produce a finite sequence. That's the unproven part, but that doesn't concern us. In this article, I'm only interested in the above code, which computes such sequences.

Here are few examples:

> Collatz.Sequence(1)
List<int>(1) { 1 }
> Collatz.Sequence(2)
List<int>(2) { 2, 1 }
> Collatz.Sequence(3)
List<int>(8) { 3, 10, 5, 16, 8, 4, 2, 1 }
> Collatz.Sequence(4)
List<int>(3) { 4, 2, 1 }

While there seems to be a general tendency for the sequence to grow as the input gets larger, that's clearly not a rule. The examples show that the sequence for 3 is longer than the sequence for 4.

All this, however, just sets the stage. The problem doesn't really have anything to do with Collatz sequences. I only ran into it while working with a Collatz sequence implementation that looked a lot like the above.

Cyclomatic complexity #

What is the cyclomatic complexity of the above Sequence function? If you need a reminder of how to count cyclomatic complexity, this is a good opportunity to take a moment to refresh your memory, count the number, and compare it with my answer.

Apart from the opportunity for exercise, it was a rhetorical question. The answer is 4.

This means that we'd need at least four unit test to cover all branches. Right? Right?

Okay, let's try.

Branch coverage #

Before we start, let's make the ritual denouncement of code coverage as a target metric. The point isn't to reach 100% code coverage as such, but to gain confidence that you've added tests that cover whatever is important to you. Also, the best way to do that is usually with TDD, which isn't the situation I'm discussing here.

The first branch that we might want to cover is the Guard Clause. This is easily addressed with an xUnit.net test:

[Fact]
public void ThrowOnInvalidInput()
{
    Assert.Throws<ArgumentOutOfRangeException>(() => Collatz.Sequence(0));
}

This test calls the Sequence function with 0, which (in this context, at least) isn't a natural number.

If you measure test coverage (or, in this case, just think it through), there are no surprises yet. One branch is covered, the rest aren't. That's 25%.

(If you use the free code coverage option for .NET, it will surprisingly tell you that you're only at 16% branch coverage. It deems the cyclomatic complexity of the Sequence function to be 6, not 4, and 1/6 is 16.67%. Why it thinks it's 6 is not entirely clear to me, but Visual Studio agrees with me that the cyclomatic complexity is 4. In this particular case, it doesn't matter anyway. The conclusion that follows remains the same.)

Let's add another test case, and perhaps one that gives the algorithm a good exercise.

[Fact]
public void Example()
{
    var actual = Collatz.Sequence(5);
    Assert.Equal(new[] { 5, 16, 8, 4, 2, 1 }, actual);
}

As expected, the test passes. What's the branch coverage now?

Try to think it through instead of relying exclusively on a tool. The algorithm isn't more complicated that you can emulate execution in your head, or perhaps with the assistance of a notepad. How many branches does it execute when the input is 5?

Branch coverage is now 100%. (Even the dotnet coverage tool agrees, despite its weird cyclomatic complexity value.) All branches are exercised.

Two tests produce 100% branch coverage of a function with a cyclomatic complexity of 4.

Surprise #

That's what befuddles me. I thought that cyclomatic complexity and branch coverage were related. I thought, that the number of branches was a good indicator of the number of tests you'd need to cover all branches. I even wrote an article to that effect, and no-one contradicted me.

That, in itself, is no proof of anything, but the notion that the article presents seems to be widely accepted. I never considered it controversial, and the only reason I didn't cite anyone is that this seems to be 'common knowledge'. I wasn't aware of a particular source I could cite.

Now, however, it seems that it's wrong. Is it wrong, or am I missing something?

To be clear, I completely understand why the above two tests are sufficient to fully cover the function. I also believe that I fully understand why the cyclomatic complexity is 4.

I am also painfully aware that the above two tests in no way fully specify the Collatz sequence. That's not the point.

The point is that it's possible to cover this function with only two tests, despite the cyclomatic complexity being 4. That surprises me.

Is this a known thing?

I'm sure it is. I've long since given up discovering anything new in programming.

Conclusion #

I recently encountered a function that performed a Collatz calculation similar to the one I've shown here. It exhibited the same trait, and since it had no Guard Clause, I could fully cover it with a single test case. That function even had a cyclomatic complexity of 6, so you can perhaps imagine my befuddlement.

Is it wrong, then, that cyclomatic complexity suggests a minimum number of test cases in order to cover all branches?

It seems so, but that's new to me. I don't mind being wrong on occasion. It's usually an opportunity to learn something new. If you have any insights, please leave a comment.


Comments

My first thought is that the code looks like an unrolled recursive function, so perhaps if it's refactored into a driver function and a "continuation passing style" it might make the cyclomatic complexity match the covering tests.

So given the following:

public delegate void ResultFunc(IEnumerable<int> result);
public delegate void ContFunc(int n, ResultFunc result, ContFunc cont);

public static void Cont(int n, ResultFunc result, ContFunc cont) {
  if (n == 1) {
    result(new[] { n });
    return;
  }

  void Result(IEnumerable<int> list) => result(list.Prepend(n));

  if (n % 2 == 0)
    cont(n / 2, Result, cont);
  else
    cont(n * 3 + 1, Result, cont);
}

public static IReadOnlyCollection<int> Continuation(int n) {
  if (n < 1)
    throw new ArgumentOutOfRangeException(
      nameof(n),
      $"Only natural numbers allowed, but given {n}.");

  var output = new List<int>();

  void Output(IEnumerable<int> list) => output = list.ToList();

  Cont(n, Output, Cont);

  return output;
}

I calculate the Cyclomatic complexity of Continuation to be 2 and Step to be 3.

And it would seem you need 5 tests to properly cover the code, 3 for Step and 2 for Continuation.

But however you write the "n >=1" case for Continuation you will have to cover some of Step.

2023-05-08 10:11 UTC
Jeroen Heijmans #

There is a relation between cyclomatic complexity and branches to cover, but it's not one of equality, cyclomatic complexity is an upper bound for the number of branches. There's a nice example illustrating this in the Wikipedia article on cyclomatic complexity that explains this, as well as the relation with path coverage (for which cyclomatic complexity is a lower bound).

2023-05-08 15:03 UTC

I find cyclomatic complexity to be overly pedantic at times, and you will need four tests if you get really pedantic. First, test the guard clause as you already did. Then, test with 1 in order to test the

while
loop body not being run. Then, test with 2 in order to test that the
while
is executed, but we only hit the
if
part of the
if/else
. Finally, test with 3 in order to hit the
else
inside of the
while
. That's four tests where each test is only testing one of the branches (some tests hit more than one branch, but the "extra branch" is already covered by another test). Again, this is being really pedantic and I wouldn't test this function as laid out above (I'd probaby put in the test with 1, since it's an edge case, but otherwise test as you did).

I don't think there's a rigorous relationship between cyclomatic complexity and number of tests. In simple cases, treating things as though the relationship exists can be helpful. But once you start having iterrelated branches in a function, things get murky, and you may have to go to pedantic lengths in order to maintain the relationship. The same thing goes for code coverage, which can be 100% even though you haven't actually tested all paths through your code if there are multiple branches in the function that depend on each other.

2023-05-08 15:30 UTC

Thank you, all, for writing. I'm extraordinarily busy at the moment, so it'll take me longer than usual to respond. Rest assured, however, that I haven't forgotten.

2023-05-11 12:42 UTC

If we agree to the definition of cyclomatic complexity as the number of independent paths through a section of code, then the number of tests needed to cover that section must be the same per definition, if those tests are also independent. Independence is crucial here, and is also the main source of confusion. Both the while and if forks depend on the same variable (current), and so they are not independent.

The second test you wrote is similarly not independent, as it ends up tracing multiple paths through through if: odd for 5, and even for 16, 8, etc, and so ends up covering all paths. Had you picked 2 instead of 5 for the test, that would have been more independent, as it would not have traced the else path, requiring one additional test.

The standard way of computing cyclomatic complexity assumes independence, which simply is not possible in this case.

2023-06-02 00:38 UTC

Struan, thank you for writing, and please accept my apologies for the time it took me to respond. I agree with your calculations of cyclomatic complexity of your refactored code.

I agree with what you write, but you can't write a sentence like "however you write the "n >=1" case for [...] you will have to cover some of [..]" and expect me to just ignore it. To be clear, I agree with you in the particular case of the methods you provided, but you inspired me to refactor my code with that rule as a specific constraint. You can see the results in my new article Collatz sequences by function composition.

Thank you for the inspiration.

2023-06-12 5:46 UTC

Jeroen, thank you for writing, and please accept my apologies for the time it took me to respond. I should have read that Wikipedia article more closely, instead of just linking to it.

What still puzzles me is that I've been aware of, and actively used, cyclomatic complexity for more than a decade, and this distinction has never come up, and no-one has called me out on it.

As Cunningham's law says, the best way to get the right answer on the Internet is not to ask a question; it's to post the wrong answer. Even so, I posted Put cyclomatic complexity to good use in 2019, and no-one contradicted it.

I don't mention this as an argument that I'm right. Obviously, I was wrong, but no-one told me. Have I had something in my teeth all these years, too?

2023-06-12 6:35 UTC

Brett, thank you for writing, and please accept my apologies for the time it took me to respond. I suppose that I failed to make my overall motivation clear. When doing proper test-driven development (TDD), one doesn't need cyclomatic complexity in order to think about coverage. When following the red-green-refactor checklist, you only add enough code to pass all tests. With that process, cyclomatic complexity is rarely useful, and I tend to ignore it.

I do, however, often coach programmers in unit testing and TDD, and people new to the technique often struggle with basics. They add too much code, instead of the simplest thing that could possibly work, or they can't think of a good next test case to write.

When teaching TDD I sometimes suggest cyclomatic complexity as a metric to help decision-making. Did we add more code to the System Under Test than warranted by tests? Is it okay to forgo writing a test of a one-liner with cyclomatic complexity of one?

The metric is also useful in hybrid scenarios where you already have production code, and now you want to add characterisation tests: Which test cases should you at least write?

Another way to answer such questions is to run a code-coverage tool, but that often takes time. I find it useful to teach people about cyclomatic complexity, because it's a lightweight heuristic always at hand.

2023-06-12 7:24 UTC

Nikola, thank you for writing. The emphasis on independence is useful; I used compatible thinking in my new article Collatz sequences by function composition. By now, including the other comments to this article, it seems that we've been able to cover the problem better, and I, at least, feel that I've learned something.

I don't think, however, that the standard way of computing cyclomatic complexity assumes independence. You can easily compute the cyclomatic complexity of the above Sequence function, even though its branches aren't independent. Tooling such as Visual Studio seems to agree with me.

2023-06-13 5:32 UTC

Refactoring pure function composition without breaking existing tests

Monday, 01 May 2023 06:44:00 UTC

An example modifying a Haskell Gossiping Bus Drivers implementation.

This is an article in an series of articles about the epistemology of interaction testing. In short, this collection of articles discusses how to test the composition of pure functions. While a pure function is intrinsically testable, how do you test the composition of pure functions? As the introductory article outlines, I consider it mostly a matter of establishing confidence. With enough test coverage you can be confident that the composition produces the desired outputs.

Keep in mind that if you compose pure functions into a larger pure function, the composition is still pure. This implies that you can still test it by supplying input and verifying that the output is correct.

Tests that exercise the composition do so by verifying observable behaviour. This makes them more robust to refactoring. You'll see an example of that later in this article.

Gossiping bus drivers #

I recently did the Gossiping Bus Drivers kata in Haskell. At first, I added the tests suggested in the kata description.

{-# OPTIONS_GHC -Wno-type-defaults #-}
module Main where
 
import GossipingBusDrivers
import Test.HUnit
import Test.Framework.Providers.HUnit (hUnitTestToTests)
import Test.Framework (defaultMain)
 
main :: IO ()
main = defaultMain $ hUnitTestToTests $ TestList [
  "Kata examples" ~: do
    (routes, expected) <-
      [
        ([[3, 1, 2, 3],
          [3, 2, 3, 1],
          [4, 2, 3, 4, 5]],
         Just 5),
        ([[2, 1, 2],
          [5, 2, 8]],
         Nothing)
      ]
    let actual = drive routes
    return $ expected ~=? actual
  ]

As I prefer them, these tests are parametrised HUnit tests.

The problem with those suggested test cases is that they don't provide enough confidence that an implementation is correct. In fact, I wrote this implementation to pass them:

drive routes = if length routes == 3 then Just 5 else Nothing

This is clearly incorrect. It just looks at the number of routes and returns a fixed value for each count. It doesn't look at the contents of the routes.

Even if you don't try to deliberately cheat I'm not convinced that these two tests are enough. You could try to write the correct implementation, but how do you know that you've correctly dealt with various edge cases?

Helper function #

The kata description isn't hard to understand, so while the suggested test cases seem insufficient, I knew what was required. Perhaps I could write a proper implementation without additional tests. After all, I was convinced that it'd be possible to do it with a cyclomatic complexity of 1, and since a test function also has a cyclomatic complexity of 1, there's always that tension in test-driven development: Why write test code to exercise code with a cyclomatic complexity of 1?.

To be clear: There are often good reasons to write tests even in this case, and this seems like one of them. Cyclomatic complexity indicates a minimum number of test cases, not necessarily a sufficient number.

Even though Haskell's type system is expressive, I soon found myself second-guessing the behaviour of various expressions that I'd experimented with. Sometimes I find GHCi (the Haskell REPL) sufficiently edifying, but in this case I thought that I might want to keep some test cases around for a helper function that I was developing:

import Data.List
import qualified Data.Map.Strict as Map
import Data.Map.Strict ((!))
import qualified Data.Set as Set
import Data.Set (Set)
 
evaluateStop :: (Functor f, Foldable f, Ord k, Ord a)
             => f (k, Set a) -> f (k, Set a)
evaluateStop stopsAndDrivers =
  let gossip (stop, driver) = Map.insertWith Set.union stop driver
      gossipAtStops = foldl' (flip gossip) Map.empty stopsAndDrivers
  in fmap (\(stop, _) -> (stop, gossipAtStops ! stop)) stopsAndDrivers

I was fairly confident that this function worked as I intended, but I wanted to be sure. I needed some examples, so I added these tests:

"evaluateStop examples" ~: do
  (stopsAndDrivers, expected) <- [
      ([(1, fromList [1]), (2, fromList [2]), (1, fromList [1])],
       [(1, fromList [1]), (2, fromList [2]), (1, fromList [1])]),
      ([(1, fromList [1]), (2, fromList [2]), (1, fromList [2])],
       [(1, fromList [1, 2]), (2, fromList [2]), (1, fromList [1, 2])]),
      ([(1, fromList [1, 2, 3]), (1, fromList [2, 3, 4])],
       [(1, fromList [1, 2, 3, 4]), (1, fromList [1, 2, 3, 4])])
    ]
  let actual = evaluateStop stopsAndDrivers
  return $ fromList expected ~=? fromList actual

They do, indeed, pass.

The idea behind that evaluateStop function is to evaluate the state at each 'minute' of the simulation. The first line of each test case is the state before the drivers meet, and the second line is the expected state after all drivers have gossiped.

My plan was to use some sort of left fold to keep evaluating states until all information has disseminated to all drivers.

Property #

Since I have already extolled the virtues of property-based testing in this article series, I wondered whether I could add some properties instead of relying on examples. Well, I did manage to add one QuickCheck property:

testProperty "drive image" $ \ (routes :: [NonEmptyList Int]) ->
  let actual = drive $ fmap getNonEmpty routes
  in isJust actual ==>
     all (\i -> 0 <= i && i <= 480) actual

There's not much to talk about here. The property only states that the result of the drive function must be between 0 and 480, if it exists.

Such a property could vacuously pass if drive always returns Nothing, so I used the ==> QuickCheck combinator to make sure that the property is actually exercising only the Just cases.

Since the drive function only returns a number, apart from verifying its image I couldn't think of any other general property to add.

You can always come up with more specific properties that explicitly set up more constrained test scenarios, but is it worth it?

It's always worthwhile to stop and think. If you're writing a 'normal' example-based test, consider whether a property would be better. Likewise, if you're about to write a property, consider whether an example would be better.

'Better' can mean more than one thing. Preventing regressions is one thing, but making the code maintainable is another. If you're writing a property that is too complicated, it might be better to write a simpler example-based test.

I could definitely think of some complicated properties, but I found that more examples might make the test code easier to understand.

More examples #

After all that angst and soul-searching, I added a few more examples to the first parametrised test:

"Kata examples" ~: do
  (routes, expected) <-
    [
      ([[3, 1, 2, 3],
        [3, 2, 3, 1],
        [4, 2, 3, 4, 5]],
       Just 5),
      ([[2, 1, 2],
        [5, 2, 8]],
       Nothing),
      ([[1, 2, 3, 4, 5],
        [5, 6, 7, 8],
        [3, 9, 6]],
       Just 13),
      ([[1, 2, 3],
        [2, 1, 3],
        [2, 4, 5, 3]],
       Just 5),
      ([[1, 2],
        [2, 1]],
       Nothing),
      ([[1]],
       Just 0),
      ([[2],
        [2]],
       Just 1)
    ]
  let actual = drive routes
  return $ expected ~=? actual

The first two test cases are the same as before, and the last two are some edge cases I added myself. The middle three I adopted from another page about the kata. Since those examples turned out to be off by one, I did those examples on paper to verify that I understood what the expected value was. Then I adjusted them to my one-indexed results.

Drive #

The drive function now correctly implements the kata, I hope. At least it passes all the tests.

drive :: (Num b, Enum b, Ord a) => [[a]] -> Maybe b
drive routes =
      -- Each driver starts with a single gossip. Any kind of value will do, as
      -- long as each is unique. Here I use the one-based index of each route,
      -- since it fulfills the requirements.
  let drivers = fmap Set.singleton [1 .. length routes]
      goal = Set.unions drivers
      stops = transpose $ fmap (take 480 . cycle) routes
      propagation =
        scanl (\ds ss -> snd <$> evaluateStop (zip ss ds)) drivers stops
  in fmap fst $ find (all (== goal) . snd) $ zip [0 ..] propagation

Haskell code can be information-dense, and if you don't have an integrated development environment (IDE) around, this may be hard to read.

drivers is a list of sets. Each set represents the gossip that a driver knows. At the beginning, each only knows one piece of gossip. The expression initialises each driver with a singleton set. Each piece of gossip is represented by a number, simply going from 1 to the number of routes. Incidentally, this is also the number of drivers, so you can consider the number 1 as a placeholder for the gossip that driver 1 knows, and so on.

The goal is the union of all the gossip. Once every driver's knowledge is equal to the goal the simulation can stop.

Since evaluateStop simulates one stop, the drive function needs a list of stops to fold. That's the stops value. In the very first example, you have three routes: [3, 1, 2, 3], [3, 2, 3, 1], and [4, 2, 3, 4, 5]. The first time the drivers stop (after one minute), the stops are 3, 3, and 4. That is, the first element in stops would be the list [3, 3, 4]. The next one would be [1, 2, 2], then [2, 3, 3], and so on.

My plan all along was to use some sort of left fold to repeatedly run evaluateStop over each minute. Since I need to produce a list of states, scanl was an appropriate choice. The lambda expression that I have to pass to it, though, is more complicated than I appreciate. We'll return to that in a moment.

The drive function can now index the propagation list by zipping it with the infinite list [0 ..], find the first element where all sets are equal to the goal set, and then return that index. That produces the correct results.

The need for a better helper function #

As I already warned, I wasn't happy with the lambda expression passed to scanl. It looks complicated and arcane. Is there a better way to express the same behaviour? Usually, when confronted with a nasty lambda expression like that, in Haskell my first instinct is to see if pointfree.io has a better option. Alas, (((snd <$>) . evaluateStop) .) . flip zip hardly seems an improvement. That flip zip expression to the right, however, suggests that it might help flipping the arguments to evaluateStop.

When I developed the evaluateStop helper function, I found it intuitive to define it over a list of tuples, where the first element in the tuple is the stop, and the second element is the set of gossip that the driver at that stop knows.

The tuples don't have to be in that order, though. Perhaps if I flip the tuples that would make the lambda expression more readable. It was worth a try.

Confidence #

Since this article is part of a small series about the epistemology of testing composed functions, let's take a moment to reflect on the confidence we may have in the drive function.

Keep in mind the goal of the kata: Calculate the number of minutes it takes for all gossip to spread to all drivers. There's a few tests that verify that; seven examples and a fairly vacuous QuickCheck property. Is that enough to be confident that the function is correct?

If it isn't, I think the best option you have is to add more examples. For the sake of argument, however, let's assume that the tests are good enough.

When summarising the tests that cover the drive function, I didn't count the three examples that exercise evaluateStop. Do these three test cases improve your confidence in the drive function? A bit, perhaps, but keep in mind that the kata description doesn't mandate that function. It's just a helper function I created in order to decompose the problem.

Granted, having tests that cover a helper function does, to a degree, increase my confidence in the code. I have confidence in the function itself, but that is largely irrelevant, because the problem I'm trying to solve is not implementing this particular function. On the other hand, my confidence in evaluateStop means that I have increased confidence in the code that calls it.

Compared to interaction-based testing, I'm not testing that drive calls evaluateStop, but I can still verify that this happens. I can just look at the code.

The composition is already there in the code. What do I gain from replicating that composition with Stubs and Spies?

It's not a breaking change if I decide to implement drive in a different way.

What gives me confidence when composing pure functions isn't that I've subjected the composition to an interaction-based test. Rather, it's that the function is composed from trustworthy components.

Strangler #

My main grievance with Stubs and Spies is that they break encapsulation. This may sound abstract, but is a real problem. This is the underlying reason that so many tests break when you refactor code.

This example code base, as other functional code that I write, avoids interaction-based testing. This makes it easier to refactor the code, as I will now demonstrate.

My goal is to change the evaluateStop helper function by flipping the tuples. If I just edit it, however, I'm going to (temporarily) break the drive function.

Katas typically result in small code bases where you can get away with a lot of bad practices that wouldn't work in a larger code base. To be honest, the refactoring I have in mind can be completed in a few minutes with a brute-force approach. Imagine, however, that we can't break compatibility of the evaluateStop function for the time being. Perhaps, had we had a larger code base, there were other code that depended on this function. At the very least, the tests do.

Instead of brute-force changing the function, I'm going to make use of the Strangler pattern, as I've also described in my book Code That Fits in Your Head.

Leave the existing function alone, and add a new one. You can typically copy and paste the existing code and then make the necessary changes. In that way, you break neither client code nor tests, because there are none.

evaluateStop' :: (Functor f, Foldable f, Ord k, Ord a)
              => f (Set a, k) -> f (Set a, k)
evaluateStop' driversAndStops =
  let gossip (driver, stop) = Map.insertWith Set.union stop driver
      gossipAtStops = foldl' (flip gossip) Map.empty driversAndStops
  in fmap (\(_, stop) -> (gossipAtStops ! stop, stop)) driversAndStops

In a language like C# you can often get away with overloading a method name, but Haskell doesn't have overloading. Since I consider this side-by-side situation to be temporary, I've appended a prime after the function name. This is a fairly normal convention in Haskell, I gather.

The only change this function represents is that I've swapped the tuple order.

Once you've added the new function, you may want to copy, paste and edit the tests. Or perhaps you want to do the tests first. During this process, make micro-commits so that you can easily suspend your 'refactoring' activity if something more important comes up.

Once everything is in place, you can change the drive function:

drive :: (Num b, Enum b, Ord a) => [[a]] -> Maybe b
drive routes =
      -- Each driver starts with a single gossip. Any kind of value will do, as
      -- long as each is unique. Here I use the one-based index of each route,
      -- since it fulfills the requirements.
  let drivers = fmap Set.singleton [1 .. length routes]
      goal = Set.unions drivers
      stops = transpose $ fmap (take 480 . cycle) routes
      propagation =
        scanl (\ds ss -> fst <$> evaluateStop' (zip ds ss)) drivers stops
  in fmap fst $ find (all (== goal) . snd) $ zip [0 ..] propagation

Notice that the type of drive hasn't change, and neither has the behaviour. This means that although I've changed the composition (the interaction) no tests broke.

Finally, once I moved all code over, I deleted the old function and renamed the new one to take its place.

Was it all worth it? #

At first glance, it doesn't look as though much was gained. What happens if I eta-reduce the new lambda expression?

drive :: (Num b, Enum b, Ord a) => [[a]] -> Maybe b
drive routes =
      -- Each driver starts with a single gossip. Any kind of value will do, as
      -- long as each is unique. Here I use the one-based index of each route,
      -- since it fulfills the requirements.
  let drivers = fmap Set.singleton [1 .. length routes]
      goal = Set.unions drivers
      stops = transpose $ fmap (take 480 . cycle) routes
      propagation = scanl (((fmap fst . evaluateStop) .) . zip) drivers stops
  in fmap fst $ find (all (== goal) . snd) $ zip [0 ..] propagation

Not much better. I can now fit the propagation expression on a single line of code and still stay within a 80x24 box, but that's about it. Is ((fmap fst . evaluateStop) .) . zip more readable than what we had before?

Hardly, I admit. I might consider reverting, and since I've been using Git tactically, I have that option.

If I hadn't tried, though, I wouldn't have known.

Conclusion #

When composing one pure function with another, how can you test that the outer function correctly calls the inner function?

By the same way that you test any other pure function. The only way you can observe whether a pure function works as intended is to compare its actual output to the output you expect its input to produce. How it arrives at that output is irrelevant. It could be looking up all results in a big table. As long as the result is correct, the function is correct.

In this article, you saw an example of how to test a composed function, as well as how to refactor it without breaking tests.

Next: When is an implementation detail an implementation detail?


Are pull requests bad because they originate from open-source development?

Monday, 24 April 2023 06:08:00 UTC

I don't think so, and at least find the argument flawed.

Increasingly I come across a quote that goes like this:

Pull requests were invented for open source projects where you want to gatekeep changes from people you don't know and don't trust to change the code safely.

If you're wondering where that 'quote' comes from, then read on. I'm not trying to stand up a straw man, but I had to do a bit of digging in order to find the source of what almost seems like a meme.

Quote investigation #

The quote is usually attributed to Dave Farley, who is a software luminary that I respect tremendously. Even with the attribution, the source is typically missing, but after asking around, Mitja Bezenšek pointed me in the right direction.

The source is most likely a video, from which I've transcribed a longer passage:

"Pull requests were invented to gatekeep access to open-source projects. In open source, it's very common that not everyone is given free access to changing the code, so contributors will issue a pull request so that a trusted person can then approve the change.

"I think this is really bad way to organise a development team.

"If you can't trust your team mates to make changes carefully, then your version control system is not going to fix that for you."

I've made an effort to transcribe as faithfully as possible, but if you really want to be sure what Dave Farley said, watch the video. The quote comes twelve minutes in.

My biases #

I agree that the argument sounds compelling, but I find it flawed. Before I proceed to put forward my arguments I want to make my own biases clear. Arguing against someone like Dave Farley is not something I take lightly. As far as I can tell, he's worked on systems more impressive than any I can showcase. I also think he has more industry experience than I have.

That doesn't necessarily make him right, but on the other hand, why should you side with me, with my less impressive résumé?

My objective is not to attack Dave Farley, or any other person for that matter. My agenda is the argument itself. I do, however, find it intellectually honest to cite sources, with the associated risk that my argument may look like a personal attack. To steelman my opponent, then, I'll try to put my own biases on display. To the degree I'm aware of them.

I prefer pull requests over pair and ensemble programming. I've tried all three, and I do admit that real-time collaboration has obvious advantages, but I find pairing or ensemble programming exhausting.

Since I read Quiet a decade ago, I've been alert to the introspective side of my personality. Although I agree with Brian Marick that one should be wary of understanding personality traits as destiny, I mostly prefer solo activities.

Increasingly, since I became self-employed, I've arranged my life to maximise the time I can work from home. The exercise regimen I've chosen for myself is independent of other people: I run, and lift weights at home. You may have noticed that I like writing. I like reading as well. And, hardly surprising, I prefer writing code in splendid isolation.

Even so, I find it perfectly possible to have meaningful relationships with other people. After all, I've been married to the same woman for decades, my (mostly) grown kids haven't fled from home, and I have friends that I've known for decades.

In a toot that I can no longer find, Brian Marick asked (and I paraphrase from memory): If you've tried a technique and didn't like it, what would it take to make you like it?

As a self-professed introvert, social interaction does tire me, but I still enjoy hanging out with friends or family. What makes those interactions different? Well, often, there's good food and wine involved. Perhaps ensemble programming would work better for me with a bottle of Champagne.

Other forces influence my preferences as well. I like the flexibility provided by asynchrony, and similarly dislike having to be somewhere at a specific time.

Having to be somewhere also involves transporting myself there, which I also don't appreciate.

In short, I prefer pull requests over pairing and ensemble programming. All of that, however, is just my subjective opinion, and that's not an argument.

Counter-examples #

The above tirade about my biases is not a refutation of Dave Farley's argument. Rather, I wanted to put my own blind spots on display. If you suspect me of motivated reasoning, that just might be the case.

All that said, I want to challenge the argument.

First, it includes an appeal to trust, which is a line of reasoning with which I don't agree. You can't trust your colleagues, just like you can't trust yourself. A code review serves more purposes than keeping malicious actors out of the code base. It also helps catch mistakes, security issues, or misunderstandings. It can also improve shared understanding of common goals and standards. Yes, this is also possible with other means, such as pair or ensemble programming, but from that, it doesn't follow that code reviews can't do that. They can. I've lived that dream.

If you take away the appeal to trust, though, there isn't much left of the argument. What remains is essentially: Pull requests were invented to solve a particular problem in open-source development. Internal software development is not open source. Pull requests are bad for internal software development.

That an invention was done in one context, however, doesn't preclude it from being useful in another. Git was invented to address an open-source problem. Should we stop using Git for internal software development?

Solar panels were originally developed for satellites and space probes. Does that mean that we shouldn't use them on Earth?

GPS was invented for use by the US military. Does that make civilian use wrong?

Are pull requests bad? #

I find the original argument logically flawed, but if I insist on logic, I'm also obliged to admit that my possible-world counter-examples don't prove that pull requests are good.

Dave Farley's claim may still turn out to be true. Not because of the argument he gives, but perhaps for other reasons.

I think I understand where the dislike of pull requests come from. As they are often practised, pull requests can sit for days with no-one looking at them. This creates unnecessary delays. If this is the only way you know of working with pull requests, no wonder you don't like them.

I advocate a more agile workflow for pull requests. I consider that congruent with my view on agile development.

Conclusion #

Pull requests are often misused, but they don't have to be. On the other hand, that's just my experience and subjective preference.

Dave Farley has argued that pull requests are a bad way to organise a development team. I've argued that the argument is logically flawed.

The question remains unsettled. I've attempted to refute one particular argument, and even if you accept my counter-examples, pull requests may still be bad. Or good.


Comments

Another important angle, for me, is that pull requests are not merely code review. It can also be a way of enforcing a variety of automated checks, i.e. running tests or linting etc. This enforces quality too - so I'd argue to use pull requests even if you don't do peer review (I do on my hobby projects atleast, for the exact reasons you mentioned in On trust in software development - I don't trust myself to be perfect.)

2023-04-26 10:26 UTC

Casper, thank you for writing. Indeed, other readers have made similar observations on other channels (Twitter, Mastodon). That, too, can be a benefit.

In order to once more steel-man 'the other side', they'd probably say that you can run automated checks in your Continuous Delivery pipeline, and halt it if automated checks fail.

When done this way, it's useful to be able to also run the same tests on your dev box. I consider that a good practice anyway.

2023-04-28 14:49 UTC

A restaurant example of refactoring from example-based to property-based testing

Monday, 17 April 2023 06:37:00 UTC

A C# example with xUnit.net and FsCheck.

This is the second comprehensive example that accompanies the article Epistemology of interaction testing. In that article, I argue that in a code base that leans toward functional programming (FP), property-based testing is a better fit than interaction-based testing. In this example, I will show how to refactor realistic state-based tests into (state-based) property-based tests.

The previous article showed a minimal and self-contained example that had the advantage of being simple, but the disadvantage of being perhaps too abstract and unrelatable. In this article, then, I will attempt to show a more realistic and concrete example. It actually doesn't start with interaction-based testing, since it's already written in the style of Functional Core, Imperative Shell. On the other hand, it shows how to refactor from concrete example-based tests to property-based tests.

I'll use the online restaurant reservation code base that accompanies my book Code That Fits in Your Head.

Smoke test #

I'll start with a simple test which was, if I remember correctly, the second test I wrote for this code base. It was a smoke test that I wrote to drive a walking skeleton. It verifies that if you post a valid reservation request to the system, you receive an HTTP response in the 200 range.

[Fact]
public async Task PostValidReservation()
{
    using var api = new LegacyApi();
 
    var expected = new ReservationDto
    {
        At = DateTime.Today.AddDays(778).At(19, 0)
                .ToIso8601DateTimeString(),
        Email = "katinka@example.com",
        Name = "Katinka Ingabogovinanana",
        Quantity = 2
    };
    var response = await api.PostReservation(expected);
 
    response.EnsureSuccessStatusCode();
    var actual = await response.ParseJsonContent<ReservationDto>();
    Assert.Equal(expected, actual, new ReservationDtoComparer());
}

Over the lifetime of the code base, I embellished and edited the test to reflect the evolution of the system as well as my understanding of it. Thus, when I wrote it, it may not have looked exactly like this. Even so, I kept it around even though other, more detailed tests eventually superseded it.

One characteristic of this test is that it's quite concrete. When I originally wrote it, I hard-coded the date and time as well. Later, however, I discovered that I had to make the time relative to the system clock. Thus, as you can see, the At property isn't a literal value, but all other properties (Email, Name, and Quantity) are.

This test is far from abstract or data-driven. Is it possible to turn such a test into a property-based test? Yes, I'll show you how.

A word of warning before we proceed: Tests with concrete, literal, easy-to-understand examples are valuable as programmer documentation. A person new to the code base can peruse such tests and learn about the system. Thus, this test is already quite valuable as it is. In a real, living code base, I'd prefer leaving it as it is, instead of turning it into a property-based test.

Since it's a simple and concrete test, on the other hand, it's easy to understand, and thus also a a good place to start. Thus, I'm going to refactor it into a property-based test; not because I think that you should (I don't), but because I think it'll be easy for you, the reader, to follow along. In other words, it's a good introduction to the process of turning a concrete test into a property-based test.

Adding parameters #

This code base already uses FsCheck so it makes sense to stick to that framework for property-based testing. While it's written in F# you can use it from C# as well. The easiest way to use it is as a parametrised test. This is possible with the FsCheck.Xunit glue library. In fact, as I refactor the PostValidReservation test, it'll look much like the AutoFixture-driven tests from the previous article.

When turning concrete examples into properties, it helps to consider whether literal values are representative of an equivalence class. In other words, is that particular value important, or is there a wider set of values that would be just as good? For example, why is the test making a reservation 778 days in the future? Why not 777 or 779? Is the value 778 important? Not really. What's important is that the reservation is in the future. How far in the future actually isn't important. Thus, we can replace the literal value 778 with a parameter:

[Property]
public async Task PostValidReservation(PositiveInt days)
{
    using var api = new LegacyApi();
 
    var expected = new ReservationDto
    {
        At = DateTime.Today.AddDays((int)days).At(19, 0)
                .ToIso8601DateTimeString(),
        // The rest of the test...

Notice that I've replaced the literal value 778 with the method parameter days. The PositiveInt type is a type from FsCheck. It's a wrapper around int that guarantees that the value is positive. This is important because we don't want to make a reservation in the past. The PositiveInt type is a good choice because it's a type that's already available with FsCheck, and the framework knows how to generate valid values. Since it's a wrapper, though, the test needs to unwrap the value before using it. This is done with the (int)days cast.

Notice, also, that I've replaced the [Fact] attribute with the [Property] attribute that comes with FsCheck.Xunit. This is what enables FsCheck to automatically generate test cases and feed them to the test method. You can't always do this, as you'll see later, but when you can, it's a nice and succinct way to express a property-based test.

Already, the PostValidReservation test method is 100 test cases (the FsCheck default), rather than one.

What about Email and Name? Is it important for the test that these values are exactly katinka@example.com and Katinka Ingabogovinanana or might other values do? The answer is that it's not important. What's important is that the values are valid, and essentially any non-null string is. Thus, we can replace the literal values with parameters:

[Property]
public async Task PostValidReservation(
    PositiveInt days,
    StringNoNulls email,
    StringNoNulls name)
{
    using var api = new LegacyApi();
 
    var expected = new ReservationDto
    {
        At = DateTime.Today.AddDays((int)days).At(19, 0)
                .ToIso8601DateTimeString(),
        Email = email.Item,
        Name = name.Item,
        Quantity = 2
    };
    var response = await api.PostReservation(expected);
 
    response.EnsureSuccessStatusCode();
    var actual = await response.ParseJsonContent<ReservationDto>();
    Assert.Equal(expected, actual, new ReservationDtoComparer());
}

The StringNoNulls type is another FsCheck wrapper, this time around string. It ensures that FsCheck will generate no null strings. This time, however, a cast isn't possible, so instead I had to pull the wrapped string out of the value with the Item property.

That's enough conversion to illustrate the process.

What about the literal values 19, 0, or 2? Shouldn't we parametrise those as well? While we could, that takes a bit more effort. The problem is that with these values, any old positive integer isn't going to work. For example, the number 19 is the hour component of the reservation time; that is, the reservation is for 19:00. Clearly, we can't just let FsCheck generate any positive integer, because most integers aren't going to work. For example, 5 doesn't work because it's in the early morning, and the restaurant isn't open at that time.

Like other property-based testing frameworks FsCheck has an API that enables you to constrain value generation, but it doesn't work with the type-based approach I've used so far. Unlike PositiveInt there's no TimeBetween16And21 wrapper type.

You'll see what you can do to control how FsCheck generates values, but I'll use another test for that.

Parametrised unit test #

The PostValidReservation test is a high-level smoke test that gives you an idea about how the system works. It doesn't, however, reveal much about the possible variations in input. To drive such behaviour, I wrote and evolved the following state-based test:

[Theory]
[InlineData(1049, 19, 00, "juliad@example.net""Julia Domna", 5)]
[InlineData(1130, 18, 15, "x@example.com""Xenia Ng", 9)]
[InlineData( 956, 16, 55, "kite@example.edu"null, 2)]
[InlineData( 433, 17, 30, "shli@example.org""Shanghai Li", 5)]
public async Task PostValidReservationWhenDatabaseIsEmpty(
    int days,
    int hours,
    int minutes,
    string email,
    string name,
    int quantity)
{
    var at = DateTime.Now.Date + new TimeSpan(days, hours, minutes, 0);
    var db = new FakeDatabase();
    var sut = new ReservationsController(
        new SystemClock(),
        new InMemoryRestaurantDatabase(Grandfather.Restaurant),
        db);
    var expected = new Reservation(
        new Guid("B50DF5B1-F484-4D99-88F9-1915087AF568"),
        at,
        new Email(email),
        new Name(name ?? ""),
        quantity);
 
    await sut.Post(expected.ToDto());
 
    Assert.Contains(expected, db.Grandfather);
}

This test gives more details, without exercising all possible code paths of the system. It's still a Facade Test that covers 'just enough' of the integration with underlying components to provide confidence that things work as they should. All the business logic is implemented by a class called MaitreD, which is covered by its own set of targeted unit tests.

While parametrised, this is still only four test cases, so perhaps you don't have sufficient confidence that everything works as it should. Perhaps, as I've outlined in the introductory article, it would help if we converted it to an FsCheck property.

Parametrised property #

I find it safest to refactor this parametrised test to a property in a series of small steps. This implies that I need to keep the [InlineData] attributes around for a while longer, removing one or two literal values at a time, turning them into randomly generated values.

From the previous test we know that the Email and Name values are almost unconstrained. This means that they are trivial in themselves to have FsCheck generate. That change, in itself, is easy, which is good, because combining an [InlineData]-driven [Theory] with an FsCheck property is enough of a mouthful for one refactoring step:

[Theory]
[InlineData(1049, 19, 00, 5)]
[InlineData(1130, 18, 15, 9)]
[InlineData( 956, 16, 55, 2)]
[InlineData( 433, 17, 30, 5)]
public void PostValidReservationWhenDatabaseIsEmpty(
    int days,
    int hours,
    int minutes,
    int quantity)
{
    Prop.ForAll(
        (from r in Gens.Reservation
         select r).ToArbitrary(),
        async r =>
        {
            var at = DateTime.Now.Date + new TimeSpan(days, hours, minutes, 0);
            var db = new FakeDatabase();
            var sut = new ReservationsController(
                new SystemClock(),
                new InMemoryRestaurantDatabase(Grandfather.Restaurant),
                db);
            var expected = r
                .WithQuantity(quantity)
                .WithDate(at);
 
            await sut.Post(expected.ToDto());
 
            Assert.Contains(expected, db.Grandfather);
        }).QuickCheckThrowOnFailure();
}

I've now managed to get rid of the email and name parameters, so I've also removed those values from the [InlineData] attributes. Instead, I've asked FsCheck to generate a valid reservation r, which comes with both valid Email and Name.

It turned out that this code base already had some custom generators in a static class called Gens, so I reused those:

internal static Gen<Email> Email =>
    from s in ArbMap.Default.GeneratorFor<NonWhiteSpaceString>()
    select new Email(s.Item);
 
internal static Gen<Name> Name =>
    from s in ArbMap.Default.GeneratorFor<StringNoNulls>()
    select new Name(s.Item);
 
internal static Gen<Reservation> Reservation =>
    from id in ArbMap.Default.GeneratorFor<Guid>()
    from d in ArbMap.Default.GeneratorFor<DateTime>()
    from e in Email
    from n in Name
    from q in ArbMap.Default.GeneratorFor<PositiveInt>()
    select new Reservation(id, d, e, n, q.Item);

As was also the case with CsCheck you typically use syntactic sugar for monads (which in C# is query syntax) to compose complex test data generators from simpler generators. This enables me to generate an entire Reservation object with a single expression.

Time of day #

Some of the values (such as the reservation's name and email address) that are involved in the PostValidReservationWhenDatabaseIsEmpty test don't really matter. Other values are constrained in some way. Even for the reservation r the above version of the test has to override the arbitrarily generated r value with a specific quantity and a specific at value. This is because you can't just reserve any quantity at any time of day. The restaurant has opening hours and actual tables. Most likely, it doesn't have a table for 100 people at 3 in the morning.

This particular test actually exercises a particular restaurant called Grandfather.Restaurant (because it was the original restaurant that was grandfathered in when the system was expanded to a multi-tenant system). It opens at 16 and has the last seating at 21. This means that the at value has to be between 16 and 21. What's the best way to generate a DateTime value that satisfies this constraint?

You could, naively, ask FsCheck to generate an integer between these two values. You'll see how to do that when we get to the quantity. While that would work for the at value, it would only generate the whole hours 16:00, 17:00, 18:00, etcetera. It would be nice if the test could also exercise times such as 18:30, 20:45, and so on. On the other hand, perhaps we don't want weird reservation times such as 17:09:23.282. How do we tell FsCheck to generate a DateTime value like that?

It's definitely possible to do from scratch, but I chose to do something else. The following shows how test code and production code can co-exist in a symbiotic relationship. The main business logic component that deals with reservations in the system is a class called MaitreD. One of its methods is used to generate a list of time slots for every day. A user interface can use that list to populate a drop-down list of available times. The method is called Segment and can also be used as a data source for an FsCheck test data generator:

internal static Gen<TimeSpan> ReservationTime(
    Restaurant restaurant,
    DateTime date)
{
    var slots = restaurant.MaitreD
        .Segment(date, Enumerable.Empty<Reservation>())
        .Select(ts => ts.At.TimeOfDay);
    return Gen.Elements(slots);
}

The Gen.Elements function is an FsCheck combinator that randomly picks a value from a collection. This one, then, picks one of the DataTime values generated by MaitreD.Segment.

The PostValidReservationWhenDatabaseIsEmpty test can now use the ReservationTime generator to produce a time of day:

[Theory]
[InlineData(5)]
[InlineData(9)]
[InlineData(2)]
public void PostValidReservationWhenDatabaseIsEmpty(int quantity)
{
    var today = DateTime.Now.Date;
    Prop.ForAll(
        (from days in ArbMap.Default.GeneratorFor<PositiveInt>()
         from t in Gens.ReservationTime(Grandfather.Restaurant, today)
         let offset = TimeSpan.FromDays((int)days) + t
         from r in Gens.Reservation
         select (r, offset)).ToArbitrary(),
        async t =>
        {
            var at = today + t.offset;
            var db = new FakeDatabase();
            var sut = new ReservationsController(
                new SystemClock(),
                new InMemoryRestaurantDatabase(Grandfather.Restaurant),
                db);
            var expected = t.r
                .WithQuantity(quantity)
                .WithDate(at);
 
            await sut.Post(expected.ToDto());
 
            Assert.Contains(expected, db.Grandfather);
        }).QuickCheckThrowOnFailure();
}

Granted, the test code is getting more and more busy, but there's room for improvement. Before I simplify it, though, I think that it's more prudent to deal with the remaining literal values.

Notice that the InlineData attributes now only supply a single value each: The quantity.

Quantity #

Like the at value, the quantity is constrained. It must be a positive integer, but it can't be larger than the largest table in the restaurant. That number, however, isn't that hard to find:

var maxCapacity = restaurant.MaitreD.Tables.Max(t => t.Capacity);

The FsCheck API includes a function that generates a random number within a given range. It's called Gen.Choose, and now that we know the range, we can use it to generate the quantity value. Here, I'm only showing the test-data-generator part of the test, since the rest doesn't change that much. You'll see the full test again after a few more refactorings.

var today = DateTime.Now.Date;
var restaurant = Grandfather.Restaurant;
var maxCapacity = restaurant.MaitreD.Tables.Max(t => t.Capacity);
Prop.ForAll(
    (from days in ArbMap.Default.GeneratorFor<PositiveInt>()
     from t in Gens.ReservationTime(restaurant, today)
     let offset = TimeSpan.FromDays((int)days) + t
     from quantity in Gen.Choose(1, maxCapacity)
     from r in Gens.Reservation
     select (r.WithQuantity(quantity), offset)).ToArbitrary(),

There are now no more literal values in the test. In a sense, the refactoring from parametrised test to property-based test is complete. It could do with a bit of cleanup, though.

Simplification #

There's no longer any need to pass along the offset variable, and the explicit QuickCheckThrowOnFailure also seems a bit redundant. I can use the [Property] attribute from FsCheck.Xunit instead.

[Property]
public Property PostValidReservationWhenDatabaseIsEmpty()
{
    var today = DateTime.Now.Date;
    var restaurant = Grandfather.Restaurant;
    var maxCapacity = restaurant.MaitreD.Tables.Max(t => t.Capacity);
    return Prop.ForAll(
        (from days in ArbMap.Default.GeneratorFor<PositiveInt>()
         from t in Gens.ReservationTime(restaurant, today)
         let at = today + TimeSpan.FromDays((int)days) + t
         from quantity in Gen.Choose(1, maxCapacity)
         from r in Gens.Reservation
         select r.WithQuantity(quantity).WithDate(at)).ToArbitrary(),
        async expected =>
        {
            var db = new FakeDatabase();
            var sut = new ReservationsController(
                new SystemClock(),
                new InMemoryRestaurantDatabase(restaurant),
                db);
 
            await sut.Post(expected.ToDto());
 
            Assert.Contains(expected, db.Grandfather);
        });
}

Compared to the initial version of the test, it has become more top-heavy. It's about the same size, though. The original version was 30 lines of code. This version is only 26 lines of code, but it is admittedly more information-dense. The original version had more 'noise' interleaved with the 'signal'. The new variation actually has a better separation of data generation and the test itself. Consider the 'actual' test code:

var db = new FakeDatabase();
var sut = new ReservationsController(
    new SystemClock(),
    new InMemoryRestaurantDatabase(restaurant),
    db);
 
await sut.Post(expected.ToDto());
 
Assert.Contains(expected, db.Grandfather);

If we could somehow separate the data generation from the test itself, we might have something that was quite readable.

Extract test data generator #

The above data generation consists of a bit of initialisation and a query expression. Like all pure functions it's easy to extract:

private static Gen<(Restaurant, Reservation)>
    GenValidReservationForEmptyDatabase()
{
    var today = DateTime.Now.Date;
    var restaurant = Grandfather.Restaurant;
    var capacity = restaurant.MaitreD.Tables.Max(t => t.Capacity);
 
    return from days in ArbMap.Default.GeneratorFor<PositiveInt>()
           from t in Gens.ReservationTime(restaurant, today)
           let at = today + TimeSpan.FromDays((int)days) + t
           from quantity in Gen.Choose(1, capacity)
           from r in Gens.Reservation
           select (restaurant, r.WithQuantity(quantity).WithDate(at));
}

While it's quite specialised, it leaves the test itself small and readable:

[Property]
public Property PostValidReservationWhenDatabaseIsEmpty()
{
    return Prop.ForAll(
        GenValidReservationForEmptyDatabase().ToArbitrary(),
        async t =>
        {
            var (restaurantexpected) = t;
            var db = new FakeDatabase();
            var sut = new ReservationsController(
                new SystemClock(),
                new InMemoryRestaurantDatabase(restaurant),
                db);
 
            await sut.Post(expected.ToDto());
 
            Assert.Contains(expected, db[restaurant.Id]);
        });
}

That's not the only way to separate test and data generation.

Test as implementation detail #

The above separation refactors the data-generating expression to a private helper function. Alternatively you can keep all that FsCheck infrastructure code in the public test method and extract the test body itself to a private helper method:

[Property]
public Property PostValidReservationWhenDatabaseIsEmpty()
{
    var today = DateTime.Now.Date;
    var restaurant = Grandfather.Restaurant;
    var capacity = restaurant.MaitreD.Tables.Max(t => t.Capacity);
 
    var g = from days in ArbMap.Default.GeneratorFor<PositiveInt>()
            from t in Gens.ReservationTime(restaurant, today)
            let at = today + TimeSpan.FromDays((int)days) + t
            from quantity in Gen.Choose(1, capacity)
            from r in Gens.Reservation
            select (restaurant, r.WithQuantity(quantity).WithDate(at));
 
    return Prop.ForAll(
        g.ToArbitrary(),
        t => PostValidReservationWhenDatabaseIsEmptyImp(
            t.restaurant,
            t.Item2));
}

At first glance, that doesn't look like an improvement, but it has the advantage that the actual test method is now devoid of FsCheck details. If we use that as a yardstick for how decoupled the test is from FsCheck, this seems cleaner.

private static async Task PostValidReservationWhenDatabaseIsEmptyImp(
    Restaurant restaurant, Reservation expected)
{
    var db = new FakeDatabase();
    var sut = new ReservationsController(
        new SystemClock(),
        new InMemoryRestaurantDatabase(restaurant),
        db);
 
    await sut.Post(expected.ToDto());
 
    Assert.Contains(expected, db[restaurant.Id]);
}

Using a property-based testing framework in C# is still more awkward than in a language with better support for monadic composition and pattern matching. That said, more recent versions of C# do have better pattern matching on tuples, but this code base is still on C# 8.

If you still think that this looks more complicated than the initial version of the test, then I agree. Property-based testing isn't free, but you get something in return. We started with four test cases and ended with 100. And that's just the default. If you want to increase the number of test cases, that's just an API call away. You could run 1,000 or 10,000 test cases if you wanted to. The only real downside is that the tests take longer to run.

Unhappy paths #

The tests above all test the happy path. A valid request arrives and the system is in a state where it can accept it. This small article series is, you may recall, a response to an email from Sergei Rogovtsev. In his email, he mentioned the need to test both happy path and various error scenarios. Let's cover a few before wrapping up.

As I was developing the system and fleshing out its behaviour, I evolved this parametrised test:

[Theory]
[InlineData(null"j@example.net""Jay Xerxes", 1)]
[InlineData("not a date""w@example.edu""Wk Hd", 8)]
[InlineData("2023-11-30 20:01"null"Thora", 19)]
[InlineData("2022-01-02 12:10""3@example.org""3 Beard", 0)]
[InlineData("2045-12-31 11:45""git@example.com""Gil Tan", -1)]
public async Task PostInvalidReservation(
    string at,
    string email,
    string name,
    int quantity)
{
    using var api = new LegacyApi();
    var response = await api.PostReservation(
        new { at, email, name, quantity });
    Assert.Equal(HttpStatusCode.BadRequest, response.StatusCode);
}

The test body itself is about as minimal as it can be. There are four test cases that I added one or two at a time.

  • The first test case covers what happens if the at value is missing (i.e. null)
  • The next test case covers a malformed at value
  • The third test case covers a missing email address
  • The two last test cases covers non-positive quantities, both 0 and a negative number

It's possible to combine FsCheck generators that deal with each of these cases, but here I want to demonstrate how it's still possible to keep each error case separate, if that's what you need. First, separate the test body from its data source, like I did above:

[Theory]
[InlineData(null"j@example.net""Jay Xerxes", 1)]
[InlineData("not a date""w@example.edu""Wk Hd", 8)]
[InlineData("2023-11-30 20:01"null"Thora", 19)]
[InlineData("2022-01-02 12:10""3@example.org""3 Beard", 0)]
[InlineData("2045-12-31 11:45""git@example.com""Gil Tan", -1)]
public async Task PostInvalidReservation(
    string at,
    string email,
    string name,
    int quantity)
{
    await PostInvalidReservationImp(at, email, name, quantity);
}
 
private static async Task PostInvalidReservationImp(
    string at,
    string email,
    string name,
    int quantity)
{
    using var api = new LegacyApi();
    var response = await api.PostReservation(
        new { at, email, name, quantity });
    Assert.Equal(HttpStatusCode.BadRequest, response.StatusCode);
}

If you consider this refactoring in isolation, it seems frivolous, but it's just preparation for further work. In each subsequent refactoring I'll convert each of the above error cases to a property.

Missing date and time #

Starting from the top, convert the reservation-at-null test case to a property:

[Property]
public async Task PostReservationAtNull(string emailstring name, PositiveInt quantity)
{
    await PostInvalidReservationImp(null, email, name, (int)quantity);
}

I've left the parametrised PostInvalidReservation test in place, but removed the [InlineData] attribute with the null value for the at parameter:

[Theory]
[InlineData("not a date""w@example.edu""Wk Hd", 8)]
[InlineData("2023-11-30 20:01"null"Thora", 19)]
[InlineData("2022-01-02 12:10""3@example.org""3 Beard", 0)]
[InlineData("2045-12-31 11:45""git@example.com""Gil Tan", -1)]
public async Task PostInvalidReservation(

The PostReservationAtNull property can use the FsCheck.Xunit [Property] attribute, because any string can be used for email and name.

To be honest, it is, perhaps, cheating a bit to post any positive quantity, because a number like, say, 1837 would be a problem even if the posted representation was well-formed and valid, since no table of the restaurant has that capacity.

Validation does, however, happen before evaluating business rules and application state, so the way the system is currently implemented, the test never fails because of that. The service never gets to that part of handling the request.

One might argue that this is relying on (and thereby coupling to) an implementation detail, but honestly, it seems unlikely that the service would begin processing an invalid request - 'invalid' implying that the request makes no sense. Concretely, if the date and time is missing from a reservation, how can the service begin to process it? On which date? At what time?

Thus, it's not that likely that this behaviour would change in the future, and therefore unlikely that the test would fail because of a policy change. It is, however, worth considering.

Malformed date and time #

The next error case is when the at value is present, but malformed. You can also convert that case to a property:

[Property]
public Property PostMalformedDateAndTime()
{
    var g = from at in ArbMap.Default.GeneratorFor<string>()
                .Where(s => !DateTime.TryParse(s, out _))
            from email in Gens.Email
            from name in Gens.Name
            from quantity in Gen.Choose(1, 10)
            select (at,
                    email: email.ToString(),
                    name: name.ToString(),
                    quantity);
 
    return Prop.ForAll(
        g.ToArbitrary(),
        t => PostInvalidReservationImp(t.at, t.email, t.name, t.quantity));
}

Given how simple PostReservationAtNull turned out to be, you may be surprised that this case takes so much code to express. There's not that much going on, though. I reuse the generators I already have for email and name, and FsCheck's built-in Gen.Choose to pick a quantity between 1 and 10. The only slightly tricky expression is for the at value.

The distinguishing part of this test is that the at value should be malformed. A randomly generated string is a good starting point. After all, most strings aren't well-formed date-and-time values. Still, a random string could be interpreted as a date or time, so it's better to explicitly disallow such values. This is possible with the Where function. It's a filter that only allows values through that are not understandable as dates or times - which is the vast majority of them.

Null email #

The penultimate error case is when the email address is missing. That one is as easy to express as the missing at value.

[Property]
public async Task PostNullEmail(DateTime atstring name, PositiveInt quantity)
{
    await PostInvalidReservationImp(at.ToIso8601DateTimeString(), null, name, (int)quantity);
}

Again, with the addition of this specific property, I've removed the corresponding [InlineData] attribute from the PostInvalidReservation test. It only has two remaining test cases, both about non-positive quantities.

Non-positive quantity #

Finally, we can add a property that checks what happens if the quantity isn't positive:

[Property]
public async Task PostNonPositiveQuantity(
    DateTime at,
    string email,
    string name,
    NonNegativeInt quantity)
{
    await PostInvalidReservationImp(at.ToIso8601DateTimeString(), email, name, -(int)quantity);
}

FsCheck doesn't have a wrapper for non-positive integers, but I can use NonNegativeInt and negate it. The point is that I want to include 0, which NonNegativeInt does. That wrapper generates integers greater than or equal to zero.

Since I've now modelled each error case as a separate FsCheck property, I can remove the PostInvalidReservation method.

Conclusion #

To be honest, I think that turning these parametrised tests into FsCheck properties is overkill. After all, when I wrote the code base, I found the parametrised tests adequate. I used test-driven development all the way through, and while I also kept the Devil's Advocate in mind, the tests that I wrote gave me sufficient confidence that the system works as it should.

The main point of this article is to show how you can convert example-based tests to property-based tests. After all, just because I felt confident in my test suite it doesn't follow that a few parametrised tests does it for you. How much testing you need depends on a variety of factors, so you may need the extra confidence that thousands of test cases can give you.

The previous article in this series showed an abstract, but minimal example. This one is more realistic, but also more involved.

Next: Refactoring pure function composition without breaking existing tests.


Comments

In the section "Missing date and time", you mention that it could be worth considering the coupling of the test to the implementation details regarding validation order and possible false positive test results. Given that you already have a test data generator that produces valid reservations (GenValidReservationForEmptyDatabase), wouldn't it be more or less trivial to just generate valid test data and modify it to make it invalid in the single specific way you want to test?

2023-04-18 14:00 UTC

Am I right in thinking shrinking doesn't work in FsCheck with the query syntax? I've just tried with two ints. How would you make it work?

[Fact]
public void ShrinkingTest()
{
    Prop.ForAll(
        (from a1 in Arb.Default.Int32().Generator
         from a2 in Arb.Default.Int32().Generator
         select (a1, a2)).ToArbitrary(),
        t =>
        {
            if (t.a2 > 10)
                throw new System.Exception();
        })
    .QuickCheckThrowOnFailure();
}
2023-04-18 19:15 UTC

Christer, thank you for writing. It wouldn't be impossible to address that concern, but I haven't found a good way of doing it without introducing other problems. So, it's a trade-off.

What I meant by my remark in the article is that in order to make an (otherwise) valid request, the test needs to know the maximum valid quantity, which varies from restaurant to restaurant. The problem, in a nutshell, is that the test in question operates exclusively against the REST API of the service, and that API doesn't expose any functionality that enable clients to query the configuration of tables for a given restaurant. There's no way to obtain that information.

The only two options I can think of are:

  • Add such a query API to the REST API. In this case, that seems unwarranted.
  • Add a backdoor API to the self-host (LegacyApi).

If I had to, I'd prefer the second option, but it would still require me to add more (test) code to the code base. There's a cost to every line of code.

Here, I'm making a bet that the grandfathered restaurant isn't going to change its configuration. The tests are then written with the implicit knowledge that that particular restaurant has a maximum table size of 10, and also particular opening and closing times.

This makes those tests more concrete, which makes them more readable. They serve as easy-to-understand examples of how the system works (once the reader has gained the implicit knowledge I just described).

It's not perfect. The tests are, perhaps, too obscure for that reason, and they are vulnerable to configuration changes. Even so, the remedies I can think of come with their own disadvantages.

So far, I've decided that the trade-offs are best leaving things as you see them here. That doesn't mean that I wouldn't change that decision in the future if it turns out that these tests are too brittle.

2023-04-19 8:18 UTC

Anthony, thank you for writing. You're correct that in FsCheck shrinking doesn't work with query syntax; at least in the versions I've used. I'm not sure if that's planned for a future release.

As far as I can tell, this is a consequence of the maturity of the library. You have the same issue with QuickCheck, which also distinguishes between Gen and Arbitrary. While Gen is a monad, Arbitrary's shrink function is invariant, which prevents it from being a functor (and hence, also from being a monad).

FsCheck is a mature port of QuickCheck, so it has the same limitation. No functor, no query syntax.

Later, this limitation was solved by modelling shrinking based on a lazily evaluated shrink tree, which does allow for a monad. The first time I saw that in effect was in Hedgehog.

2023-04-21 6:17 UTC

Hedgehog does a little better than FsCheck but it doesn't shrink well when the variables are dependent.

[Fact]
public void ShrinkingTest_Hedgehog()
{
    Property.ForAll(
        from a1 in Gen.Int32(Range.ConstantBoundedInt32())
        from a2 in Gen.Int32(Range.ConstantBoundedInt32())
        where a1 > a2
        select (a1, a2))
    .Select(t =>
    {
        if (t.a2 > 10)
            throw new System.Exception();
    })
    .Check(PropertyConfig.Default.WithTests(1_000_000).WithShrinks(1_000_000));
}

[Fact]
public void ShrinkingTest_Hedgehog2()
{
    Property.ForAll(
        from a1 in Gen.Int32(Range.ConstantBoundedInt32())
        from a2 in Gen.Int32(Range.Constant(0, a1))
        select (a1, a2))
    .Select(t =>
    {
        if (t.a2 > 10)
            throw new System.Exception();
    })
    .Check(PropertyConfig.Default.WithTests(1_000_000).WithShrinks(1_000_000));
}

[Fact]
public void ShrinkingTest_CsCheck()
{
    (from a1 in Gen.Int
     from a2 in Gen.Int
     where a1 > a2
     select (a1, a2))
    .Sample((_, a2) =>
    {
        if (a2 > 10)
            throw new Exception();
    }, iter: 1_000_000);
}

[Fact]
public void ShrinkingTest_CsCheck2()
{
    (from a1 in Gen.Int.Positive
     from a2 in Gen.Int[0, a1]
     select (a1, a2))
    .Sample((_, a2) =>
    {
        if (a2 > 10)
            throw new Exception();
    }, iter: 1_000_000);
}

This and the syntax complexity I mentioned in the previous post were the reasons I developed CsCheck. Random shrinking is the key innovation that makes it simpler.

2023-04-21 16:38 UTC

Anagrams kata as a one-liner

Monday, 10 April 2023 08:08:00 UTC

A futile exercise in code compaction.

Recently I was doing the Anagrams kata in F# with Grzegorz Dziadkiewicz, and along the way realised that the implementation is essentially a one-liner. I thought it would be fun to redo the exercise in Haskell and see how compact code I could get away with.

In short, in the exercise, you're given a list of words, and you need to find all the anagrams in the list. For example, given the list bar, foo, bra, the result should be bar, bra, and foo shouldn't be part of the output, since it's not an anagram of any other word in the list.

A pipeline of transformations #

My idea was to collect all the words in a Map (dictionary) keyed by the string, but sorted. Even if the sorted string is a nonsense word, all anagrams sort to the same sequence of letters:

ghci> sort "bar"
"abr"
ghci> sort "bra"
"abr"

Each of the keys should contain a Set of words, since I don't care about the order.

Once I have that map of sets, I can throw away the singleton sets, and then the keys. Or perhaps first throw away the keys, and then the singleton sets. The order of those two steps doesn't matter.

The reason I don't want the singleton sets is that a set with only one word means that no anagrams were found.

Creating the map #

How to create the desired map? The Map module exports the fromListWith function that enables you to go through an association list and combine values along the way, in case you encounter the key more than once. That sounds useful, but means that first I have to convert the list of words to an association list.

Importing Control.Arrow, I can do it like this:

ghci> fmap (sort &&& Set.singleton) ["bar", "foo", "bra"]
[("abr",fromList ["bar"]),("foo",fromList ["foo"]),("abr",fromList ["bra"])]

Each element in the list is a pair of a key, and a set containing a single word. Notice that the set containing "bar" has the same key as the set containing "bra". When using fromListWith, the function will have to unite these two sets whenever it encounters the same key.

ghci> Map.fromListWith Set.union $ fmap (sort &&& Set.singleton) ["bar", "foo", "bra"]
fromList [("abr",fromList ["bar","bra"]),("foo",fromList ["foo"])]

The two anagrams "bar" and "bra" now belong to the same set, while "foo" is still solitary.

Finding the anagrams #

Now that we've grouped sets according to key, we no longer need the keys:

ghci> Map.elems $ Map.fromListWith Set.union $ fmap (sort &&& Set.singleton) ["bar", "foo", "bra"]
[fromList ["bar","bra"],fromList ["foo"]]

The anagrams are those sets that have more than one element, so we can throw away those that are smaller.

ghci> filter ((1 <) . Set.size) $ Map.elems $ Map.fromListWith Set.union $
      fmap (sort &&& Set.singleton) ["bar", "foo", "bra"]
[fromList ["bar","bra"]]

The expression has now grown to such a width that I've broken it into two lines to make it more readable. It really is just one line, though.

Function #

To save a bit of space, I eta-reduced the expression before I made it a function:

anagrams :: Ord a => [[a]] -> Set (Set [a])
anagrams =
  Set.fromList . filter ((1 <) . Set.size) . Map.elems . Map.fromListWith Set.union
  . fmap (sort &&& Set.singleton)

The leftmost Set.fromList converts the list of anagrams to a Set of anagrams, since I didn't think that it was a postcondition that the anagrams should be returned in a specific order.

Unfortunately the expression is still so wide that I found it necessary to break it into two lines.

Just for the hell of it, I tried to fix the situation by changing the imports:

import Control.Arrow
import Data.List (sort)
import Data.Map.Strict (fromListWithelems)
import Data.Set (SetfromListsingleton)

With this very specific set of imports, the expression now barely fits on a single line:

anagrams :: Ord a => [[a]] -> Set (Set [a])
anagrams = fromList . filter ((1 <) . length) . elems . fromListWith (<>) . fmap (sort &&& singleton)

Here, I also took advantage of Semigroup append (<>) being equal to Set.union for Set.

Is it readable? Hardly.

My main goal with the exercise was to implement the desired functionality as a single expression. Perhaps I was inspired by Dave Thomas, who wrote:

"I hacked a solution together in 25 lines of Ruby."

25 lines of Ruby? I can do it in one line of Haskell!

Is that interesting? Does it make sense to compare two languages? Why not? By trying out different languages you learn the strengths and weaknesses of each. There's no denying that Haskell is expressive. On the other hand, what you can't see in this blog post is that compilation takes forever. Not for this code in particular, but in general.

I'm sure Dave Thomas was done with his Ruby implementation before my machine had even finished compiling the empty, scaffolded Haskell code.

Performance #

Dave Thomas also wrote:

"It runs on this wordlist in 1.8s on a 1.7GHz i7."

Usually I don't care that much about performance as long as it's adequate. Or rather, I find that good software architecture with poor algorithms usually beats bad architecture with good algorithms. But I digress.

How fares my one-liner against Dave Thomas' implementation?

ghci> :set +s
ghci> length . anagrams . lines <$> readFile "wordlist.txt"
20683
(3.56 secs, 1,749,984,448 bytes)

Oh, 3.56 seconds isn't particularly g Holy thunk, Batman! 1.7 gigabytes!

That's actually disappointing, I admit. If only I could excuse this by running on a weaker machine, but mine is a 1.9 GHz i7. Nominally faster than Dave Thomas' machine.

At least, the time it takes to run through that 3.7 MB file is the same order of magnitude.

Tests #

Since I had a good idea about the kind of implementation I was aiming for, I didn't write that many tests. Only three, actually.

main :: IO ()
main = defaultMain $ hUnitTestToTests (TestList [
  "Examples" ~: do
    (words, expected) <-
      [
        (["foo""bar""baz"], Set.empty),
        (["bar""foo""bra"], Set.fromList [Set.fromList ["bar""bra"]]),
        (["foo""bar""bra""oof"],
         Set.fromList [
          Set.fromList ["foo""oof"], Set.fromList ["bar""bra"]])
      ]
    let actual = anagrams words
    return $ expected ~=? actual
  ])

As I usually do it in Haskell, these are inlined parametrised HUnit tests.

Conclusion #

Doing katas is a good way to try out new ideas, dumb or otherwise. Implementing the Anagrams kata as a one-liner was fun, but the final code shown here is sufficiently unreadable that I wouldn't recommend it.

You could still write the anagrams function based on the idea presented here, but in a shared code base with an indefinite life span, I'd break it up into multiple expressions with descriptive names.


An abstract example of refactoring from interaction-based to property-based testing

Monday, 03 April 2023 06:02:00 UTC

A C# example with xUnit.net and CsCheck

This is the first comprehensive example that accompanies the article Epistemology of interaction testing. In that article, I argue that in a code base that leans toward functional programming (FP), property-based testing is a better fit than interaction-based testing. In this example, I will show how to refactor simple interaction-based tests into a property-based tests.

This small article series was prompted by an email from Sergei Rogovtsev, who was kind enough to furnish example code. I'll use his code as a starting point for this example, so I've forked the repository. If you want to follow along, all my work is in a branch called no-mocks. That branch simply continues off the master branch.

Interaction-based testing #

Sergei Rogovtsev writes:

"A major thing to point out here is that I'm not following TDD here not by my own choice, but because my original question arose in a context of a legacy system devoid of tests, so I choose to present it to you in the same way. I imagine that working from tests would avoid a lot of questions."

Even when using test-driven development (TDD), most code bases I've seen make use of Stubs and Mocks (or, rather, Spies). In an object-oriented context this can make much sense. After all, a catch phrase of object-oriented programming is tell, don't ask.

If you base API design on that principle, you're modelling side effects, and it makes sense that tests use Spies to verify those side effects. The book Growing Object-Oriented Software, Guided by Tests is a good example of this approach. Thus, even if you follow established good TDD practice, you could easily arrive at a code base reminiscent of Sergei Rogovtsev's example. I've written plenty of such code bases myself.

Sergei Rogovtsev then extracts a couple of components, leaving him with a Controller class looking like this:

public string Complete(string statestring code)
{
    var knownState = _repository.GetState(state);
    try
    {
        if (_stateValidator.Validate(code, knownState))
            return _renderer.Success(knownState);
        else
            return _renderer.Failure(knownState);
    }
    catch (Exception e)
    {
        return _renderer.Error(knownState, e);
    }
}

This code snippet doesn't show the entire class, but only its solitary action method. Keep in mind that the entire repository is available on GitHub if you want to see the surrounding code.

The Complete method orchestrates three injected dependencies: _repository, _stateValidator, and _renderer. The question that Sergei Rogovtsev asks is how to test this method. You may think that it's so simple that you don't need to test it, but keep in mind that this is a minimal and self-contained example that stands in for something more complicated.

The method has a cyclomatic complexity of 3, so you need at least three test cases. That's also what Sergei Rogovtsev's code contains. I'll show each test case in turn, while I refactor them.

The overall question is still this: Both IStateValidator and IRenderer interfaces have only a single production implementation, and in both cases the implementations are pure functions. If interaction-based testing is suboptimal, is there a better way to test this code?

As I outlined in the introductory article, I consider property-based testing a good alternative. In the following, I'll refactor the tests. Since the tests already use AutoFixture, most of the preliminary work can be done without choosing a property-based testing framework. I'll postpone that decision until I need it.

State validator #

The IStateValidator interface has a single implementation:

public class StateValidator : IStateValidator
{
    public bool Validate(string code, (string expectedCode, bool isMobile, Uri redirect) knownState)
        => code == knownState.expectedCode;
}

The Validate method is a pure function, so it's completely deterministic. It means that you don't have to hide it behind an interface and replace it with a Test Double in order to control it. Rather, just feed it proper data. Still, that's not what the interaction-based tests do:

[Theory]
[AutoData]
public void HappyPath(string statestring code, (stringbool, Uri) knownStatestring response)
{
    _repository.Add(state, knownState);
    _stateValidator
        .Setup(validator => validator.Validate(code, knownState))
        .Returns(true);
    _renderer
        .Setup(renderer => renderer.Success(knownState))
        .Returns(response);
 
    _target
        .Complete(state, code)
        .Should().Be(response);
}

These tests use AutoFixture, which will make it a bit easier to refactor them to properties. It also makes the test a bit more abstract, since you don't get to see concrete test data. In short, the [AutoData] attribute will generate a random state string, a random code string, and so on. If you want to see an example with concrete test data, the next article shows that variation.

The test uses Moq to control the behaviour of the Test Doubles. It states that the Validate method will return true when called with certain arguments. This is possible because you can redefine its behaviour, but as far as executable specifications go, this test doesn't reflect reality. There's only one Validate implementation, and it doesn't behave like that. Rather, it'll return true when code is equal to knownState.expectedCode. The test poorly communicates that behaviour.

Even before I replace AutoFixture with CsCheck, I'll prepare the test by making it more honest. I'll replace the code parameter with a Derived Value:

[Theory]
[AutoData]
public void HappyPath(string state, (stringbool, Uri) knownStatestring response)
{
    var (expectedCode__) = knownState;
    var code = expectedCode;
    // The rest of the test...

I've removed the code parameter to replace it with a variable derived from knownState. Notice how this documents the overall behaviour of the (sub-)system.

This also means that I can now replace the IStateValidator Test Double with the real, pure implementation:

[Theory]
[AutoData]
public void HappyPath(string state, (stringbool, Uri) knownStatestring response)
{
    var (expectedCode__) = knownState;
    var code = expectedCode;
    _repository.Add(state, knownState);
    _renderer
        .Setup(renderer => renderer.Success(knownState))
        .Returns(response);
    var sut = new Controller(_repository, new StateValidator(), _renderer.Object);
 
    sut
        .Complete(state, code)
        .Should().Be(response);
}

I give the Failure test case the same treatment:

[Theory]
[AutoData]
public void Failure(string state, (stringbool, Uri) knownStatestring response)
{
    var (expectedCode__) = knownState;
    var code = expectedCode + "1"// Any extra string will do
    _repository.Add(state, knownState);
    _renderer
        .Setup(renderer => renderer.Failure(knownState))
        .Returns(response);
    var sut = new Controller(_repository, new StateValidator(), _renderer.Object);
 
    sut
        .Complete(state, code)
        .Should().Be(response);
}

The third test case is a bit more interesting.

An impossible case #

Before I make any changes to it, the third test case is this:

[Theory]
[AutoData]
public void Error(
    string state,
    string code,
    (stringbool, Uri) knownState,
    Exception e,
    string response)
{
    _repository.Add(state, knownState);
    _stateValidator
        .Setup(validator => validator.Validate(code, knownState))
        .Throws(e);
    _renderer
        .Setup(renderer => renderer.Error(knownState, e))
        .Returns(response);
 
    _target
        .Complete(state, code)
        .Should().Be(response);
}

This test case verifies the behaviour of the Controller class when the Validate method throws an exception. If we want to instead use the real, pure implementation, how can we get it to throw an exception? Consider it again:

public bool Validate(string code, (string expectedCode, bool isMobile, Uri redirect) knownState)
    => code == knownState.expectedCode;

As far as I can tell, there's no way to get this method to throw an exception. You might suggest passing null as the knownState parameter, but that's not possible. This is a new version of C# and the nullable reference types feature is turned on. I spent some fifteen minutes trying to convince the compiler to pass a null argument in place of knownState, but I couldn't make it work in a unit test.

That's interesting. The Error test is exercising a code path that's impossible in production. Is it redundant?

It might be, but here I think that it's more an artefact of the process. Sergei Rogovtsev has provided a minimal example, and as it sometimes happens, perhaps it's a bit too minimal. He did write, however, that he considered it essential for the example that the logic involved more that an Boolean true/false condition. In order to keep with the spirit of the example, then, I'm going to modify the Validate method so that it's also possible to make it throw an exception:

public bool Validate(string code, (string expectedCode, bool isMobile, Uri redirect) knownState)
{
    if (knownState == default)
        throw new ArgumentNullException(nameof(knownState));
 
    return code == knownState.expectedCode;
}

The method now throws an exception if you pass it a default value for knownState. From an implementation standpoint, there's no reason to do this, so it's only for the sake of the example. You can now test how the Controller handles an exception:

[Theory]
[AutoData]
public void Error(string statestring codestring response)
{
    _repository.Add(state, default);
    _renderer
        .Setup(renderer => renderer.Error(default, It.IsAny<Exception>()))
        .Returns(response);
    var sut = new Controller(_repository, new StateValidator(), _renderer.Object);
 
    sut
        .Complete(state, code)
        .Should().Be(response);
}

The test no longer has a reference to the specific Exception object that Validate is going to throw, so instead it has to use Moq's It.IsAny API to configure the _renderer. This is, however, only an interim step, since it's now time to treat that dependency in the same way as the validator.

Renderer #

The Renderer class has three methods, and they are all pure functions:

public class Renderer : IRenderer
{
    public string Success((string expectedCode, bool isMobile, Uri redirect) knownState)
    {
        if (knownState.isMobile)
            return "{\"success\": true, \"redirect\": \"" + knownState.redirect + "\"}";
        else
            return "302 Location: " + knownState.redirect;
    }
 
    public string Failure((string expectedCode, bool isMobile, Uri redirect) knownState)
    {
        if (knownState.isMobile)
            return "{\"success\": false, \"redirect\": \"login\"}";
        else
            return "302 Location: login";
    }
 
    public string Error((string expectedCode, bool isMobile, Uri redirect) knownState, Exception e)
    {
        if (knownState.isMobile)
            return "{\"error\": \"" + e.Message + "\"}";
        else
            return "500";
    }
}

Since all three methods are deterministic, automated tests can control their behaviour simply by passing in the appropriate arguments:

[Theory]
[AutoData]
public void HappyPath(string state, (stringbool, Uri) knownStatestring response)
{
    var (expectedCode__) = knownState;
    var code = expectedCode;
    _repository.Add(state, knownState);
    var renderer = new Renderer();
    var sut = new Controller(_repository, renderer);
 
    var expected = renderer.Success(knownState);
    sut
        .Complete(state, code)
        .Should().Be(expected);
}

Instead of configuring an IRenderer Stub, the test can state the expected output: That the output is equal to the output that renderer.Success would return.

Notice that the test doesn't require that the implementation calls renderer.Success. It only requires that the output is equal to the output that renderer.Success would return. Thus, it has less of an opinion about the implementation, which means that it's marginally less coupled to it.

You might protest that the test now duplicates the implementation code. This is partially true, but no more than the previous incarnation of it. Before, the test used Moq to explicitly require that renderer.Success gets called. Now, there's still coupling, but this refactoring reduces it.

As a side note, this may partially be an artefact of the process. Here I'm refactoring tests while keeping the implementation intact. Had I started with a property, perhaps the test would have turned out differently, and less coupled to the implementation. If you're interested in a successful exercise in using property-based TDD, you may find my article Property-based testing is not the same as partition testing interesting.

Simplification #

Once you've refactored the tests to use the pure functions as dependencies, you no longer need the interfaces. The interfaces IStateValidator and IRenderer only existed to support testing. Now that the tests no longer use the interfaces, you can delete them.

Furthermore, once you've removed those interfaces, there's no reason for the classes to support instantiation. Instead, make them static:

public static class StateValidator
{
    public static bool Validate(
        string code,
        (string expectedCode, bool isMobile, Uri redirect) knownState)
    {
        if (knownState == default)
            throw new ArgumentNullException(nameof(knownState));
 
        return code == knownState.expectedCode;
    }
}

You can do the same for the Renderer class.

This doesn't change the overall flow of the Controller class' Complete method, although the implementation details have changed a bit:

public string Complete(string statestring code)
{
    var knownState = _repository.GetState(state);
    try
    {
        if (StateValidator.Validate(code, knownState))
            return Renderer.Success(knownState);
        else
            return Renderer.Failure(knownState);
    }
    catch (Exception e)
    {
        return Renderer.Error(knownState, e);
    }
}

StateValidator and Renderer are no longer injected dependencies, but rather 'modules' that affords pure functions.

Both the Controller class and the tests that cover it are simpler.

Properties #

So far I've been able to make all these changes without introducing a property-based testing framework. This was possible because the tests already used AutoFixture, which, while not a property-based testing framework, already strongly encourages you to write tests without literal test data.

This makes it easy to make the final change to property-based testing. On the other hand, it's a bit unfortunate from a pedagogical perspective. This means that you didn't get to see how to refactor a 'traditional' unit test to a property. The next article in this series will plug that hole, as well as show a more realistic example.

It's now time to pick a property-based testing framework. On .NET you have a few choices. Since this code base is C#, you may consider a framework written in C#. I'm not convinced that this is necessarily better, but it's a worthwhile experiment. Here I've used CsCheck.

Since the tests already used randomly generated test data, the conversion to CsCheck is relatively straightforward. I'm only going to show one of the tests. You can always find the rest of the code in the Git repository.

[Fact]
public void HappyPath()
{
    (from state in Gen.String
     from expectedCode in Gen.String
     from isMobile in Gen.Bool
     let urls = new[] { "https://example.com""https://example.org" }
     from redirect in Gen.OneOfConst(urls).Select(s => new Uri(s))
     select (state, (expectedCode, isMobile, redirect)))
    .Sample((stateknownState) =>
    {
        var (expectedCode__) = knownState;
        var code = expectedCode;
        var repository = new RepositoryStub();
        repository.Add(state, knownState);
        var sut = new Controller(repository);
 
        var expected = Renderer.Success(knownState);
        sut
            .Complete(state, code)
            .Should().Be(expected);
    });
}

Compared to the AutoFixture version of the test, this looks more complicated. Part of it is that CsCheck (as far as I know) doesn't have the same integration with xUnit.net that AutoFixture has. That might be an issue that someone could address; after all, FsCheck has framework integration, to name an example.

Test data generators are monads so you typically leverage whatever syntactic sugar a language offers to simplify monadic composition. In C# that syntactic sugar is query syntax, which explains that initial from block.

The test does look too top-heavy for my taste. An equivalent problem appears in the next article, where I also try to address it. In general, the better monad support a language offers, the more elegantly you can address this kind of problem. C# isn't really there yet, whereas languages like F# and Haskell offer superior alternatives.

Conclusion #

In this article I've tried to demonstrate how property-based testing is a viable alternative to using Stubs and Mocks for verification of composition. You can try to sabotage the Controller.Complete method in the no-mocks branch and see that one or more properties will fail.

While the example code base that I've used for this article has the strength of being small and self-contained, it also suffers from a few weaknesses. It's perhaps a bit too abstract to truly resonate. It also uses AutoFixture to generate test data, which already takes it halfway towards property-based testing. While that makes the refactoring easier, it also means that it may not fully demonstrate how to refactor an example-based test to a property. I'll try to address these shortcomings in the next article.

Next: A restaurant example of refactoring from example-based to property-based testing.


Comments

First of all, thanks again for continuing to explore this matter. This was very enlightening, but in the end, I was left with is a sense of subtle wrongness, which is very hard to pin down, and even harder to tell apart between "this is actually not right for me" and "this is something new I'm not accustomed to".

I suppose that my main question would center around difference between your tests for IStateValidator and IRenderer. Let's start with the latter:

Instead of configuring an IRenderer Stub, the test can state the expected output: That the output is equal to the output that renderer.Success would return.

Coupled with the explanation ("[the test] has less of an opinion about the implementation, which means that it's marginally less coupled to it"), this makes a lot of sense, with the only caveat that in more production-like cases comparing the output would be harder (e.g., if IRenderer operates on HttpContext to produce a full HTTP response), but that's a technicality that can be sorted out with proper assertion library. But let's now look at the IStateValidator part:

as far as executable specifications go, this test doesn't reflect reality. There's only one Validate implementation, and it doesn't behave like that. Rather, it'll return true when code is equal to knownState.expectedCode. The test poorly communicates that behaviour.

Here you act with the opposite intent: you want the test to communicate the specification, and thus be explicitly tied to the logic in the implementation (if not the actual code of it). There are two thing about that that bother me. First of all, it's somewhat inconsistent, so it makes it harder for me to choose which path to follow when testing the next code I'd write (or articulating to another developer how they should do it). But what's more important - and that comes from my example being minimal, as you've already noted - is that the validation logic might be more complicated, and thus the setup would be complicated as well. And as you've already mentioned on Twitter, when chaging the code in the validator implementation, you might be forced to change the implementation in the test, even if the test is more about the controller itself.

There's also another frame for the same issue: the original test read as (at least for me): "if the state is valid, we return successful response based on this state". It didn't matter what is "valid" not did it matter what is "successful response". The new tests reads as "if state in the repository matches passed code, we return successful response for the state". It still doesn't matter what is "successful response", but the definition of validity does matter. For me, this is a change of test meaning, and I'm not sure I understand where that leads me.

Let's consider the following scenario: we need to add another validity criteria, such as "state in repository has an expiration date, and this date should be in the future". We obviously need to add a couple of tests for this (negative and positive). Where do we add them? I'd say we add them into the tests for the validator itself (which are "not shown" for the purposes of brevity), but then it feels weird that we also need to change this "happy path" test...

2023-04-03 21:24 UTC

Thanks for showing CsCheck. I've put in a PR to show how I would refactor the CsCheck tests and will attempt to explain some of the design choices of CsCheck.

First of all, it may be a personal opinion but I don't really tend to use query syntax for CsCheck. I prefer to see the SelectManys and there are a number of additional overloads that simplify ranging and composing Gens.

On the design of CsCheck, I build it to not use reflection, attributes or target test frameworks. I've seen the very difficult problems these lead to (for author and user) when you try to move past simple examples.

I wanted the user to be able to move from simple general type generators to ranged complex types easily in a fluent style. No Arbs, attributes, PositiveInt type etc.

CsCheck has automatic shrinking even for the most complex types that just comes out of composing Gens.

I think some of the reason it was so easy to extend the library to areas such as concurrency testing was because of this simplicity (as well as the random shrinking insight).

Gen<Uri> _genRedirect = Gen.OneOfConst(new Uri("https://example.com"), new Uri("https://example.org"));

[Fact]
public void HappyPath()
{
    Gen.Select(Gen.String, Gen.String, Gen.Bool, _genRedirect)
    .Sample((state, expectedCode, isMobile, redirect) =>
    {
    var code = expectedCode;
    var knownState = (expectedCode, isMobile, redirect);
    var repository = new RepositoryStub { { state, knownState } };
    var sut = new Controller(repository);
    var actual = sut.Complete(state, code);
    var expected = Renderer.Success(knownState);
    return actual == expected;
    });
}
2023-04-04 23:56 UTC

Sergei, thank you for writing. I'm afraid all of this is context-dependent, and I seem to constantly fail giving enough context. It's a fair criticism that I seem to employ inconsistent heuristics when making technical decisions. Part of it is caused by my lack of context. The code base is deliberately stripped of context, which has many other benefits, but gives me little to navigate after. I'm flying blind, so to speak. I've had to (implicitly) imagine some forces acting on the software in order to make technical decisions. Since we haven't explicitly outlined such forces, I've had to make them up as I went. It's quite possible, even, that I've imagined one set of forces in one place, and another set somewhere else. If so, no wonder the decisions are inconsistent.

What do I mean by forces? I'm thinking of the surrounding context that you have to take into a account when making technical decisions about code: Is the current feature urgent? Is it a bug already in production? Is this a new system with a small code base? Or is it an old system with a large code base? Do you have good automated tests? Do you have a Continuous Delivery pipeline? Are you experiencing problems with code quality? What does the team makeup look like? Do you have mostly seasoned veterans who've worked in that code base for years? Or do you have many newcomers? Is the system public-facing or internal? Is it a system, even, or a library or framework? What sort of organisation owns the software? Is it a product group? Or a cost centre? What is the organization's goals? How are you incentivized? How are other stakeholders incentivized?

As you can imagine, I can keep going, asking questions like these, and they may all be relevant. Clearly, we can't expect a self-contained minimal example to also include all such superstructure, so that's what I (inconsistently) have to imagine, on the fly.

I admit that the decisions I describe seem inconsistent, and the explanation may simply be what is already implied above: I may have had a different context in mind when I made one, and a variation in mind when I made the other.

That's hardly the whole story, though. I didn't start my answer with the above litany of forces only to make a bad excuse for myself. Rather, what I had in mind was to argue that I use a wider context when making decisions. That context is not just technical, but includes, among many other considerations, the team structure.

As an example, I was recently working with some students in a university setting. These are people in their early twenties, with only a few months of academic programming under their belt, as well as, perhaps, a few years of hobby programming. They'd just been introduced to Git and GitHub a few weeks earlier, C# a month before that. I was trying to teach them how to use Git and GitHub, how to structure decent C# code, and many other things. During our project, they did send me a few pull requests I would have immediately rejected from a professional programmer. In this particular context, however, that would have been counter-productive. These students were doing a good job, based on their level of experience, and they needed the sense of accomplishment that I would (often, but not always) accept their code.

I could have insisted on a higher code quality, and I would also have been able to teach it to anyone patient enough to listen. One thing I've learned all too slowly in my decades of working with other people is that most people aren't as patient with me as I'd like them to be. I need to explicitly consider how to motivate my collaborators.

Here's another example: Years ago, I worked with a rag-tag team hastily assembled via word-of-mouth of some fine European freelancers. My challenge here was another. These people were used to be on top of their game - usually the ones brought in to an organisation because they were the best. I needed them to work together, and among other things, it meant showing them that even though they might think that their way was the best way, other ways exist. I wanted them to be able to work together and produce code with shared ownership. At the beginning, I was rather strict with my standards, clearly bruising a few egos, but ultimately several members have told what a positive transformative experience it was for them. It was a positive transformative experience for me, too.

I discuss all of this because you, among various points, mention the need to be able to articulate to other developers how to make technical decisions about tests. The point is that there's a lot of context that goes into making decisions, and hardly a one-size-fits-all heuristic.

What usually guides me is an emphasis on coupling, and that's also, I believe, what ultimately motivated me here. There's always going to be some coupling between tests and production code, but the less the better. For example, when considering whether how to write an assertion, I consider whether a change in production code's behaviour would cause a test to break.

Consider, for example, the renderer in the present example. How important is the exact output? What happens if I change a character in the string that is being returned?

That's a good example of context being important. If that output is part of an implementation of a network protocol or some other technical spec, just one character change could, indeed, imply that your implementation is off spec. In that case, we do want to test the exact output, and we do want the test to fail if the output changes.

On the other hand, if the output is a piece of UI, or perhaps an error message, then the exact wording is likely to change over time. Since this doesn't really imply a change in behaviour, changing such a string output shouldn't break a test.

You need that wider context in order to make decisions like that: If we change the System Under Test in this way, will the test break? Should it? What if we change it in another way?

This is relevant in order to address your final concern: What if you now decide that the expiration date should be in the future? The way you describe it, it sounds like this strengthens the preconditions of the system - in other words, it breaks backwards compatibility. So yes, making that change may break existing tests, but this could be an indication that it's also going to break existing clients.

If you have any clients, that is. Again, you know your context better than I do, so only you can decide whether making such a change is okay. I can think of situations where is, but I usually find myself in contexts where it isn't, so I tend to err on the side of avoiding breaking changes.

2023-04-09 14:28 UTC

Mark, thank you for taking the time to discuss this.

Having the magic word "architect" somewhere in my title, I know the importance of context, and in fact that would usually be the first counter-question that I have for somebody coming at me with a question: "what is your context?". So here we are, me being contractually obliged to strip as much context from my code as possible, and you having to reinvent it back from your experience. On the other hand, this allows us to point out which decisions are actually context-driven, and how different contexts affect them.

With that in mind, I can actually propose two different contexts to reframe the decisions above, so that we could arrive at more insights.

The first would be an imaginary context, which I had in mind when writing the code, but haven't thought of communicating: the renderer is as important as the validator. In case of "mobile" state the consumer is actually a mobile application, so we need to know we've produced the right JSON it will consume, and in case of non-"mobile" state the consumer is the browser, which again needs to be properly redirected. In my mind, this is no less important than the validation logic itself, because breaking it will break at least one consumer (mobile), and more likely both of them. Thus, according to the logic above, this is a compatibility issue, and as such we need to explicitly spell this behavior in the tests. Which gives us six outcome branches... six tests? Or something more complicated? This is especially interesting, considering the fact that we can test the renderer in isolation, so we'd be either duplicating our tests... or just just discarding the isolated tests for the renderer?

And then here's the actual real context, which I can thankfully divulge to this extent: this is, in fact, a migration problem, when we move from one externally-facing framework (i.e. ASP.NET Web API) to another (i.e. ASP.NET Core). So I am not, in fact, concerned about the validation at all - I'm concerned about the data being properly passed to the validator (because the validator already existed, and worked properly, I'm just calling it from another framework), its result properly handled in the controller (which I am replacing), and then I'm concerned that despite the heavy changes between ASP.NET versions I'm still rendering the output in the exactly same way.

Now that I'm thinking about it, it seems grossly unfair that I've hidden this context beforehand, but then, I didn't see how it was affecting the decisions in test design. So hopefully we can still find some use in this.

2023-04-10 18:03 UTC

Anthony, thank you for writing. I didn't intend to disparage CsCheck. Having once run a moderately successful open source project, I've learned that it's a good idea to limit the scope to a functional minimum. I'm not complaining about the design decisions made for CsCheck.

I was a bit concerned that a casual reader might compare the CsCheck example with the previous code and be put off by the apparent complexity. I admit that your example looks less complex than mine, so to address that particular concern, I could have used code similar to yours.

Whether one prefers the syntactic sugar of query syntax or the more explicit use of SelectMany is, to a degree, subjective. There are, on the other hand, some cases where one is objectively better than the other. One day I should write an article about that.

I agree that the test-framework integration that exists in FsCheck is less than ideal. I'm not wishing for anything like that. Something declarative, however, would be nice. Contrary to you, I consider wrapper types like PositiveInt to be a boon, but perhaps not like they're implemented in FsCheck. (And this, by the way, isn't to harp on FsCheck either; I only bring up those examples because FsCheck has more of that kind of API than Hedgehog does.) The Haskell QuickCheck approach is nice: While a wrapper like Positive is predefined in the package, it's just an Arbitrary instance. There's really nothing special about it, and you can easily define domain-specific wrappers for your own testing purposes. Here's an example: Naming newtypes for QuickCheck Arbitraries. I'm wondering if something similar wouldn't be possible with interfaces in .NET.

2023-04-13 8:46 UTC

Sergei, thank you for writing. I'm sorry if I came across as condescending or implying that you weren't aware that context matters. Of course you do. I was mostly trying to reconstruct my own decision process, which is far less explicit than you might like.

Regarding the renderer component, I understand that testing such a thing in reality may be more involved than the toy example we're looking at. My first concern is to avoid duplicating efforts too much. Again, however, the external behaviour should be the primary concern. I'm increasingly shifting towards making more and more things internal, as long as I can still test them via the application boundary. As I understand it, this is the same concern that made Dan North come up with behaviour-driven development. I'm also concerned primarily with testing the behaviour of systems, just without all the Gherkin overhead.

There comes a time, however, where testing everything through an external API becomes too awkward. I'm not adverse to introduce or make public classes or functions to enable testing at a different abstraction level. Doing that, however, represents a bet that it's possible to keep the new component's API stable enough that it isn't going to cause too much test churn. Still, if we imagine that we've already made such a decision, and that we now have some renderer component, then it's only natural to test it thoroughly. Then, in order to avoid duplicating assertions, we can state, as I did in this article, that the overall system should expect to see whatever the renderer component returns.

That was, perhaps, too wordy an explanation. Perhaps this is more helpful: Don't repeat yourself. What has been asserted in one place shouldn't be asserted in another place.

The other example you mention, about migrating to another framework, reminds me of two things.

The first is that we shouldn't forget about other ways of verifying code correctness. We've mostly been discussing black-box testing, and while it can be an interesting exercise to imagine an adversary developer, in general that's rarely the reality. Are there other ways to verify that methods are called with the correct values? How about looking at the code? Consistent code reviews are good at detecting bugs.

The second observation is that if you already have two working components (validator and renderer) you can treat them as Test Oracles. This still works well with property-based testing. Write tests based on equivalence classes and get a property-based testing framework to sample from those classes. Then use the Test Oracles to define the expected output. That's essentially what I have done in this article.

Does it prove that your framework-based code calls the components with the correct arguments? No, not like if you'd used a Test Spy. Property-based testing produces knowledge reminiscent of the kind of knowledge produced by experimental physics; not the kind of axiomatic knowledge produced by mathematics. That's why I named this article series Epistemology of interaction testing.

Is it wrong to test with Stubs or Spies in a case like this? Not necessarily. Ultimately, what I try to do with this blog is to investigate and present alternatives. Only once we have alternatives do we have choices.

2023-04-15 15:53 UTC

More functional pits of success

Monday, 27 March 2023 05:36:00 UTC

FAQ: What are the other pits of successes of functional programming?

People who have seen my presentation Functional architecture: the pits of success occasionally write to ask: What are the other pits?

The talk is about some of the design goals that we often struggle with in object-oriented programming, but which tend to happen automatically in functional programming (FP). In the presentation I cover three pits of success, but I also mention that there are more. In a one-hour conference presentation, I simply didn't have time to discuss more than three.

It's a natural question, then, to ask what are the pits of success that I don't cover in the talk?

I've been digging through my notes and found the following:

  • Parallelism
  • Ports and adapters
  • Services, Entities, Value Objects
  • Testability
  • Composition
  • Package and component principles
  • CQS
  • Encapsulation

Finding a lost list like this, more than six years after I jotted it down, presents a bit of a puzzle to me, too. In this post, I'll see if I can reconstruct some of the points.

Parallelism #

When most things are immutable you don't have to worry about multiple threads updating the same shared resource. Much has already been said and written about this in the context of functional programming, to the degree that for some people, it's the main (or only?) reason to adopt FP.

Even so, I had (and still have) a version of the presentation that included this advantage. When I realised that I had to cut some content for time, it was easy to cut this topic in favour of other benefits. After all, this one was already well known in 2016.

Ports and adapters #

This was one of the three benefits I kept in the talk. I've also covered it on my blog in the article Functional architecture is Ports and Adapters.

Services, Entities, Value Objects #

This was the second topic I included in the talk. I don't think that there's an explicit article here on the blog that deals with this particular subject matter, so if you want the details, you'll have to view the recording.

In short, though, what I had in mind was that Domain-Driven Design explicitly distinguishes between Services, Entities, and Value Objects, which often seems to pull in the opposite direction of the object-oriented notion of data with behaviour. In FP, on the contrary, it's natural to separate data from behaviour. Since behaviour often implements business logic, and since business logic tends to change at a different rate than data, it's a good idea to keep them apart.

Testability #

The third pit of success I covered in the talk was testability. I've also covered this here on the blog: Functional design is intrinsically testable. Pure functions are trivial to test: Supply some input and verify the output.

Composition #

Pure functions compose. In the simplest case, use the return value from one function as input for another function. In more complex cases, you may need various combinators in order to be able to 'click' functions together.

I don't have a single article about this. Rather, I have scores: From design patterns to category theory.

Package and component principles #

When it comes to this one, I admit, I no longer remember what I had in mind. Perhaps I was thinking about Scott Wlaschin's studies of cycles and modularity. Perhaps I did, again, have my article about Ports and Adapters in mind, or perhaps it was my later articles on dependency rejection that already stirred.

CQS #

The Command Query Separation principle states that an operation (i.e. a method) should be either a Command or a Query, but not both. In most programming languages, the onus is on you to maintain that discipline. It's not a design principle that comes easy to most object-oriented programmers.

Functional programming, on the other hand, emphasises pure functions, which are all Queries. Commands have side effects, and pure functions can't have side effects. Haskell even makes sure to type-check that pure functions don't perform side effects. If you're interested in a C# explanation of how that works, see The IO Container.

What impure actions remain after you've implemented most of your code base as pure functions can violate or follow CQS as you see fit. I usually still follow that principle, but since the impure parts of a functional code base tends to be fairly small and isolated to the edges of the application, even if you decide to violate CQS there, it probably makes little difference.

Encapsulation #

Functional programmers don't talk much about encapsulation, but you'll often hear them say that we should make illegal states unrepresentable. I recently wrote an article that explains that this tends to originate from the same motivation: Encapsulation in Functional Programming.

In languages like F# and Haskell, most type definitions require a single line of code, in contrast to object-oriented programming where types are normally classes, which take up a whole file each. This makes it much easier and succinct to define constructive data in proper FP languages.

Furthermore, perhaps most importantly, pure functions are referentially transparent, and referential transparency fits in your head.

Conclusion #

In a recording of the talk titled Functional architecture: the pits of success I explain that the presentation only discusses three pits of success, but that there are more. Consulting my notes, I found five more that I didn't cover. I've now tried to remedy this lapse.

I don't, however, believe that this list is exhaustive. Why should it be?


On trust in software development

Monday, 20 March 2023 08:55:00 UTC

Can you trust your colleagues to write good code? Can you trust yourself?

I've recently noticed a trend among some agile thought leaders. They talk about trust and gatekeeping. It goes something like this:

Why put up barriers to prevent people from committing code? Don't you trust your colleagues?

Gated check-ins, pull requests, reviews are a sign of a dysfunctional organisation.

I'm deliberately paraphrasing. While I could cite multiple examples, I wish to engage with the idea rather than the people who propose it. Thus, I apologise for the seeming use of weasel words in the above paragraph, but my agenda is the opposite of appealing to anonymous authority.

If someone asks me: "Don't you trust your colleagues?", my answer is:

No, I don't trust my colleagues, as I don't trust myself.

Framing #

I don't trust myself to write defect-free code. I don't trust that I've always correctly understood the requirements. I don't trust that I've written the code in the best possible way. Why should I trust my colleagues to be superhumanly perfect?

The trust framing is powerful because few people like to be labeled as mistrusting. When asked "don't you trust your colleagues?" you don't want to answer in the affirmative. You don't want to come across as suspicious or paranoid. You want to belong.

The need to belong is fundamental to human nature. When asked if you trust your colleagues, saying "no" implicitly disassociates you from the group.

Sometimes the trust framing goes one step further and labels processes such as code reviews or pull requests as gatekeeping. This is still the same framing, but now turns the group dynamics around. Now the question isn't whether you belong, but whether you're excluding others from the group. Most people (me included) want to be nice people, and excluding other people is bullying. Since you don't want to be a bully, you don't want to be a gatekeeper.

Framing a discussion about software engineering as one of trust and belonging is powerful and seductive. You're inclined to accept arguments made from that position, and you may not discover the sleight of hand. It's subliminal.

Most likely, it's such a fundamental and subconscious part of human psychology that the thought leaders who make the argument don't realise what they are doing. Many of them are professionals that I highly respect; people with more merit, experience, and education than I have. I don't think they're deliberately trying to put one over you.

I do think, on the other hand, that this is an argument to be challenged.

Two kinds of trust #

On the surface, the trust framing seems to be about belonging, or its opposite, exclusion. It implies that if you don't trust your co-workers, you suspect them of malign intent. Organisational dysfunction, it follows, is a Hobbesian state of nature where everyone is out for themselves: Expect your colleague to be a back-stabbing liar out to get you.

Indeed, the word trust implies that, too, but that's usually not the reason to introduce guardrails and checks to a software engineering process.

Rather, another fundamental human characteristic is fallibility. We make mistakes in all sorts of way, and we don't make them from malign intent. We make them because we're human.

Do we trust our colleagues to make no mistakes? Do we trust that our colleagues have perfect knowledge of requirement, goals, architecture, coding standards, and so on? I don't, just as I don't trust myself to have those qualities.

This interpretation of trust is, I believe, better aligned with software engineering. If we institute formal sign-offs, code reviews, and other guardrails, it's not that we suspect co-workers of ill intent. Rather, we're trying to prevent mistakes.

Two wrongs... #

That's not to say that all guardrails are necessary all of the time. The thought leaders I so vaguely refer to will often present alternatives: Pair programming instead of pull requests. Indeed, that can be an efficient and confidence-inducing way to work, in certain contexts. I describe advantages as well as disadvantages in my book Code That Fits in Your Head.

I've warned about the trust framing, but that doesn't mean that pull requests, code reviews, gated check-ins, or feature branches are always a good idea. Just because one argument is flawed it does't mean that the message is wrong. It could be correct for other reasons.

I agree with the likes of Martin Fowler and Dave Farley that feature branching is a bad idea, and that you should adopt Continuous Delivery. Accelerate strongly suggests that.

I also agree that pull requests and formal reviews with sign-offs, as they're usually practised, is at odds with even Continuous Integration. Again, be aware of common pitfalls in logic. Just because one way to do reviews is counter-productive, it doesn't follow that all reviews are bad.

As I have outlined in another article, under the right circumstances, agile pull requests are possible. I've had good result with pull requests like that. Reviewing was everyone's job, and we integrated multiple times a day.

Is that way to work always possible? Is it always the best way to work? No, of course not. Context matters. I've worked with another team where it was evident that that process had little chance of working. On the other hand, that team wasn't keen on pair programming either. Then what do you do?

Mistakes were made #

I rarely have reason to believe that co-workers have malign intent. When we are working together towards a common goal, I trust that they have as much interest in reaching that goal as I have.

Does that trust mean that everyone is free to do whatever they want? Of course not. Even with the best of intentions, we make mistakes, there are misunderstandings, or we have incomplete information.

This is one among several reasons I practice test-driven development (TDD). Writing a test before implementation code catches many mistakes early in the process. In this context, the point is that I don't trust myself to be perfect.

Even with TDD and the best of intentions, there are other reasons to look at other people's work.

Last year, I did some freelance programming for a customer, and sometimes I would receive feedback that a function I'd included in a pull request already existed in the code base. I didn't have that knowledge, but the review caught it.

Could we have caught that with pair or ensemble programming? Yes, that would work too. There's more than one way to make things work, and they tend to be context-dependent.

If everyone on a team have the luxury of being able to work together, then pair or ensemble programming is an efficient way to coordinate work. Little extra process may be required, because everyone is already on the same page.

If team members are less fortunate, or have different preferences, they may need to rely on the flexibility offered by asynchrony. This doesn't mean that you can't do Continuous Delivery, even with pull requests and code reviews, but the trade-offs are different.

Conclusion #

There are many good reasons to be critical of code reviews, pull requests, and other processes that seem to slow things down. The lack of trust in co-workers is, however, not one of them.

You can easily be swayed by that argument because it touches something deep in our psyche. We want to be trusted, and we want to trust our colleagues. We want to belong.

The argument is visceral, but it misrepresents the motivation for process. We don't review code because we believe that all co-workers are really North Korean agents looking to sneak in security breaches if we look away.

We look at each other's work because it's human to make mistakes. If we can't all be in the same office at the same time, fast but asynchronous reviews also work.


Comments

This reminds me of Hanlon's razor.

Never attribute to malice that which is adequately explained by stupidity.
2023-03-23 23:29 UTC

Too much technical attibution is still being presented on a whole, into a field which is far more of a social and psycological construct. What's even worse is the evidence around organisation and team capabilities is pretty clear towards what makes a good team and organisation.

The technical solutions for reaching trust or beloningness is somewhat a menas to an end. They can't stand alone because it only takes two humans to side track them.

therefore I still absolutely believe that the technical parts of software engineering is by far the less demanding. As technical work most often still are done as individual contributions based on loose requirements, equally loose leadership and often non-existing enforcement from tooling and scattered human ownership. Even worse subjective perceptions. That I believe on the other hand has everything to do with trust, gatekeeping, belonging and other human psycology needs.

2023-03-27 11:29 UTC

I like to look at it from the other side - I ask my colleagues for reviewing my code, because I trust them. I trust they can help me make it better and it's in all of us interests to make it better - as soon as the code gets in to the main branch, it's not my code anymore - it's teams code.

Also, not tightly connected to the topic of code reviews and pull requests, but still about trust in software development, I had this general thought about trusting yourself in software programming: Don't trust your past self. Do trust your future self.
Don't trust your past self in the meaning: don't hesitate to refactor your code. If you've written the code a while ago, even with the best intentions of making it readable, as you normally have, but now you read it and you don't understand it - don't make too much effort into trying to understand it. If you don't understand it, others won't as well. Rewrite it to make it better.
Do trust your future self in the meaning: don't try to be smarter than you are now. If you don't know all the requirements or answers to all the questions, don't guess them when you don't have to. Don't create an abstraction from one use case you have - in future you will have more use cases and you will be able to create a better abstraction - trust me, or actually trust yourself. I guess it's just YAGNI rephrased ;)

2023-05-16 21:22 UTC

Jakub, thank you for writing. I don't think that I have much to add. More people should use refactoring as a tool for enhancing understanding when they run into incomprehensible code. Regardless of who originally wrote that code.

2023-06-13 7:36 UTC

Confidence from Facade Tests

Monday, 13 March 2023 07:15:00 UTC

Recycling an old neologism of mine, I try to illustrate a point about the epistemology of testing function composition.

This article continues the introduction of a series on the epistemology of interaction testing. In the first article, I attempted to explain how to test the composition of functions. Despite my best efforts, I felt that that article somehow fell short of its potential. Particularly, I felt that I ought to have been able to provide some illustrations.

After publishing the first article, I finally found a way to illustrate what I'd been trying to communicate. That's this article. Better late than never.

Previously, on epistemology of interaction testing #

A brief summary of the previous article may be in order. The question this article series tries to address is how to unit test composition of functions - particularly pure functions.

Consider the illustration from the previous article, repeated here for your convenience:

Example component graph with four leaves.

When the leaves are pure functions they are intrinsically testable. That's not the hard part, but how do we test the internal nodes or the root?

While most people would reach for Stubs and Spies, those kinds of Test Doubles tend to break encapsulation.

What are the alternatives?

An alternative I find useful is to test groups of functions composed together. Particularly when they are pure functions, you have no problem with non-deterministic behaviour. On the other hand, this approach seems to run afoul of the problem with combinatorial explosion of integration testing so eloquently explained by J.B. Rainsberger.

What I suggest, however, isn't quite integration testing.

Neologism #

If it isn't integration testing, then what is it? What do we call it?

I'm going to resurrect and recycle an old term of mine: Facade Tests. Ten years ago I had a more narrow view of a term like 'unit test' than I do today, but the overall idea seems apt in this new context. A Facade Test is a test that exercises a Facade.

These days, I don't find it productive to distinguish narrowly between different kinds of tests. At least not to the the degree that I wish to fight over terminology. On the other hand, occasionally it's useful to have a name for a thing, in order to be able to differentiate it from some other thing.

The term Facade Tests is my attempt at a neologism. I hope it helps.

Code coverage as a proxy for confidence #

The question I'm trying to address is how to test functions that compose other functions - the internal nodes or the root in the above graph. As I tried to explain in the previous article, you need to build confidence that various parts of the composition work. How do you gain confidence in the leaves?

One way is to test each leaf individually.

A single leaf node and a test module pointing to it.

The first test or two may exercise a tiny slice of the System Under Test (SUT):

A single leaf node with a thin slice of space filled in green, driven by a test.

The next few tests may exercise another part of the SUT:

A single leaf node with two thin slices of space filled in green, driven by tests.

Keep adding more tests:

A single leaf node with a triangle of space filled in green, driven by tests.

Stop when you have good confidence that the SUT works as intended:

A single leaf node fully filled in green, indicating full code coverage by tests.

If you're now thinking of code coverage, I can't blame you. To be clear, I haven't changed my position about code coverage. Code coverage is a useless target measure. On the other hand, there's no harm in having a high degree of code coverage. It still might give you confidence that the SUT works as intended.

You may think of the amount of green in the above diagrams as a proxy for confidence. The more green, the more confident you are in the SUT.

None of the arguments here hinge on code coverage per se. What matters is confidence.

Facade testing confidence #

With all the leaves covered, you can move on to the internal nodes. This is the actual problem that I'm trying to address. We would like to test an internal node, but it has dependencies. Fortunately, the context of this article is that the dependencies are pure functions, so we don't have a problem with non-deterministic behaviour. No need for Test Doubles.

It's really simple, then. Just test the internal node until you're confident that it works:

An internal node fully filled in blue, indicating full code coverage by tests. Dependencies are depicted as boxes below the internal node, but with only slivers of coverage.

The goal is to build confidence in the internal node, the new SUT. While it has dependencies, covering those with tests is no longer the goal. This is the key difference between Facade Testing and Integration Testing. You're not trying to cover all combinations of code paths in the integrated set of components. You're still just trying to test the new SUT.

Whether or not these tests exercise the leaves is irrelevant. The leaves are already covered by other tests. What 'coverage' you get of the leaves is incidental.

Once you've built confidence in internal nodes, you can repeat the process with the root node:

A root node fully filled in blue, indicating full code coverage by tests. Dependencies are depicted as trees below the root node, but with only slivers of coverage.

The test covers enough of the root node to give you confidence in it. Some of the dependencies are also partially exercised by the tests, but this is still secondary. The way I've drawn the diagram, the left internal node is exercised in such a way that its dependencies (the leaves) are partially exercised. The test apparently also exercises the right internal node, but none of that activity makes it interact with the leaves.

These aren't integration tests, so they avoid the problem of combinatorial explosion.

Conclusion #

This article was an attempt to illustrate the prose in the previous article. You can unit test functions that compose other functions by first unit testing the leaf functions and then the compositions. While these tests exercise an 'integration' of components, the purpose is not to test the integration. Thus, they aren't integration tests. They're facade tests.

Next: An abstract example of refactoring from interaction-based to property-based testing.


Comments

I really appreciate that you are writing about testing compositions of pure functions. As an F# dev who tries to adhere to the impureim sandwich (which, indeed, you helped me with before), this is something I have also been struggling with, and failing to find good answers to.

But following your suggestion, aren’t we testing implementation details?

Using the terminology in this article, I often have a root function that is public, which composes and delegates work to private helper functions. Compared to having all the logic directly in the root function, the code is, unsurprisingly, easier to read and maintain this way. However, all the private helper functions (internal nodes and leaves) as well as the particularities of how the root and the internal nodes compose their “children”, are very much just implementation details of the root function.

I occasionally need to change such code in a way that does not change the public API (at least not significantly enough to cause excessive test maintenance), but which significantly restructures the internal helpers. If I were to test as suggested in this article, I would have many broken tests on my hands. These would be tests of the internal nodes and leaves (which may not exist at all after the refactor, having been replaced with completely different functions) as well as tests of how the root node composes the other functions (which, presumably, would still pass but may not actually test anything useful anymore).

In short, testing in the way suggested here would act as a force to avoid refactoring, which seems counter-productive.

One would also need to use InternalsVisibleTo or similar in order to test those helpers. I’m not very concerned about that on its own (though I’d like to keep the helpers private), but it always smells of testing implementation details, which, as I argue, is what I think we’re doing. (One could alternatively make the helpers public – they’re pure, after all, so presumably no harm done – but that would expose a public API that no-one should actually use, and doesn’t avoid the main problem anyway.)

As a motivating example from my world, consider a system for sending email notifications. The root function accepts a list of notifications that should be sent, together with any auxiliary data (names and other data from all users referenced by the notifications; translated strings; any environment-specific data such as base URLs for links; etc.), and returns the email HTML (or at least a structure that maps trivially to HTML). In doing this, the code has to group notifications in several levels, sort them in various ways, merge similar consecutive notifications in non-trivial ways, hide notifications that the user has not asked to receive (but which must still be passed to the root function since they are needed for other logic), and so on. All in all, I have almost 600 lines of pure code that does this. (In addition, I have 150 lines that fetches everything from the DB and creates necessary lookup maps of auxiliary data to pass to the root function. I consider this code “too boring to fail”.)

The pure part of the code was recently significantly revamped. Had I had tests for private/internal helpers, the refactor would likely have been much more painful.

I expect there is no perfect way to make the code both testable and easy to refactor. But I am still eager to hear your thoughts on my concern: Following your suggestion, aren’t we testing implementation details?

2023-03-13 14:00 UTC

Christer, thank you for writing. The short answer is: yes.

Isn't this a separate problem, though? If you're using Stubs and Spies to test interaction, and other tests to verify your implementations, then isn't that a similar problem?

I'm going to graze this topic in the future article in this series tentatively titled Refactoring pure function composition without breaking existing tests, but I should probably write another article more specifically about this topic...

2023-03-14 9:06 UTC

Christer (and everyone who may be interested), I've long been wanting to expand on my previous answer, and I finally found the time to write an article that discusses the implementation-detail question.

Apart from that I also want to remind readers that the article Refactoring pure function composition without breaking existing tests has been available since May 1st, 2023. It shows one example of using the Strangler pattern to refactor pure Facade Tests without breaking them.

This doesn't imply that one should irresponsibly make every pure function public. These days, I make things internal by default, but make them public if I think they'd be good seams. Particularly when following test-driven development, it's possible to unit test private helpers via a public API. This does, indeed, have the benefit that you're free to refactor those helpers without impacting test code.

The point of this article series isn't that you should make pure functions public and test interactions with property-based testing. The point is that if you already have pure functions and you wish to test how they interact, then property-based testing is a good way to achieve that goal.

If, on the other hand, you have a pure function for composing emails, and you can keep all helper functions private, still cover it enough to be confident that it works, and do that by only exercising a single public root function (mail-slot testing) then that's preferable. That's what I would aim for as well.

2023-06-19 7:04 UTC

Warnings-as-errors friction

Monday, 06 March 2023 07:17:00 UTC

TDD friction. Surely that's a bad thing(?)

Paul Wilson recently wrote on Mastodon:

Software development opinion (warnings as errors)

Just seen this via Elixir Radar, https://curiosum.com/til/warnings-as-errors-elixir-mix-compile on on treating warnings as errors, and yeah don't integrate code with warnings. But ....

Having worked on projects with this switched on in dev, it's an annoying bit of friction when Test Driving code. Yes, have it switched on in CI, but don't make me fix all the warnings before I can run my failing test.

(Using an env variable for the switch is a good compromise here, imo).

This made me reflect on similar experiences I've had. I thought perhaps I should write them down.

To be clear, this article is not an attack on Paul Wilson. He's right, but since he got me thinking, I only find it honest and respectful to acknowledge that.

The remark does, I think, invite more reflection.

Test friction example #

An example would be handy right about now.

As I was writing the example code base for Code That Fits in Your Head, I was following the advice of the book:

  • Turn on Nullable reference types (only relevant for C#)
  • Turn on static code analysis or linters
  • Treat warnings as errors. Yes, also the warnings produced by the two above steps

As Paul Wilson points out, this tends to create friction with test-driven development (TDD). When I started the code base, this was the first TDD test I wrote:

[Fact]
public async Task PostValidReservation()
{
    var response = await PostReservation(new {
        date = "2023-03-10 19:00",
        email = "katinka@example.com",
        name = "Katinka Ingabogovinanana",
        quantity = 2 });
 
    Assert.True(
        response.IsSuccessStatusCode,
        $"Actual status code: {response.StatusCode}.");
}

Looks good so far, doesn't it? Are any of the warnings-as-errors settings causing friction? Not directly, but now regard the PostReservation helper method:

[SuppressMessage(
    "Usage",
    "CA2234:Pass system uri objects instead of strings",
    Justification = "URL isn't passed as variable, but as literal.")]
private async Task<HttpResponseMessage> PostReservation(
    object reservation)
{
    using var factory = new WebApplicationFactory<Startup>();
    var client = factory.CreateClient();
 
    string json = JsonSerializer.Serialize(reservation);
    using var content = new StringContent(json);
    content.Headers.ContentType.MediaType = "application/json";
    return await client.PostAsync("reservations", content);
}

Notice the [SuppressMessage] attribute. Without it, the compiler emits this error:

error CA2234: Modify 'ReservationsTests.PostReservation(object)' to call 'HttpClient.PostAsync(Uri, HttpContent)' instead of 'HttpClient.PostAsync(string, HttpContent)'.

That's an example of friction in TDD. I could have fixed the problem by changing the last line to:

return await client.PostAsync(new Uri("reservations", UriKind.Relative), content);

This makes the actual code more obscure, which is the reason I didn't like that option. Instead, I chose to add the [SuppressMessage] attribute and write a Justification. It is, perhaps, not much of an explanation, but my position is that, in general, I consider CA2234 a good and proper rule. It's a specific example of favouring stronger types over stringly typed code. I'm all for it.

If you grok the motivation for the rule (which, evidently, the documentation code-example writer didn't) you also know when to safely ignore it. Types are useful because they enable you to encapsulate knowledge and guarantees about data in a way that strings and ints typically don't. Indeed, if you are passing URLs around, pass them around as Uri objects rather than strings. This prevents simple bugs, such as accidentally swapping the place of two variables because they're both strings.

In the above example, however, a URL isn't being passed around as a variable. The value is hard-coded right there in the code. Wrapping it in a Uri object doesn't change that.

But I digress...

This is an example of friction in TDD. Instead of being able to just plough through, I had to stop and deal with a Code Analysis rule.

SUT friction example #

But wait! There's more.

To pass the test, I had to add this class:

    [Route("[controller]")]
    public class ReservationsController
    {
#pragma warning disable CA1822 // Mark members as static
        public void Post() { }
#pragma warning restore CA1822 // Mark members as static
    }

I had to suppress CA1822 as well, because it generated this error:

error CA1822: Member Post does not access instance data and can be marked as static (Shared in VisualBasic)

Keep in mind that because of my settings, it's an error. The code doesn't compile.

You can try to fix it by making the method static, but this then triggers another error:

error CA1052: Type 'ReservationsController' is a static holder type but is neither static nor NotInheritable

In other words, the class should be static as well:

[Route("[controller]")]
public static class ReservationsController
{
    public static void Post() { }
}

This compiles. What's not to like? Those Code Analysis rules are there for a reason, aren't they? Yes, but they are general rules that can't predict every corner case. While the code compiles, the test fails.

Out of the box, that's just not how that version of ASP.NET works. The MVC model of ASP.NET expects action methods to be instance members.

(I'm sure that there's a way to tweak ASP.NET so that it allows static HTTP handlers as well, but I wasn't interested in researching that option. After all, the above code only represents an interim stage during a longer TDD session. Subsequent tests would prompt me to give the Post method some proper behaviour that would make it an instance method anyway.)

So I kept the method as an instance method and suppressed the Code Analysis rule.

Friction? Demonstrably.

Opt in #

Is there a way to avoid the friction? Paul Wilson mentions a couple of options: Using an environment variable, or only turning warnings into errors in your deployment pipeline. A variation on using an environment variable is to only turn on errors for Release builds (for languages where that distinction exists).

In general, if you have a useful tool that unfortunately takes a long time to run, making it a scheduled or opt-in tool may be the way to go. A mutation testing tool like Stryker can easily run for hours, so it's not something you want to do for every change you make.

Another example is dependency analysis. One of my recent clients had a tool that scanned their code dependencies (NuGet, npm) for versions with known vulnerabilities. This tool would also take its time before delivering a verdict.

Making tools opt-in is definitely an option.

You may be concerned that this requires discipline that perhaps not all developers have. If a tool is opt-in, will anyone remember to run it?

As I also describe in Code That Fits in Your Head, you could address that issue with a checklist.

Yeah, but do we then need a checklist to remind us to look at the checklist? Right, quis custodiet ipsos custodes? Is it going to be turtles all the way down?

Well, if no-one in your organisation can be trusted to follow any commonly-agreed-on rules on a regular basis, you're in trouble anyway.

Good friction? #

So far, I've spent some time describing the problem. When encountering resistance your natural reaction is to find it disagreeable. You want to accomplish something, and then this rule/technique/tool gets in the way!

Despite this, is it possible that this particular kind of friction is beneficial?

By (subconsciously, I'm sure) picking a word like 'friction', you've already chosen sides. That word, in general, has a negative connotation. Is it the only word that describes the situation? What if we talked about it instead in terms of safety, assistance, or predictability?

Ironically, friction was a main complaint about TDD when it was first introduced.

"What do you mean? I have to write a test before I write the implementation? That's going to slow me down!"

The TDD and agile movement developed a whole set of standard responses to such objections. Brakes enable you to go faster. If it hurts, do it more often.

Try those on for size, only now applied to warnings as errors. Friction is what makes brakes work.

Additive mindset #

As I age, I'm becoming increasingly aware of a tendency in the software industry. Let's call it the additive mindset.

It's a reflex to consider addition a good thing. An API with a wide array of options is better than a narrow API. Software with more features is better than software with few features. More log data provides better insight.

More code is better than less code.

Obviously, that's not true, but we. keep. behaving as though it is. Just look at the recent hubbub about ChatGPT, or GitHub Copilot, which I recently wrote about. Everyone reflexively view them as productivity tools because the can help us produce more code faster.

I had a cup of coffee with my wife as I took a break from writing this article, and I told her about it. Her immediate reaction when told about friction is that it's a benefit. She's a doctor, and naturally view procedure, practice, regulation, etcetera as occasionally annoying, but essential to the practice of medicine. Without procedures, patients would die from preventable mistakes and doctors would prescribe morphine to themselves. Checking boxes and signing off on decisions slow you down, and that's half the point. Making you slow down can give you the opportunity to realise that you're about to do something stupid.

Worried that TDD will slow down your programmers? Don't. They probably need slowing down.

But if TDD is already being touted as a process to make us slow down and think, is it a good idea, then, to slow down TDD with warnings as errors? Are we not interfering with a beneficial and essential process?

Alternatives to TDD #

I don't have a confident answer to that question. What follows is tentative. I've been doing TDD since 2003 and while I was also an early critic, it's still central to how I write code.

When I began doing TDD with all the errors dialled to 11 I was concerned about the friction, too. While I also believe in linters, the two seem to work at cross purposes. The rule about static members in the above example seems clearly counterproductive. After all, a few commits later I'd written enough code for the Post method that it had to be an instance method after all. The degenerate state was temporary, an artefact of the TDD process, but the rule triggered anyway.

What should I think of that?

I don't like having to deal with such false positives. The question is whether treating warnings as errors is a net positive or a net negative?

It may help to recall why TDD is a useful practice. A major reason is that it provides rapid feedback. There are, however, other ways to produce rapid feedback. Static types, compiler warnings, and static code analysis are other ways.

I don't think of these as alternatives to TDD, but rather as complementary. Tests can produce feedback about some implementation details. Constructive data is another option. Compiler warnings and linters enter that mix as well.

Here I again speak with some hesitation, but it looks to me as though the TDD practice originated in dynamically typed tradition (Smalltalk), and even though some Java programmers were early adopters as well, from my perspective it's always looked stronger among the dynamic languages than the compiled languages. The unadulterated TDD tradition still seems to largely ignore the existence of other forms of feedback. Everything must be tested.

At the risk of repeating myself, I find TDD invaluable, but I'm happy to receive rapid feedback from heterogeneous sources: Tests, type checkers, compilers, linters, fellow ensemble programmers.

This suggests that TDD isn't the only game in town. This may also imply that the friction to TDD caused by treating warnings as errors may not be as costly as first perceived. After all, slowing down something that you rely on 75% of the time isn't quite as bad as slowing down something you rely on 100% of the time.

While it's a cost, perhaps it went down...

Simplicity #

As always, circumstances matter. Is it always a good idea to treat warnings as errors?

Not really. To be honest, treating warnings as errors is another case of treating a symptom. The reason I recommend it is that I've seen enough code bases where compiler warnings (not errors) have accumulated. In a setting where that happens, treating (new) warnings as errors can help get the situation under control.

When I work alone, I don't allow warnings to build up. I rarely tell the compiler to treat warnings as errors in my personal code bases. There's no need. I have zero tolerance for compiler warnings, and I do spot them.

If you have a team that never allows compiler warnings to accumulate, is there any reason to treat them as errors? Probably not.

This underlines an important point about productivity: A good team without strict process can outperform a poor team with a clearly defined process. Mindset beats tooling. Sometimes.

Which mindset is that? Not the additive mindset. Rather, I believe in focusing on simplicity. The alternative to adding things isn't to blindly remove things. You can't add features to a program only by deleting code. Rather, add code, but keep it simple. Decouple to delete.

perfection is attained not when there is nothing more to add, but when there is nothing more to remove.

Simple code. Simple tests. Be warned, however, that code simplicity does not imply naive code understandable by everyone. I'll refer you to Rich Hickey's wonderful talk Simple Made Easy and remind you that this was the line of thinking that lead to Clojure.

Along the same lines, I tend to consider Haskell to be a vehicle for expressing my thoughts in a simpler way than I can do in F#, which again enables simplicity not available in C#. Simpler, not easier.

Conclusion #

Does treating warnings as errors imply TDD friction? It certainly looks that way.

Is it worth it, nonetheless? Possibly. It depends on why you need to turn warnings into errors in the first place. In some settings, the benefits of treating warnings as errors may be greater than the cost. If that's the only way you can keep compiler warnings down, then do treat warnings as errors. Such a situation, however, is likely to be a symptom of a more fundamental mindset problem.

This almost sounds like a moral judgement, I realise, but that's not my intent. Mindset is shaped by personal preference, but also by organisational and peer pressure, as well as knowledge. If you only know of one way to achieve a goal, you have no choice. Only if you know of more than one way can you choose.

Choose the way that leaves the code simpler than the other.


Page 9 of 76

"Our team wholeheartedly endorses Mark. His expert service provides tremendous value."
Hire me!