Refactoring pure function composition without breaking existing tests

An example modifying a Haskell Gossiping Bus Drivers implementation.

This is an article in an series of articles about the epistemology of interaction testing. In short, this collection of articles discusses how to test the composition of pure functions. While a pure function is intrinsically testable, how do you test the composition of pure functions? As the introductory article outlines, I consider it mostly a matter of establishing confidence. With enough test coverage you can be confident that the composition produces the desired outputs.

Keep in mind that if you compose pure functions into a larger pure function, the composition is still pure. This implies that you can still test it by supplying input and verifying that the output is correct.

Tests that exercise the composition do so by verifying observable behaviour. This makes them more robust to refactoring. You'll see an example of that later in this article.

Gossiping bus drivers #

I recently did the Gossiping Bus Drivers kata in Haskell. At first, I added the tests suggested in the kata description.

{-# OPTIONS_GHC -Wno-type-defaults #-}
module Main where
 
import GossipingBusDrivers
import Test.HUnit
import Test.Framework.Providers.HUnit (hUnitTestToTests)
import Test.Framework (defaultMain)
 
main :: IO ()
main = defaultMain $ hUnitTestToTests $ TestList [
  "Kata examples" ~: do
    (routes, expected) <-
      [
        ([[3, 1, 2, 3],
          [3, 2, 3, 1],
          [4, 2, 3, 4, 5]],
         Just 5),
        ([[2, 1, 2],
          [5, 2, 8]],
         Nothing)
      ]
    let actual = drive routes
    return $ expected ~=? actual
  ]

As I prefer them, these tests are parametrised HUnit tests.

The problem with those suggested test cases is that they don't provide enough confidence that an implementation is correct. In fact, I wrote this implementation to pass them:

drive routes = if length routes == 3 then Just 5 else Nothing

This is clearly incorrect. It just looks at the number of routes and returns a fixed value for each count. It doesn't look at the contents of the routes.

Even if you don't try to deliberately cheat I'm not convinced that these two tests are enough. You could try to write the correct implementation, but how do you know that you've correctly dealt with various edge cases?

Helper function #

The kata description isn't hard to understand, so while the suggested test cases seem insufficient, I knew what was required. Perhaps I could write a proper implementation without additional tests. After all, I was convinced that it'd be possible to do it with a cyclomatic complexity of 1, and since a test function also has a cyclomatic complexity of 1, there's always that tension in test-driven development: Why write test code to exercise code with a cyclomatic complexity of 1?.

To be clear: There are often good reasons to write tests even in this case, and this seems like one of them. Cyclomatic complexity indicates a minimum number of test cases, not necessarily a sufficient number.

Even though Haskell's type system is expressive, I soon found myself second-guessing the behaviour of various expressions that I'd experimented with. Sometimes I find GHCi (the Haskell REPL) sufficiently edifying, but in this case I thought that I might want to keep some test cases around for a helper function that I was developing:

import Data.List
import qualified Data.Map.Strict as Map
import Data.Map.Strict ((!))
import qualified Data.Set as Set
import Data.Set (Set)
 
evaluateStop :: (Functor f, Foldable f, Ord k, Ord a)
             => f (k, Set a) -> f (k, Set a)
evaluateStop stopsAndDrivers =
  let gossip (stop, driver) = Map.insertWith Set.union stop driver
      gossipAtStops = foldl' (flip gossip) Map.empty stopsAndDrivers
  in fmap (\(stop, _) -> (stop, gossipAtStops ! stop)) stopsAndDrivers

I was fairly confident that this function worked as I intended, but I wanted to be sure. I needed some examples, so I added these tests:

"evaluateStop examples" ~: do
  (stopsAndDrivers, expected) <- [
      ([(1, fromList [1]), (2, fromList [2]), (1, fromList [1])],
       [(1, fromList [1]), (2, fromList [2]), (1, fromList [1])]),
      ([(1, fromList [1]), (2, fromList [2]), (1, fromList [2])],
       [(1, fromList [1, 2]), (2, fromList [2]), (1, fromList [1, 2])]),
      ([(1, fromList [1, 2, 3]), (1, fromList [2, 3, 4])],
       [(1, fromList [1, 2, 3, 4]), (1, fromList [1, 2, 3, 4])])
    ]
  let actual = evaluateStop stopsAndDrivers
  return $ fromList expected ~=? fromList actual

They do, indeed, pass.

The idea behind that evaluateStop function is to evaluate the state at each 'minute' of the simulation. The first line of each test case is the state before the drivers meet, and the second line is the expected state after all drivers have gossiped.

My plan was to use some sort of left fold to keep evaluating states until all information has disseminated to all drivers.

Property #

Since I have already extolled the virtues of property-based testing in this article series, I wondered whether I could add some properties instead of relying on examples. Well, I did manage to add one QuickCheck property:

testProperty "drive image" $ \ (routes :: [NonEmptyList Int]) ->
  let actual = drive $ fmap getNonEmpty routes
  in isJust actual ==>
     all (\i -> 0 <= i && i <= 480) actual

There's not much to talk about here. The property only states that the result of the drive function must be between 0 and 480, if it exists.

Such a property could vacuously pass if drive always returns Nothing, so I used the ==> QuickCheck combinator to make sure that the property is actually exercising only the Just cases.

Since the drive function only returns a number, apart from verifying its image I couldn't think of any other general property to add.

You can always come up with more specific properties that explicitly set up more constrained test scenarios, but is it worth it?

It's always worthwhile to stop and think. If you're writing a 'normal' example-based test, consider whether a property would be better. Likewise, if you're about to write a property, consider whether an example would be better.

'Better' can mean more than one thing. Preventing regressions is one thing, but making the code maintainable is another. If you're writing a property that is too complicated, it might be better to write a simpler example-based test.

I could definitely think of some complicated properties, but I found that more examples might make the test code easier to understand.

More examples #

After all that angst and soul-searching, I added a few more examples to the first parametrised test:

"Kata examples" ~: do
  (routes, expected) <-
    [
      ([[3, 1, 2, 3],
        [3, 2, 3, 1],
        [4, 2, 3, 4, 5]],
       Just 5),
      ([[2, 1, 2],
        [5, 2, 8]],
       Nothing),
      ([[1, 2, 3, 4, 5],
        [5, 6, 7, 8],
        [3, 9, 6]],
       Just 13),
      ([[1, 2, 3],
        [2, 1, 3],
        [2, 4, 5, 3]],
       Just 5),
      ([[1, 2],
        [2, 1]],
       Nothing),
      ([[1]],
       Just 0),
      ([[2],
        [2]],
       Just 1)
    ]
  let actual = drive routes
  return $ expected ~=? actual

The first two test cases are the same as before, and the last two are some edge cases I added myself. The middle three I adopted from another page about the kata. Since those examples turned out to be off by one, I did those examples on paper to verify that I understood what the expected value was. Then I adjusted them to my one-indexed results.

Drive #

The drive function now correctly implements the kata, I hope. At least it passes all the tests.

drive :: (Num b, Enum b, Ord a) => [[a]] -> Maybe b
drive routes =
      -- Each driver starts with a single gossip. Any kind of value will do, as
      -- long as each is unique. Here I use the one-based index of each route,
      -- since it fulfills the requirements.
  let drivers = fmap Set.singleton [1 .. length routes]
      goal = Set.unions drivers
      stops = transpose $ fmap (take 480 . cycle) routes
      propagation =
        scanl (\ds ss -> snd <$> evaluateStop (zip ss ds)) drivers stops
  in fmap fst $ find (all (== goal) . snd) $ zip [0 ..] propagation

Haskell code can be information-dense, and if you don't have an integrated development environment (IDE) around, this may be hard to read.

drivers is a list of sets. Each set represents the gossip that a driver knows. At the beginning, each only knows one piece of gossip. The expression initialises each driver with a singleton set. Each piece of gossip is represented by a number, simply going from 1 to the number of routes. Incidentally, this is also the number of drivers, so you can consider the number 1 as a placeholder for the gossip that driver 1 knows, and so on.

The goal is the union of all the gossip. Once every driver's knowledge is equal to the goal the simulation can stop.

Since evaluateStop simulates one stop, the drive function needs a list of stops to fold. That's the stops value. In the very first example, you have three routes: [3, 1, 2, 3], [3, 2, 3, 1], and [4, 2, 3, 4, 5]. The first time the drivers stop (after one minute), the stops are 3, 3, and 4. That is, the first element in stops would be the list [3, 3, 4]. The next one would be [1, 2, 2], then [2, 3, 3], and so on.

My plan all along was to use some sort of left fold to repeatedly run evaluateStop over each minute. Since I need to produce a list of states, scanl was an appropriate choice. The lambda expression that I have to pass to it, though, is more complicated than I appreciate. We'll return to that in a moment.

The drive function can now index the propagation list by zipping it with the infinite list [0 ..], find the first element where all sets are equal to the goal set, and then return that index. That produces the correct results.

The need for a better helper function #

As I already warned, I wasn't happy with the lambda expression passed to scanl. It looks complicated and arcane. Is there a better way to express the same behaviour? Usually, when confronted with a nasty lambda expression like that, in Haskell my first instinct is to see if pointfree.io has a better option. Alas, (((snd <$>) . evaluateStop) .) . flip zip hardly seems an improvement. That flip zip expression to the right, however, suggests that it might help flipping the arguments to evaluateStop.

When I developed the evaluateStop helper function, I found it intuitive to define it over a list of tuples, where the first element in the tuple is the stop, and the second element is the set of gossip that the driver at that stop knows.

The tuples don't have to be in that order, though. Perhaps if I flip the tuples that would make the lambda expression more readable. It was worth a try.

Confidence #

Since this article is part of a small series about the epistemology of testing composed functions, let's take a moment to reflect on the confidence we may have in the drive function.

Keep in mind the goal of the kata: Calculate the number of minutes it takes for all gossip to spread to all drivers. There's a few tests that verify that; seven examples and a fairly vacuous QuickCheck property. Is that enough to be confident that the function is correct?

If it isn't, I think the best option you have is to add more examples. For the sake of argument, however, let's assume that the tests are good enough.

When summarising the tests that cover the drive function, I didn't count the three examples that exercise evaluateStop. Do these three test cases improve your confidence in the drive function? A bit, perhaps, but keep in mind that the kata description doesn't mandate that function. It's just a helper function I created in order to decompose the problem.

Granted, having tests that cover a helper function does, to a degree, increase my confidence in the code. I have confidence in the function itself, but that is largely irrelevant, because the problem I'm trying to solve is not implementing this particular function. On the other hand, my confidence in evaluateStop means that I have increased confidence in the code that calls it.

Compared to interaction-based testing, I'm not testing that drive calls evaluateStop, but I can still verify that this happens. I can just look at the code.

The composition is already there in the code. What do I gain from replicating that composition with Stubs and Spies?

It's not a breaking change if I decide to implement drive in a different way.

What gives me confidence when composing pure functions isn't that I've subjected the composition to an interaction-based test. Rather, it's that the function is composed from trustworthy components.

Strangler #

My main grievance with Stubs and Spies is that they break encapsulation. This may sound abstract, but is a real problem. This is the underlying reason that so many tests break when you refactor code.

This example code base, as other functional code that I write, avoids interaction-based testing. This makes it easier to refactor the code, as I will now demonstrate.

My goal is to change the evaluateStop helper function by flipping the tuples. If I just edit it, however, I'm going to (temporarily) break the drive function.

Katas typically result in small code bases where you can get away with a lot of bad practices that wouldn't work in a larger code base. To be honest, the refactoring I have in mind can be completed in a few minutes with a brute-force approach. Imagine, however, that we can't break compatibility of the evaluateStop function for the time being. Perhaps, had we had a larger code base, there were other code that depended on this function. At the very least, the tests do.

Instead of brute-force changing the function, I'm going to make use of the Strangler pattern, as I've also described in my book Code That Fits in Your Head.

Leave the existing function alone, and add a new one. You can typically copy and paste the existing code and then make the necessary changes. In that way, you break neither client code nor tests, because there are none.

evaluateStop' :: (Functor f, Foldable f, Ord k, Ord a)
              => f (Set a, k) -> f (Set a, k)
evaluateStop' driversAndStops =
  let gossip (driver, stop) = Map.insertWith Set.union stop driver
      gossipAtStops = foldl' (flip gossip) Map.empty driversAndStops
  in fmap (\(_, stop) -> (gossipAtStops ! stop, stop)) driversAndStops

In a language like C# you can often get away with overloading a method name, but Haskell doesn't have overloading. Since I consider this side-by-side situation to be temporary, I've appended a prime after the function name. This is a fairly normal convention in Haskell, I gather.

The only change this function represents is that I've swapped the tuple order.

Once you've added the new function, you may want to copy, paste and edit the tests. Or perhaps you want to do the tests first. During this process, make micro-commits so that you can easily suspend your 'refactoring' activity if something more important comes up.

Once everything is in place, you can change the drive function:

drive :: (Num b, Enum b, Ord a) => [[a]] -> Maybe b
drive routes =
      -- Each driver starts with a single gossip. Any kind of value will do, as
      -- long as each is unique. Here I use the one-based index of each route,
      -- since it fulfills the requirements.
  let drivers = fmap Set.singleton [1 .. length routes]
      goal = Set.unions drivers
      stops = transpose $ fmap (take 480 . cycle) routes
      propagation =
        scanl (\ds ss -> fst <$> evaluateStop' (zip ds ss)) drivers stops
  in fmap fst $ find (all (== goal) . snd) $ zip [0 ..] propagation

Notice that the type of drive hasn't change, and neither has the behaviour. This means that although I've changed the composition (the interaction) no tests broke.

Finally, once I moved all code over, I deleted the old function and renamed the new one to take its place.

Was it all worth it? #

At first glance, it doesn't look as though much was gained. What happens if I eta-reduce the new lambda expression?

drive :: (Num b, Enum b, Ord a) => [[a]] -> Maybe b
drive routes =
      -- Each driver starts with a single gossip. Any kind of value will do, as
      -- long as each is unique. Here I use the one-based index of each route,
      -- since it fulfills the requirements.
  let drivers = fmap Set.singleton [1 .. length routes]
      goal = Set.unions drivers
      stops = transpose $ fmap (take 480 . cycle) routes
      propagation = scanl (((fmap fst . evaluateStop) .) . zip) drivers stops
  in fmap fst $ find (all (== goal) . snd) $ zip [0 ..] propagation

Not much better. I can now fit the propagation expression on a single line of code and still stay within a 80x24 box, but that's about it. Is ((fmap fst . evaluateStop) .) . zip more readable than what we had before?

Hardly, I admit. I might consider reverting, and since I've been using Git tactically, I have that option.

If I hadn't tried, though, I wouldn't have known.

Conclusion #

When composing one pure function with another, how can you test that the outer function correctly calls the inner function?

By the same way that you test any other pure function. The only way you can observe whether a pure function works as intended is to compare its actual output to the output you expect its input to produce. How it arrives at that output is irrelevant. It could be looking up all results in a big table. As long as the result is correct, the function is correct.

In this article, you saw an example of how to test a composed function, as well as how to refactor it without breaking tests.

Next: When is an implementation detail an implementation detail?

Published: Monday, 01 May 2023 06:44:00 UTC

Refactoring pure function composition without breaking existing tests by Mark Seemann