Refactoring pure function composition without breaking existing tests by Mark Seemann
An example modifying a Haskell Gossiping Bus Drivers implementation.
This is an article in an series of articles about the epistemology of interaction testing. In short, this collection of articles discusses how to test the composition of pure functions. While a pure function is intrinsically testable, how do you test the composition of pure functions? As the introductory article outlines, I consider it mostly a matter of establishing confidence. With enough test coverage you can be confident that the composition produces the desired outputs.
Keep in mind that if you compose pure functions into a larger pure function, the composition is still pure. This implies that you can still test it by supplying input and verifying that the output is correct.
Tests that exercise the composition do so by verifying observable behaviour. This makes them more robust to refactoring. You'll see an example of that later in this article.
Gossiping bus drivers #
I recently did the Gossiping Bus Drivers kata in Haskell. At first, I added the tests suggested in the kata description.
{-# OPTIONS_GHC -Wno-type-defaults #-} module Main where import GossipingBusDrivers import Test.HUnit import Test.Framework.Providers.HUnit (hUnitTestToTests) import Test.Framework (defaultMain) main :: IO () main = defaultMain $ hUnitTestToTests $ TestList [ "Kata examples" ~: do (routes, expected) <- [ ([[3, 1, 2, 3], [3, 2, 3, 1], [4, 2, 3, 4, 5]], Just 5), ([[2, 1, 2], [5, 2, 8]], Nothing) ] let actual = drive routes return $ expected ~=? actual ]
As I prefer them, these tests are parametrised HUnit tests.
The problem with those suggested test cases is that they don't provide enough confidence that an implementation is correct. In fact, I wrote this implementation to pass them:
drive routes = if length routes == 3 then Just 5 else Nothing
This is clearly incorrect. It just looks at the number of routes and returns a fixed value for each count. It doesn't look at the contents of the routes.
Even if you don't try to deliberately cheat I'm not convinced that these two tests are enough. You could try to write the correct implementation, but how do you know that you've correctly dealt with various edge cases?
Helper function #
The kata description isn't hard to understand, so while the suggested test cases seem insufficient, I knew what was required. Perhaps I could write a proper implementation without additional tests. After all, I was convinced that it'd be possible to do it with a cyclomatic complexity of 1, and since a test function also has a cyclomatic complexity of 1, there's always that tension in test-driven development: Why write test code to exercise code with a cyclomatic complexity of 1?.
To be clear: There are often good reasons to write tests even in this case, and this seems like one of them. Cyclomatic complexity indicates a minimum number of test cases, not necessarily a sufficient number.
Even though Haskell's type system is expressive, I soon found myself second-guessing the behaviour of various expressions that I'd experimented with. Sometimes I find GHCi (the Haskell REPL) sufficiently edifying, but in this case I thought that I might want to keep some test cases around for a helper function that I was developing:
import Data.List import qualified Data.Map.Strict as Map import Data.Map.Strict ((!)) import qualified Data.Set as Set import Data.Set (Set) evaluateStop :: (Functor f, Foldable f, Ord k, Ord a) => f (k, Set a) -> f (k, Set a) evaluateStop stopsAndDrivers = let gossip (stop, driver) = Map.insertWith Set.union stop driver gossipAtStops = foldl' (flip gossip) Map.empty stopsAndDrivers in fmap (\(stop, _) -> (stop, gossipAtStops ! stop)) stopsAndDrivers
I was fairly confident that this function worked as I intended, but I wanted to be sure. I needed some examples, so I added these tests:
"evaluateStop examples" ~: do (stopsAndDrivers, expected) <- [ ([(1, fromList [1]), (2, fromList [2]), (1, fromList [1])], [(1, fromList [1]), (2, fromList [2]), (1, fromList [1])]), ([(1, fromList [1]), (2, fromList [2]), (1, fromList [2])], [(1, fromList [1, 2]), (2, fromList [2]), (1, fromList [1, 2])]), ([(1, fromList [1, 2, 3]), (1, fromList [2, 3, 4])], [(1, fromList [1, 2, 3, 4]), (1, fromList [1, 2, 3, 4])]) ] let actual = evaluateStop stopsAndDrivers return $ fromList expected ~=? fromList actual
They do, indeed, pass.
The idea behind that evaluateStop
function is to evaluate the state at each 'minute' of the simulation. The first line of each test case is the state before the drivers meet, and the second line is the expected state after all drivers have gossiped.
My plan was to use some sort of left fold to keep evaluating states until all information has disseminated to all drivers.
Property #
Since I have already extolled the virtues of property-based testing in this article series, I wondered whether I could add some properties instead of relying on examples. Well, I did manage to add one QuickCheck property:
testProperty "drive image" $ \ (routes :: [NonEmptyList Int]) -> let actual = drive $ fmap getNonEmpty routes in isJust actual ==> all (\i -> 0 <= i && i <= 480) actual
There's not much to talk about here. The property only states that the result of the drive
function must be between 0
and 480
, if it exists.
Such a property could vacuously pass if drive
always returns Nothing
, so I used the ==>
QuickCheck combinator to make sure that the property is actually exercising only the Just
cases.
Since the drive
function only returns a number, apart from verifying its image I couldn't think of any other general property to add.
You can always come up with more specific properties that explicitly set up more constrained test scenarios, but is it worth it?
It's always worthwhile to stop and think. If you're writing a 'normal' example-based test, consider whether a property would be better. Likewise, if you're about to write a property, consider whether an example would be better.
'Better' can mean more than one thing. Preventing regressions is one thing, but making the code maintainable is another. If you're writing a property that is too complicated, it might be better to write a simpler example-based test.
I could definitely think of some complicated properties, but I found that more examples might make the test code easier to understand.
More examples #
After all that angst and soul-searching, I added a few more examples to the first parametrised test:
"Kata examples" ~: do (routes, expected) <- [ ([[3, 1, 2, 3], [3, 2, 3, 1], [4, 2, 3, 4, 5]], Just 5), ([[2, 1, 2], [5, 2, 8]], Nothing), ([[1, 2, 3, 4, 5], [5, 6, 7, 8], [3, 9, 6]], Just 13), ([[1, 2, 3], [2, 1, 3], [2, 4, 5, 3]], Just 5), ([[1, 2], [2, 1]], Nothing), ([[1]], Just 0), ([[2], [2]], Just 1) ] let actual = drive routes return $ expected ~=? actual
The first two test cases are the same as before, and the last two are some edge cases I added myself. The middle three I adopted from another page about the kata. Since those examples turned out to be off by one, I did those examples on paper to verify that I understood what the expected value was. Then I adjusted them to my one-indexed results.
Drive #
The drive
function now correctly implements the kata, I hope. At least it passes all the tests.
drive :: (Num b, Enum b, Ord a) => [[a]] -> Maybe b drive routes = -- Each driver starts with a single gossip. Any kind of value will do, as -- long as each is unique. Here I use the one-based index of each route, -- since it fulfills the requirements. let drivers = fmap Set.singleton [1 .. length routes] goal = Set.unions drivers stops = transpose $ fmap (take 480 . cycle) routes propagation = scanl (\ds ss -> snd <$> evaluateStop (zip ss ds)) drivers stops in fmap fst $ find (all (== goal) . snd) $ zip [0 ..] propagation
Haskell code can be information-dense, and if you don't have an integrated development environment (IDE) around, this may be hard to read.
drivers
is a list of sets. Each set represents the gossip that a driver knows. At the beginning, each only knows one piece of gossip. The expression initialises each driver with a singleton
set. Each piece of gossip is represented by a number, simply going from 1
to the number of routes. Incidentally, this is also the number of drivers, so you can consider the number 1
as a placeholder for the gossip that driver 1 knows, and so on.
The goal
is the union of all the gossip. Once every driver's knowledge is equal to the goal
the simulation can stop.
Since evaluateStop
simulates one stop, the drive
function needs a list of stops to fold. That's the stops
value. In the very first example, you have three routes: [3, 1, 2, 3]
, [3, 2, 3, 1]
, and [4, 2, 3, 4, 5]
. The first time the drivers stop (after one minute), the stops are 3
, 3
, and 4
. That is, the first element in stops
would be the list [3, 3, 4]
. The next one would be [1, 2, 2]
, then [2, 3, 3]
, and so on.
My plan all along was to use some sort of left fold to repeatedly run evaluateStop
over each minute. Since I need to produce a list of states, scanl
was an appropriate choice. The lambda expression that I have to pass to it, though, is more complicated than I appreciate. We'll return to that in a moment.
The drive
function can now index the propagation
list by zipping it with the infinite list [0 ..]
, find
the first element where all
sets are equal to the goal
set, and then return that index. That produces the correct results.
The need for a better helper function #
As I already warned, I wasn't happy with the lambda expression passed to scanl
. It looks complicated and arcane. Is there a better way to express the same behaviour? Usually, when confronted with a nasty lambda expression like that, in Haskell my first instinct is to see if pointfree.io has a better option. Alas, (((snd <$>) . evaluateStop) .) . flip zip
hardly seems an improvement. That flip zip
expression to the right, however, suggests that it might help flipping the arguments to evaluateStop
.
When I developed the evaluateStop
helper function, I found it intuitive to define it over a list of tuples, where the first element in the tuple is the stop, and the second element is the set of gossip that the driver at that stop knows.
The tuples don't have to be in that order, though. Perhaps if I flip the tuples that would make the lambda expression more readable. It was worth a try.
Confidence #
Since this article is part of a small series about the epistemology of testing composed functions, let's take a moment to reflect on the confidence we may have in the drive
function.
Keep in mind the goal of the kata: Calculate the number of minutes it takes for all gossip to spread to all drivers. There's a few tests that verify that; seven examples and a fairly vacuous QuickCheck property. Is that enough to be confident that the function is correct?
If it isn't, I think the best option you have is to add more examples. For the sake of argument, however, let's assume that the tests are good enough.
When summarising the tests that cover the drive
function, I didn't count the three examples that exercise evaluateStop
. Do these three test cases improve your confidence in the drive
function? A bit, perhaps, but keep in mind that the kata description doesn't mandate that function. It's just a helper function I created in order to decompose the problem.
Granted, having tests that cover a helper function does, to a degree, increase my confidence in the code. I have confidence in the function itself, but that is largely irrelevant, because the problem I'm trying to solve is not implementing this particular function. On the other hand, my confidence in evaluateStop
means that I have increased confidence in the code that calls it.
Compared to interaction-based testing, I'm not testing that drive
calls evaluateStop
, but I can still verify that this happens. I can just look at the code.
The composition is already there in the code. What do I gain from replicating that composition with Stubs and Spies?
It's not a breaking change if I decide to implement drive
in a different way.
What gives me confidence when composing pure functions isn't that I've subjected the composition to an interaction-based test. Rather, it's that the function is composed from trustworthy components.
Strangler #
My main grievance with Stubs and Spies is that they break encapsulation. This may sound abstract, but is a real problem. This is the underlying reason that so many tests break when you refactor code.
This example code base, as other functional code that I write, avoids interaction-based testing. This makes it easier to refactor the code, as I will now demonstrate.
My goal is to change the evaluateStop
helper function by flipping the tuples. If I just edit it, however, I'm going to (temporarily) break the drive
function.
Katas typically result in small code bases where you can get away with a lot of bad practices that wouldn't work in a larger code base. To be honest, the refactoring I have in mind can be completed in a few minutes with a brute-force approach. Imagine, however, that we can't break compatibility of the evaluateStop
function for the time being. Perhaps, had we had a larger code base, there were other code that depended on this function. At the very least, the tests do.
Instead of brute-force changing the function, I'm going to make use of the Strangler pattern, as I've also described in my book Code That Fits in Your Head.
Leave the existing function alone, and add a new one. You can typically copy and paste the existing code and then make the necessary changes. In that way, you break neither client code nor tests, because there are none.
evaluateStop' :: (Functor f, Foldable f, Ord k, Ord a) => f (Set a, k) -> f (Set a, k) evaluateStop' driversAndStops = let gossip (driver, stop) = Map.insertWith Set.union stop driver gossipAtStops = foldl' (flip gossip) Map.empty driversAndStops in fmap (\(_, stop) -> (gossipAtStops ! stop, stop)) driversAndStops
In a language like C# you can often get away with overloading a method name, but Haskell doesn't have overloading. Since I consider this side-by-side situation to be temporary, I've appended a prime after the function name. This is a fairly normal convention in Haskell, I gather.
The only change this function represents is that I've swapped the tuple order.
Once you've added the new function, you may want to copy, paste and edit the tests. Or perhaps you want to do the tests first. During this process, make micro-commits so that you can easily suspend your 'refactoring' activity if something more important comes up.
Once everything is in place, you can change the drive
function:
drive :: (Num b, Enum b, Ord a) => [[a]] -> Maybe b drive routes = -- Each driver starts with a single gossip. Any kind of value will do, as -- long as each is unique. Here I use the one-based index of each route, -- since it fulfills the requirements. let drivers = fmap Set.singleton [1 .. length routes] goal = Set.unions drivers stops = transpose $ fmap (take 480 . cycle) routes propagation = scanl (\ds ss -> fst <$> evaluateStop' (zip ds ss)) drivers stops in fmap fst $ find (all (== goal) . snd) $ zip [0 ..] propagation
Notice that the type of drive
hasn't change, and neither has the behaviour. This means that although I've changed the composition (the interaction) no tests broke.
Finally, once I moved all code over, I deleted the old function and renamed the new one to take its place.
Was it all worth it? #
At first glance, it doesn't look as though much was gained. What happens if I eta-reduce the new lambda expression?
drive :: (Num b, Enum b, Ord a) => [[a]] -> Maybe b drive routes = -- Each driver starts with a single gossip. Any kind of value will do, as -- long as each is unique. Here I use the one-based index of each route, -- since it fulfills the requirements. let drivers = fmap Set.singleton [1 .. length routes] goal = Set.unions drivers stops = transpose $ fmap (take 480 . cycle) routes propagation = scanl (((fmap fst . evaluateStop) .) . zip) drivers stops in fmap fst $ find (all (== goal) . snd) $ zip [0 ..] propagation
Not much better. I can now fit the propagation
expression on a single line of code and still stay within a 80x24 box, but that's about it. Is ((fmap fst . evaluateStop) .) . zip
more readable than what we had before?
Hardly, I admit. I might consider reverting, and since I've been using Git tactically, I have that option.
If I hadn't tried, though, I wouldn't have known.
Conclusion #
When composing one pure function with another, how can you test that the outer function correctly calls the inner function?
By the same way that you test any other pure function. The only way you can observe whether a pure function works as intended is to compare its actual output to the output you expect its input to produce. How it arrives at that output is irrelevant. It could be looking up all results in a big table. As long as the result is correct, the function is correct.
In this article, you saw an example of how to test a composed function, as well as how to refactor it without breaking tests.
Next: When is an implementation detail an implementation detail?