Property-based testing is not the same as partition testing by Mark Seemann
Including an example of property-based testing without much partitioning.
A tweet from Brian Marick induced me to read a paper by Dick Hamlet and Ross Taylor called Partition Testing Does Not Inspire Confidence. In general, I find the conclusion fairly intuitive, but on the other hand hardly an argument against property-based testing.
I'll later return to why I find the conclusion intuitive, but first, I'd like to address the implied connection between partition testing and property-based testing. I'll also show a detailed example.
The source code used in this article is available on GitHub.
Not the same #
The Hamlet & Taylor paper is exclusively concerned with partition testing, which makes sense, since it's from 1990. As far as I'm aware, property-based testing wasn't invented until later.
Brian Marick extends its conclusions to property-based testing:
This seems to imply that property-based testing isn't effective, because (if you accept the paper's conclusions) partition testing isn't effective."I've been a grump about property-based testing because I bought into the conclusion of Hamlet&Taylor's 1990 "Partition testing does not inspire confidence""
There's certainly overlap between partition testing and property-based testing, but it's not complete. Some property-based testing isn't partition testing, or the other way around:
To be fair, the overlap may easily be larger than the figure implies, but you can certainly describes properties without having to partition a function's domain.
In fact, the canonical example of property-based testing (that reversing a list twice yields the original list: reverse (reverse xs) == xs
) does not rely on partitioning. It works for all finite lists.
You may think that this is only because the case is so simple, but that's not the case. You can also avoid partitioning on the slightly more complex problem presented by the Diamond kata. In fact, the domain for that problem is so small that you don't need a property-based framework.
You may argue that the Diamond kata is another toy problem, but I've also solved a realistic, complex business problem with property-based testing without relying on partitioning. Granted, the property shown in that article doesn't sample uniformly from the entire domain of the System Under Test, but the property (there's only one) doesn't rely on partitioning. Instead, it relies on incremental tightening of preconditions to tease out the desired behaviour.
I'll show another example next.
FizzBuzz via partitioning #
When introducing equivalence classes and property-based testing in workshops, I sometimes use the FizzBuzz kata as an example. When I do this, I first introduce the concept of equivalence classes and then proceed to explain how instead of manually picking values from each partition, you can randomly sample from them:
[<Property(QuietOnSuccess = true)>] let ``FizzBuzz.transform returns Buzz`` (number : int) = (number % 5 = 0 && number % 3 <> 0) ==> lazy let actual = FizzBuzz.transform number let expected = "Buzz" expected = actual
(That's F# code, but the rest of the code in this article is going to be Haskell.)
While this gently introduces the concept of testing based on randomly sampling inputs, it relies heavily on partitioning. The above example filters inputs so that it only runs with numbers that are divisible by five, but not by three.
As at least one workshop attendee objected, it's getting perilously close to reproducing the implementation logic in the test. It always hurts when someone directs valid criticism at you, but he was right. That's not a problem with property-based testing, though, but rather with the way I presented it.
We can do better.
FizzBuzz with proper properties #
The trick to 'true' property-based testing is identifying proper properties for the problem being solved. Granted, this can be difficult and often requires some creative thinking (which is also why I find it so enjoyable). Fortunately, certain patterns tend to recur; for example, Scott Wlaschin has a small collection of property-based testing patterns.
As the FizzBuzz kata is described, the domain for a fizzBuzz
function is only the numbers from one to 100. Let's be generous, however, and expand it to all integers, since it makes no practical difference.
In Haskell, for example, we might aim for a function with this API:
fizzBuzz :: (Integral a, Show a) => a -> String
Is it possible to test-drive the correct implementation with QuickCheck without relying on partitioning?
I must admit that I can't figure out how to entirely avoid partitioning, but it's possible to bootstrap the process using only a single partition. If you know of a way to entirely do without partitioning, leave a comment.
FizzBuzz #
In order to anchor the behaviour, we have to describe how at least a single value translates to a string, for example that all multiples of both three and five translate to "FizzBuzz". It might be enough to simply state that a small number like 0
or 15
translates to "FizzBuzz"
, but we might as well exercise that entire partition:
testProperty "Divisible by both 3 and 5" $ \ (seed :: Int) -> let i = seed * 3 * 5 actual = fizzBuzz i in "FizzBuzz" === actual
Here I take any integer seed
and use it to produce an integer i
which is guaranteed to belong to the partition that always produces the output "FizzBuzz"
.
Certainly, this tests only a single partition, but as Johannes Link points out, property-based testing still performs randomised testing within the partition.
The simplest implementation that passes the test is this:
fizzBuzz :: Integral a => a -> String fizzBuzz _ = "FizzBuzz"
From here, however, it's possible to describe the rest of the problem without relying on partition testing.
At least one number in three consecutive values #
How to proceed from there long eluded me. Then it dawned on me that while it's hard to test a single call to the fizzBuzz
function without relying on partitioning, you can examine the output from projecting a small range of inputs to outputs.
What happens if we pick a random number, use it as an origin to enumerate three numbers in total (that is: two more numbers), and then call fizzBuzz
with each of them?
Imagine what happens if we randomly pick 10. In that case, we're going to enumerate three numbers, starting with 10: 10, 11, 12. What's the expected output of applying these three numbers to fizzBuzz
? It's Buzz, 11, Fizz. Let's try a few more, and make a table of it:
i | i+1 | i+2 |
---|---|---|
10 → Buzz | 11 → 11 | 12 → Fizz |
11 → 11 | 12 → Fizz | 13 → 13 |
12 → Fizz | 13 → 13 | 14 → 14 |
13 → 13 | 14 → 14 | 15 → FizzBuzz |
14 → 14 | 15 → FizzBuzz | 16 → 16 |
Do you notice a pattern?
There's more than a single pattern, but one is that there's always at least one number among the three results. Even if you have both a Fizz and a Buzz, as is the case with 10, 11, 12, at least one of the results (11) remains a number. Think about it some more to convince yourself that this should always be the case for three consecutive numbers.
That's a property of fizzBuzz
, and it holds universally (also for negative integers).
You can turn it into a QuickCheck property like this:
testProperty "At least one number in 3 consecutive values" $ \ (i :: Int) -> let range = [i..i+2] actual = fizzBuzz <$> range in counterexample (show range ++ "->" ++ show actual) $ any (\x -> isJust (readMaybe x :: Maybe Int)) actual
This test doesn't rely on input partitioning. It works for all integers.
In this test I used QuickCheck's counterexample
function to provide a helpful message in case of failure. Running the test suite against the above version of fizzBuzz
yields a failure like this:
At least one number in 3 consecutive values: [Failed]
*** Failed! Falsified (after 1 test):
0
[0,1,2]->["FizzBuzz","FizzBuzz","FizzBuzz"]
(used seed -6204080625786338123)
Here we see that the sequence [0,1,2]
produces the output ["FizzBuzz","FizzBuzz","FizzBuzz"]
, which is not only wrong, but is specifically incorrect in the sense that none of the values can be parsed as an integer.
Given the current implementation, that's hardly surprising.
Using the Devil's Advocate technique, I chose to pass both tests with this implementation:
fizzBuzz :: Integral a => a -> String fizzBuzz i = if i `mod` 15 == 0 then "FizzBuzz" else "2112"
The property I just added doesn't check whether the number is one of the input numbers, so the implementation can get away with returning the hard-coded string "2112"
.
At least one Fizz in three consecutive values #
Take a look at the above table. Do you notice any other patterns?
Of each set of three results, there's always a string that starts with Fizz. Sometimes, as we see with the input 15, the output is FizzBuzz, so it's not always just Fizz, but there's always a string that starts with Fizz.
This is another universal property of the fizzBuzz
function, which we can express as a test:
testProperty "At least one Fizz in 3 consecutive values" $ \ (i :: Int) -> let range = [i..i+2] actual = fizzBuzz <$> range in counterexample (show range ++ "->" ++ show actual) $ any ("Fizz" `isPrefixOf`) actual
Again, no partitioning is required to express this property. The arbitrary parameter i
is unconstrained.
To pass all tests, I implemented fizzBuzz
like this:
fizzBuzz :: Integral a => a -> String fizzBuzz i = if i `mod` 3 == 0 then "FizzBuzz" else "2112"
It doesn't look as though much changed, but notice that the modulo check changed from modulo 15 to modulo 3. While incorrect, it passes all tests.
Only one Buzz in five consecutive values #
Using the same reasoning as above, another property emerges. If, instead of looking at sequences of three input arguments, you create five values, only one of them should result in a Buzz result; e.g. 8, 9, 10, 11, 12 should result in 8, Fizz, Buzz, 11, Fizz. Sometimes, however, the Buzz value is FizzBuzz, for example when the origin is 11: 11, 12, 13, 14, 15 should produce 11, Fizz, 13, 14, FizzBuzz.
Like the above property, there's only one Buzz, but sometimes it's part of a compound word. What's clear, though, is that there should be only one result that ends with Buzz.
Not only is the idea similar to the one above, so is the test:
testProperty "Only one Buzz in 5 consecutive values" $ \ (i :: Int) -> let range = [i..i+4] actual = fizzBuzz <$> range in counterexample (show range ++ "->" ++ show actual) $ 1 == length (filter ("Buzz" `isSuffixOf`) actual)
Again, no partitioning is required to express this property.
This version of fizzBuzz
passes all tests:
fizzBuzz :: Integral a => a -> String fizzBuzz i | i `mod` 5 == 0 = "FizzBuzz" fizzBuzz i | i `mod` 3 == 0 = "Fizz" fizzBuzz _ = "2112"
We're not quite there yet, but we're getting closer.
At least one literal Buzz in ten consecutive values #
What's wrong with the above implementation?
It never returns Buzz. How can we express a property that forces it to do that?
We can keep going in the same vein. We know that if we sample a sufficiently large sequence of numbers, it might produce a FizzBuzz value, but even if it does, there's going to be a Buzz value five positions before and after. For example, if the input sequence contains 30 (producing FizzBuzz) then both 25 and 35 should produce Buzz.
How big a range should we sample to be certain that there's at least one Buzz?
If it's not immediately clear, try setting up a table similar to the one above:
i | i+1 | i+2 | i+3 | i+4 | i+5 | i+6 | i+7 | i+8 | i+9 |
---|---|---|---|---|---|---|---|---|---|
10 → Buzz |
11 → 11 |
12 → Fizz |
13 → 13 |
14 → 14 |
15 → FizzBuzz |
16 → 16 |
17 → 17 |
18 → Fizz |
19 → 19 |
11 → 11 |
12 → Fizz |
13 → 13 |
14 → 14 |
15 → FizzBuzz |
16 → 16 |
17 → 17 |
18 → Fizz |
19 → 19 |
20 → Buzz |
17 → 17 |
18 → Fizz |
19 → 19 |
20 → Buzz |
21 → Fizz |
22 → 22 |
23 → 23 |
24 → Fizz |
25 → Buzz |
26 → 26 |
Notice that as one Buzz drops off to the left, a new one appears to the right. Additionally, there may be more than one literal Buzz, but there's always at least one (that is, one that's exactly Buzz, and not just ending in Buzz).
That's another universal property: for any consecutive sequence of numbers of length ten, there's at least one exact Buzz. Here's how to express that as a QuickCheck property:
testProperty "At least one literal Buzz in 10 values" $ \ (i :: Int) -> let range = [i..i+9] actual = fizzBuzz <$> range in counterexample (show range ++ "->" ++ show actual) $ elem "Buzz" actual
Again, no partitioning is required to express this property.
This version of fizzBuzz
passes all tests:
fizzBuzz :: Integral a => a -> String fizzBuzz i | i `mod` 15 == 0 = "FizzBuzz" fizzBuzz i | i `mod` 3 == 0 = "Fizz" fizzBuzz i | i `mod` 5 == 0 = "Buzz" fizzBuzz _ = "2112"
What's left? Only that the number is still hard-coded.
Numbers round-trip #
How to get rid of the hard-coded number?
From one of the above properties, we know that if we pick an arbitrary consecutive sequence of three numbers, at least one of the results will be a string representation of the input number.
It's not guaranteed to be the origin, though. If the origin is, say, 3, the input sequence is 3, 4, 5, which should yield the resulting sequence Fizz, 4, Buzz.
Since we don't know which number(s) will remain, how can we check that it translates correctly? We can use a variation of a common property-based testing pattern - the one that Scott Wlaschin calls There and back again.
We can take any sequence of three outputs and try to parse them back to integers. All successfully parsed numbers must belong to the input sequence.
That's another universal property. Here's how to express that as a QuickCheck property:
testProperty "Numbers round-trip" $ \ (i :: Int) -> let range = [i..i+2] actual = fizzBuzz <$> range numbers = catMaybes $ readMaybe <$> actual in counterexample (show range ++ "->" ++ show actual) $ all (`elem` range) numbers
The parsed numbers
may contain one or two elements, but in both cases, all of them must be an element of range
.
Again, no partitioning is required to express this property.
This version of fizzBuzz
passes all tests:
fizzBuzz :: (Integral a, Show a) => a -> String fizzBuzz i | i `mod` 15 == 0 = "FizzBuzz" fizzBuzz i | i `mod` 3 == 0 = "Fizz" fizzBuzz i | i `mod` 5 == 0 = "Buzz" fizzBuzz i = show i
This looks good. Let's call it a day.
Not so fast, though.
Redundant property? #
With the new round-trip property, isn't the property titled At least one number in 3 consecutive values redundant?
You might think so, but it's not. What happens if we remove it?
If you remove the At least one number in 3 consecutive values property, the Devil's Advocate can corrupt fizzBuzz
like this:
fizzBuzz :: (Integral a, Show a) => a -> String fizzBuzz i | i `mod` 15 == 0 = "FizzBuzz" fizzBuzz i | i `mod` 3 == 0 = "Fizz" fizzBuzz i | i `mod` 5 == 0 = "Buzz" fizzBuzz _ = "Pop"
This passes all tests if you remove At least one number in 3 consecutive values. Why doesn't the Numbers round-trip property fail?
It doesn't fail because with the above implementation of fizzBuzz
its numbers list is always empty. This property doesn't require numbers to be non-empty. It doesn't have to, because that's the job of the At least one number in 3 consecutive values property. Thus, that property isn't redundant. Leave it in.
Intuition behind the paper #
What about the results from the Hamlet & Taylor paper? Are the conclusions in the paper wrong?
They may be, but that's not my take. Rather, the way I understand the paper, it says that partition testing isn't much more efficient at detecting errors than pure random sampling.
I've been using the rather schizophrenic version of the Devil's Advocate technique (the one that I call Gollum style) for so long that this conclusion rings true for me.
Consider a truly adversarial developer from Hell. He or she could subvert fizzBuzz
like this:
fizzBuzz :: (Integral a, Show a) => a -> String fizzBuzz 18641 = "Pwnd" fizzBuzz i | i `mod` 15 == 0 = "FizzBuzz" fizzBuzz i | i `mod` 3 == 0 = "Fizz" fizzBuzz i | i `mod` 5 == 0 = "Buzz" fizzBuzz i = show i
The test suite is very unlikely to detect the error - even if you ask it to run each property a million times:
$ stack test --ta "-a 1000000" FizzBuzzHaskellPropertyBased> test (suite: FizzBuzz-test, args: -a 1000000) Divisible by both 3 and 5: [OK, passed 1000000 tests] At least one number in 3 consecutive values: [OK, passed 1000000 tests] At least one Fizz in 3 consecutive values: [OK, passed 1000000 tests] Only one Buzz in 5 consecutive values: [OK, passed 1000000 tests] At least one literal Buzz in 10 values: [OK, passed 1000000 tests] Numbers round-trip: [OK, passed 1000000 tests] Properties Total Passed 6 6 Failed 0 0 Total 6 6 FizzBuzzHaskellPropertyBased> Test suite FizzBuzz-test passed
How many times should we run these properties before we'd expect the At least one number in 3 consecutive values property to detect the error?
In Haskell, Int
is an integer type with at least the range [-2^29 .. 2^29-1] - that is, from -536,870,912 to 536,870,911, for a total range of 1,073,741,823 numbers.
In order to detect the error, the At least one number in 3 consecutive values property needs to hit 18,641, which it can only do if QuickCheck supplies an i
value of 18,639, 18,640, or 18,641. That's three values out of 1,073,741,823.
If we assume a uniform distribution, the chance of detecting the error is 3 / 1,073,741,823, or approximately one in 333 million.
Neither property-based testing nor randomised testing is likely to detect this kind of error. That's basically the intuition that makes sense to me when reading the Hamlet & Taylor paper. If you don't know where to look, partition testing isn't going to help you detect errors like the above.
I can live with that. After all, the value I get out of property-based testing is as a variation on test-driven development, rather than only quality assurance. It enables me to incrementally flesh out a problem in a way that example-based testing sometimes can't.
Conclusion #
There's a big overlap between partition testing and property-based testing. Often, identifying equivalence classes is the first step to expressing a property. A conspicuous example can be seen in my article series Types + Properties = Software, which shows a detailed walk-through of the Tennis kata done with FsCheck. For a hybrid approach, see Roman numerals via property-based TDD.
In my experience, it's much easier to partition a domain into equivalence classes than it is to describe universal properties of a system. Thus, many properties I write tend to be a kind of partition testing. On the other hand, it's more satisfying when you can express universal properties. I'm not sure that it's always possible, but I find that when it is, it better decouples the tests from implementation details.
Based on the FizzBuzz example shown here, you may find it unappealing that there's more test code than 'production code'. Clearly, for a problem like FizzBuzz, this is unwarranted. That, however, wasn't the point with the example. The point was to show an easily digestible example of universal properties. For a more realistic example, I'll refer you to the scheduling problem I also linked to earlier. While the production code ultimately turned out to be compact, it's far from trivial.