A proof of concept of cross-branch testing for compiled languages.

Hillel Wayne recently published an article called Cross-Branch Testing. It outlines an approach to a class of problems that are hard to test. He mentions computer vision and simulations, among others. I can add that it's also difficult to write intuitive tests of convex hulls and Conway's game of life.

Hillel Wayne calls these rho problems, 'just because'. I'm totally going to run with that term.

In the article, he outlines an approach where you test an iteration of rho code against a 'last known good' snapshot. He uses git worktree to set up a snapshot of the reference implementation. He then writes a property that compares the refactored code's behaviour against the reference.

The example code is in Python, which is a language that I don't know. As far as I can tell, it works because Python is 'lightweight' enough that you can load and execute source code directly. I found that the approach makes much sense, but I wondered how it would apply for statically typed, compiled languages. I decided to create a proof of concept in F#.

Test cases from Python #

My first problem was to port Hillel Wayne's example rho problem to F#. The function f doesn't have any immediate mathematical properties; nor is its behaviour intuitive. While I think that I understand what each line of code in f means, I don't really know Python. Since one of the properties of rho problems is that bugs can be subtle, I didn't trust myself to be able to port the Python code to F# without some test cases.

To solve that problem, I first found an online Python interpreter and pasted the f function into it. I then wrote code to print the output of a function call:

print(f'1, 2, 3, { f(1, 2, 3) }')

This line of code produces this output:

1, 2, 3, True

In other words, I could produce comma-separated values of input and actual output.

Hillel Wayne wrote properties using Hypothesis, which, it seems, by default runs each property 200 times.

In F# I'm going to use FsCheck, so I first used F# Interactive with FsCheck to produce 200 Python print statements like the above:

> Arb.Default.Int32().Generator
|> Gen.three
|> Gen.map (fun (x, y, z) -> sprintf "print(f'%i, %i, %i, { f(%i, %i, %i) }')" x y z x y z)
|> Gen.sample 100 200
|> List.iter (printfn "%s");;
print(f'-77, 67, 84, { f(-77, 67, 84) }')
print(f'58, -46, 3, { f(58, -46, 3) }')
print(f'21, 13, 94, { f(21, 13, 94) }')
...

This is a throwaway data pipeline that starts with an FsCheck integer generator, creates a triple from it, turns that triple into a Python print statement, and finally writes 200 of those to the console. The above code listing only shows the first three lines of output, while the rest are indicated by an ellipsis.

I copied those 200 print statements over to the online Python interpreter and ran the code. That produced 200 comma-separated values like these:

-77, 67, 84, False
58, -46, 3, False
21, 13, 94, True
...

These can serve as test cases for porting the Python code to F#.

Port to F# #

The next step is to write a parametrised test, using a provisional implementation of f:

[<Theory; MemberData(nameof fTestCases)>]
let ``test f`` x y z expected = expected =! f x y z

This test uses xUnit.net 2.4.1 and Unquote 5.0.0. As you can tell, apart from the annotations, it's a true one-liner. It calls the f function with the three supplied arguments x, y, and z and compares the return value with the expected value.

The code uses the new nameof feature of F# 5. fTestCases is a function in the same module that holds the test:

// unit -> seq<obj []>
let fTestCases () =
    use strm = typeof<Anchor>.Assembly.GetManifestResourceStream streamName
    use rdr = new StreamReader (strm)
    let s = rdr.ReadToEnd ()
    s.Split Environment.NewLine |> Seq.map csvToTestCase

It reads an embedded resource stream of test cases, like the above comma-separated values. Even though the values are in a text file, it's easier to embed the file in the test assembly, because it nicely dispenses with the problem of copying a text file to the appropriate output directory when the code compiles. That would, however, be an valid alternative.

Anchor is a dummy type to support typeof, and streamName is just a string constant that identifies the name of the stream.

The csvToTestCase function converts each line of comma-separated values to test cases for the [<Theory>] attribute:

// string -> obj []
let csvToTestCase (csv : string) =
    let values = csv.Split ','
    [|
        values.[0] |> Convert.ToInt32 |> box
        values.[1] |> Convert.ToInt32 |> box
        values.[2] |> Convert.ToInt32 |> box
        values.[3] |> Convert.ToBoolean |> box
    |]

It's not the safest code I could write, but this is, after all, a proof of concept.

The most direct port of the Python code I could produce is this:

// f : int -> int -> int -> bool
let f (x : int) (y : int) (z : int) =
    let mutable mx = bigint x
    let mutable my = bigint y
    let mutable mz = bigint z
    let mutable out = 0I
    for i in [0I..9I] do
        out <- out * mx + abs (my * mz - i * i)
        let x' = mx
        let y' = my
        let z' = mz
        mx <- y' + 1I
        my <- z'
        mz <- x'
    abs out % 100I < 10I

As F# code goes, it's disagreeable, but it passes all 200 test cases, so this will serve as an initial implementation. The out variable can grow to values that overflow even 64-bit integers, so I had to convert to bigint to get all test cases to pass.

If I make the same mutation to the code that Hillel Wayne did (abs out % 100I < 9I) two test cases fail. This gives me some confidence that I have a degree of problem coverage comparable to his.

Test oracle #

Now that a reference implementation exists, we can use it as a test oracle for refactorings. You can, for example, add a little test-only utility to your program portfolio:

open Prod
open FsCheck
 
[<EntryPoint>]
let main argv =
    Arb.Default.Int32().Generator
    |> Gen.three
    |> Gen.sample 100 200
    |> List.iter (fun (x, y, z) -> printfn "%i, %i, %i, %b" x y z (f x y z))
 
    0 // return an integer exit code

Notice that the last step in the pipeline is to output the values of each x, y, and z, as well as the result of calling f x y z.

This is a command-line executable that uses FsCheck to produce new test cases by calling the f function. It looks similar to the above one-off script that produced Python code, but this one instead just produces comma-separated values. You can run it from the command line to produce a new sample of test cases:

$ ./foracle
29, -48, -78, false
-8, -25, 13, false
-74, 34, -68, true
...

As above, I've used an ellipsis to indicate that in reality, 200 lines of comma-separated values scroll by.

When you use Bash, you can even pipe the output straight to a file:

$ ./foracle > csv.txt

You can now take the new comma-separated values and update the test values that the above test f test uses.

In other words, you use version n of f as a test oracle for version n + 1. When iteration n + 1 is a function of iteration n, you have a so-called dynamic system, so I think that we can call this technique dynamic test oracles.

The above foracle program is just a proof of concept. You could make it more flexible by making it take command-line arguments that would let you control the sample size and FsCheck's size parameter (the hard-coded 100 in the above code listing).

Refactoring #

With the confidence instilled by the test cases, we can now refactor the f function:

// f : int -> int -> int -> bool
let f (x : int) (y : int) (z : int) =
    let imp (x, y, z, out) i =
        let out = out * x + abs (y * z - i * i)
        y + 1I, z, x, out
    let (_, _, _, out) = List.fold imp (bigint x, bigint y, bigint z, 0I) [0I..9I]
    abs out % 100I < 10I

Instead of all those mutable variables, the function is, after all, just a left fold. Phew, I feel better now.

Conclusion #

This article demonstrated a proof of concept where you use a known good version (n) of the code as a test oracle for the next version (n + 1). In interpreted languages, you may be able to load two versions of the code base side by side, but that's rarely practical in a statically typed compiled language like F#. Instead, I used a utility program to generate test cases that can be used as a data source for a parametrised test.

The example rho problem takes only integers as input, and returns a simple Boolean value, so in this case it's trivial to encode each test case as a line of comma-separated values. For (real) problems that may involve more complex types, it'd be better to use another serialisation format, such as JSON or XML.

An outstanding issue is whether it's possible to implement shrinking behaviour when tests fail. Currently, the proof of concept just uses a set of serialised test cases. Normally, when a property-based testing framework like FsCheck detects a counter-example, it'll shrink the counter-example to values that are easier to understand than the original. This proof of concept doesn't do that. I'm not sure if a framework like FsCheck currently contains enough extensibility points to enable that sort of behaviour. I'll leave that question open for any reader interested in taking on that problem.


Comments

Hi Mark! Thanks for another thought provoking post.

I believe you and Hillel are writing characterization tests, which you've mentioned in the past. Namely, you're both using the behavior of existing code to verify the correctness of a refactor. The novel part to me is that Hillel is using code as the test oracle. Your solution serializes the oracle to a static file. The library I use for characterization tests (ApprovalTests) does this as well.

I believe shrinking is impossible when the oracle is a static file. However with Hillel's solution the oracle may be consulted at any time, making shrinking viable. If only there was a practical way to combine the two...

2021-01-06 23:01 UTC

A thought provoking post indeed!

In F# I'm going to use FsCheck...

I think that is a fine choice given the use case laid out in this post. In general though, I think Hedgehog is a better property-based testing library. Its killer feature is integrated shrinking, which means that all generators can also shrink and this extra power is essentially free.

For the record (because this can be a point of confusion), Haskell has QuickCheck and (Haskell) Hedgehog while F# has ports from Haskell called FsCheck and (FSharp) Hedgehog.

Jacob Stanley gave this excellent talk at YOW! Lambda Jam 2017 that explains the key idea that allows Hedgehog to have integrated shrinking. (Spoiler: A generic type that is invariant in its only type parameter is replaced by a different generic type that is monadic in its only type parameter. API design guided by functional programming for the win!)

Normally, when a property-based testing framework like FsCheck detects a counter-example, it'll shrink the counter-example to values that are easier to understand than the original.

In my experience, property-based tests written with QuickCheck / FsCheck do not normally shrink. I think this is because of the extra work required to enable shrinking. For an anecdotal example, consider this post by Fraser Tweedale. He believed that it would be faster to add (Haskell) Hedgehog as a dependency and create a generator for it than to add shrinking to his existing generator in QuickCheck.

In other words, you use version n of f as a test oracle for version n + 1. When iteration n + 1 is a function of iteration n, you have a so-called dynamic system, so I think that we can call this technique dynamic test oracles.

I am confused by this paragraph. I interpret your word "When" at the start of the second sentence as a common-language synonym for the mathematical word "If". Here is roughly how I understand that paragraph, where A stands for "version / iteration n of f" and B stands for "version / iteration n + 1 of f". "A depends on B. If B depends on A, then we have a dynamic system. Therefore, we have a dynamic system." I feel like the paragraph assumes (because it is obvious?) that version / iteration n + 1 of f depends on version / iteration n of f. In what sense is that the case?

An outstanding issue is whether it's possible to implement shrinking behaviour when tests fail. [...] I'll leave that question open for any reader interested in taking on that problem.

I am interested!

Quoting Mark and then Alex.

Hillel Wayne [...] outlines an approach where you test an iteration of rho code against a 'last known good' snapshot. He uses git worktree to set up a snapshot of the reference implementation. He then writes a property that compares the refactored code's behaviour against the reference.

The example code is in Python, which is a language that I don't know. As far as I can tell, it works because Python is 'lightweight' enough that you can load and execute source code directly. I found that the approach makes much sense, but I wondered how it would apply for statically typed, compiled languages. I decided to create a proof of concept in F#.

I believe shrinking is impossible when the oracle is a static file. However with Hillel's solution the oracle may be consulted at any time, making shrinking viable.

I want to start by elaborating on this to make sure we are all on the same page. I think of shrinking as involving two parts. On the one hand, we have the "shrink tree", which contains the values to test during the shrinking process. On the other hand, for each input tested, we need to know if the output should cause the test to pass or fail.

With Hedgehog, getting a shrink tree would not be too difficult. For a generator with type parameter 'a, the current generator API returns a "random" shrink tree of type 'a in which the root is an instance a of the type 'a and the tree completely depends on a. It should be easy to expose an additional function that accepts inputs of type 'a Gen and 'a and returns the tree with the given 'a as its root.

The difficult part is being able to query the test oracle. As Mark said, this seems easy to do in a dynamically-typed language like Python. In contrast, the fundamental issue with a statically-typed language like F# is that the compiled code exists in an assembly and only one assembly of a given name can be loaded in a given process at the same time.

This leads me to two ideas for workarounds. First, we could query the test oracle in a different process. I imagine an entry point could be generated that gives direct access to the test oracle. Then the test process could query the test oracle by executing this generated process. Second, we could generate a different assembly that exposes the test oracle. Then the test process could load this generated assembly to query the test oracle. The second approach seems like it would have a faster query time but be harder to implement. The first approach seems easier to implement but would probably have a slower query time. Maybe the query time would be fast enough though, especially if it was only queried when shrinking.

But given such a solution, who wants to restrict access to the test oracle only to shrinking? If the test oracle is always available, then there is no need to store input-output pairs. Instead of always checking that the system under test works correctly for a previously selected set of inputs, the property-based test can check that the system under test has the expected behavior for a unique set of inputs each time the property-based test is executed. In my experience, this is the default behavior of a property-based test.

One concern that some people might have is the idea of checking into the code repository the binary containing the test oracle. My first though is that the size of this is likely so small that it does not matter. My second thought is that the binary containing the test oracle does not have to be included in the repository. Instead, the workflow could be to (1) create the property-based test that uses the compiled test oracle, (2) refactor the system under test, (3) observe that the property-based test still passes, (4) commit the refactored code, and (5) discard the remaining changes, which will delete the property-based test and the compiled test oracle.

Instead of completely removing that property-based test, it might be better to leave it there with input-output pairs stored in a file. Then the conversion from that state of the property-based test to the one that uses the compiled test oracle will be much smaller.

2021-01-07 19:27 UTC

Alex, thank you for writing. Yes, I think that calling this a Characterisation Test is correct. I wasn't aware of the ApprovalTests library; thank you for mentioning it.

When I originally wrote the article, I was under the impression that shrinking might still be possible. I admit, though, that I hadn't thought things through. I think that Tyson Williams argues convincingly that this isn't possible.

2021-01-15 13:42 UTC


Wish to comment?

You can add a comment to this post by sending me a pull request. Alternatively, you can discuss this post on Twitter or somewhere else with a permalink. Ping me with the link, and I may respond.

Published

Monday, 04 January 2021 06:26:00 UTC

Tags



"Our team wholeheartedly endorses Mark. His expert service provides tremendous value."
Hire me!
Published: Monday, 04 January 2021 06:26:00 UTC