Using characterisation tests and mutation testing to describe existing behaviour. An example.

This article is part of an article series that presents multiple design alternatives for a given problem that I call the song recommendations problem. In short, the problem is to recommend songs to a user based on a vast repository of scrobbles. The problem was originally proposed by Oleksii Holub as a an example of a problem that may not be a good fit for functional programming (FP).

As I've outlined in the introductory article, I'd like to use the opportunity to demonstrate alternative FP designs. Before I can do that, however, I need a working example of Oleksii Holub's code example, as well as a trustworthy test suite. That's what this article is about.

The code in this article mostly come from the master branch of the .NET repository that accompanies this article series, although some of the code is taken from intermediate commits on that branch.

Inferred details #

The original article only shows code, but doesn't link to an existing code base. While I suppose I could have asked Oleksii Holub if he had a copy he would share, the existence of such a code base isn't a given. In any case, inferring an entire code base from a comprehensive snippet is an interesting exercise in its own right.

The first step was to copy the example code into a code base. Initially it didn't compile because of some missing dependencies that I had to infer. It was only three Value Objects and an interface:

public sealed record Song(int Id, bool IsVerifiedArtist, byte Rating);

public sealed record Scrobble(Song Song, int ScrobbleCount);

public sealed record User(string UserName, int TotalScrobbleCount);

public interface SongService
{
    Task<IReadOnlyCollection<User>> GetTopListenersAsync(int songId);
    Task<IReadOnlyCollection<Scrobble>> GetTopScrobblesAsync(string userName);
}

These type declarations are straightforward, but still warrant a few comments. First, Song, Scrobble, and User are C# records, which is a more recent addition to the language. If you're reading along here, but using another C-based language, or an older version of C#, you can implement such immutable Value Objects with normal language constructs; it just takes more code, instead of the one-liner syntax. F# developers will, of course, be familiar with the concept of records, and Haskell also has them.

Another remark about the above type declarations is that while SongService is an interface, it has no I prefix. This is syntactically legal, but not idiomatic in C#. I've used the name from the original code sample verbatim, so that's the immediate explanation. It's possible that Oleksii Holub intended the type to be a base class, but for various reasons I prefer interfaces, although in this particular example I don't think it would have made much of a difference. I'm only pointing it out because there's a risk that it might confuse some readers who are used to the C# naming convention. Java programmers, on the other hand, should feel at home.

As far as I remember, the only other change I had to make to the code in order to make it compile was to give the RecommendationsProvider class a constructor:

public RecommendationsProvider(SongService songService)
{
    _songService = songService;
}

This is just basic Constructor Injection, and while I find the underscore prefix redundant, I've kept it in order to stay as faithful to the original example as possible.

At this point the code compiles.

Test Double #

The goal of this article series is to present several alternative designs that implement the same behaviour. This means that as I refactor the code, I need to know that I didn't break existing functionality.

"to refactor, the essential precondition is [...] solid tests"

Currently, I have no tests, so I'll have to add some. Since RecommendationsProvider makes heavy use of its injected SongService, tests must supply that dependency in order to do meaningful work. Since Stubs and Mocks break encapsulation I instead favour state-based testing with Fake Objects.

After some experimentation, I arrived at this FakeSongService:

public sealed class FakeSongService : SongService
{
    private readonly ConcurrentDictionary<int, Song> songs;
    private readonly ConcurrentDictionary<string, ConcurrentDictionary<intint>> users;
 
    public FakeSongService()
    {
        songs = new ConcurrentDictionary<int, Song>();
        users = new ConcurrentDictionary<string, ConcurrentDictionary<intint>>();
    }
 
    public Task<IReadOnlyCollection<User>> GetTopListenersAsync(int songId)
    {
        var listeners =
            from kvp in users
            where kvp.Value.ContainsKey(songId)
            select new User(kvp.Key, kvp.Value.Values.Sum());
 
        return Task.FromResult<IReadOnlyCollection<User>>(listeners.ToList());
    }
 
    public Task<IReadOnlyCollection<Scrobble>> GetTopScrobblesAsync(
        string userName)
    {
        var scrobbles = users
            .GetOrAdd(userName, new ConcurrentDictionary<intint>())
            .Select(kvp => new Scrobble(songs[kvp.Key], kvp.Value));
 
        return Task.FromResult<IReadOnlyCollection<Scrobble>>(scrobbles.ToList());
    }
 
    public void Scrobble(string userName, Song song, int scrobbleCount)
    {
        users.AddOrUpdate(
            userName,
            new ConcurrentDictionary<intint>(
                new[] { KeyValuePair.Create(song.Id, scrobbleCount) }),
            (_, scrobbles) => AddScrobbles(scrobbles, song, scrobbleCount));
 
        songs.AddOrUpdate(song.Id, song, (_, _) => song);
    }
 
    private static ConcurrentDictionary<intint> AddScrobbles(
        ConcurrentDictionary<intint> scrobbles,
        Song song,
        int scrobbleCount)
    {
        scrobbles.AddOrUpdate(
            song.Id,
            scrobbleCount,
            (_, oldCount) => oldCount + scrobbleCount);
        return scrobbles;
    }
}

If you're wondering about the use of concurrent dictionaries, I use them because it made it easier to write the implementation, and not because I need the implementation to be thread-safe. In fact, I'm fairly sure that it's not thread-safe. That's not an issue. The tests aren't going to use shared mutable state.

The GetTopListenersAsync and GetTopScrobblesAsync methods implement the interface, and the Scrobble method (here, scrobble is a verb: to scrobble) is a back door that enables tests to populate the FakeSongService.

Icebreaker Test #

While the 'production code' is in C#, I decided to write the tests in F# for two reasons.

The first reason was that I wanted to be able to present the various FP designs in both C# and F#. Writing the tests in F# would make it easier to replace the C# code base with an F# alternative.

The second reason was that I wanted to leverage a property-based testing framework's ability to produce many randomly-generated test cases. I considered this important to build confidence that my tests weren't just a few specific examples that wouldn't catch errors when I made changes. Since the RecommendationsProvider API is asynchronous, the only .NET property-based framework I knew of that can run Task-valued properties is FsCheck. While it's possible to use FsCheck from C#, the F# API is more powerful.

In order to get started, however, I first wrote an Icebreaker Test without FsCheck:

[<Fact>]
let ``No data`` () = task {
    let srvc = FakeSongService ()
    let sut = RecommendationsProvider srvc
    let! actual = sut.GetRecommendationsAsync "foo"
    Assert.Empty actual }

This is both a trivial case and an edge case, but clearly, if there's no data in the SongService, the RecommendationsProvider can't recommend any songs.

As I usually do with Characterisation Tests, I temporarily sabotage the System Under Test so that the test fails. This is to ensure that I didn't write a tautological assertion. Once I've seen the test fail for the appropriate reason, I undo the sabotage and check in the code.

Refactor to property #

In the above No data test, the specific input value "foo" is irrelevant. It might as well have been any other string, so why not make it a property?

In this particular case, the userName could be any string, but it might be appropriate to write a custom generator that produces 'realistic' user names. Just to make things simple, I'm going to assume that user names are between one and twenty characters and assembled from alphanumeric characters, and that the fist character must be a letter:

module Gen =
    let alphaNumeric = Gen.elements (['a'..'z'] @ ['A'..'Z'] @ ['0'..'9'])
 
    let userName = gen {
        let! length = Gen.choose (1, 19)
        let! firstLetter = Gen.elements <| ['a'..'z'] @ ['A'..'Z']
        let! rest = alphaNumeric |> Gen.listOfLength length
        return firstLetter :: rest |> List.toArray |> String }

Strictly speaking, as long as user names are distinct, the code ought to work, so this generator may be more conservative than necessary. Why am I constraining the generator? For two reasons: First, when FsCheck finds a counter-example, it displays the values that caused the property to fail. A twenty-character alphanumeric string is easier to relate to than some arbitrary string with line breaks and unprintable characters. The second reason is that I'm later going to measure memory load for some of the alternatives, and I wanted data to have realistic size. If user names are chosen by humans, they're unlikely to be longer than twenty characters on average (I've decided).

I can now rewrite the above No data test as an FsCheck property:

[<Property>]
let ``No data`` () =
    Gen.userName |> Arb.fromGen |> Prop.forAll <| fun userName ->
    task {
        let srvc = FakeSongService ()
        let sut = RecommendationsProvider srvc
 
        let! actual = sut.GetRecommendationsAsync userName
 
        Assert.Empty actual } :> Task

You may think that this is overkill just to be able to supply random user names to the GetRecommendationsAsync method. In isolation, I'd be inclined to agree, but this edit was an occasion to get my FsCheck infrastructure in place. I can now use that to add more properties.

Full coverage #

The cyclomatic complexity of the GetRecommendationsAsync method is only 3, so it doesn't require many tests to attain full code coverage. Not that 100% code coverage should be a goal in itself, but when adding tests to an untested code base, it can be useful as an indicator of confidence. Despite its low cyclomatic complexity, the method, with all of its filtering and sorting, is actually quite involved. 100% coverage strikes me as a low bar.

The above No data test exercises one of the three branches. At most two more tests are required to attain full coverage. I'll just show the simplest of them here.

The next test case requires a new FsCheck generator, in addition to Gen.userName already shown.

let song = ArbMap.generate ArbMap.defaults |> Gen.map Song

As a fairly simple one-liner, this seems close to the Fairbairn threshold, but I think that giving this generator a name makes the test easier to read.

[<Property>]
let ``One user, some songs`` () =
    gen {
        let! user = Gen.userName
        let! songs = Gen.arrayOf Gen.song
        let! scrobbleCounts =
            Gen.choose (1, 100) |> Gen.arrayOfLength songs.Length
        return (user, Array.zip songs scrobbleCounts) }
    |> Arb.fromGen |> Prop.forAll <| fun (user, scrobbles) ->
    task {
        let srvc = FakeSongService ()
        scrobbles |> Array.iter (fun (s, c) -> srvc.Scrobble (user, s, c))
        let sut = RecommendationsProvider srvc
 
        let! actual = sut.GetRecommendationsAsync user
 
        Assert.Empty actual } :> Task

This test creates scrobbles for a single user and adds them to the Fake data store. It uses TIE-fighter syntax to connect the generators to the test body.

Since all the scrobble counts are generated between 1 and 100, none of them are greater than or equal to 10,000 and thus the test expects no recommendations.

You may think that I'm cheating - after all, why didn't I choose another range for the scrobble count? To be honest, I was still in an exploratory phase, trying to figure out how to express the tests, and as a first step, I was aiming for full code coverage. Even though this test's assertion is weak, it does exercise another branch of the GetRecommendationsAsync method.

I had to write only one more test to fully cover the System Under Test. That method is more complicated, so I'll spare you the details. If you're interested, you may consider consulting the example source code repository.

Mutation testing #

While I don't think that code coverage is useful as a target measure, it can be illuminating as a tool. In this case, knowing that I've now attained full coverage tells me that I need to resort to other techniques if I want another goal to aim for.

I chose mutation testing as my new technique. The GetRecommendationsAsync method makes heavy use of LINQ methods such as OrderByDescending, Take, and Where. Stryker for .NET knows about LINQ, so among all the automated sabotage is does, it tries to see what happens if it removes e.g. Where or Take.

Although I find the Stryker jargon childish, I set myself the goal to see if I could 'kill mutants' to a degree that I'd at least get a green rating.

I found that I could edge closer to that goal by a combination of appending assertions (thus strengthening postconditions) and adding tests. While I sometimes find it easier to define properties than examples, at other times, it's the other way around. In this case, I found it easier to add single examples, like this one:

[<Fact>]
let ``One verified recommendation`` () = task {
    let srvc = FakeSongService ()
    srvc.Scrobble ("cat", Song (1, false, 6uy),     10)
    srvc.Scrobble ("ana", Song (1, false, 5uy),     10)
    srvc.Scrobble ("ana", Song (2,  true, 5uy), 9_9990)
    let sut = RecommendationsProvider srvc
 
    let! actual = sut.GetRecommendationsAsync "cat"
 
    Assert.Equal<Song> ([ Song (2, true, 5uy) ], actual) } :> Task

It adds three scrobbles to the data store, but only one of them is verified (which is what the true value indicates), so this is the only recommendation the test expects to see.

Notice that although song number 2 'only' has 9,9990 plays, the user ana has exactly 10,000 plays in all, so barely makes the cut. By carefully adding five examples like this one, I was able to 'kill all mutants'.

In all, I have eight tests; three FsCheck properties and five normal xUnit.net facts.

All tests work exclusively by supplying direct and indirect input to the System Under Test (SUT), and verify the return value of GetRecommendationsAsync. No Mocks or Stubs have opinions about how the SUT interacts with the injected SongService. This gives me confidence that the tests constitute a trustworthy regression test suite, and that they're still sufficiently decoupled from implementation details to enable me to completely rewrite the SUT.

Quirks #

When you add tests to an existing code base, you may discover edge cases that the original programmer overlooked. The GetRecommendationsAsync method is only a code example, so I don't blame Oleksii Holub for some casual coding, but it turns out that the code has some quirks.

For example, there's no deduplication, so I had to apologise in my test code:

[<Fact>]
let ``Only top-rated songs`` () = task {
    let srvc = FakeSongService ()
    // Scale ratings to keep them less than or equal to 10.
    [1..20] |> List.iter (fun i ->
        srvc.Scrobble ("hyle", Song (i, true, byte i / 2uy), 500))
    let sut = RecommendationsProvider srvc
 
    let! actual = sut.GetRecommendationsAsync "hyle"
 
    Assert.NotEmpty actual
    // Since there's only one user, but with 20 songs, the implementation loops
    // over the same songs 20 times, so 400 songs in total (with duplicates).
    // Ordering on rating, only the top-rated 200 remains, that is, those rated
    // 5-10. Note that this is a Characterization Test, so not necessarily
    // reflective of how a real recommendation system should work.
    Assert.All (actual, fun s -> Assert.True (5uy <= s.Rating)) } :> Task

This test creates twenty scrobbles for one user: One with a zero rating, two with rating 1, two with rating 2, and so on, up to a single song with rating 10.

The implementation of GetRecommendationsAsync uses these twenty songs to find 'other users' who have these top songs as well. In this case, there's only one user, so for every of those twenty songs, you get the same twenty songs, for a total of 400.

You might protest that this is because my FakeSongService implementation is too unsophisticated. Obviously, it should not return the 'original' user's songs! Do, however, consider the implied signature of the GetTopListenersAsync method:

Task<IReadOnlyCollection<User>> GetTopListenersAsync(int songId);

The method only accepts a songId as input, and if we assume that the service is stateless, it doesn't know who the 'original' user is.

Should I fix the quirks? In a real system, it might be appropriate, but in this context I find it better to keep the them. Real systems often have quirks in the shape of legacy business rules and the like, so I only find it realistic that the system may exhibit some weird behaviour. The goal of this set of articles isn't to refactor this particular system, but rather to showcase alternative designs for a system sufficiently complicated to warrant refactorings. Simplifying the code might defeat that purpose.

As shown, I have an automated test that requires me to keep that behaviour. I think I'm in a good position to make sweeping changes to the code.

Conclusion #

As Martin Fowler writes, an essential precondition for refactoring is a trustworthy test suite. On a daily basis, millions of developers prove him wrong by deploying untested changes to production. There are other ways to make changes, including manual testing, A/B testing, testing in production, etc. Some of them may even work in some contexts.

In contrast to such real-world concerns, I don't have a production system with real users. Neither do I have a product owner or a department of manual testers. The best I can do is to add enough Characterisation Tests that I feel confident that I've described the behaviour, rather than the implementation, in enough detail to hold it in place. A Software Vise, as Michael Feathers calls it in Working Effectively with Legacy Code.

Most systems in 'the real world' have too few automated tests. Adding tests to a legacy code base is a difficult discipline, so I found it worthwhile to document this work before embarking on the actual design changes promised by this article series. Now that this is out of the way, we can proceed.

The next two articles do more groundwork to establish equivalent code bases in F# and Haskell. If you only care about the C# examples, you can go back to the first article in this article series and use the table of contents to jump to the next C# example.

Next: Porting song recommendations to F#.



Wish to comment?

You can add a comment to this post by sending me a pull request. Alternatively, you can discuss this post on Twitter or somewhere else with a permalink. Ping me with the link, and I may respond.
Published: Thursday, 10 April 2025 08:05:00 UTC