Two attempts to measure the quality of automated test suites.

While test-driven development remains, in my view, the most scientific approach to software testing, I realize that it's still a minority practice. Furthermore, with the rise of AI, it's becoming increasingly common to let LLMs generate tests.

Being practical about it, we need to explore how to critique tests; how to measure or evaluate the quality of tests we never wrote, and that we never saw fail.

I'm aware of two technical measurements, as well as handful of heuristics that we may apply, but I think we may need more. Thus, this overview is only preliminary.

Code coverage #

The notion of code coverage has long, with good reason, been dismissed as 'not really helpful'. And indeed, code coverage is a useless target measure, because it's too easy to game.

Perhaps we should reevaluate that position, now that it looks as though tests will increasingly be written by LLMs. While human developers will game simple incentives, who knows what LLMs will do? In any case, as test generation becomes automated, we need no longer care that much about agents 'gaming' the system.

After all, when people game an incentive system, the problem is two-fold. First, direct outcomes may have adverse effects. In the context of testing, tests written to attain a certain level of test coverage may be of poor quality, requiring too much maintenance. Second, there's opportunity cost. The time spent writing poor tests could, perhaps, be spent doing something more valuable.

The first concern is still relevant when asking LLMs to generate tests, but the second concern may be of less importance. Assuming that LLM-generated tests are relatively inexpensive, the least we may ask of such tests is a high coverage ratio.

This is not much of a quality measure, but rather a minimum bar. If you ask an LLM to generate tests, and all it can do is to achieve 30% coverage, that really isn't impressive. In the end, it's up to you to determine what to test and not to test, but for LLM-generated tests, I would expect high coverage.

After all, reaching 100% coverage is not that trivial, so expecting high coverage means something.

The next technique may also indirectly reveal problems with path coverage, but is less available. Most mainstream languages or programming platforms come with some kind of coverage tool, whereas mutation testing is rarer.

Mutation testing #

Mutation testing is the process of changing (mutating) particular code parts of the System Under Test and then run tests to see if any of them fail. If, for example, you can change a greater-than operator to a greater-than-or-equal operator, and no tests fail, this indicates that the tests don't cover an edge case.

As I understand it, originally mutation testing mostly targeted relational operators, replacing >= with > or perhaps even <, replacing == with != and so on. The last time I used Stryker for C#, however, it went much further than that, by, for example, trying to remove filter expressions from query pipelines, and so on.

Mutation testing overlaps code coverage in that it also identifies uncovered branches, but it can flush out additional problems. Even so, mutation testing is not always an option. The first problem is that, if you want to automate the process, the solution is language-specific. If, for example, you want to mutate equality relations, in most languages you'd look for the == operator. Even so, in C# you need to change that to !=, while in Haskell the opposite operator is /=. And in F#, the operator to look for is =, to be replaced with <>.

That said, you might think that you could write a simple search-and-replace script to get the job done, but consider that a character like < may have multiple meanings in a code base. In C# and Java, for example, < and > are also used to indicate generic type arguments, and in Haskell those characters are also used for compound operators such as >>=.

A mutation-testing tool must know about the language it targets. To be on the safe side, it's probably best to at least have a parser so that you can manipulate abstract syntax trees.

Then, for each mutation, the tool needs to run the test suite in question, keeping track of which mutations cause test failures, and which ones don't. I'm not saying that this is impossibly difficult, but it's also not entirely trivial.

Another problem with mutation testing is that it takes time. Consider changing every relational operator in your code base. How many do you have? Thousands? Then consider how much time it takes to run the test suite. Now multiply those two numbers.

And this is only for single mutations. If you want to also test combinations of mutations, the number is now exponential rather than linear. For most code bases, this is impractical. You can see how code coverage is a practical alternative.

Heuristics #

In addition to code coverage and mutation testing, if I were given a unit test suite and had to evaluate its quality (but prevented from treating each test as a Characterization Test), I'd also consider the following.

As a rule of thumb, tests should have a cyclomatic complexity of 1. In many languages, you can get a report of cyclomatic complexity. If such a report finds tests with a cyclomatic complexity greater than 1, this bears investigation. Unless it's a parametrised test, it probably shouldn't contain loops or branching.

Even simpler than cyclomatic complexity, you may consider something as basic as the size of each test. How many lines of code is it? What's the line width? Does it fit into a reasonably-sized box?

Furthermore, measure the running time of the new tests. In itself, this doesn't tell you anything about correctness, but if some tests are suspiciously slow, this could be caused because a test is awaiting some other event, and suspending its thread while doing that. Such tests are not only slow, but may also be incorrect because using timeouts or similar for thread synchronization tends to be faulty in non-deterministic ways.

While we are on the topic of non-determinism, try running the test multiple times, and make sure that the results are consistent over several runs.

Finally, if you have the choice, favour tests written in the language with the most powerful type system. For example, if the System Under Test (SUT) is written in JavaScript, you can target it from tests written in a selection of languages. I'd rather see LLM-generated tests in TypeScript than in JavaScript, because the TypeScript type checker can catch errors that may go unnoticed in JavaScript. I haven't kept up with that ecosystem, but perhaps PureScript is an even better choice than TypeScript.

Likewise, if the SUT is a .NET application, I'd trust LLM-generated tests written in F# over tests written in C#.

Not all ecosystems give you such a choice, but if possible, favour tests written in a language with a powerful type checker.

Additionally, run linters or static code analysis on the tests, and treat warnings as errors. And be sure to scan the code for pragmas that suppress warnings.

There's quite a bit to look after. Perhaps a checklist would be helpful.

Conclusion #

Using LLMs to generate tests will almost certainly become increasingly common. This raises the fundamental question: How do we know that the tests do what we want them to?

While you could go systematically through each test and apply the process for empirical Characterization Testing, I doubt most people have the patience or discipline. As a next-best solution, we may look for ways to critique the tests, or rather, measure their quality.

For the time being, I can think of two tools for this purpose: Code coverage and mutation testing. None are particularly reassuring, so this seems to me to be a field where more research and development would be beneficial.



Wish to comment?

You can add a comment to this post by sending me a pull request. Alternatively, you can discuss this post on Twitter or somewhere else with a permalink. Ping me with the link, and I may respond.

Published

Monday, 16 February 2026 13:10:00 UTC

Tags



"Our team wholeheartedly endorses Mark. His expert service provides tremendous value."
Hire me!
Published: Monday, 16 February 2026 13:10:00 UTC