How I currently think about errors in programming.

A reader recently asked a question that caused me to reflect on the way I think about errors in software. While my approach to error handling has remained largely the same for years, I don't think I've described it in an organized way. I'll try to present those thoughts here.

This article is, for lack of a better term, a think piece. I don't pretend that it represents any fundamental truth, or that this is the only way to tackle problems. Rather, I write this article for two reasons.

  • Writing things down often helps clarifying your thoughts. While I already feel that my thinking on the topic of error handling is fairly clear, I've written enough articles that I know that by writing this one, I'll learn something new.
  • Publishing this article enables the exchange of ideas. By sharing my thoughts, I enable readers to point out errors in my thinking, or to improve on my work. Again, I may learn something. Perhaps others will, too.

Although I don't claim that the following is universal, I've found it useful for years.

Error categories #

Almost all software is at risk of failing for a myriad of reasons: User input, malformed data, network partitions, cosmic rays, race conditions, bugs, etc. Even so, we may categorize errors like this:

  • Predictable errors we can handle
  • Predictable errors we can't handle
  • Errors we've failed to predict

This distinction is hardly original. I believe I've picked it up from Michael Feathers, but although I've searched, I can't find the source, so perhaps I'm remembering it wrong.

You may find these three error categories underwhelming, but I find it useful to first consider what may be done about an error. Plenty of error situations are predictable. For example, all input should be considered suspect. This includes user input, but also data you receive from other systems. This kind of potential error you can typically solve with input validation, which I believe is a solved problem. Another predictable kind of error is unavailable services. Many systems store data in databases. You can easily predict that the database will, sooner or later, be unreachable. Potential causes include network partitions, a misconfigured connection string, logs running full, a crashed server, denial-of-service attacks, etc.

With some experience with software development, it's not that hard producing a list of things that could go wrong. The next step is to decide what to do about it.

There are scenarios that are so likely to happen, and where the solution is so well-known, that they fall into the category of predictable errors that you can handle. User input belongs here. You examine the input and inform the user if it's invalid.

Even with input, however, other scenarios may lead you down different paths. What if, instead of a system with a user interface, you're developing a batch job that receives a big data file every night? How do you deal with invalid input in that scenario? Do you reject the entire data set, or do you filter it so that you only handle the valid input? Do you raise a notification to asynchronously inform the sender that input was malformed?

Notice how categorization is context-dependent. It would be a (category?) error to interpret the above model as fixed and universal. Rather, it's an analysis framework that helps identifying how to categorize various fault scenarios in a particular application context.

Another example may be in order. If your system depends on a database, a predictable error is that the database will be unavailable. Can you handle that situation?

A common reaction is that there's really not a lot one can do about that. You may retry the operation, log the problem, or notify an on-call engineer, but ultimately the system depends on the database. If the database is unreachable, the system can't work. You can't handle that problem, so this falls in the category of predictable errors that you can't handle.

Or does it?

Trade-offs of error handling #

The example of an unreachable database is useful to explore in order to demonstrate that error handling isn't writ in stone, but rather an architectural design decision. Consider a common API design like this:

public interface IRepository<T>
{
    int Create(T item);
 
    // other members
}

What happens if client code calls Create but the database is unreachable? This is C# code, but the problem generalizes. With most implementations, the Create method will throw an exception.

Can you handle that situation? You may retry a couple of times, but if you have a user waiting for a response, you can't retry for too long. Once time is up, you'll have to accept that the operation failed. In a language like C#, the most robust implementation is to not handle the specific exception, but instead let it bubble up to be handled by a global exception handler that usually can't do much else than showing the user a generic error message, and then log the exception.

This isn't your only option, though. You may find yourself in a context where this kind of attitude towards errors is unacceptable. If you're working with BLOBAs it's probably fine, but if you're working with medical life-support systems, or deep-space probes, or in other high-value contexts, the overall error-tolerance may be lower. Then what do you do?

You may try to address the concern with IT operations: Configure failover systems for the database, installing two network cards in every machine, and so on. This may (also) be a way to address the problem, but isn't your only option. You may also consider changing the software architecture.

One option may be to switch to an asynchronous message-based system where messages are transmitted via durable queues. Granted, durables queues may fail as well (everything may fail), but when done right, they tend to be more robust. Even a machine that has lost all network connectivity may queue messages on its local disk until the network returns. Yes, the disk may run full, etc. but it's less likely to happen than a network partition or an unreachable database.

Notice that an unreachable database now goes into the category of errors that you've predicted, and that you can handle. On the other hand, failing to send an asynchronous message is now a new kind of error in your system: One that you can predict, but can't handle.

Making this change, however, impacts your software architecture. You can no longer have an interface method like the above Create method, because you can't rely on it returning an int in reasonable time. During error scenarios, messages may sit in queues for hours, if not days, so you can't block on such code.

As I've explained elsewhere you can instead model a Create method like this:

public interface IRepository<T>
{
    void Create(Guid id, T item);
 
    // other members
}

Not only does this follow the Command Query Separation principle, it also makes it easier for you to adopt an asynchronous message-based architecture. Done consistently, however, this requires that you approach application design in a way different from a design where you assume that the database is reachable.

It may even impact a user interface, because it'd be a good idea to design user experience in such a way that it helps the user have a congruent mental model of how the system works. This may include making the concept of an outbox explicit in the user interface, as it may help users realize that writes happen asynchronously. Most users understand that email works that way, so it's not inconceivable that they may be able to adopt a similar mental model of other applications.

The point is that this is an option that you may consider as an architect. Should you always design systems that way? I wouldn't. There's much extra complexity that you have to deal with in order to make asynchronous messaging work: UX, out-of-order messages, dead-letter queues, message versioning, etc. Getting to five nines is expensive, and often not warranted.

The point is rather that what goes in the predictable errors we can't handle category isn't fixed, but context-dependent. Perhaps we should rather name the category predictable errors we've decided not to handle.

Bugs #

How about the third category of errors, those we've failed to predict? We also call these bugs or defects. By definition, we only learn about them when they manifest. As soon as they become apparent, however, they fall into one of the other categories. If an error occurs once, it may occur again. It is now up to you to decide what to do about it.

I usually consider errors as stop-the-line-issues, so I'd be inclined to immediately address them. On they other hand, if you don't do that, you've implicitly decided to put them in the category of predictable errors that you've decided not to handle.

We don't intentionally write bugs; there will always be some of those around. On the other hand, various practices help reducing them: Test-driven development, code reviews, property-based testing, but also up-front design.

Error-free code #

Do consider explicitly how code may fail.

Despite the title of this section, there's no such thing as error-free code. Still, you can explicitly think about edge cases. For example, how might the following function fail?

public static TimeSpan Average(this IEnumerable<TimeSpantimeSpans)
{
    var sum = TimeSpan.Zero;
    var count = 0;
    foreach (var ts in timeSpans)
    {
        sum += ts;
        count++;
    }
    return sum / count;
}

In at least two ways: The input collection may be empty or infinite. I've already suggested a few ways to address those problems. Some of them are similar to what Michael Feathers calls unconditional code, in that we may change the domain. Another option, that I didn't cover in the linked article, is to expand the codomain:

public static TimeSpan? Average(this IReadOnlyCollection<TimeSpan> timeSpans)
{
    if (!timeSpans.Any())
        return null;
 
    var sum = TimeSpan.Zero;
    foreach (var ts in timeSpans)
        sum += ts;
    return sum / timeSpans.Count;
}

Now, instead of diminishing the domain, we expand the codomain by allowing the return value to be null. (Interestingly, this is the inverse of my profunctor description of the Liskov Substitution Principle. I don't yet know what to make of that. See: Just by writing things down, I learn something I hadn't realized before.)

This is beneficial in a statically typed language, because such a change makes hidden knowledge explicit. It makes it so explicit that a type checker can point out when we make mistakes. Make illegal states unrepresentable. Poka-yoke. A potential run-time exception is now a compile-time error, and it's firmly in the category of errors that we've predicted and decided to handle.

In the above example, we could use the built-in .NET Nullable<T> (with the ? syntactic-sugar alias). In other cases, you may resort to returning a Maybe (AKA option).

Modelling errors #

Explicitly expanding the codomain of functions to signal potential errors is beneficial if you expect the caller to be able to handle the problem. If callers can't handle an error, forcing them to deal with it is just going to make things more difficult. I've never done any professional Java programming, but I've heard plenty of Java developers complain about checked exceptions. As far as I can tell, the problem in Java isn't so much with the language feature per se, but rather with the exception types that APIs force you to handle.

As an example, imagine that every time you call a database API, the compiler forces you to handle an IOException. Unless you explicitly architect around it (as outlined above), this is likely to be one of the errors you can predict, but decide not to handle. But if the compiler forces you to handle it, then what do you do? You probably find some workaround that involves re-throwing the exception, or, as I understand that some Java developers do, declare that their own APIs may throw any exception, and by that means just pass the buck. Not helpful.

As far as I can tell, (checked) exceptions are equivalent to the Either container, also known as Result. We may imagine that instead of throwing exceptions, a function may return an Either value: Right for a right result (explicit mnemonic, there!), and left for an error.

It might be tempting to model all error-producing operations as Either-returning, but you're often better off using exceptions. Throw exceptions in those situations that you expect most clients can't recover from. Return left (or error) cases in those situations that you expect that a typical client would want to handle.

Again, it's context-specific, so if you're developing a reusable library, there's a balance to strike in API design (or overloads to supply).

Most errors are just branches #

In many languages, errors are somehow special. Most modern languages include a facility to model errors as exceptions, and special syntax to throw or catch them. (The odd man out may be C, with its reliance on error codes as return values, but that is incredible awkward for other reasons. You may also reasonably argue that C is hardly a modern language.)

Even Haskell has exceptions, even though it also has deep language support for Maybe and Either. Fortunately, Haskell APIs tend to only throw exceptions in those cases where average clients are unlikely to handle them: Timeouts, I/O failures, and so on.

It's unfortunate that languages treat errors as something exceptional, because this nudges us to make a proper category error: That errors are somehow special, and that we can't use normal coding constructs or API design practices to model them.

But you can. That's what Michael Feathers' presentation is about, and that's what you can do by making illegal states unrepresentable, or by returning Maybe or Either values.

Most errors are just branches in your code; where it diverges from the happy path in order to do something else.

Conclusion #

This article presents a framework for thinking about software errors. There are those you can predict may happen, and you choose to handle; those you predict may happen, but you choose to ignore; and those that you have not yet predicted: bugs.

A little up-front thinking will often help you predict some errors, but I'm not advocating that you foresee all errors. Some errors are programmer errors, and we make those errors because we're human, exactly because we're failing to predict the behaviour of a particular state of the code. Once you discover a bug, however, you have a choice: Do you address it or ignore it?

There are error conditions that you may deliberately choose to ignore. This doesn't necessarily make you an irresponsible programmer, but may rather be the result of a deliberate feasibility study. For example, every network operation may fail. How important is it that your application can keep running without the network? Is it worthwhile to make the code so robust that it can handle that situation? Or can you rather live with a few hours of downtime per quarter? If the latter, it may be best to let a human deal with network partitions when they occur.

The three error categories I suggest here are context-dependent. You decide which problems to deal with, and which ones to ignore, but apart from that, error-handling doesn't have to be difficult.



Wish to comment?

You can add a comment to this post by sending me a pull request. Alternatively, you can discuss this post on Twitter or somewhere else with a permalink. Ping me with the link, and I may respond.

Published

Monday, 29 January 2024 16:05:00 UTC

Tags



"Our team wholeheartedly endorses Mark. His expert service provides tremendous value."
Hire me!
Published: Monday, 29 January 2024 16:05:00 UTC