Conservative codomain conjecture

Monday, 06 May 2024 06:35:00 UTC

An API design heuristic.

For a while now, I've been wondering whether, in the language of Postel's law, one should favour being liberal in what one accepts over being conservative in what one sends. Yes, according to the design principle, a protocol or API should do both, but sometimes, you can't do that. Instead, you'll have to choose. I've recently reached the tentative conclusion that it may be a good idea favouring being conservative in what one sends.

Good API design explicitly considers contracts. What are the preconditions for invoking an operation? What are the postconditions? Are there any invariants? These questions are relevant far beyond object-oriented design. They are equally important in Functional Programming, as well as in service-oriented design.

If you have a type system at your disposal, you can often model pre- and postconditions as types. In practice, however, it frequently turns out that there's more than one way of doing that. You can model an additional precondition with an input type, but you can also model potential errors as a return type. Which option is best?

That's what this article is about, and my conjecture is that constraining the input type may be preferable, thus being conservative about what is returned.

An average example #

That's all quite abstract, so for the rest of this article, I'll discuss this kind of problem in the context of an example. We'll revisit the good old example of calculating an average value. This example, however, is only a placeholder for any kind of API design problem. This article is only superficially about designing an API for calculating an average. More generally, this is about API design. I like the average example because it's easy to follow, and it does exhibit some characteristics that you can hopefully extrapolate from.

In short, what is the contract of the following method?

public static TimeSpan Average(this IEnumerable<TimeSpantimeSpans)
{
    var sum = TimeSpan.Zero;
    var count = 0;
    foreach (var ts in timeSpans)
    {
        sum += ts;
        count++;
    }
    return sum / count;
}

What are the preconditions? What are the postconditions? Are there any invariants?

Before I answer these questions, I'll offer equivalent code in two other languages. Here it is in F#:

let average (timeSpans : TimeSpan seq) =
    timeSpans
    |> Seq.averageBy (_.Ticks >> double)
    |> int64
    |> TimeSpan.FromTicks

And in Haskell:

average :: (Fractional a, Foldable t) => t a -> a
average xs = sum xs / fromIntegral (length xs)

These three examples have somewhat different implementations, but the same externally observable behaviour. What is the contract?

It seems straightforward: If you input a sequence of values, you get the average of all of those values. Are there any preconditions? Yes, the sequence can't be empty. Given an empty sequence, all three implementations throw an exception. (The Haskell version is a little more nuanced than that, but given an empty list of NominalDiffTime, it does throw an exception.)

Any other preconditions? At least one more: The sequence must be finite. All three functions allow infinite streams as input, but if given one, they will fail to return an average.

Are there any postconditions? I can only think of a statement that relates to the preconditions: If the preconditions are fulfilled, the functions will return the correct average value (within the precision allowed by floating-point calculations).

All of this, however, is just warming up. We've been over this ground before.

Modelling contracts #

Keep in mind that this average function is just an example. Think of it as a stand-in for a procedure that's much more complicated. Think of the most complicated operation in your code base.

Not only do real code bases have many complicated operations. Each comes with its own contract, different from the other operations, and if the team isn't explicitly thinking in terms of contracts, these contracts may change over time, as the team adds new features and fixes bugs.

It's difficult work to keep track of all those contracts. As I argue in Code That Fits in Your Head, it helps if you can automate away some of that work. One way is having good test coverage. Another is to leverage a static type system, if you're fortunate enough to work in a language that has one. As I've also already covered, you can't replace all rules with types, but it doesn't mean that using the type system is ineffectual. Quite the contrary. Every part of a contract that you can offload to the type system frees up your brain to think about something else - something more important, hopefully.

Sometimes there's no good way to to model a precondition with a type, or perhaps it's just too awkward. At other times, there's really only a single way to address a concern. When it comes to the precondition that you can't pass an infinite sequence to the average function, change the type so that it takes some finite collection instead. That's not what this article is about, though.

Assuming that you've already dealt with the infinite-sequence issue, how do you address the other precondition?

Error-handling #

A typical object-oriented move is to introduce a Guard Clause:

public static TimeSpan Average(this IReadOnlyCollection<TimeSpantimeSpans)
{
    if (!timeSpans.Any())
        throw new ArgumentOutOfRangeException(
            nameof(timeSpans),
            "Can't calculate the average of an empty collection.");
 
    var sum = TimeSpan.Zero;
    foreach (var ts in timeSpans)
        sum += ts;
    return sum / timeSpans.Count;
}

You could do the same in F#:

let average (timeSpans : TimeSpan seq) =
    if Seq.isEmpty timeSpans then
        raise (
            ArgumentOutOfRangeException(
                nameof timeSpans,
                "Can't calculate the average of an empty collection."))
 
    timeSpans
    |> Seq.averageBy (_.Ticks >> double)
    |> int64
    |> TimeSpan.FromTicks

You could also replicate such behaviour in Haskell, but it'd be highly unidiomatic. Instead, I'd rather discuss one idiomatic solution in Haskell, and then back-port it.

While you can throw exceptions in Haskell, you typically handle predictable errors with a sum type. Here's a version of the Haskell function equivalent to the above C# code:

average :: (Foldable t, Fractional a) => t a -> Either String a
average xs =
  if null xs
    then Left "Can't calculate the average of an empty collection."
    else Right $ sum xs / fromIntegral (length xs)

For the readers that don't know the Haskell base library by heart, null is a predicate that checks whether or not a collection is empty. It has nothing to do with null pointers.

This variation returns an Either value. In practice you shouldn't just return a String as the error value, but rather a strongly-typed value that other code can deal with in a robust manner.

On the other hand, in this particular example, there's really only one error condition that the function is able to detect, so you often see a variation where instead of a single error message, such a function just doesn't return anything:

average :: (Foldable t, Fractional a) => t a -> Maybe a
average xs = if null xs then Nothing else Just $ sum xs / fromIntegral (length xs)

This iteration of the function returns a Maybe value, indicating that a return value may or may not be present.

Liberal domain #

We can back-port this design to F#, where I'd also consider it idiomatic:

let average (timeSpans : IReadOnlyCollection<TimeSpan>) =
    if timeSpans.Count = 0 then None else
        timeSpans
        |> Seq.averageBy (_.Ticks >> double)
        |> int64
        |> TimeSpan.FromTicks
        |> Some

This version returns a TimeSpan option rather than just a TimeSpan. While this may seem to put the burden of error-handling on the caller, nothing has really changed. The fundamental situation is the same. Now the function is just being more explicit (more honest, you could say) about the pre- and postconditions. The type system also now insists that you deal with the possibility of error, rather than just hoping that the problem doesn't occur.

In C# you can expand the codomain by returning a nullable TimeSpan value, but such an option may not always be available at the language level. Keep in mind that the Average method is just an example standing in for something that may be more complicated. If the original return type is a reference type rather than a value type, only recent versions of C# allows statically-checked nullable reference types. What if you're working in an older version of C#, or another language that doesn't have that feature?

In that case, you may need to introduce an explicit Maybe class and return that:

public static Maybe<TimeSpan> Average(this IReadOnlyCollection<TimeSpan> timeSpans)
{
    if (timeSpans.Count == 0)
        return new Maybe<TimeSpan>();
 
    var sum = TimeSpan.Zero;
    foreach (var ts in timeSpans)
        sum += ts;
    return new Maybe<TimeSpan>(sum / timeSpans.Count);
}

Two things are going on here; one is obvious while the other is more subtle. Clearly, all of these alternatives change the static type of the function in order to make the pre- and postconditions more explicit. So far, they've all been loosening the codomain (the return type). This suggests a connection with Postel's law: be conservative in what you send, be liberal in what you accept. These variations are all liberal in what they accept, but it seems that the API design pays the price by also having to widen the set of possible return values. In other words, such designs aren't conservative in what they send.

Do we have other options?

Conservative codomain #

Is it possible to instead design the API in such a way that it's conservative in what it returns? Ideally, we'd like it to guarantee that it returns a number. This is possible by making the preconditions even more explicit. I've also covered that alternative already, so I'm just going to repeat the C# code here without further comments:

public static TimeSpan Average(this NotEmptyCollection<TimeSpantimeSpans)
{
    var sum = timeSpans.Head;
    foreach (var ts in timeSpans.Tail)
        sum += ts;
    return sum / timeSpans.Count;
}

This variation promotes another precondition to a type. The precondition that the input collection mustn't be empty can be explicitly modelled with a type. This enables us to be conservative about the codomain. The method now guarantees that it will return a value.

This idea is also easily ported to F#:

type NonEmpty<'a> = { Head : 'a; Tail : IReadOnlyCollection<'a> }
 
let average (timeSpans : NonEmpty<TimeSpan>) =
    [ timeSpans.Head ] @ List.ofSeq timeSpans.Tail
    |> List.averageBy (_.Ticks >> double)
    |> int64
    |> TimeSpan.FromTicks

The average function now takes a NonEmpty collection as input, and always returns a proper TimeSpan value.

Haskell already comes with a built-in NonEmpty collection type, and while it oddly doesn't come with an average function, it's easy enough to write:

import qualified Data.List.NonEmpty as NE

average :: Fractional a => NE.NonEmpty a -> a
average xs = sum xs / fromIntegral (NE.length xs)

You can find a recent example of using a variation of that function here.

Choosing between the two alternatives #

While Postel's law recommends having liberal domains and conservative codomains, in the case of the average API, we can't have both. If we design the API with a liberal input type, the output type has to be liberal as well. If we design with a restrictive input type, the output can be guaranteed. In my experience, you'll often find yourself in such a conundrum. The average API examined in this article is just an example, while the problem occurs often.

Given such a choice, what should you choose? Is it even possible to give general guidance on this sort of problem?

For decades, I considered such a choice a toss-up. After all, these solutions seem to be equivalent. Perhaps even isomorphic?

When I recently began to explore this isomorphism more closely, it dawned on me that there's a small asymmetry in the isomorphism that favours the conservative codomain option.

Isomorphism #

An isomorphism is a two-way translation between two representations. You can go back and forth between the two alternatives without loss of information.

Is this possible with the two alternatives outlined above? For example, if you have the conservative version, can create the liberal alternative? Yes, you can:

average' :: Fractional a => [a] -> Maybe a
average' = fmap average . NE.nonEmpty

Not surprisingly, this is trivial in Haskell. If you have the conservative version, you can just map it over a more liberal input.

In F# it looks like this:

module NonEmpty =
    let tryOfSeq xs =
        if Seq.isEmpty xs then None
        else Some { Head = Seq.head xs; Tail = Seq.tail xs |> List.ofSeq }
 
let average' (timeSpans : IReadOnlyCollection<TimeSpan>) =
    NonEmpty.tryOfSeq timeSpans |> Option.map average

In C# we can create a liberal overload that calls the conservative method:

public static TimeSpan? Average(this IReadOnlyCollection<TimeSpan> timeSpans)
{
    if (timeSpans.Count == 0)
        return null;
 
    var arr = timeSpans.ToArray();
    return new NotEmptyCollection<TimeSpan>(arr[0], arr[1..]).Average();
}

Here I just used a Guard Clause and explicit construction of the NotEmptyCollection. I could also have added a NotEmptyCollection.TryCreate method, like in the F# and Haskell examples, but I chose the above slightly more imperative style in order to demonstrate that my point isn't tightly coupled to the concept of functors, mapping, and other Functional Programming trappings.

These examples highlight how you can trivially make a conservative API look like a liberal API. Is it possible to go the other way? Can you make a liberal API look like a conservative API?

Yes and no.

Consider the liberal Haskell version of average, shown above; that's the one that returns Maybe a. Can you make a conservative function based on that?

average' :: Fractional a => NE.NonEmpty a -> a
average' xs = fromJust $ average xs

Yes, this is possible, but only by resorting to the partial function fromJust. I'll explain why that is a problem once we've covered examples in the two other languages, such as F#:

let average' (timeSpans : NonEmpty<TimeSpan>) =
    [ timeSpans.Head ] @ List.ofSeq timeSpans.Tail |> average |> Option.get

In this variation, average is the liberal version shown above; the one that returns a TimeSpan option. In order to make a conservative version, the average' function can call the liberal average function, but has to resort to the partial function Option.get.

The same issue repeats a third time in C#:

public static TimeSpan Average(this NotEmptyCollection<TimeSpan> timeSpans)
{
    return timeSpans.ToList().Average().Value;
}

This time, the partial function is the unsafe Value property, which throws an InvalidOperationException if there's no value.

This even violates Microsoft's own design guidelines:

"AVOID throwing exceptions from property getters."

I've cited Cwalina and Abrams as the authors, since this rule can be found in my 2006 edition of Framework Design Guidelines. This isn't a new insight.

While the two alternatives are 'isomorphic enough' that we can translate both ways, the translations are asymmetric in the sense that one is safe, while the other has to resort to an inherently unsafe operation to make it work.

Encapsulation #

I've called the operations fromJust, Option.get, and Value partial, and only just now used the word unsafe. You may protest that neither of the three examples are unsafe in practice, since we know that the input is never empty. Thus, we know that the liberal function will always return a value, and therefore it's safe to call a partial function, even though these operations are unsafe in the general case.

While that's true, consider how the burden shifts. When you want to promote a conservative variant to a liberal variant, you can rely on all the operations being total. On the other hand, if you want to make a liberal variant look conservative, the onus is on you. None of the three type systems on display here can perform that analysis for you.

This may not be so bad when the example is as simple as taking the average of a collection of numbers, but does it scale? What if the operation you're invoking is much more complicated? Can you still be sure that you safely invoke a partial function on the return value?

As Code That Fits in Your Head argues, procedures quickly become so complicated that they no longer fit in your head. If you don't have well-described and patrolled contracts, you don't know what the postconditions are. You can't trust the return values from method calls, or even the state of the objects you passed as arguments. This tend to lead to defensive coding, where you write code that checks the state of everything all too often.

The remedy is, as always, good old encapsulation. In this case, check the preconditions at the beginning, and capture the result of that check in an object or type that is guaranteed to be always valid. This goes beyond making illegal states unrepresentable because it also works with predicative types. Once you're past the Guard Clauses, you don't have to check the preconditions again.

This kind of thinking illustrates why you need a multidimensional view on API design. As useful as Postel's law sometimes is, it doesn't address all problems. In fact, it turned out to be unhelpful in this context, while another perspective proves more fruitful. Encapsulation is the art and craft of designing APIs in such a way that they suggest or even compels correct interactions. The more I think of this, the more it strikes me that a ranking is implied: Preconditions are more important than postconditions, because if the preconditions are unfulfilled, you can't trust the postconditions, either.

Mapping #

What's going on here? One perspective is to view types as sets. In the average example, the function maps from one set to another:

Mapping from the set of collections to the set of real numbers.

Which sets are they? We can think of the average function as a mapping from the set of non-empty collections of numbers to the set of real numbers. In programming, we can't represent real numbers, so instead, the left set is going to be the set of all the non-empty collections the computer or the language can represent and hold in (virtual) memory, and the right-hand set is the set of all the possible numbers of whichever type you'd like (32-bit signed integers, 64-bit floating-point numbers, 8-bit unsigned integers, etc.).

In reality, the left-hand set is much larger than the set to the right.

Drawing all those arrows quickly becomes awkward , so instead, we may draw each mapping as a pipe. Such a pipe also corresponds to a function. Here's an intermediate step in such a representation:

Mapping from one set to the other, drawn inside a transparent pipe.

One common element is, however, missing from the left set. Which one?

Pipes #

The above mapping corresponds to the conservative variation of the function. It's a total function that maps all values in the domain to a value in the codomain. It accomplishes this trick by explicitly constraining the domain to only those elements on which it's defined. Due to the preconditions, that excludes the empty collection, which is therefore absent from the left set.

What if we also want to allow the empty collection to be a valid input?

Unless we find ourselves in some special context where it makes sense to define a 'default average value', we can't map an empty collection to any meaningful number. Rather, we'll have to map it to some special value, such as Nothing, None, or null:

Mapping the empty collection to null in a pipe separate, but on top of, the proper function pipe.

This extra pipe is free, because it's supplied by the Maybe functor's mapping (Select, map, fmap).

What happens if we need to go the other way? If the function is the liberal variant that also maps the empty collection to a special element that indicates a missing value?

Mapping all collections, including the empty collection, to the set of real numbers.

In this case, it's much harder to disentangle the mappings. If you imagine that a liquid flows through the pipes, we can try to be careful and avoid 'filling up' the pipe.

Pipe partially filled with liquid.

The liquid represents the data that we do want to transmit through the pipe. As this illustration suggests, we now have to be careful that nothing goes wrong. In order to catch just the right outputs on the right side, you need to know how high the liquid may go, and attach a an 'flat-top' pipe to it:

Pipe composed with open-top pipe.

As this illustration tries to get across, this kind of composition is awkward and error-prone. What's worse is that you need to know how high the liquid is going to get on the right side. This depends on what actually goes on inside the pipe, and what kind of input goes into the left-hand side.

This is a metaphor. The longer the pipe is, the more difficult it gets to keep track of that knowledge. The stubby little pipe in these illustrations may correspond to the average function, which is an operation that easily fits in our heads. It's not too hard to keep track of the preconditions, and how they map to postconditions.

Thus, turning such a small liberal function into a conservative function is possible, but already awkward. If the operation is complicated, you can no longer keep track of all the details of how the inputs relate to the outputs.

Additive extensibility #

This really shouldn't surprise us. Most programming languages come with all sorts of facilities that enable extensibility: The ability to add more functionality, more behaviour, more capabilities, to existing building blocks. Conversely, few languages come with removability facilities. You can't, commonly, declare that an object is an instance of a class, except one method, or that a function is just like another function, except that it doesn't accept a particular subset of input.

This explains why we can safely make a conservative function liberal, but why it's difficult to make a liberal function conservative. This is because making a conservative function liberal adds functionality, while making a liberal function conservative attempts to remove functionality.

Conjecture #

All this leads me to the following conjecture: When faced with a choice between two versions of an API, where one has a liberal domain, and the other a conservative codomain, choose the design with the conservative codomain.

If you need the liberal version, you can create it from the conservative operation. The converse need not be true.

Conclusion #

Postel's law encourages us to be liberal with what we accept, but conservative with what we return. This is a good design heuristic, but sometimes you're faced with mutually exclusive alternatives. If you're liberal with what you accept, you'll also need to be too loose with what you return, because there are input values that you can't handle. On the other hand, sometimes the only way to be conservative with the output is to also be restrictive when it comes to input.

Given two such alternatives, which one should you choose?

This article conjectures that you should choose the conservative alternative. This isn't a political statement, but simply a result of the conservative design being the smaller building block. From a small building block, you can compose something bigger, whereas from a bigger unit, you can't easily extract something smaller that's still robust and useful.


Service compatibility is determined based on policy

Monday, 29 April 2024 11:12:00 UTC

A reading of the fourth Don Box tenet, with some commentary.

This article is part of a series titled The four tenets of SOA revisited. In each of these articles, I'll pull one of Don Box's four tenets of service-oriented architecture (SOA) out of the original MSDN Magazine article and add some of my own commentary. If you're curious why I do that, I cover that in the introductory article.

In this article, I'll go over the fourth tenet, quoting from the MSDN Magazine article unless otherwise indicated.

Service compatibility is determined based on policy #

The fourth tenet is the forgotten one. I could rarely remember exactly what it included, but it does give me an opportunity to bring up a few points about compatibility. The articles said:

Object-oriented designs often confuse structural compatibility with semantic compatibility. Service-orientation deals with these two axes separately. Structural compatibility is based on contract and schema and can be validated (if not enforced) by machine-based techniques (such as packet-sniffing, validating firewalls). Semantic compatibility is based on explicit statements of capabilities and requirements in the form of policy.

Every service advertises its capabilities and requirements in the form of a machine-readable policy expression. Policy expressions indicate which conditions and guarantees (called assertions) must hold true to enable the normal operation of the service. Policy assertions are identified by a stable and globally unique name whose meaning is consistent in time and space no matter which service the assertion is applied to. Policy assertions may also have parameters that qualify the exact interpretation of the assertion. Individual policy assertions are opaque to the system at large, which enables implementations to apply simple propositional logic to determine service compatibility.

As you can tell, this description is the shortest of the four. This is also the point where I begin to suspect that my reading of the third tenet may deviate from what Don Box originally had in mind.

This tenet is also the most baffling to me. As I understand it, the motivation behind the four tenets was to describe assumptions about the kind of systems that people would develop with Windows Communication Foundation (WCF), or SOAP in general.

While I worked with WCF for a decade, the above description doesn't ring a bell. Reading it now, the description of policy sounds more like a system such as clojure.spec, although that's not something I know much about either. I don't recall WCF ever having a machine-readable policy subsystem, and if it had, I never encountered it.

It does seem, however, as though what I interpret as contract, Don Box called policy.

Despite my confusion, the word compatibility is worth discussing, regardless of whether that was what Don Box meant. A well-designed service is one where you've explicitly considered forwards and backwards compatibility.

Versioning #

Planning for forwards and backwards compatibility does not imply that you're expected to be able to predict the future. It's fine if you have so much experience developing and maintaining online systems that you may have enough foresight to plan for certain likely changes that you may have to make in the future, but that's not what I have in mind.

Rather, what you should do is to have a system that enables you to detect breaking changes before you deploy them. Furthermore you should have a strategy for how to deal with the perceived necessity to introduce breaking changes.

The most effective strategy that I know of is to employ explicit versioning, particularly message versioning. You can version an entire service as one indivisible block, but I often find it more useful to version at the message level. If you're designing a REST API, for example, you can take advantage of Content Negotiation.

If you like, you can use Semantic Versioning as a versioning scheme, but for services, the thing that mostly matters is the major version. Thus, you may simply label your messages with the version numbers 1, 2, etc.

If you already have a published service without explicit message version information, then you can still retrofit versioning afterwards. Imagine that your existing data looks like this:

{
  "singleTable": {
    "capacity": 16,
    "minimalReservation": 10
  }
}

This JSON document has no explicit version information, but you can interpret that as implying that the document has the 'default' version, which is always 1:

{
  "singleTable": {
    "version": 1,
    "capacity": 16,
    "minimalReservation": 10
  }
}

If you later realize that you need to make a breaking change, you can do that by increasing the (major) version:

{
  "singleTable": {
    "version": 2,
    "id": 12,
    "capacity": 16,
    "minimalReservation": 10
  }
}

Recipients can now look for the version property to learn how to interpret the rest of the message, and failing to find it, infer that this is version 1.

As Don Box wrote, in a service-oriented system, you can't just update all systems in a single coordinated release. Therefore, you must never break compatibility. Versioning enables you to move forward in a way that does break with the past, but without breaking existing clients.

Ultimately, you may attempt to retire old service versions, but be ready to keep them around for a long time.

For more of my thoughts about backwards compatibility, see Backwards compatibility as a profunctor.

Conclusion #

The fourth tenet is the most nebulous, and I wonder if it was ever implemented. If it was, I'm not aware of it. Even so, compatibility is an important component of service design, so I took the opportunity to write about that. In most cases, it pays to think explicitly about message versioning.

I have the impression that Don Box had something in mind more akin to what I call contract. Whether you call it one thing or another, it stands to reason that you often need to attach extra rules to simple types. The schema may define an input value as a number, but the service does require that this particular number is a natural number. Or that a string is really a proper encoding of a date. Perhaps you call that policy. I call it contract. In any case, clearly communicating such expectations is important for systems to be compatible.


Fitting a polynomial to a set of points

Monday, 22 April 2024 05:35:00 UTC

The story of a fiasco.

This is the second in a small series of articles titled Trying to fit the hype cycle. In the introduction, I've described the exercise I had in mind: Determining a formula, or at least a piecewise function, for the Gartner hype cycle. This, to be clear, is an entirely frivolous exercise with little practical application.

In the previous article, I extracted a set of (x, y) coordinates from a bitmap. In this article, I'll showcase my failed attempt at fitting the data to a polynomial.

Failure #

I've already revealed that I failed to accomplish what I set out to do. Why should you read on, then?

You don't have to, and I can't predict the many reasons my readers have for occasionally swinging by. Therefore, I can't tell you why you should keep reading, but I can tell you why I'm writing this article.

This blog is a mix of articles that I write because readers ask me interesting questions, and partly, it's my personal research-and-development log. In that mode, I write about things that I've learned, and I write in order to learn. One can learn from failure as well as from success.

I'm not that connected to 'the' research community (if such a thing exists), but I'm getting the sense that there's a general tendency in academia that researchers rarely publish their negative results. This could be a problem, because this means that the rest of us never learn about the thousands of ways that don't work.

Additionally, in 'the' programming community, we also tend to boast our victories and hide our failures. More than one podcast (sorry about the weasel words, but I don't remember which ones) have discussed how this gives young programmers the wrong impression of what programming is like. It is, indeed, a process of much trial and error, but usually, we only publish our polished, final result.

Well, I did manage to produce code to fit a polynomial to the Gartner hype cycle, but I never managed to get a good fit.

The big picture #

I realize that I have a habit of burying the lede when I write technical articles. I don't know if I've picked up that tendency from F#, which does demand that you define a value or function before you can use it. This, by the way, is a good feature.

Here, I'll try to do it the other way around, and start with the big picture:

data = numpy.loadtxt('coords.txt', delimiter=',')
 
x = data[:, 0]
t = data[:, 1]
w = fit_polynomial(x, t, 9)
 
plot_fit(x, t, w)

This, by the way, is a Python script, and it opens with these imports:

import numpy
import matplotlib.pyplot as plt

The first line of code reads the CSV file into the data variable. The first column in that file contains all the x values, and the second column the y values. The book that I've been following uses t for the data, rather than y. (Now that I think about it, I believe that this may only be because it works from an example in which the data to be fitted are 100 m dash times, denoted t.)

Once the script has extracted the data, it calls the fit_polynomial function to produce a set of weights w. The constant 9 is the degree of polynomial to fit, although I think that I've made an off-by-one error so that the result is only a eighth-degree polynomial.

Finally, the code plots the original data together with the polynomial:

Gartner hype cycle and a eighth-degree fitted polynomial.

The green dots are the (x, y) coordinates that I extracted in the previous article, while the red curve is the fitted eighth-degree polynomial. Even though we're definitely in the realm of over-fitting, it doesn't reproduce the Gartner hype cycle.

I've even arrived at the value 9 after some trial and error. After all, I wasn't trying to do any real science here, so over-fitting is definitely allowed. Even so, 9 seems to be the best fit I can achieve. With lover values, like 8, below, the curve deviates too much:

Gartner hype cycle and a seventh-degree fitted polynomial.

The value 10 looks much like 9, but above that (11), the curve completely disconnects from the data, it seems:

Gartner hype cycle and a tenth-degree fitted polynomial.

I'm not sure why it does this, to be honest. I would have thought that the more degrees you added, the more (over-)fitted the curve would be. Apparently, this is not so, or perhaps I made a mistake in my code.

Calculating the weights #

The fit_polynomial function calculates the polynomial coefficients using a linear algebra formula that I've found in at least two text books. Numpy makes it easy to invert, transpose, and multiply matrices, so the formula itself is just a one-liner. Here it is in the entire context of the function, though:

def fit_polynomial(x, t, degree):
    """
    Fits a polynomial to the given data.
 
    Parameters
    ----------
    x : Array of shape [n_samples]
    t : Array of shape [n_samples]
    degree : degree of the polynomial
 
    Returns
    -------
    w : Array of shape [degree + 1]
    """
 
    # This expansion creates a matrix, so we name that with an upper-case letter
    # rather than a lower-case letter, which is used for vectors.
    X = expand(x.reshape((len(x), 1)), degree)
    return numpy.linalg.inv(X.T @ X) @ X.T @ t

This may look daunting, but is really just two lines of code. The rest is docstring and a comment.

The above-mentioned formula is the last line of code. The one before that expands the input data t from a simple one-dimensional array to a matrix of those values squared, cubed, etc. That's how you use the least squares method if you want to fit it to a polynomial of arbitrary degree.

Expansion #

The expand function looks like this:

def expand(x, degree):
    """
    Expands the given array to polynomial elements of the given degree.
 
    Parameters
    ----------
    x : Array of shape [n_samples, 1]
    degree : degree of the polynomial
 
    Returns
    -------
    Xp : Array of shape [n_samples, degree + 1]
    """
 
    Xp = numpy.ones((len(x), 1))
    for i in range(1, degree + 1):
        Xp = numpy.hstack((Xp, numpy.power(x, i)))
    return Xp

The function begins by creating a column vector of ones, here illustrated with only three rows:

>>> Xp = numpy.ones((3, 1))
>>> Xp
array([[1.],
       [1.],
       [1.]])

It then proceeds to loop over as many degrees as you've asked it to, each time adding a column to the Xp matrix. Here's an example of doing that up to a power of three, on example input [1,2,3]:

>>> x = numpy.array([1,2,3]).reshape((3, 1))
>>> x
array([[1],
       [2],
       [3]])
>>> Xp = numpy.hstack((Xp, numpy.power(x, 1)))
>>> Xp
array([[1., 1.],
       [1., 2.],
       [1., 3.]])
>>> Xp = numpy.hstack((Xp, numpy.power(x, 2))) 
>>> Xp
array([[1., 1., 1.],
       [1., 2., 4.],
       [1., 3., 9.]])
>>> Xp = numpy.hstack((Xp, numpy.power(x, 3))) 
>>> Xp
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  2.,  4.,  8.],
       [ 1.,  3.,  9., 27.]])

Once it's done looping, the expand function returns the resulting Xp matrix.

Plotting #

Finally, here's the plot_fit procedure:

def plot_fit(x, t, w):
    """
    Plots the polynomial with the given weights and the data.
 
    Parameters
    ----------
    x : Array of shape [n_samples]
    t : Array of shape [n_samples]
    w : Array of shape [degree + 1]
    """
 
    xs = numpy.linspace(x[0], x[0]+len(x), 100)
    ys = numpy.polyval(w[::-1], xs)
 
    plt.plot(xs, ys, 'r')
    plt.scatter(x, t, s=10, c='g')
    plt.show()

This is fairly standard pyplot code, so I don't have much to say about it.

Conclusion #

When I started this exercise, I'd hoped that I could get close to the Gartner hype cycle by over-fitting the model to some ridiculous polynomial degree. This turned out not to be the case, for reasons that I don't fully understand. As I increase the degree, the curve begins to deviate from the data.

I can't say that I'm a data scientist or a statistician of any skill, so it's possible that my understanding is still too shallow. Perhaps I'll return to this article later and marvel at the ineptitude on display here.


Services share schema and contract, not class

Monday, 15 April 2024 07:25:00 UTC

A reading of the third Don Box tenet, with some commentary.

This article is part of a series titled The four tenets of SOA revisited. In each of these articles, I'll pull one of Don Box's four tenets of service-oriented architecture (SOA) out of the original MSDN Magazine article and add some of my own commentary. If you're curious why I do that, I cover that in the introductory article.

In this article, I'll go over the third tenet, quoting from the MSDN Magazine article unless otherwise indicated.

Services share schema and contract, not class #

Compared to the second tenet, the following description may at first seem more dated. Here's what the article said:

Object-oriented programming encourages developers to create new abstractions in the form of classes. Most modern development environments not only make it trivial to define new classes, modern IDEs do a better job guiding you through the development process as the number of classes increases (as features like IntelliSense® provide a more specific list of options for a given scenario).

Classes are convenient abstractions as they share both structure and behavior in a single named unit. Service-oriented development has no such construct. Rather, services interact based solely on schemas (for structures) and contracts (for behaviors). Every service advertises a contract that describes the structure of messages it can send and/or receive as well as some degree of ordering constraints over those messages. This strict separation between structure and behavior vastly simplifies deployment, as distributed object concepts such as marshal-by-value require a common execution and security environment which is in direct conflict with the goals of autonomous computing.

Services do not deal in types or classes per se; rather, only with machine readable and verifiable descriptions of the legal "ins and outs" the service supports. The emphasis on machine verifiability and validation is important given the inherently distributed nature of how a service-oriented application is developed and deployed. Unlike a traditional class library, a service must be exceedingly careful about validating the input data that arrives in each message. Basing the architecture on machine-validatible schema and contract gives both developers and infrastructure the hints they need to protect the integrity of an individual service as well as the overall application as a whole.

Because the contract and schema for a given service are visible over broad ranges of both space and time, service-orientation requires that contracts and schema remain stable over time. In the general case, it is impossible to propagate changes in schema and/or contract to all parties who have ever encountered a service. For that reason, the contract and schema used in service-oriented designs tend to have more flexibility than traditional object-oriented interfaces. It is common for services to use features such as XML element wildcards (like xsd:any) and optional SOAP header blocks to evolve a service in ways that do not break already deployed code.

With its explicit discussion of XML, SOAP, and XSD, this description may seem more stuck in 2004 than the two first tenets.

I'll cover the most obvious consequence first.

At the boundaries... #

In the MSDN article, the four tenets guide the design of Windows Communication Foundation (WCF) - a technology that in 2004 was under development, but still not completed. While SOAP already existed as a platform-independent protocol, WCF was a .NET endeavour. Most developers using the Microsoft platform at the time were used to some sort of binary protocol, such as DCOM or .NET Remoting. Thus, it makes sense that Don Box was deliberately explicit that this was not how SOA (or WCF) was supposed to work.

In fact, since SOAP is platform-independent, you could write a web service in one language (say, Java) and consume it with a different language (e.g. C++). WCF was Microsoft's SOAP technology for .NET.

If you squint enough that you don't see the explicit references to XML or SOAP, however, the description still applies. Today, you may exchange data with JSON over REST, Protocol Buffers via gRPC, or something else, but it's still common to have a communications protocol that is independent of specific service implementations. A service may be written in Python, Haskell, C, or any other language that supports the wire format. As this little list suggests, the implementation language doesn't even have to be object-oriented.

In fact,

A formal interface definition language (IDL) may enable you to automate serialization and deserialization, but these are usually constrained to defining the shape of data and operations. Don Box talks about validation, and types don't replace validation - particularly if you allow xsd:any. That particular remark is quite at odds with the notion that a formal schema definition is necessary, or even desirable.

And indeed, today we often see JSON-based REST APIs that are more loosely defined. Even so, the absence of a machine-readable IDL doesn't entail the absence of a schema. As Alexis King wrote related to the static-versus-dynamic-types debate, dynamic type systems are not inherently more open. A similar argument can be made about schema. Regardless of whether or not a formal specification exists, a service always has a de-facto schema.

To be honest, though, when I try to interpret what this and the next tenet seem to imply, an IDL may have been all that Don Box had in mind. By schema he may only have meant XSD, and by contract, he may only have meant SOAP. More broadly speaking, this notion of contract may entail nothing more than a list of named operations, and references to schemas that indicate what input each operation takes, and what output it returns.

What I have in mind with the rest of this article may be quite an embellishment on that notion. In fact, my usual interpretation of the word contract may be more aligned with what Don Box calls policy. Thus, if you want a very literal reading of the four tenets, what comes next may fit better with the fourth tenet, that service compatibility is determined based on policy.

Regardless of whether you think that the following discussion belongs here, or in the next article, I'll assert that it's paramount to designing and developing useful and maintainable web services.

Encapsulation #

If we, once more, ignore the particulars related to SOAP and XML, we may rephrase the notion of schema and contract as follows. Schema describes the shape of data: Is it a number, a string, a tuple, or a combination of these? Is there only one, or several? Is the data composed from smaller such definitions? Does the composition describe the combination of several such definitions, or does it describe mutually exclusive alternatives?

Compliant data may be encoded as objects or data structures in memory, or serialized to JSON, XML, CSV, byte streams, etc. We may choose to call a particular agglomeration of data a message, which we may pass from one system to another. The first tenet already used this metaphor.

You can't, however, just pass arbitrary valid messages from one system to another. Certain operations allow certain data, and may promise to return other kinds of messages. In additions to the schema, we also need to describe a contract.

What's a contract? If you consult Object-Oriented Software Construction, a contract stipulates invariants, pre- and postconditions for various operations.

Preconditions state what must be true before an operation can take place. This often puts the responsibility on the caller to ensure that the system is in an appropriate state, and that the message that it intends to pass to the other system is valid according to that state.

Postconditions, on the other hand, detail what the caller can expect in return. This includes guarantees about response messages, but may also describe the posterior state of the system.

Invariants, finally, outline what is always true about the system.

Although such a description of a contract originates from a book about object-oriented design, it's useful in other areas, too, such as functional programming. It strikes me that it applies equally well in the context of service-orientation.

The combination of contract and well-described message structure is, in other words, encapsulation. There's nothing wrong with that: It works. If you actually apply it as a design principle, that is.

Conclusion #

The third SOA tenet emphasizes that only data travels over service boundaries. In order to communicate effectively, services must agree on the shape of data, and which operations are legal when. While they exchange data, however, they don't share address space, or even internal representation.

One service may be written in F# and the client in Clojure. Even so, it's important that they have a shared understanding of what is possible, and what is not. The more explicit you, as a service owner, can be, the better.

Next: Service compatibility is determined based on policy.


Extracting curve coordinates from a bitmap

Monday, 08 April 2024 05:32:00 UTC

Another example of using Haskell as an ad-hoc scripting language.

This article is part of a short series titled Trying to fit the hype cycle. In the first article, I outlined what it is that I'm trying to do. In this article, I'll describe how I extract a set of x and y coordinates from this bitmap:

Gartner hype cycle.

(Actually, this is scaled-down version of the image. The file I work with is a bit larger.)

As I already mentioned in the previous article, these days there are online tools for just about everything. Most likely, there's also an online tool that will take a bitmap like that and return a set of (x, y) coordinates.

Since I'm doing this for the programming exercise, I'm not interested in that. Rather, I'd like to write a little Haskell script to do it for me.

Module and imports #

Yes, I wrote Haskell script. As I've described before, with good type inference, a statically typed language can be as good for scripting as a dynamic one. Just as might be the case with, say, a Python script, you'll be iterating, trying things out until finally the script settles into its final form. What I present here is the result of my exercise. You should imagine that I made lots of mistakes underway, tried things that didn't work, commented out code and print statements, imported modules I eventually didn't need, etc. Just like I imagine you'd also do with a script in a dynamically typed language. At least, that's how I write Python, when I'm forced to do that.

In other words, the following is far from the result of perfect foresight, but rather the equilibrium into which the script settled.

I named the module HypeCoords, because the purpose of it is to extract the (x, y) coordinates from the above Gartner hype cycle image. These are the imports it turned out that I ultimately needed:

module HypeCoords where
 
import qualified Data.List.NonEmpty as NE
import Data.List.NonEmpty (NonEmpty((:|)))
import Codec.Picture
import Codec.Picture.Types

The Codec.Picture modules come from the JuicyPixels package. This is what enables me to read a .png file and extract the pixels.

Black and white #

If you look at the above bitmap, you may notice that it has some vertical lines in a lighter grey than the curve itself. My first task, then, is to get rid of those. The easiest way to do that is to convert the image to a black-and-white bitmap, with no grey scale.

Since this is a one-off exercise, I could easily do that with a bitmap editor, but on the other hand, I thought that this was a good first task to give myself. After all, I didn't know the JuicyPixels library at all, so this was an opportunity to start with a task just a notch simpler than the one that was my actual goal.

I thought that the easiest way to convert to a black-and-white image would be to turn all pixels white if they are lighter than some threshold, and black otherwise.

A PNG file has more information than I need, so I first converted the image to an 8-bit RGB bitmap. Even though the above image looks as though it's entirely grey scale, each pixel is actually composed of three colours. In order to compare a pixel with a threshold, I needed a single measure of how light or dark it is.

That turned out to be about as simple as it sounds: Just take the average of the three colours. Later, I'd need a function to compute the average for another reason, so I made it a reusable function:

average :: Integral a => NE.NonEmpty a -> a
average nel = sum nel `div` fromIntegral (NE.length nel)

It's a bit odd that the Haskell base library doesn't come with such a function (at least to my knowledge), but anyway, this one is specialized to do integer division. Notice that this function computes only non-exceptional averages, since it requires the input to be a NonEmpty list. No division-by-zero errors here, please!

Once I'd computed a pixel average and compared it to a threshold value, I wanted to replace it with either black or white. In order to make the code more readable I defined two named constants:

black :: PixelRGB8
black = PixelRGB8 minBound minBound minBound
white :: PixelRGB8
white = PixelRGB8 maxBound maxBound maxBound

With that in place, converting to black-and-white is only a few more lines of code:

toBW :: PixelRGB8 -> PixelRGB8
toBW (PixelRGB8 r g b) =
  let threshold = 192 :: Integer
      lum = average (fromIntegral r :| [fromIntegral g, fromIntegral b])
  in if lum <= threshold then black else white

I arrived at the threshold of 192 after a bit of trial-and-error. That's dark enough that the light vertical lines fall to the white side, while the real curve becomes black.

What remained was to glue the parts together to save the black-and-white file:

main :: IO ()
main = do
  readResult <- readImage "hype-cycle-cleaned.png"
  case readResult of
    Left msg -> putStrLn msg
    Right img -> do
      let bwImg = pixelMap toBW $ convertRGB8 img
      writePng "hype-cycle-bw.png" bwImg

The convertRGB8 function comes from JuicyPixels.

The hype-cycle-bw.png picture unsurprisingly looks like this:

Black-and-white Gartner hype cycle.

Ultimately, I didn't need the black-and-white bitmap file. I just wrote the script to create the file in order to be able to get some insights into what I was doing. Trust me, I made a lot of stupid mistakes along the way, and among other issues had some 'fun' with integer overflows.

Extracting image coordinates #

Now I had a general feel for how to work with the JuicyPixels library. It still required quite a bit of spelunking through the documentation before I found a useful API to extract all the pixels from a bitmap:

pixelCoordinates :: Pixel a => Image a -> [((IntInt), a)]
pixelCoordinates = pixelFold (\acc x y px -> ((x,y),px):acc) []

While this is, after all, just a one-liner, I'm surprised that something like this doesn't come in the box. It returns a list of tuples, where the first element contains the pixel coordinates (another tuple), and the second element the pixel information (e.g. the RGB value).

One y value per x value #

There were a few more issues to be addressed. The black curve in the black-and-white bitmap is thicker than a single pixel. This means that for each x value, there will be several black pixels. In order to do linear regression, however, we need a single y value per x value.

One easy way to address that concern is to calculate the average y value for each x value. This may not always be the best choice, but as far as we can see in the above black-and-white image, it doesn't look as though there's any noise left in the picture. This means that we don't have to worry about outliers pulling the average value away from the curve. In other words, finding the average y value is an easy way to get what we need.

averageY :: Integral b => NonEmpty (a, b) -> (a, b)
averageY nel = (fst $ NE.head nel, average $ snd <$> nel)

The averageY function converts a NonEmpty list of tuples to a single tuple. Watch out! The input tuples are not the 'outer' tuples that pixelCoordinates returns, but rather a list of actual pixel coordinates. Each tuple is a set of coordinates, but since the function never manipulates the x coordinate, the type of the first element is just unconstrained a. It can literally be anything, but will, in practice, be an integer.

The assumption is that the input is a small list of coordinates that all share the same x coordinate, such as (42, 99) :| [(42, 100), (42, 102)]. The function simply returns a single tuple that it creates on the fly. For the first element of the return tuple, it picks the head tuple from the input ((42, 99) in the example), and then that tuple's fst element (42). For the second element, the function averages all the snd elements (99, 100, and 102) to get 100 (integer division, you may recall):

ghci> averageY ((42, 99) :| [(42, 100), (42, 102)])
(42,100)

What remains is to glue together the building blocks.

Extracting curve coordinates #

A few more steps were required, but these I just composed in situ. I found no need to define them as individual functions.

The final composition looks like this:

main :: IO ()
main = do
  readResult <- readImage "hype-cycle-cleaned.png"
  case readResult of
    Left msg -> putStrLn msg
    Right img -> do
      let bwImg = pixelMap toBW $ convertRGB8 img
      let blackPixels =
            fst <$> filter ((black ==) . snd) (pixelCoordinates bwImg)
      let h = imageHeight bwImg
      let lineCoords = fmap (h -) . averageY <$> NE.groupAllWith fst blackPixels
      writeFile "coords.txt" $
        unlines $ (\(x,y) -> show x ++ "," ++ show y) <$> lineCoords

The first lines of code, until and including let bwImg, are identical to what you've already seen.

We're only interested in the black pixels, so the main action uses the standard filter function to keep only those that are equal to the black constant value. Once the white pixels are gone, we no longer need the pixel information. The expression that defines the blackPixels value finally (remember, you read Haskell code from right to left) throws away the pixel information by only retaining the fst element. That's the tuple that contains the coordinates. You may want to refer back to the type signature of pixelCoordinates to see what I mean.

The blackPixels value has the type [(Int, Int)].

Two more things need to happen. One is to group the pixels together per x value so that we can use averageY. The other is that we want the coordinates as normal Cartesian coordinates, and right now, they're in screen coordinates.

When working with bitmaps, it's quite common that pixels are measured out from the top left corner, instead of from the bottom left corner. It's not difficult to flip the coordinates, but we need to know the height of the image:

let h = imageHeight bwImg

The imageHeight function is another JuicyPixels function.

Because I sometimes get carried away, I write the code in a 'nice' compact style that could be more readable. I accomplished both of the above remaining tasks with a single line of code:

let lineCoords = fmap (h -) . averageY <$> NE.groupAllWith fst blackPixels

This first groups the coordinates according to x value, so that all coordinates that share an x value are collected in a single NonEmpty list. This means that we can map all of those groups over averageY. Finally, the expression flips from screen coordinates to Cartesian coordinates by subtracting the y coordinate from the height h.

The final writeFile expression writes the coordinates to a text file as comma-separated values. The first ten lines of that file looks like this:

9,13
10,13
11,13
12,14
13,15
14,15
15,16
16,17
17,17
18,18
...

Do these points plot the Gartner hype cycle?

Sanity checking by plotting the coordinates #

To check whether the coordinates look useful, we could plot them. If I wanted to use a few more hours, I could probably figure out how to do that with JuicyPixels as well, but on the other hand, I already know how to do that with Python:

data = numpy.loadtxt('coords.txt', delimiter=',')
x = data[:, 0]
t = data[:, 1]
plt.scatter(x, t, s=10, c='g')
plt.show()

That produces this plot:

Coordinates plotted with Python.

LGTM.

Conclusion #

In this article, you've seen how a single Haskell script can extract curve coordinates from a bitmap. The file is 41 lines all in all, including module declaration and white space. This article shows every single line in that file, apart from some blank lines.

I loaded the file into GHCi and ran the main action in order to produce the CSV file.

I did spend a few hours looking around in the JuicyPixels documentation before I'd identified the functions that I needed. All in all I used some hours on this exercise. I didn't keep track of time, but I guess that I used more than three, but probably fewer than six, hours on this.

This was the successful part of the overall exercise. Now onto the fiasco.

Next: Fitting a polynomial to a set of points.


Trying to fit the hype cycle

Monday, 01 April 2024 07:14:00 UTC

An amateur tries his hand at linear modelling.

About a year ago, I was contemplating a conference talk I was going to give. Although I later abandoned the idea for other reasons, for a few days I was thinking about using the Gartner hype cycle for an animation. What I had in mind would require me to draw the curve in a way that would enable me to zoom in and out. Vector graphics would be much more useful for that job than a bitmap.

Gartner hype cycle.

Along the way, I considered if there was a function that would enable me to draw it on the fly. A few web searches revealed the Cross Validated question Is there a linear/mixture function that can fit the Gartner hype curve? So I wasn't the first person to have that idea, but at the time I found it, the question was effectively dismissed without a proper answer. Off topic, dontcha know?

A web search also seems to indicate the existence of a few research papers where people have undertaken this task, but there's not a lot about it. True, the Gartner hype cycle isn't a real function, but it sounds like a relevant exercise in statistics, if one's into that kind of thing.

Eventually, for my presentation, I went with another way to illustrate what I wanted to say, so for half I year, I didn't think more about it.

Linear regression? #

Recently, however, I was following a course in mathematical analysis of data, and among other things, I learned how to fit a line to data. Not just a straight line, but any degree of polynomial. So I thought that perhaps it'd be an interesting exercise to see if I could fit the hype cycle to some high-degree polynomial - even though I do realize that the hype cycle isn't a real function, and neither does it look like a straight polynomial function.

In order to fit a polynomial to the curve, I needed some data, so my first task was to convert an image to a series of data points.

I'm sure that there are online tools and apps that offer to do that for me, but the whole point of this was that I wanted to learn how to tackle problems like these. It's like doing katas. The journey is the goal.

This turned out to be an exercise consisting of two phases so distinct that I wrote them in two different languages.

As the articles will reveal, the first part went quite well, while the other was, essentially, a fiasco.

Conclusion #

There's not much point in finding a formula for the Gartner hype cycle, but the goal of this exercise was, for me, to tinker with some new techniques to see if I could learn from doing the exercise. And I did learn something.

In the next articles in this series, I'll go over some of the details.

Next: Extracting curve coordinates from a bitmap.


Services are autonomous

Monday, 25 March 2024 08:31:00 UTC

A reading of the second Don Box tenet, with some commentary.

This article is part of a series titled The four tenets of SOA revisited. In each of these articles, I'll pull one of Don Box's four tenets of service-oriented architecture (SOA) out of the original MSDN Magazine article and add some of my own commentary. If you're curious why I do that, I cover that in the introductory article.

In this article, I'll go over the second tenet. The quotes are from the MSDN Magazine article unless otherwise indicated.

Services are autonomous #

Compared with the first tenet, you'll see that Don Box had more to say about this one. I, conversely, have less to add. First, here's what the article said:

Service-orientation mirrors the real world in that it does not assume the presence of an omniscient or omnipotent oracle that has awareness and control over all parts of a running system. This notion of service autonomy appears in several facets of development, the most obvious place being the area of deployment and versioning.

Object-oriented programs tend to be deployed as a unit. Despite the Herculean efforts made in the 1990s to enable classes to be independently deployed, the discipline required to enable object-oriented interaction with a component proved to be impractical for most development organizations. When coupled with the complexities of versioning object-oriented interfaces, many organizations have become extremely conservative in how they roll out object-oriented code. The popularity of the XCOPY deployment and private assemblies capabilities of the .NET Framework is indicative of this trend.

Service-oriented development departs from object-orientation by assuming that atomic deployment of an application is the exception, not the rule. While individual services are almost always deployed atomically, the aggregate deployment state of the overall system/application rarely stands still. It is common for an individual service to be deployed long before any consuming applications are even developed, let alone deployed into the wild. Amazon.com is one example of this build-it-and-they-will-come philosophy. There was no way the developers at Amazon could have known the multitude of ways their service would be used to build interesting and novel applications.

It is common for the topology of a service-oriented application to evolve over time, sometimes without direct intervention from an administrator or developer. The degree to which new services may be introduced into a service-oriented system depends on both the complexity of the service interaction and the ubiquity of services that interact in a common way. Service-orientation encourages a model that increases ubiquity by reducing the complexity of service interactions. As service-specific assumptions leak into the public facade of a service, fewer services can reasonably mimic that facade and stand in as a reasonable substitute.

The notion of autonomous services also impacts the way failures are handled. Objects are deployed to run in the same execution context as the consuming application. Service-oriented designs assume that this situation is the exception, not the rule. For that reason, services expect that the consuming application can fail without notice and often without any notification. To maintain system integrity, service-oriented designs use a variety of techniques to deal with partial failure modes. Techniques such as transactions, durable queues, and redundant deployment and failover are quite common in a service-oriented system.

Because many services are deployed to function over public networks (such as the Internet), service-oriented development assumes not only that incoming message data may be malformed but also that it may have been transmitted for malicious purposes. Service-oriented architectures protect themselves by placing the burden of proof on all message senders by requiring applications to prove that all required rights and privileges have been granted. Consistent with the notion of service autonomy, service-oriented architectures invariably rely on administratively managed trust relationships in order to avoid per-service authentication mechanisms common in classic Web applications.

Again, I'd like to highlight how general these ideas are. Once lifted out of the context of Windows Communication Foundation, all of this applies more broadly.

Perhaps a few details now seem dated, but in general I find that this description holds up well.

Wildlife #

It's striking that someone in 2004 observed that big, complex, coordinated releases are impractical. Even so, it doesn't seem as though adopting a network-based technology and architecture in itself solves that problem. I wrote about that in 2012, and I've seen Adam Ralph make a similar observation. Many organizations inadvertently create distributed monoliths. I think that this often stems from a failure of heeding the tenet that services are autonomous.

I've experienced the following more than once. A team of developers rely on a service. As they take on a new feature, they realize that the way things are currently modelled prevents them from moving forward. Typical examples include mismatched cardinalities. For example, a customer record has a single active address, but the new feature requires that customers may have multiple active addresses. It could be that a customer has a permanent address, but also a summerhouse.

It is, however, the other service that defines how customer addresses are modelled, so the development team contacts the service team to discuss a breaking change. The service team agrees to the breaking change, but this means that the service and the relying client team now have to coordinate when they deploy the new versions of their software. The service is no longer autonomous.

I've already discussed this kind of problem in a previous article, and as Don Box also implied, this discussion is related to the question of versioning, which we'll get back to when covering the fourth tenet.

Transactions #

It may be worthwhile to comment on this sentence:

Techniques such as transactions, durable queues, and redundant deployment and failover are quite common in a service-oriented system.

Indeed, but particularly regarding database transactions, a service may use them internally (typically leveraging a database engine like SQL Server, Oracle, PostgreSQL, etc.), but not across services. Around the time Don Box wrote the original MSDN Magazine article an extension to SOAP colloquially known as WS-Death Star was in the works, and it included WS Transaction.

I don't know whether Don Box had something like this in mind when he wrote the word transaction, but in my experience, you don't want to go there. If you need to, you can make use of database transactions to keep your own service ACID-consistent, but don't presume that this is possible with multiple autonomous services.

As always, even if a catchphrase such as services are autonomous sounds good, it's always illuminating to understand that there are trade-offs involved - and what they are. Here, a major trade-off is that you need to think about error-handling in a different way. If you don't already know how to address such concerns, look up lock-free transactions and eventual consistency. As Don Box also mentioned, durable queues are often part of such a solution, as is idempotence.

Validation #

From this discussion follows that an autonomous service should, ideally, exist independently of the software ecosystem in which it exists. While an individual service can't impose its will on its surroundings, it can, and should, behave in a consistent and correct manner.

This does include deliberate consistency for the service itself. An autonomous service may make use of ACID or eventual consistency as the service owner deems appropriate.

It should also treat all input as suspect, until proven otherwise. Input validation is an important part of service design. It is my belief that validation is a solved problem, but that doesn't mean that you don't have to put in the work. You should consider correctness, versioning, as well as Postel's law.

Security #

A similar observation relates to security. Some services (particularly read-only services) may allow for anonymous access, but if a service needs to authenticate or authorize requests, consider how this is done in an autonomous manner. Looking up account information in a centralized database isn't the autonomous way. If a service does that, it now relies on the account database, and is no longer autonomous.

Instead, rely on claims-based identity. In my experience, OAuth with JWT is usually fine.

If your service needs to know something about the user that only an external source can tell it, don't look it up in an external system. Instead, demand that it's included in the JWT as a claim. Do you need to validate the age of the user? Require a date-of-birth or age claim. Do you need to know if the request is made on behalf of a system administrator? Demand a list of role claims.

Conclusion #

The second of Don Box's four tenets of SOA state that services should be autonomous. At first glance, you may think that all this means is that a service shouldn't share its database with another service. That is, however, a minimum bar. You need to consider how a service exists in an environment that it doesn't control. Again, the wildlife metaphor seems apt. Particularly if your service is exposed to the internet, it lives in a hostile environment.

Not only should you consider all input belligerent, you must also take into account that friendly systems may disappear or change. Your service exists by itself, supported by itself, relying on itself. If you need to coordinate work with other service owners, that's a strong hint that your service isn't, after all, autonomous.

Next: Services share schema and contract, not class.


Extracting data from a small CSV file with Python

Monday, 18 March 2024 08:36:00 UTC

My inept adventures with a dynamically typed language.

This article is the third in a small series about ad-hoc programming in two languages. In the previous article you saw how I originally solved a small data extraction and analysis problem with Haskell, even though it was strongly implied that Python was the language for the job.

Months after having solved the problem I'd learned a bit more Python, so I decided to return to it and do it again in Python as an exercise. In this article, I'll briefly describe what I did.

Reading CSV data #

When writing Python, I feel the way I suppose a script kiddie might feel. I cobble together code based on various examples I've seen somewhere else, without a full or deep understanding of what I'm doing. There's more than a hint of programming by coincidence, I'm afraid. One thing I've picked up along the way is that I can use pandas to read a CSV file:

data = pd.read_csv('survey_data.csv', header=None)
grades = data.iloc[:, 2]
experiences = data.iloc[:, 3]

In order for this to work, I needed to import pandas. Ultimately, my imports looked like this:

import pandas as pd
from collections import Counter
from itertools import combinations, combinations_with_replacement
import matplotlib.pyplot as plt

In other Python code that I've written, I've been a heavy user of NumPy, and while I several times added it to my imports, I never needed it for this task. That was a bit surprising, but I've only done Python programming for a year, and I still don't have a good feel for the ecosystem.

The above code snippet also demonstrates how easy it is to slice a dataframe into columns: grades contains all the values in the (zero-indexed) second column, and experiences likewise the third column.

Sum of grades #

All the trouble I had with binomial choice without replacement that I had with my Haskell code is handled with combinations, which happily handles duplicate values:

>>> list(combinations('foo', 2))
[('f', 'o'), ('f', 'o'), ('o', 'o')]

Notice that combinations doesn't list ('o', 'f'), since (apparently) it doesn't consider ordering important. That's more in line with the binomial coefficient, whereas my Haskell code considers a tuple like ('f', 'o') to be distinct from ('o', 'f'). This is completely consistent with how Haskell works, but means that all the counts I arrived at with Haskell are double what they are in this article. Ultimately, 6/1406 is equal to 3/703, so the probabilities are the same. I'll try to call out this factor-of-two difference whenever it occurs.

A Counter object counts the number of occurrences of each value, so reading, picking combinations without replacement and adding them together is just two lines of code, and one more to print them:

sumOfGrades = Counter(map(sum, combinations(grades, 2)))
sumOfGrades = sorted(sumOfGrades.items(), key=lambda item: item[0])
print(f'Sums of grades: {sumOfGrades}')

The output is:

Sums of grades: [(0, 3), (2, 51), (4, 157), (6, 119), (7, 24), (8, 21), (9, 136), (10, 3),
                 (11, 56), (12, 23), (14, 69), (16, 14), (17, 8), (19, 16), (22, 2), (24, 1)]

(Formatting courtesy of yours truly.)

As already mentioned, these values are off by a factor two compared to the previous Haskell code, but since I'll ultimately be dealing in ratios, it doesn't matter. What this output indicates is that the sum 0 occurs three times, the sum 2 appears 51 times, and so on.

This is where I, in my Haskell code, dropped down to a few ephemeral REPL-based queries that enabled me to collect enough information to paste into Excel in order to produce a figure. In Python, however, I have Matplotlib, which means that I can create the desired plots entirely in code. It does require that I write a bit more code, though.

First, I need to calculate the range of the Probability Mass Function (PMF), since there are values that are possible, but not represented in the above data set. To calculate all possible values in the PMF's range, I use combinations_with_replacement against the Danish grading scale.

grade_scale = [-3, 0, 2, 4, 7, 10, 12]
sumOfGradesRange = set(map(sum, combinations_with_replacement(grade_scale, 2)))
sumOfGradesRange = sorted(sumOfGradesRange)
print(f'Range of sums of grades: {sumOfGradesRange}')

The output is this:

Range of sums of grades: [-6, -3, -1, 0, 1, 2, 4, 6, 7, 8, 9, 10, 11, 12, 14, 16, 17, 19, 20, 22, 24]

Next, I create a dictionary of all possible grades, initializing all entries to zero, but then updating that dictionary with the observed values, where they are present:

probs = dict.fromkeys(sumOfGradesRange, 0)
probs.update(dict(sumOfGrades))

Finally, I recompute the dictionary entries to probabilities.

total = sum(x[1] for x in sumOfGrades)
for k, v in probs.items():
    probs[k] = v / total

Now I have all the data needed to plot the desired bar char:

plt.bar(probs.keys(), probs.values())
plt.xlabel('Sum')
plt.ylabel('Probability')
plt.show()

The result looks like this:

Bar chart of the sum-of-grades PMF.

While I'm already on line 34 in my Python file, with one more question to answer, I've written proper code in order to produce data that I only wrote ephemeral queries for in Haskell.

Difference of experiences #

The next question is almost a repetition of the the first one, and I've addressed it by copying and pasting. After all, it's only duplication, not triplication, so I can always invoke the Rule of Three. Furthermore, this is a one-off script that I don't expect to have to maintain in the future, so copy-and-paste, here we go:

diffOfExperiances = \
    Counter(map(lambda x: abs(x[0] - x[1]), combinations(experiences, 2)))
diffOfExperiances = sorted(diffOfExperiances.items(), key=lambda item: item[0])
print(f'Differences of experiences: {diffOfExperiances}')
 
experience_scale = list(range(1, 8))
diffOfExperiancesRange = set(\
    map(lambda x: abs(x[0] - x[1]),\
    combinations_with_replacement(experience_scale, 2)))
diffOfExperiancesRange = sorted(diffOfExperiancesRange)
 
probs = dict.fromkeys(diffOfExperiancesRange, 0)
probs.update(dict(diffOfExperiances))
total = sum(x[1] for x in diffOfExperiances)
for k, v in probs.items():
    probs[k] = v / total
 
# Plot the histogram of differences of experiences
plt.bar(probs.keys(), probs.values())
plt.xlabel('Difference')
plt.ylabel('Probability')
plt.show()

The bar chart has the same style as before, but obviously displays different data. See the bar chart in the previous article for the Excel-based rendition of that data.

Conclusion #

The Python code runs to 61 lines of code, compared with the 34 lines of Haskell code. The Python code, however, is much more complete than the Haskell code, since it also contains the code that computes the range of each PMF, as well as code that produces the figures.

Like the Haskell code, it took me a couple of hours to produce this, so I can't say that I feel much more productive in Python than in Haskell. On the other hand, I also acknowledge that I have less experience writing Python code. If I had to do a lot of ad-hoc data crunching like this, I can see how Python is useful.


Boundaries are explicit

Monday, 11 March 2024 08:03:00 UTC

A reading of the first Don Box tenet, with some commentary.

This article is part of a series titled The four tenets of SOA revisited. In each of these articles, I'll pull one of Don Box's four tenets of service-oriented architecture (SOA) out of the original MSDN Magazine article and add some of my own commentary. If you're curious why I do that, I cover that in the introductory article.

In this article, I'll go over the first tenet, quoting from the MSDN Magazine article unless otherwise indicated.

Boundaries are explicit #

This tenet was the one I struggled with the most. It took me a long time to come to grips with how to apply it, but I'll get back to that in a moment. First, here's what the article said:

A service-oriented application often consists of services that are spread over large geographical distances, multiple trust authorities, and distinct execution environments. The cost of traversing these various boundaries is nontrivial in terms of complexity and performance. Service-oriented designs acknowledge these costs by putting a premium on boundary crossings. Because each cross-boundary communication is potentially costly, service-orientation is based on a model of explicit message passing rather than implicit method invocation. Compared to distributed objects, the service-oriented model views cross-service method invocation as a private implementation technique, not as a primitive construct—the fact that a given interaction may be implemented as a method call is a private implementation detail that is not visible outside the service boundary.

Though service-orientation does not impose the RPC-style notion of a network-wide call stack, it can support a strong notion of causality. It is common for messages to explicitly indicate which chain(s) of messages a particular message belongs to. This indication is useful for message correlation and for implementing several common concurrency models.

The notion that boundaries are explicit applies not only to inter-service communication but also to inter-developer communication. Even in scenarios in which all services are deployed in a single location, it is commonplace for the developers of each service to be spread across geographical, cultural, and/or organizational boundaries. Each of these boundaries increases the cost of communication between developers. Service orientation adapts to this model by reducing the number and complexity of abstractions that must be shared across service boundaries. By keeping the surface area of a service as small as possible, the interaction and communication between development organizations is reduced. One theme that is consistent in service-oriented designs is that simplicity and generality aren't a luxury but rather a critical survival skill.

Notice that there's nothing here about Windows Communication Framework (WCF), or any other specific technology. This is common to all four tenets, and one of the reasons that I think they deserve to be lifted out of their original context and put on display as the general ideas that they are.

I'm getting the vibe that the above description was written under the impression of the disenchantment with distributed objects that was setting in around that time. The year before, Martin Fowler had formulated his

"First Law of Distributed Object Design: Don't distribute your objects!"

The way that I read the tenet then, and the way I still read it today, is that in contrast to distributed objects, you should treat any service invocation as a noticeable operation, "putting a premium on boundary crossings", somehow distinct from normal code.

Perhaps I read to much into that, because WCF immediately proceeded to convert any SOAP service into a lot of auto-generated C# code that would then enable you to invoke operations on a remote service using (you guessed it) a method invocation.

Here a code snippet from the WCF documentation:

double value1 = 100.00D;
double value2 = 15.99D;
double result = client.Add(value1, value2);

What happens here is that client.Add creates and sends a SOAP message to a service, receives the response, unpacks it, and returns it as a double value. Even so, it looks just like any other method call. There's no "premium on boundary crossings" here.

So much for the principle that boundaries are explicit. They're not, and it bothered me twenty years ago, as it bothers me today.

I'll remind you what the problem is. When the boundary is not explicit, you may inadvertently write client code that makes network calls, and you may not be aware of it. This could noticeably slow down the application, particularly if you do it in a loop.

How do you make boundaries explicit? #

This problem isn't isolated to WCF or SOAP. Network calls are slow and unreliable. Perhaps you're connecting to a system on the other side of the Earth. Perhaps the system is unavailable. This is true regardless of protocol.

From the software architect's perspective, the tenet that boundaries are explicit is a really good idea. The clearer it is where in a code base network operations take place, the easier it is to troubleshot and maintain that code. This could make it easier to spot n + 1 problems, as well as give you opportunities to add logging, Circuit Breakers, etc.

How do you make boundaries explicit? Clearly, WCF failed to do so, despite the design goal.

Only Commands #

After having struggled with this question for years, I had an idea. This idea, however, doesn't really work, but I'll briefly cover it here. After all, if I can have that idea, other people may get it as well. It could save you some time if I explain why I believe that it doesn't address the problem.

The idea is to mandate that all network operations are Commands. In a C-like language, that would indicate a void method.

While it turns out that it ultimately doesn't work, this isn't just some arbitrary rule that I've invented. After all, if a method doesn't return anything, the boundary does, in a sense, become explicit. You can't just 'keep dotting', fluent-interface style.

channel.UpdateProduct(pc);

This gives you the opportunity to treat network operations as fire-and-forget operations. While you could still run such Commands in a tight loop, you could at least add them to a queue and move on. Such a queue could be be an in-process data structure, or a persistent queue. Your network card also holds a small queue of network packets.

This is essentially an asynchronous messaging architecture. It seems to correlate with Don Box's talk about messages.

Although this may seem as though it addresses some concerns about making boundaries explicit, an obvious question arises: How do you perform queries in this model?

You could keep such an architecture clean. You might, for example, implement a CQRS architecture where Commands create Events for which your application may subscribe. Such events could be handled by event handlers (other void methods) to update in-memory data as it changes.

Even so, there are practical limitations with such a model. What's likely to happen, instead, is the following.

Request-Reply #

It's hardly unlikely that you may need to perform some kind of Query against a remote system. If you can only communicate with services using void methods, such a scenario seems impossible.

It's not. There's even a pattern for that. Enterprise Integration Patterns call it Request-Reply. You create a Query message and give it a correlation ID, send it, and wait for the reply message to arrive at your own message handler. Client code might look like this:

var correlationId = Guid.NewGuid();
var query = new FavouriteSongsQuery(UserId: 123, correlationId);
channel.Send(query);
IReadOnlyCollection<Song> songs = [];
while (true)
{
    var response = subscriber.GetNextResponse(correlationId);
    if (response is null)
        Thread.Sleep(100);
    else
        songs = response;
        break;
}

While this works, it's awkward to use, so it doesn't take long before someone decides to wrap it in a helpful helper method:

public IReadOnlyCollection<Song> GetFavouriteSongs(int userId)
{
    var correlationId = Guid.NewGuid();
    var query = new FavouriteSongsQuery(userId, correlationId);
    channel.Send(query);
    IReadOnlyCollection<Song> songs = [];
    while (true)
    {
        var response = subscriber.GetNextResponse(correlationId);
        if (response is null)
            Thread.Sleep(100);
        else
            songs = response;
        break;
    }
 
    return songs;
}

This now enables you to write client code like this:

var songService = new SongService();
var songs = songService.GetFavouriteSongs(123);

We're back where we started. Boundaries are no longer explicit. Equivalent to how good names are only skin-deep, this attempt to make boundaries explicit can't resist programmers' natural tendency to make things easier for themselves.

If only there was some way to make an abstraction contagious...

Contagion #

Ideally, we'd like to make boundaries explicit in such a way that they can't be hidden away. After all,

"Abstraction is the elimination of the irrelevant and the amplification of the essential."

The existence of a boundary is essential, so while we might want to eliminate various other irrelevant details, this is a property that we should retain and surface in APIs. Even better, it'd be best if we could do it in such a way that it can't easily be swept under the rug, as shown above.

In Haskell, this is true for all input/output - not only network requests, but also file access, and other non-deterministic actions. In Haskell this is a 'wrapper' type called IO; for an explanation with C# examples, see The IO Container.

In a more idiomatic way, we can use task asynchronous programming as an IO surrogate. People often complain that async code is contagious. By that they mean that once a piece of code is asynchronous, the caller must also be asynchronous. This effect is transitive, and while this is often lamented as a problem, this is exactly what we need. Amplify the essential. Make boundaries explicit.

This doesn't mean that your entire code base has to be asynchronous. Only your network (and similar, non-deterministic) code should be asynchronous. Write your Domain Model and application code as pure functions, and compose them with the asynchronous code using standard combinators.

Conclusion #

The first of Don Box's four tenets of SOA is that boundaries should be explicit. WCF failed to deliver on that ideal, and it took me more than a decade to figure out how to square that circle.

Many languages now come with support for asynchronous programming, often utilizing some kind of generic Task or Async monad. Since such types are usually contagious, you can use them to make boundaries explicit.

Next: Services are autonomous.


The four tenets of SOA revisited

Monday, 04 March 2024 06:39:00 UTC

Twenty years after.

In the January 2004 issue of MSDN Magazine you can find an article by Don Box titled A Guide to Developing and Running Connected Systems with Indigo. Buried within the (now dated) discussion of the technology code-named Indigo (later Windows Communication Foundation) you can find a general discussion of four tenets of service-oriented architecture (SOA).

I remember that they resonated strongly with me back then, or that they at least prompted me to think explicitly about how to design software services. Some of these ideas have stayed with me ever since, while another has nagged at me for decades before I found a way to reconcile it with other principles of software design.

Now that it's twenty years ago that the MSDN article was published, I find that this is as good a time as ever to revisit it.

Legacy #

Why should we care about an old article about SOAP and SOA? Does anyone even use such things today, apart from legacy systems?

After all, we've moved on from SOAP to REST, gRPC, or GraphQL, and from SOA to microservices - that is, if we're not already swinging back towards monoliths.

Even so, I find much of what Don Box wrote twenty years ago surprisingly prescient. If you're interested in distributed software design involving some kind of remote API design, the four tenets of service-orientation apply beyond their original context. Some of the ideas, at least.

As is often the case in our field, various resources discuss the tenets without much regard to proper citation. Thus, I can't be sure that the MSDN article is where they first appeared, but I haven't found any earlier source.

My motivation for writing these article is partly to rescue the four tenets from obscurity, and partly to add some of my own commentary.

Much of the original article is about Indigo, and I'm going to skip that. On the other hand, I'm going to quote rather extensively from the article, in order to lift the more universal ideas out of their original context.

I'll do that in a series of articles, each covering one of the tenets.

Not all of the tenets have stood the test of time equally well, so I may not add an equal amount of commentary to all four.

Conclusion #

Ever since I first encountered the four tenets of SOA, they've stayed with me in one form or other. When helping teams to design services, even what we may today consider 'modern services', I've drawn on some of those ideas. There are insights of a general nature that are worth considering even today.

Next: Boundaries are explicit.


Page 1 of 73

"Our team wholeheartedly endorses Mark. His expert service provides tremendous value."
Hire me!