Services share schema and contract, not class

Monday, 15 April 2024 07:25:00 UTC

A reading of the third Don Box tenet, with some commentary.

This article is part of a series titled The four tenets of SOA revisited. In each of these articles, I'll pull one of Don Box's four tenets of service-oriented architecture (SOA) out of the original MSDN Magazine article and add some of my own commentary. If you're curious why I do that, I cover that in the introductory article.

In this article, I'll go over the third tenet, quoting from the MSDN Magazine article unless otherwise indicated.

Services share schema and contract, not class #

Compared to the second tenet, the following description may at first seem more dated. Here's what the article said:

Object-oriented programming encourages developers to create new abstractions in the form of classes. Most modern development environments not only make it trivial to define new classes, modern IDEs do a better job guiding you through the development process as the number of classes increases (as features like IntelliSense® provide a more specific list of options for a given scenario).

Classes are convenient abstractions as they share both structure and behavior in a single named unit. Service-oriented development has no such construct. Rather, services interact based solely on schemas (for structures) and contracts (for behaviors). Every service advertises a contract that describes the structure of messages it can send and/or receive as well as some degree of ordering constraints over those messages. This strict separation between structure and behavior vastly simplifies deployment, as distributed object concepts such as marshal-by-value require a common execution and security environment which is in direct conflict with the goals of autonomous computing.

Services do not deal in types or classes per se; rather, only with machine readable and verifiable descriptions of the legal "ins and outs" the service supports. The emphasis on machine verifiability and validation is important given the inherently distributed nature of how a service-oriented application is developed and deployed. Unlike a traditional class library, a service must be exceedingly careful about validating the input data that arrives in each message. Basing the architecture on machine-validatible schema and contract gives both developers and infrastructure the hints they need to protect the integrity of an individual service as well as the overall application as a whole.

Because the contract and schema for a given service are visible over broad ranges of both space and time, service-orientation requires that contracts and schema remain stable over time. In the general case, it is impossible to propagate changes in schema and/or contract to all parties who have ever encountered a service. For that reason, the contract and schema used in service-oriented designs tend to have more flexibility than traditional object-oriented interfaces. It is common for services to use features such as XML element wildcards (like xsd:any) and optional SOAP header blocks to evolve a service in ways that do not break already deployed code.

With its explicit discussion of XML, SOAP, and XSD, this description may seem more stuck in 2004 than the two first tenets.

I'll cover the most obvious consequence first.

At the boundaries... #

In the MSDN article, the four tenets guide the design of Windows Communication Foundation (WCF) - a technology that in 2004 was under development, but still not completed. While SOAP already existed as a platform-independent protocol, WCF was a .NET endeavour. Most developers using the Microsoft platform at the time were used to some sort of binary protocol, such as DCOM or .NET Remoting. Thus, it makes sense that Don Box was deliberately explicit that this was not how SOA (or WCF) was supposed to work.

In fact, since SOAP is platform-independent, you could write a web service in one language (say, Java) and consume it with a different language (e.g. C++). WCF was Microsoft's SOAP technology for .NET.

If you squint enough that you don't see the explicit references to XML or SOAP, however, the description still applies. Today, you may exchange data with JSON over REST, Protocol Buffers via gRPC, or something else, but it's still common to have a communications protocol that is independent of specific service implementations. A service may be written in Python, Haskell, C, or any other language that supports the wire format. As this little list suggests, the implementation language doesn't even have to be object-oriented.

In fact,

A formal interface definition language (IDL) may enable you to automate serialization and deserialization, but these are usually constrained to defining the shape of data and operations. Don Box talks about validation, and types don't replace validation - particularly if you allow xsd:any. That particular remark is quite at odds with the notion that a formal schema definition is necessary, or even desirable.

And indeed, today we often see JSON-based REST APIs that are more loosely defined. Even so, the absence of a machine-readable IDL doesn't entail the absence of a schema. As Alexis King wrote related to the static-versus-dynamic-types debate, dynamic type systems are not inherently more open. A similar argument can be made about schema. Regardless of whether or not a formal specification exists, a service always has a de-facto schema.

To be honest, though, when I try to interpret what this and the next tenet seem to imply, an IDL may have been all that Don Box had in mind. By schema he may only have meant XSD, and by contract, he may only have meant SOAP. More broadly speaking, this notion of contract may entail nothing more than a list of named operations, and references to schemas that indicate what input each operation takes, and what output it returns.

What I have in mind with the rest of this article may be quite an embellishment on that notion. In fact, my usual interpretation of the word contract may be more aligned with what Don Box calls policy. Thus, if you want a very literal reading of the four tenets, what comes next may fit better with the fourth tenet, that service compatibility is determined based on policy.

Regardless of whether you think that the following discussion belongs here, or in the next article, I'll assert that it's paramount to designing and developing useful and maintainable web services.

Encapsulation #

If we, once more, ignore the particulars related to SOAP and XML, we may rephrase the notion of schema and contract as follows. Schema describes the shape of data: Is it a number, a string, a tuple, or a combination of these? Is there only one, or several? Is the data composed from smaller such definitions? Does the composition describe the combination of several such definitions, or does it describe mutually exclusive alternatives?

Compliant data may be encoded as objects or data structures in memory, or serialized to JSON, XML, CSV, byte streams, etc. We may choose to call a particular agglomeration of data a message, which we may pass from one system to another. The first tenet already used this metaphor.

You can't, however, just pass arbitrary valid messages from one system to another. Certain operations allow certain data, and may promise to return other kinds of messages. In additions to the schema, we also need to describe a contract.

What's a contract? If you consult Object-Oriented Software Construction, a contract stipulates invariants, pre- and postconditions for various operations.

Preconditions state what must be true before an operation can take place. This often puts the responsibility on the caller to ensure that the system is in an appropriate state, and that the message that it intends to pass to the other system is valid according to that state.

Postconditions, on the other hand, detail what the caller can expect in return. This includes guarantees about response messages, but may also describe the posterior state of the system.

Invariants, finally, outline what is always true about the system.

Although such a description of a contract originates from a book about object-oriented design, it's useful in other areas, too, such as functional programming. It strikes me that it applies equally well in the context of service-orientation.

The combination of contract and well-described message structure is, in other words, encapsulation. There's nothing wrong with that: It works. If you actually apply it as a design principle, that is.

Conclusion #

The third SOA tenet emphasizes that only data travels over service boundaries. In order to communicate effectively, services must agree on the shape of data, and which operations are legal when. While they exchange data, however, they don't share address space, or even internal representation.

One service may be written in F# and the client in Clojure. Even so, it's important that they have a shared understanding of what is possible, and what is not. The more explicit you, as a service owner, can be, the better.

Next: Service compatibility is determined based on policy.


Extracting curve coordinates from a bitmap

Monday, 08 April 2024 05:32:00 UTC

Another example of using Haskell as an ad-hoc scripting language.

This article is part of a short series titled Trying to fit the hype cycle. In the first article, I outlined what it is that I'm trying to do. In this article, I'll describe how I extract a set of x and y coordinates from this bitmap:

Gartner hype cycle.

(Actually, this is scaled-down version of the image. The file I work with is a bit larger.)

As I already mentioned in the previous article, these days there are online tools for just about everything. Most likely, there's also an online tool that will take a bitmap like that and return a set of (x, y) coordinates.

Since I'm doing this for the programming exercise, I'm not interested in that. Rather, I'd like to write a little Haskell script to do it for me.

Module and imports #

Yes, I wrote Haskell script. As I've described before, with good type inference, a statically typed language can be as good for scripting as a dynamic one. Just as might be the case with, say, a Python script, you'll be iterating, trying things out until finally the script settles into its final form. What I present here is the result of my exercise. You should imagine that I made lots of mistakes underway, tried things that didn't work, commented out code and print statements, imported modules I eventually didn't need, etc. Just like I imagine you'd also do with a script in a dynamically typed language. At least, that's how I write Python, when I'm forced to do that.

In other words, the following is far from the result of perfect foresight, but rather the equilibrium into which the script settled.

I named the module HypeCoords, because the purpose of it is to extract the (x, y) coordinates from the above Gartner hype cycle image. These are the imports it turned out that I ultimately needed:

module HypeCoords where
 
import qualified Data.List.NonEmpty as NE
import Data.List.NonEmpty (NonEmpty((:|)))
import Codec.Picture
import Codec.Picture.Types

The Codec.Picture modules come from the JuicyPixels package. This is what enables me to read a .png file and extract the pixels.

Black and white #

If you look at the above bitmap, you may notice that it has some vertical lines in a lighter grey than the curve itself. My first task, then, is to get rid of those. The easiest way to do that is to convert the image to a black-and-white bitmap, with no grey scale.

Since this is a one-off exercise, I could easily do that with a bitmap editor, but on the other hand, I thought that this was a good first task to give myself. After all, I didn't know the JuicyPixels library at all, so this was an opportunity to start with a task just a notch simpler than the one that was my actual goal.

I thought that the easiest way to convert to a black-and-white image would be to turn all pixels white if they are lighter than some threshold, and black otherwise.

A PNG file has more information than I need, so I first converted the image to an 8-bit RGB bitmap. Even though the above image looks as though it's entirely grey scale, each pixel is actually composed of three colours. In order to compare a pixel with a threshold, I needed a single measure of how light or dark it is.

That turned out to be about as simple as it sounds: Just take the average of the three colours. Later, I'd need a function to compute the average for another reason, so I made it a reusable function:

average :: Integral a => NE.NonEmpty a -> a
average nel = sum nel `div` fromIntegral (NE.length nel)

It's a bit odd that the Haskell base library doesn't come with such a function (at least to my knowledge), but anyway, this one is specialized to do integer division. Notice that this function computes only non-exceptional averages, since it requires the input to be a NonEmpty list. No division-by-zero errors here, please!

Once I'd computed a pixel average and compared it to a threshold value, I wanted to replace it with either black or white. In order to make the code more readable I defined two named constants:

black :: PixelRGB8
black = PixelRGB8 minBound minBound minBound
white :: PixelRGB8
white = PixelRGB8 maxBound maxBound maxBound

With that in place, converting to black-and-white is only a few more lines of code:

toBW :: PixelRGB8 -> PixelRGB8
toBW (PixelRGB8 r g b) =
  let threshold = 192 :: Integer
      lum = average (fromIntegral r :| [fromIntegral g, fromIntegral b])
  in if lum <= threshold then black else white

I arrived at the threshold of 192 after a bit of trial-and-error. That's dark enough that the light vertical lines fall to the white side, while the real curve becomes black.

What remained was to glue the parts together to save the black-and-white file:

main :: IO ()
main = do
  readResult <- readImage "hype-cycle-cleaned.png"
  case readResult of
    Left msg -> putStrLn msg
    Right img -> do
      let bwImg = pixelMap toBW $ convertRGB8 img
      writePng "hype-cycle-bw.png" bwImg

The convertRGB8 function comes from JuicyPixels.

The hype-cycle-bw.png picture unsurprisingly looks like this:

Black-and-white Gartner hype cycle.

Ultimately, I didn't need the black-and-white bitmap file. I just wrote the script to create the file in order to be able to get some insights into what I was doing. Trust me, I made a lot of stupid mistakes along the way, and among other issues had some 'fun' with integer overflows.

Extracting image coordinates #

Now I had a general feel for how to work with the JuicyPixels library. It still required quite a bit of spelunking through the documentation before I found a useful API to extract all the pixels from a bitmap:

pixelCoordinates :: Pixel a => Image a -> [((IntInt), a)]
pixelCoordinates = pixelFold (\acc x y px -> ((x,y),px):acc) []

While this is, after all, just a one-liner, I'm surprised that something like this doesn't come in the box. It returns a list of tuples, where the first element contains the pixel coordinates (another tuple), and the second element the pixel information (e.g. the RGB value).

One y value per x value #

There were a few more issues to be addressed. The black curve in the black-and-white bitmap is thicker than a single pixel. This means that for each x value, there will be several black pixels. In order to do linear regression, however, we need a single y value per x value.

One easy way to address that concern is to calculate the average y value for each x value. This may not always be the best choice, but as far as we can see in the above black-and-white image, it doesn't look as though there's any noise left in the picture. This means that we don't have to worry about outliers pulling the average value away from the curve. In other words, finding the average y value is an easy way to get what we need.

averageY :: Integral b => NonEmpty (a, b) -> (a, b)
averageY nel = (fst $ NE.head nel, average $ snd <$> nel)

The averageY function converts a NonEmpty list of tuples to a single tuple. Watch out! The input tuples are not the 'outer' tuples that pixelCoordinates returns, but rather a list of actual pixel coordinates. Each tuple is a set of coordinates, but since the function never manipulates the x coordinate, the type of the first element is just unconstrained a. It can literally be anything, but will, in practice, be an integer.

The assumption is that the input is a small list of coordinates that all share the same x coordinate, such as (42, 99) :| [(42, 100), (42, 102)]. The function simply returns a single tuple that it creates on the fly. For the first element of the return tuple, it picks the head tuple from the input ((42, 99) in the example), and then that tuple's fst element (42). For the second element, the function averages all the snd elements (99, 100, and 102) to get 100 (integer division, you may recall):

ghci> averageY ((42, 99) :| [(42, 100), (42, 102)])
(42,100)

What remains is to glue together the building blocks.

Extracting curve coordinates #

A few more steps were required, but these I just composed in situ. I found no need to define them as individual functions.

The final composition looks like this:

main :: IO ()
main = do
  readResult <- readImage "hype-cycle-cleaned.png"
  case readResult of
    Left msg -> putStrLn msg
    Right img -> do
      let bwImg = pixelMap toBW $ convertRGB8 img
      let blackPixels =
            fst <$> filter ((black ==) . snd) (pixelCoordinates bwImg)
      let h = imageHeight bwImg
      let lineCoords = fmap (h -) . averageY <$> NE.groupAllWith fst blackPixels
      writeFile "coords.txt" $
        unlines $ (\(x,y) -> show x ++ "," ++ show y) <$> lineCoords

The first lines of code, until and including let bwImg, are identical to what you've already seen.

We're only interested in the black pixels, so the main action uses the standard filter function to keep only those that are equal to the black constant value. Once the white pixels are gone, we no longer need the pixel information. The expression that defines the blackPixels value finally (remember, you read Haskell code from right to left) throws away the pixel information by only retaining the fst element. That's the tuple that contains the coordinates. You may want to refer back to the type signature of pixelCoordinates to see what I mean.

The blackPixels value has the type [(Int, Int)].

Two more things need to happen. One is to group the pixels together per x value so that we can use averageY. The other is that we want the coordinates as normal Cartesian coordinates, and right now, they're in screen coordinates.

When working with bitmaps, it's quite common that pixels are measured out from the top left corner, instead of from the bottom left corner. It's not difficult to flip the coordinates, but we need to know the height of the image:

let h = imageHeight bwImg

The imageHeight function is another JuicyPixels function.

Because I sometimes get carried away, I write the code in a 'nice' compact style that could be more readable. I accomplished both of the above remaining tasks with a single line of code:

let lineCoords = fmap (h -) . averageY <$> NE.groupAllWith fst blackPixels

This first groups the coordinates according to x value, so that all coordinates that share an x value are collected in a single NonEmpty list. This means that we can map all of those groups over averageY. Finally, the expression flips from screen coordinates to Cartesian coordinates by subtracting the y coordinate from the height h.

The final writeFile expression writes the coordinates to a text file as comma-separated values. The first ten lines of that file looks like this:

9,13
10,13
11,13
12,14
13,15
14,15
15,16
16,17
17,17
18,18
...

Do these points plot the Gartner hype cycle?

Sanity checking by plotting the coordinates #

To check whether the coordinates look useful, we could plot them. If I wanted to use a few more hours, I could probably figure out how to do that with JuicyPixels as well, but on the other hand, I already know how to do that with Python:

data = numpy.loadtxt('coords.txt', delimiter=',')
x = data[:, 0]
t = data[:, 1]
plt.scatter(x, t, s=10, c='g')
plt.show()

That produces this plot:

Coordinates plotted with Python.

LGTM.

Conclusion #

In this article, you've seen how a single Haskell script can extract curve coordinates from a bitmap. The file is 41 lines all in all, including module declaration and white space. This article shows every single line in that file, apart from some blank lines.

I loaded the file into GHCi and ran the main action in order to produce the CSV file.

I did spend a few hours looking around in the JuicyPixels documentation before I'd identified the functions that I needed. All in all I used some hours on this exercise. I didn't keep track of time, but I guess that I used more than three, but probably fewer than six, hours on this.

This was the successful part of the overall exercise. Now onto the fiasco.

Next: Fitting a polynomial to a set of points.


Trying to fit the hype cycle

Monday, 01 April 2024 07:14:00 UTC

An amateur tries his hand at linear modelling.

About a year ago, I was contemplating a conference talk I was going to give. Although I later abandoned the idea for other reasons, for a few days I was thinking about using the Gartner hype cycle for an animation. What I had in mind would require me to draw the curve in a way that would enable me to zoom in and out. Vector graphics would be much more useful for that job than a bitmap.

Gartner hype cycle.

Along the way, I considered if there was a function that would enable me to draw it on the fly. A few web searches revealed the Cross Validated question Is there a linear/mixture function that can fit the Gartner hype curve? So I wasn't the first person to have that idea, but at the time I found it, the question was effectively dismissed without a proper answer. Off topic, dontcha know?

A web search also seems to indicate the existence of a few research papers where people have undertaken this task, but there's not a lot about it. True, the Gartner hype cycle isn't a real function, but it sounds like a relevant exercise in statistics, if one's into that kind of thing.

Eventually, for my presentation, I went with another way to illustrate what I wanted to say, so for half I year, I didn't think more about it.

Linear regression? #

Recently, however, I was following a course in mathematical analysis of data, and among other things, I learned how to fit a line to data. Not just a straight line, but any degree of polynomial. So I thought that perhaps it'd be an interesting exercise to see if I could fit the hype cycle to some high-degree polynomial - even though I do realize that the hype cycle isn't a real function, and neither does it look like a straight polynomial function.

In order to fit a polynomial to the curve, I needed some data, so my first task was to convert an image to a series of data points.

I'm sure that there are online tools and apps that offer to do that for me, but the whole point of this was that I wanted to learn how to tackle problems like these. It's like doing katas. The journey is the goal.

This turned out to be an exercise consisting of two phases so distinct that I wrote them in two different languages.

As the articles will reveal, the first part went quite well, while the other was, essentially, a fiasco.

Conclusion #

There's not much point in finding a formula for the Gartner hype cycle, but the goal of this exercise was, for me, to tinker with some new techniques to see if I could learn from doing the exercise. And I did learn something.

In the next articles in this series, I'll go over some of the details.

Next: Extracting curve coordinates from a bitmap.


Services are autonomous

Monday, 25 March 2024 08:31:00 UTC

A reading of the second Don Box tenet, with some commentary.

This article is part of a series titled The four tenets of SOA revisited. In each of these articles, I'll pull one of Don Box's four tenets of service-oriented architecture (SOA) out of the original MSDN Magazine article and add some of my own commentary. If you're curious why I do that, I cover that in the introductory article.

In this article, I'll go over the second tenet. The quotes are from the MSDN Magazine article unless otherwise indicated.

Services are autonomous #

Compared with the first tenet, you'll see that Don Box had more to say about this one. I, conversely, have less to add. First, here's what the article said:

Service-orientation mirrors the real world in that it does not assume the presence of an omniscient or omnipotent oracle that has awareness and control over all parts of a running system. This notion of service autonomy appears in several facets of development, the most obvious place being the area of deployment and versioning.

Object-oriented programs tend to be deployed as a unit. Despite the Herculean efforts made in the 1990s to enable classes to be independently deployed, the discipline required to enable object-oriented interaction with a component proved to be impractical for most development organizations. When coupled with the complexities of versioning object-oriented interfaces, many organizations have become extremely conservative in how they roll out object-oriented code. The popularity of the XCOPY deployment and private assemblies capabilities of the .NET Framework is indicative of this trend.

Service-oriented development departs from object-orientation by assuming that atomic deployment of an application is the exception, not the rule. While individual services are almost always deployed atomically, the aggregate deployment state of the overall system/application rarely stands still. It is common for an individual service to be deployed long before any consuming applications are even developed, let alone deployed into the wild. Amazon.com is one example of this build-it-and-they-will-come philosophy. There was no way the developers at Amazon could have known the multitude of ways their service would be used to build interesting and novel applications.

It is common for the topology of a service-oriented application to evolve over time, sometimes without direct intervention from an administrator or developer. The degree to which new services may be introduced into a service-oriented system depends on both the complexity of the service interaction and the ubiquity of services that interact in a common way. Service-orientation encourages a model that increases ubiquity by reducing the complexity of service interactions. As service-specific assumptions leak into the public facade of a service, fewer services can reasonably mimic that facade and stand in as a reasonable substitute.

The notion of autonomous services also impacts the way failures are handled. Objects are deployed to run in the same execution context as the consuming application. Service-oriented designs assume that this situation is the exception, not the rule. For that reason, services expect that the consuming application can fail without notice and often without any notification. To maintain system integrity, service-oriented designs use a variety of techniques to deal with partial failure modes. Techniques such as transactions, durable queues, and redundant deployment and failover are quite common in a service-oriented system.

Because many services are deployed to function over public networks (such as the Internet), service-oriented development assumes not only that incoming message data may be malformed but also that it may have been transmitted for malicious purposes. Service-oriented architectures protect themselves by placing the burden of proof on all message senders by requiring applications to prove that all required rights and privileges have been granted. Consistent with the notion of service autonomy, service-oriented architectures invariably rely on administratively managed trust relationships in order to avoid per-service authentication mechanisms common in classic Web applications.

Again, I'd like to highlight how general these ideas are. Once lifted out of the context of Windows Communication Foundation, all of this applies more broadly.

Perhaps a few details now seem dated, but in general I find that this description holds up well.

Wildlife #

It's striking that someone in 2004 observed that big, complex, coordinated releases are impractical. Even so, it doesn't seem as though adopting a network-based technology and architecture in itself solves that problem. I wrote about that in 2012, and I've seen Adam Ralph make a similar observation. Many organizations inadvertently create distributed monoliths. I think that this often stems from a failure of heeding the tenet that services are autonomous.

I've experienced the following more than once. A team of developers rely on a service. As they take on a new feature, they realize that the way things are currently modelled prevents them from moving forward. Typical examples include mismatched cardinalities. For example, a customer record has a single active address, but the new feature requires that customers may have multiple active addresses. It could be that a customer has a permanent address, but also a summerhouse.

It is, however, the other service that defines how customer addresses are modelled, so the development team contacts the service team to discuss a breaking change. The service team agrees to the breaking change, but this means that the service and the relying client team now have to coordinate when they deploy the new versions of their software. The service is no longer autonomous.

I've already discussed this kind of problem in a previous article, and as Don Box also implied, this discussion is related to the question of versioning, which we'll get back to when covering the fourth tenet.

Transactions #

It may be worthwhile to comment on this sentence:

Techniques such as transactions, durable queues, and redundant deployment and failover are quite common in a service-oriented system.

Indeed, but particularly regarding database transactions, a service may use them internally (typically leveraging a database engine like SQL Server, Oracle, PostgreSQL, etc.), but not across services. Around the time Don Box wrote the original MSDN Magazine article an extension to SOAP colloquially known as WS-Death Star was in the works, and it included WS Transaction.

I don't know whether Don Box had something like this in mind when he wrote the word transaction, but in my experience, you don't want to go there. If you need to, you can make use of database transactions to keep your own service ACID-consistent, but don't presume that this is possible with multiple autonomous services.

As always, even if a catchphrase such as services are autonomous sounds good, it's always illuminating to understand that there are trade-offs involved - and what they are. Here, a major trade-off is that you need to think about error-handling in a different way. If you don't already know how to address such concerns, look up lock-free transactions and eventual consistency. As Don Box also mentioned, durable queues are often part of such a solution, as is idempotence.

Validation #

From this discussion follows that an autonomous service should, ideally, exist independently of the software ecosystem in which it exists. While an individual service can't impose its will on its surroundings, it can, and should, behave in a consistent and correct manner.

This does include deliberate consistency for the service itself. An autonomous service may make use of ACID or eventual consistency as the service owner deems appropriate.

It should also treat all input as suspect, until proven otherwise. Input validation is an important part of service design. It is my belief that validation is a solved problem, but that doesn't mean that you don't have to put in the work. You should consider correctness, versioning, as well as Postel's law.

Security #

A similar observation relates to security. Some services (particularly read-only services) may allow for anonymous access, but if a service needs to authenticate or authorize requests, consider how this is done in an autonomous manner. Looking up account information in a centralized database isn't the autonomous way. If a service does that, it now relies on the account database, and is no longer autonomous.

Instead, rely on claims-based identity. In my experience, OAuth with JWT is usually fine.

If your service needs to know something about the user that only an external source can tell it, don't look it up in an external system. Instead, demand that it's included in the JWT as a claim. Do you need to validate the age of the user? Require a date-of-birth or age claim. Do you need to know if the request is made on behalf of a system administrator? Demand a list of role claims.

Conclusion #

The second of Don Box's four tenets of SOA state that services should be autonomous. At first glance, you may think that all this means is that a service shouldn't share its database with another service. That is, however, a minimum bar. You need to consider how a service exists in an environment that it doesn't control. Again, the wildlife metaphor seems apt. Particularly if your service is exposed to the internet, it lives in a hostile environment.

Not only should you consider all input belligerent, you must also take into account that friendly systems may disappear or change. Your service exists by itself, supported by itself, relying on itself. If you need to coordinate work with other service owners, that's a strong hint that your service isn't, after all, autonomous.

Next: Services share schema and contract, not class.


Extracting data from a small CSV file with Python

Monday, 18 March 2024 08:36:00 UTC

My inept adventures with a dynamically typed language.

This article is the third in a small series about ad-hoc programming in two languages. In the previous article you saw how I originally solved a small data extraction and analysis problem with Haskell, even though it was strongly implied that Python was the language for the job.

Months after having solved the problem I'd learned a bit more Python, so I decided to return to it and do it again in Python as an exercise. In this article, I'll briefly describe what I did.

Reading CSV data #

When writing Python, I feel the way I suppose a script kiddie might feel. I cobble together code based on various examples I've seen somewhere else, without a full or deep understanding of what I'm doing. There's more than a hint of programming by coincidence, I'm afraid. One thing I've picked up along the way is that I can use pandas to read a CSV file:

data = pd.read_csv('survey_data.csv', header=None)
grades = data.iloc[:, 2]
experiences = data.iloc[:, 3]

In order for this to work, I needed to import pandas. Ultimately, my imports looked like this:

import pandas as pd
from collections import Counter
from itertools import combinations, combinations_with_replacement
import matplotlib.pyplot as plt

In other Python code that I've written, I've been a heavy user of NumPy, and while I several times added it to my imports, I never needed it for this task. That was a bit surprising, but I've only done Python programming for a year, and I still don't have a good feel for the ecosystem.

The above code snippet also demonstrates how easy it is to slice a dataframe into columns: grades contains all the values in the (zero-indexed) second column, and experiences likewise the third column.

Sum of grades #

All the trouble I had with binomial choice without replacement that I had with my Haskell code is handled with combinations, which happily handles duplicate values:

>>> list(combinations('foo', 2))
[('f', 'o'), ('f', 'o'), ('o', 'o')]

Notice that combinations doesn't list ('o', 'f'), since (apparently) it doesn't consider ordering important. That's more in line with the binomial coefficient, whereas my Haskell code considers a tuple like ('f', 'o') to be distinct from ('o', 'f'). This is completely consistent with how Haskell works, but means that all the counts I arrived at with Haskell are double what they are in this article. Ultimately, 6/1406 is equal to 3/703, so the probabilities are the same. I'll try to call out this factor-of-two difference whenever it occurs.

A Counter object counts the number of occurrences of each value, so reading, picking combinations without replacement and adding them together is just two lines of code, and one more to print them:

sumOfGrades = Counter(map(sum, combinations(grades, 2)))
sumOfGrades = sorted(sumOfGrades.items(), key=lambda item: item[0])
print(f'Sums of grades: {sumOfGrades}')

The output is:

Sums of grades: [(0, 3), (2, 51), (4, 157), (6, 119), (7, 24), (8, 21), (9, 136), (10, 3),
                 (11, 56), (12, 23), (14, 69), (16, 14), (17, 8), (19, 16), (22, 2), (24, 1)]

(Formatting courtesy of yours truly.)

As already mentioned, these values are off by a factor two compared to the previous Haskell code, but since I'll ultimately be dealing in ratios, it doesn't matter. What this output indicates is that the sum 0 occurs three times, the sum 2 appears 51 times, and so on.

This is where I, in my Haskell code, dropped down to a few ephemeral REPL-based queries that enabled me to collect enough information to paste into Excel in order to produce a figure. In Python, however, I have Matplotlib, which means that I can create the desired plots entirely in code. It does require that I write a bit more code, though.

First, I need to calculate the range of the Probability Mass Function (PMF), since there are values that are possible, but not represented in the above data set. To calculate all possible values in the PMF's range, I use combinations_with_replacement against the Danish grading scale.

grade_scale = [-3, 0, 2, 4, 7, 10, 12]
sumOfGradesRange = set(map(sum, combinations_with_replacement(grade_scale, 2)))
sumOfGradesRange = sorted(sumOfGradesRange)
print(f'Range of sums of grades: {sumOfGradesRange}')

The output is this:

Range of sums of grades: [-6, -3, -1, 0, 1, 2, 4, 6, 7, 8, 9, 10, 11, 12, 14, 16, 17, 19, 20, 22, 24]

Next, I create a dictionary of all possible grades, initializing all entries to zero, but then updating that dictionary with the observed values, where they are present:

probs = dict.fromkeys(sumOfGradesRange, 0)
probs.update(dict(sumOfGrades))

Finally, I recompute the dictionary entries to probabilities.

total = sum(x[1] for x in sumOfGrades)
for k, v in probs.items():
    probs[k] = v / total

Now I have all the data needed to plot the desired bar char:

plt.bar(probs.keys(), probs.values())
plt.xlabel('Sum')
plt.ylabel('Probability')
plt.show()

The result looks like this:

Bar chart of the sum-of-grades PMF.

While I'm already on line 34 in my Python file, with one more question to answer, I've written proper code in order to produce data that I only wrote ephemeral queries for in Haskell.

Difference of experiences #

The next question is almost a repetition of the the first one, and I've addressed it by copying and pasting. After all, it's only duplication, not triplication, so I can always invoke the Rule of Three. Furthermore, this is a one-off script that I don't expect to have to maintain in the future, so copy-and-paste, here we go:

diffOfExperiances = \
    Counter(map(lambda x: abs(x[0] - x[1]), combinations(experiences, 2)))
diffOfExperiances = sorted(diffOfExperiances.items(), key=lambda item: item[0])
print(f'Differences of experiences: {diffOfExperiances}')
 
experience_scale = list(range(1, 8))
diffOfExperiancesRange = set(\
    map(lambda x: abs(x[0] - x[1]),\
    combinations_with_replacement(experience_scale, 2)))
diffOfExperiancesRange = sorted(diffOfExperiancesRange)
 
probs = dict.fromkeys(diffOfExperiancesRange, 0)
probs.update(dict(diffOfExperiances))
total = sum(x[1] for x in diffOfExperiances)
for k, v in probs.items():
    probs[k] = v / total
 
# Plot the histogram of differences of experiences
plt.bar(probs.keys(), probs.values())
plt.xlabel('Difference')
plt.ylabel('Probability')
plt.show()

The bar chart has the same style as before, but obviously displays different data. See the bar chart in the previous article for the Excel-based rendition of that data.

Conclusion #

The Python code runs to 61 lines of code, compared with the 34 lines of Haskell code. The Python code, however, is much more complete than the Haskell code, since it also contains the code that computes the range of each PMF, as well as code that produces the figures.

Like the Haskell code, it took me a couple of hours to produce this, so I can't say that I feel much more productive in Python than in Haskell. On the other hand, I also acknowledge that I have less experience writing Python code. If I had to do a lot of ad-hoc data crunching like this, I can see how Python is useful.


Boundaries are explicit

Monday, 11 March 2024 08:03:00 UTC

A reading of the first Don Box tenet, with some commentary.

This article is part of a series titled The four tenets of SOA revisited. In each of these articles, I'll pull one of Don Box's four tenets of service-oriented architecture (SOA) out of the original MSDN Magazine article and add some of my own commentary. If you're curious why I do that, I cover that in the introductory article.

In this article, I'll go over the first tenet, quoting from the MSDN Magazine article unless otherwise indicated.

Boundaries are explicit #

This tenet was the one I struggled with the most. It took me a long time to come to grips with how to apply it, but I'll get back to that in a moment. First, here's what the article said:

A service-oriented application often consists of services that are spread over large geographical distances, multiple trust authorities, and distinct execution environments. The cost of traversing these various boundaries is nontrivial in terms of complexity and performance. Service-oriented designs acknowledge these costs by putting a premium on boundary crossings. Because each cross-boundary communication is potentially costly, service-orientation is based on a model of explicit message passing rather than implicit method invocation. Compared to distributed objects, the service-oriented model views cross-service method invocation as a private implementation technique, not as a primitive construct—the fact that a given interaction may be implemented as a method call is a private implementation detail that is not visible outside the service boundary.

Though service-orientation does not impose the RPC-style notion of a network-wide call stack, it can support a strong notion of causality. It is common for messages to explicitly indicate which chain(s) of messages a particular message belongs to. This indication is useful for message correlation and for implementing several common concurrency models.

The notion that boundaries are explicit applies not only to inter-service communication but also to inter-developer communication. Even in scenarios in which all services are deployed in a single location, it is commonplace for the developers of each service to be spread across geographical, cultural, and/or organizational boundaries. Each of these boundaries increases the cost of communication between developers. Service orientation adapts to this model by reducing the number and complexity of abstractions that must be shared across service boundaries. By keeping the surface area of a service as small as possible, the interaction and communication between development organizations is reduced. One theme that is consistent in service-oriented designs is that simplicity and generality aren't a luxury but rather a critical survival skill.

Notice that there's nothing here about Windows Communication Framework (WCF), or any other specific technology. This is common to all four tenets, and one of the reasons that I think they deserve to be lifted out of their original context and put on display as the general ideas that they are.

I'm getting the vibe that the above description was written under the impression of the disenchantment with distributed objects that was setting in around that time. The year before, Martin Fowler had formulated his

"First Law of Distributed Object Design: Don't distribute your objects!"

The way that I read the tenet then, and the way I still read it today, is that in contrast to distributed objects, you should treat any service invocation as a noticeable operation, "putting a premium on boundary crossings", somehow distinct from normal code.

Perhaps I read to much into that, because WCF immediately proceeded to convert any SOAP service into a lot of auto-generated C# code that would then enable you to invoke operations on a remote service using (you guessed it) a method invocation.

Here a code snippet from the WCF documentation:

double value1 = 100.00D;
double value2 = 15.99D;
double result = client.Add(value1, value2);

What happens here is that client.Add creates and sends a SOAP message to a service, receives the response, unpacks it, and returns it as a double value. Even so, it looks just like any other method call. There's no "premium on boundary crossings" here.

So much for the principle that boundaries are explicit. They're not, and it bothered me twenty years ago, as it bothers me today.

I'll remind you what the problem is. When the boundary is not explicit, you may inadvertently write client code that makes network calls, and you may not be aware of it. This could noticeably slow down the application, particularly if you do it in a loop.

How do you make boundaries explicit? #

This problem isn't isolated to WCF or SOAP. Network calls are slow and unreliable. Perhaps you're connecting to a system on the other side of the Earth. Perhaps the system is unavailable. This is true regardless of protocol.

From the software architect's perspective, the tenet that boundaries are explicit is a really good idea. The clearer it is where in a code base network operations take place, the easier it is to troubleshot and maintain that code. This could make it easier to spot n + 1 problems, as well as give you opportunities to add logging, Circuit Breakers, etc.

How do you make boundaries explicit? Clearly, WCF failed to do so, despite the design goal.

Only Commands #

After having struggled with this question for years, I had an idea. This idea, however, doesn't really work, but I'll briefly cover it here. After all, if I can have that idea, other people may get it as well. It could save you some time if I explain why I believe that it doesn't address the problem.

The idea is to mandate that all network operations are Commands. In a C-like language, that would indicate a void method.

While it turns out that it ultimately doesn't work, this isn't just some arbitrary rule that I've invented. After all, if a method doesn't return anything, the boundary does, in a sense, become explicit. You can't just 'keep dotting', fluent-interface style.

channel.UpdateProduct(pc);

This gives you the opportunity to treat network operations as fire-and-forget operations. While you could still run such Commands in a tight loop, you could at least add them to a queue and move on. Such a queue could be be an in-process data structure, or a persistent queue. Your network card also holds a small queue of network packets.

This is essentially an asynchronous messaging architecture. It seems to correlate with Don Box's talk about messages.

Although this may seem as though it addresses some concerns about making boundaries explicit, an obvious question arises: How do you perform queries in this model?

You could keep such an architecture clean. You might, for example, implement a CQRS architecture where Commands create Events for which your application may subscribe. Such events could be handled by event handlers (other void methods) to update in-memory data as it changes.

Even so, there are practical limitations with such a model. What's likely to happen, instead, is the following.

Request-Reply #

It's hardly unlikely that you may need to perform some kind of Query against a remote system. If you can only communicate with services using void methods, such a scenario seems impossible.

It's not. There's even a pattern for that. Enterprise Integration Patterns call it Request-Reply. You create a Query message and give it a correlation ID, send it, and wait for the reply message to arrive at your own message handler. Client code might look like this:

var correlationId = Guid.NewGuid();
var query = new FavouriteSongsQuery(UserId: 123, correlationId);
channel.Send(query);
IReadOnlyCollection<Song> songs = [];
while (true)
{
    var response = subscriber.GetNextResponse(correlationId);
    if (response is null)
        Thread.Sleep(100);
    else
        songs = response;
        break;
}

While this works, it's awkward to use, so it doesn't take long before someone decides to wrap it in a helpful helper method:

public IReadOnlyCollection<Song> GetFavouriteSongs(int userId)
{
    var correlationId = Guid.NewGuid();
    var query = new FavouriteSongsQuery(userId, correlationId);
    channel.Send(query);
    IReadOnlyCollection<Song> songs = [];
    while (true)
    {
        var response = subscriber.GetNextResponse(correlationId);
        if (response is null)
            Thread.Sleep(100);
        else
            songs = response;
        break;
    }
 
    return songs;
}

This now enables you to write client code like this:

var songService = new SongService();
var songs = songService.GetFavouriteSongs(123);

We're back where we started. Boundaries are no longer explicit. Equivalent to how good names are only skin-deep, this attempt to make boundaries explicit can't resist programmers' natural tendency to make things easier for themselves.

If only there was some way to make an abstraction contagious...

Contagion #

Ideally, we'd like to make boundaries explicit in such a way that they can't be hidden away. After all,

"Abstraction is the elimination of the irrelevant and the amplification of the essential."

The existence of a boundary is essential, so while we might want to eliminate various other irrelevant details, this is a property that we should retain and surface in APIs. Even better, it'd be best if we could do it in such a way that it can't easily be swept under the rug, as shown above.

In Haskell, this is true for all input/output - not only network requests, but also file access, and other non-deterministic actions. In Haskell this is a 'wrapper' type called IO; for an explanation with C# examples, see The IO Container.

In a more idiomatic way, we can use task asynchronous programming as an IO surrogate. People often complain that async code is contagious. By that they mean that once a piece of code is asynchronous, the caller must also be asynchronous. This effect is transitive, and while this is often lamented as a problem, this is exactly what we need. Amplify the essential. Make boundaries explicit.

This doesn't mean that your entire code base has to be asynchronous. Only your network (and similar, non-deterministic) code should be asynchronous. Write your Domain Model and application code as pure functions, and compose them with the asynchronous code using standard combinators.

Conclusion #

The first of Don Box's four tenets of SOA is that boundaries should be explicit. WCF failed to deliver on that ideal, and it took me more than a decade to figure out how to square that circle.

Many languages now come with support for asynchronous programming, often utilizing some kind of generic Task or Async monad. Since such types are usually contagious, you can use them to make boundaries explicit.

Next: Services are autonomous.


The four tenets of SOA revisited

Monday, 04 March 2024 06:39:00 UTC

Twenty years after.

In the January 2004 issue of MSDN Magazine you can find an article by Don Box titled A Guide to Developing and Running Connected Systems with Indigo. Buried within the (now dated) discussion of the technology code-named Indigo (later Windows Communication Foundation) you can find a general discussion of four tenets of service-oriented architecture (SOA).

I remember that they resonated strongly with me back then, or that they at least prompted me to think explicitly about how to design software services. Some of these ideas have stayed with me ever since, while another has nagged at me for decades before I found a way to reconcile it with other principles of software design.

Now that it's twenty years ago that the MSDN article was published, I find that this is as good a time as ever to revisit it.

Legacy #

Why should we care about an old article about SOAP and SOA? Does anyone even use such things today, apart from legacy systems?

After all, we've moved on from SOAP to REST, gRPC, or GraphQL, and from SOA to microservices - that is, if we're not already swinging back towards monoliths.

Even so, I find much of what Don Box wrote twenty years ago surprisingly prescient. If you're interested in distributed software design involving some kind of remote API design, the four tenets of service-orientation apply beyond their original context. Some of the ideas, at least.

As is often the case in our field, various resources discuss the tenets without much regard to proper citation. Thus, I can't be sure that the MSDN article is where they first appeared, but I haven't found any earlier source.

My motivation for writing these article is partly to rescue the four tenets from obscurity, and partly to add some of my own commentary.

Much of the original article is about Indigo, and I'm going to skip that. On the other hand, I'm going to quote rather extensively from the article, in order to lift the more universal ideas out of their original context.

I'll do that in a series of articles, each covering one of the tenets.

Not all of the tenets have stood the test of time equally well, so I may not add an equal amount of commentary to all four.

Conclusion #

Ever since I first encountered the four tenets of SOA, they've stayed with me in one form or other. When helping teams to design services, even what we may today consider 'modern services', I've drawn on some of those ideas. There are insights of a general nature that are worth considering even today.

Next: Boundaries are explicit.


Testing exceptions

Monday, 26 February 2024 06:47:00 UTC

Some thoughts on testing past the happy path.

Test-driven development is a great development technique that enables you to get rapid feedback on design and implementation ideas. It enables you to rapidly move towards a working solution.

The emphasis on the happy path, however, can make you forget about all the things that may go wrong. Sooner or later, though, you may realize that the implementation can fail for a number of reasons, and, wanting to make things more robust, you may want to also subject your error-handling code to automated testing.

This doesn't have to be difficult, but can raise some interesting questions. In this article, I'll try to address a few.

Throwing exceptions with a dynamic mock #

In a question to another article AmirB asks how to use a Fake Object to test exceptions. Specifically, since a Fake is a Test Double with a coherent contract it'll be inappropriate to let it throw exceptions that relate to different implementations.

Egads, that was quite abstract, so let's consider a concrete example.

The original article that AmirB asked about used this interface as an example:

public interface IUserRepository
{
    User Read(int userId);
 
    void Create(int userId);
}

Granted, this interface is a little odd, but it should be good enough for the present purpose. As AmirB wrote:

"In scenarios where dynamic mocks (like Moq) are employed, we can mock a method to throw an exception, allowing us to test the expected behavior of the System Under Test (SUT)."

Specifically, this might look like this, using Moq:

[Fact]
public void CreateThrows()
{
    var td = new Mock<IUserRepository>();
    td.Setup(r => r.Read(1234)).Returns(new User { Id = 0 });
    td.Setup(r => r.Create(It.IsAny<int>())).Throws(MakeSqlException());
    var sut = new SomeController(td.Object);
 
    var actual = sut.GetUser(1234);
 
    Assert.NotNull(actual);
}

It's just an example, but the point is that since you can make a dynamic mock do anything that you can define in code, you can also use it to simulate database exceptions. This test pretends that the IUserRepository throws a SqlException from the Create method.

Perhaps the GetUser implementation now looks like this:

public User GetUser(int userId)
{
    var u = this.userRepository.Read(userId);
    if (u.Id == 0)
        try
        {
            this.userRepository.Create(userId);
        }
        catch (SqlException)
        {
        }
    return u;
}

If you find the example contrived, I don't blame you. The IUserRepository interface, the User class, and the GetUser method that orchestrates them are all degenerate in various ways. I originally created this little code example to discuss data flow verification, and I'm now stretching it beyond any reason. I hope that you can look past that. The point I'm making here is more general, and doesn't hinge on particulars.

Fake #

The article also suggests a FakeUserRepository that is small enough that I can repeat it here.

public sealed class FakeUserRepository : Collection<User>, IUserRepository
{
    public void Create(int userId)
    {
        Add(new User { Id = userId });
    }
 
    public User Read(int userId)
    {
        var user = this.SingleOrDefault(u => u.Id == userId);
        if (user == null)
            return new User { Id = 0 };
        return user;
    }
}

The question is how to use something like this when you want to test exceptions? It's possible that this little class may produce errors that I've failed to predict, but it certainly doesn't throw any SqlExceptions!

Should we inflate FakeUserRepository by somehow also giving it the capability to throw particular exceptions?

Throwing exceptions from Test Doubles #

I understand why AmirB asks that question, because it doesn't seem right. As a start, it would go against the Single Responsibility Principle. The FakeUserRepository would then have more than reason to change: You'd have to change it if the IUserRepository interface changes, but you'd also have to change it if you wanted to simulate a different error situation.

Good coding practices apply to test code as well. Test code is code that you have to read and maintain, so all the good practices that keep production code in good shape also apply to test code. This may include the SOLID principles, unless you're of the mind that SOLID ought to be a thing of the past.

If you really must throw exceptions from a Test Double, perhaps a dynamic mock object as shown above is the best option. No-one says that if you use a Fake Object for most of your tests you can't use a dynamic mock library for truly one-off test cases.Or perhaps a one-off Test Double that throws the desired exception.

I would, however, consider it a code smell if this happens too often. Not a test smell, but a code smell.

Is the exception part of the contract? #

You may ask yourself whether a particular exception type is part of an object's contract. As I always do, when I use the word contract, I refer to a set of invariants, pre-, and postconditions, taking a cue from Object-Oriented Software Construction. See also my video course Encapsulation and SOLID for more details.

You can imply many things about a contract when you have a static type system at your disposal, but there are always rules that you can't express that way. Parts of a contract are implicitly understood, or communicated in other ways. Code comments, docstrings, or similar, are good options.

What may you infer from the IUserRepository interface? What should you not infer?

I'd expect the Read method to return a User object. This code example hails us from 2013, before C# had nullable reference types. Around that time I'd begun using Maybe to signal that the return value might be missing. This is a convention, so the reader needs to be aware of it in order to correctly infer that part of the contract. Since the Read method does not return Maybe<User> I might infer that a non-null User object is guaranteed; that's a post-condition.

These days, I'd also use asynchronous APIs to hint that I/O is involved, but again, the example is so old and simplified that this isn't the case here. Still, regardless of how this is communicated to the reader, if an interface (or base class) is intended for I/O, we may expect it to fail at times. In most languages, such errors manifest as exceptions.

At least two questions arise from such deliberations:

  • Which exception types may the methods throw?
  • Can you even handle such exceptions?

Should SqlException even be part of the contract? Isn't that an implementation detail?

The FakeUserRepository class neither uses SQL Server nor throws SqlExceptions. You may imagine other implementations that use a document database, or even just another relational database than SQL Server (Oracle, MySQL, PostgreSQL, etc.). Those wouldn't throw SqlExceptions, but perhaps other exception types.

According to the Dependency Inversion Principle,

"Abstractions should not depend upon details. Details should depend upon abstractions."

If we make SqlException part of the contract, an implementation detail becomes part of the contract. Not only that: With an implementation like the above GetUser method, which catches SqlException, we've also violated the Liskov Substitution Principle. If you injected another implementation, one that throws a different type of exception, the code would no longer work as intended.

Loosely coupled code shouldn't look like that.

Many specific exceptions are of a kind that you can't handle anyway. On the other hand, if you do decide to handle particular error scenarios, make it part of the contract, or, as Michael Feathers puts it, extend the domain.

Integration testing #

How should we unit test specific exception? Mu, we shouldn't.

"Personally, I avoid using try-catch blocks in repositories or controllers and prefer handling exceptions in middleware (e.g., ErrorHandler). In such cases, I write separate unit tests for the middleware. Could this be a more fitting approach?"

That is, I think, an excellent approach to those exceptions that that you've decided to not handle explicitly. Such middleware would typically log or otherwise notify operators that a problem has arisen. You could also write some general-purpose middleware that performs retries or implements the Circuit Breaker pattern, but reusable libraries that do that already exist. Consider using one.

Still, you may have decided to implement a particular feature by leveraging a capability of a particular piece of technology, and the code you intent to write is complicated enough, or important enough, that you'd like to have good test coverage. How do you do that?

I'd suggest an integration test.

I don't have a good example lying around that involves throwing specific exceptions, but something similar may be of service. The example code base that accompanies my book Code That Fits in Your Head pretends to be an online restaurant reservation system. Two concurrent clients may compete for the last table on a particular date; a typical race condition.

There are more than one way to address such a concern. As implied in a previous article, you may decide to rearchitect the entire application to be able to handle such edge cases in a robust manner. For the purposes of the book's example code base, however, I considered a lock-free architecture out of scope. Instead, I had in mind dealing with that issue by taking advantage of .NET and SQL Server's support for lightweight transactions via a TransactionScope. While this is a handy solution, it's utterly dependent on the technology stack. It's a good example of an implementation detail that I'd rather not expose to a unit test.

Instead, I wrote a self-hosted integration test that runs against a real SQL Server instance (automatically deployed and configured on demand). It tests behaviour rather than implementation details:

[Fact]
public async Task NoOverbookingRace()
{
    var start = DateTimeOffset.UtcNow;
    var timeOut = TimeSpan.FromSeconds(30);
    var i = 0;
    while (DateTimeOffset.UtcNow - start < timeOut)
        await PostTwoConcurrentLiminalReservations(start.DateTime.AddDays(++i));
}
 
private static async Task PostTwoConcurrentLiminalReservations(DateTime date)
{
    date = date.Date.AddHours(18.5);
    using var service = new RestaurantService();
 
    var task1 = service.PostReservation(new ReservationDtoBuilder()
        .WithDate(date)
        .WithQuantity(10)
        .Build());
    var task2 = service.PostReservation(new ReservationDtoBuilder()
        .WithDate(date)
        .WithQuantity(10)
        .Build());
    var actual = await Task.WhenAll(task1, task2);
 
    Assert.Single(actual, msg => msg.IsSuccessStatusCode);
    Assert.Single(
        actual,
        msg => msg.StatusCode == HttpStatusCode.InternalServerError);
}

This test attempts to make two concurrent reservations for ten people. This is also the maximum capacity of the restaurant: It's impossible to seat twenty people. We'd like for one of the requests to win that race, while the server should reject the loser.

This test is only concerned with the behaviour that clients can observe, and since this code base contains hundreds of other tests that inspect HTTP response messages, this one only looks at the status codes.

The implementation handles the potential overbooking scenario like this:

private async Task<ActionResult> TryCreate(Restaurant restaurant, Reservation reservation)
{
    using var scope = new TransactionScope(TransactionScopeAsyncFlowOption.Enabled);
 
    var reservations = await Repository.ReadReservations(restaurant.Id, reservation.At);
    var now = Clock.GetCurrentDateTime();
    if (!restaurant.MaitreD.WillAccept(now, reservations, reservation))
        return NoTables500InternalServerError();
 
    await Repository.Create(restaurant.Id, reservation);
 
    scope.Complete();
 
    return Reservation201Created(restaurant.Id, reservation);
}

Notice the TransactionScope.

I'm under the illusion that I could radically change this implementation detail without breaking the above test. Granted, unlike another experiment, this hypothesis isn't one I've put to the test.

Conclusion #

How does one automatically test error branches? Most unit testing frameworks come with APIs that makes it easy to verify that specific exceptions were thrown, so that's not the hard part. If a particular exception is part of the System Under Test's contract, just test it like that.

On the other hand, when it comes to objects composed with other objects, implementation details may easily leak through in the shape of specific exception types. I'd think twice before writing a test that verifies whether a piece of client code (such as the above SomeController) handles a particular exception type (such as SqlException).

If such a test is difficult to write because all you have is a Fake Object (e.g. FakeUserRepository), that's only good. The rapid feedback furnished by test-driven development strikes again. Listen to your tests.

You should probably not write that test at all, because there seems to be an issue with the planned structure of the code. Address that problem instead.


Extracting data from a small CSV file with Haskell

Monday, 19 February 2024 12:57:00 UTC

Statically typed languages are also good for ad-hoc scripting.

This article is part of a short series of articles that compares ad-hoc scripting in Haskell with solving the same problem in Python. The introductory article describes the problem to be solved, so here I'll jump straight into the Haskell code. In the next article I'll give a similar walkthrough of my Python script.

Getting started #

When working with Haskell for more than a few true one-off expressions that I can type into GHCi (the Haskell REPL), I usually create a module file. Since I'd been asked to crunch some data, and I wasn't feeling very imaginative that day, I just named the module (and the file) Crunch. After some iterative exploration of the problem, I also arrived at a set of imports:

module Crunch where
 
import Data.List (sort)
import qualified Data.List.NonEmpty as NE
import Data.List.Split
import Control.Applicative
import Control.Monad
import Data.Foldable

As we go along, you'll see where some of these fit in.

Reading the actual data file, however, can be done with just the Haskell Prelude:

inputLines = words <$> readFile "survey_data.csv"

Already now, it's possible to load the module in GHCi and start examining the data:

ghci> :l Crunch.hs
[1 of 1] Compiling Crunch           ( Crunch.hs, interpreted )
Ok, one module loaded.
ghci> length <$> inputLines
38

Looks good, but reading a text file is hardly the difficult part. The first obstacle, surprisingly, is to split comma-separated values into individual parts. For some reason that I've never understood, the Haskell base library doesn't even include something as basic as String.Split from .NET. I could probably hack together a function that does that, but on the other hand, it's available in the split package; that explains the Data.List.Split import. It's just a bit of a bother that one has to pull in another package only to do that.

Grades #

Extracting all the grades are now relatively easy. This function extracts and parses a grade from a single line:

grade :: Read a => String -> a
grade line = read $ splitOn "," line !! 2

It splits the line on commas, picks the third element (zero-indexed, of course, so element 2), and finally parses it.

One may experiment with it in GHCi to get an impression that it works:

ghci> fmap grade <$> inputLines :: IO [Int]
[2,2,12,10,4,12,2,7,2,2,2,7,2,7,2,4,2,7,4,7,0,4,0,7,2,2,2,2,2,2,4,4,2,7,4,0,7,2]

This lists all 38 expected grades found in the data file.

In the introduction article I spent some time explaining how languages with strong type inference don't need type declarations. This makes iterative development easier, because you can fiddle with an expression until it does what you'd like it to do. When you change an expression, often the inferred type changes as well, but there's no programmer overhead involved with that. The compiler figures that out for you.

Even so, the above grade function does have a type annotation. How does that gel with what I just wrote?

It doesn't, on the surface, but when I was fiddling with the code, there was no type annotation. The Haskell compiler is perfectly happy to infer the type of an expression like

grade line = read $ splitOn "," line !! 2

The human reader, however, is not so clever (I'm not, at least), so once a particular expression settles, and I'm fairly sure that it's not going to change further, I sometimes add the type annotation to aid myself.

When writing this, I was debating the didactics of showing the function with the type annotation, against showing it without it. Eventually I decided to include it, because it's more understandable that way. That decision, however, prompted this explanation.

Binomial choice #

The next thing I needed to do was to list all pairs from the data file. Usually, when I run into a problem related to combinations, I reach for applicative composition. For example, to list all possible combinations of the first three primes, I might do this:

ghci> liftA2 (,) [2,3,5] [2,3,5]
[(2,2),(2,3),(2,5),(3,2),(3,3),(3,5),(5,2),(5,3),(5,5)]

You may now protest that this is sampling with replacement, whereas the task is to pick two different rows from the data file. Usually, when I run into that requirement, I just remove the ones that pick the same value twice:

ghci> filter (uncurry (/=)) $ liftA2 (,) [2,3,5] [2,3,5]
[(2,3),(2,5),(3,2),(3,5),(5,2),(5,3)]

That works great as long as the values are unique, but what if that's not the case?

ghci> liftA2 (,) "foo" "foo"
[('f','f'),('f','o'),('f','o'),('o','f'),('o','o'),('o','o'),('o','f'),('o','o'),('o','o')]
ghci> filter (uncurry (/=)) $ liftA2 (,) "foo" "foo"
[('f','o'),('f','o'),('o','f'),('o','f')]

This removes too many values! We don't want the combinations where the first o is paired with itself, or when the second o is paired with itself, but we do want the combination where the first o is paired with the second, and vice versa.

This is relevant because the data set turns out to contain identical rows. Thus, I needed something that would deal with that issue.

Now, bear with me, because it's quite possible that what i did do isn't the simplest solution to the problem. On the other hand, I'm reporting what I did, and how I used Haskell to solve a one-off problem. If you have a simpler solution, please leave a comment.

You often reach for the tool that you already know, so I used a variation of the above. Instead of combining values, I decided to combine row indices instead. This meant that I needed a function that would produce the indices for a particular list:

indices :: Foldable t => t a -> [Int]
indices f = [0 .. length f - 1]

Again, the type annotation came later. This just produces sequential numbers, starting from zero:

ghci> indices <$> inputLines
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,
 21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37]

Such a function hovers just around the Fairbairn threshold; some experienced Haskellers would probably just inline it.

Since row numbers (indices) are unique, the above approach to binomial choice works, so I also added a function for that:

choices :: Eq a => [a] -> [(a, a)]
choices = filter (uncurry (/=)) . join (liftA2 (,))

Combined with indices I can now enumerate all combinations of two rows in the data set:

ghci> choices . indices <$> inputLines
[(0,1),(0,2),(0,3),(0,4),(0,5),(0,6),(0,7),(0,8),(0,9),...

I'm only showing the first ten results here, because in reality, there are 1406 such pairs.

Perhaps you think that all of this seems quite elaborate, but so far it's only four lines of code. The reason it looks like more is because I've gone to some lengths to explain what the code does.

Sum of grades #

The above combinations are pairs of indices, not values. What I need is to use each index to look up the row, from the row get the grade, and then sum the two grades. The first parts of that I can accomplish with the grade function, but I need to do if for every row, and for both elements of each pair.

While tuples are Functor instances, they only map over the second element, and that's not what I need:

ghci> rows = ["foo", "bar", "baz"]
ghci> fmap (rows!!) <$> [(0,1),(0,2)]
[(0,"bar"),(0,"baz")]

While this is just a simple example that maps over the two pairs (0,1) and (0,2), it illustrates the problem: It only finds the row for each tuple's second element, but I need it for both.

On the other hand, a type like (a, a) gives rise to a functor, and while a wrapper type like that is not readily available in the base library, defining one is a one-liner:

newtype Pair a = Pair { unPair :: (a, a) } deriving (EqShowFunctor)

This enables me to map over pairs in one go:

ghci> unPair <$> fmap (rows!!) <$> Pair <$> [(0,1),(0,2)]
[("foo","bar"),("foo","baz")]

This makes things a little easier. What remains is to use the grade function to look up the grade value for each row, then add the two numbers together, and finally count how many occurrences there are of each:

sumGrades ls =
  liftA2 (,) NE.head length <$> NE.group
    (sort (uncurry (+) . unPair . fmap (grade . (ls !!)) . Pair <$>
      choices (indices ls)))

You'll notice that this function doesn't have a type annotation, but we can ask GHCi if we're curious:

ghci> :t sumGrades
sumGrades :: (Ord a, Num a, Read a) => [String] -> [(a, Int)]

This enabled me to get a count of each sum of grades:

ghci> sumGrades <$> inputLines
[(0,6),(2,102),(4,314),(6,238),(7,48),(8,42),(9,272),(10,6),
 (11,112),(12,46),(14,138),(16,28),(17,16),(19,32),(22,4),(24,2)]

The way to read this is that the sum 0 occurs six times, 2 appears 102 times, etc.

There's one remaining task to accomplish before we can produce a PMF of the sum of grades: We need to enumerate the range, because, as it turns out, there are sums that are possible, but that don't appear in the data set. Can you spot which ones?

Using tools already covered, it's easy to enumerate all possible sums:

ghci> import Data.List
ghci> sort $ nub $ (uncurry (+)) <$> join (liftA2 (,)) [-3,0,2,4,7,10,12]
[-6,-3,-1,0,1,2,4,6,7,8,9,10,11,12,14,16,17,19,20,22,24]

The sums -6, -3, -1, and more, are possible, but don't appear in the data set. Thus, in the PMF for two randomly picked grades, the probability that the sum is -6 is 0. On the other hand, the probability that the sum is 0 is 6/1406 ~ 0.004267, and so on.

Difference of experience levels #

The other question posed in the assignment was to produce the PMF for the absolute difference between two randomly selected students' experience levels.

Answering that question follows the same mould as above. First, extract experience level from each data row, instead of the grade:

experience :: Read a => String -> a
experience line = read $ splitOn "," line !! 3

Since I was doing an ad-hoc script, I just copied the grade function and changed the index from 2 to 3. Enumerating the experience differences were also a close copy of sumGrades:

diffExp ls =
  liftA2 (,) NE.head length <$> NE.group
    (sort (abs . uncurry (-) . unPair . fmap (experience . (ls !!)) . Pair <$>
      choices (indices ls)))

Running it in the REPL produces some other numbers, to be interpreted the same way as above:

ghci> diffExp <$> inputLines
[(0,246),(1,472),(2,352),(3,224),(4,82),(5,24),(6,6)]

This means that the difference 0 occurs 246 times, 1 appears 472 times, and so on. From those numbers, it's fairly simple to set up the PMF.

Figures #

Another part of the assignment was to produce plots of both PMFs. I don't know how to produce figures with Haskell, and since the final results are just a handful of numbers each, I just copied them into a text editor to align them, and then pasted them into Excel to produce the figures there.

Here's the PMF for the differences:

Bar chart of the differences PMF.

I originally created the figure with Danish labels. I'm sure that you can guess what differens means, and sandsynlighed means probability.

Conclusion #

In this article you've seen the artefacts of an ad-hoc script to extract and analyze a small data set. While I've spent quite a few words to explain what's going on, the entire Crunch module is only 34 lines of code. Add to that a few ephemeral queries done directly in GHCi, but never saved to a file. It's been some months since I wrote the code, but as far as I recall, it took me a few hours all in all.

If you do stuff like this every day, you probably find that appalling, but data crunching isn't really my main thing.

Is it quicker to do it in Python? Not for me, it turns out. It also took me a couple of hours to repeat the exercise in Python.

Next: Extracting data from a small CSV file with Python.


Range as a functor

Monday, 12 February 2024 06:59:00 UTC

With examples in C#, F#, and Haskell.

This article is an instalment in a short series of articles on the Range kata. In the previous three articles you've seen the Range kata implemented in Haskell, in F#, and in C#.

The reason I engaged with this kata was that I find that it provides a credible example of a how a pair of functors itself forms a functor. In this article, you'll see how that works out in three languages. If you don't care about one or two of those languages, just skip that section.

Haskell perspective #

If you've done any Haskell programming, you may be thinking that I have in mind the default Functor instances for tuples. As part of the base library, tuples (pairs, triples, quadruples, etc.) are already Functor instances. Specifically for pairs, we have this instance:

instance Functor ((,) a)

Those are not the functor instances I have in mind. To a degree, I find these default Functor instances unfortunate, or at least arbitrary. Let's briefly explore the above instance to see why that is.

Haskell is a notoriously terse language, but if we expand the above instance to (invalid) pseudocode, it says something like this:

instance Functor ((a,b) b)

What I'm trying to get across here is that the a type argument is fixed, and only the second type argument b can be mapped. Thus, you can map a (Bool, String) pair to a (Bool, Int) pair:

ghci> fmap length (True, "foo")
(True,3)

but the first element (Bool, in this example) is fixed, and you can't map that. To be clear, the first element can be any type, but once you've fixed it, you can't change it (within the constraints of the Functor API, mind):

ghci> fmap (replicate 3) (42, 'f')
(42,"fff")
ghci> fmap ($ 3) ("bar", (* 2))
("bar",6)

The reason I find these default instances arbitrary is that this isn't the only possible Functor instance. Pairs, in particular, are also Bifunctor instances, so you can easily map over the first element, instead of the second:

ghci> first show (42, 'f')
("42",'f')

Similarly, one can easily imagine a Functor instance for triples (three-tuples) that map the middle element. The default instance, however, maps the third (i.e. last) element only.

There are some hand-wavy rationalizations out there that argue that in Haskell, application and reduction is usually done from the right, so therefore it's most appropriate to map over the rightmost element of tuples. I admit that it at least argues from a position of consistency, and it does make it easier to remember, but from a didactic perspective I still find it a bit unfortunate. It suggests that a tuple functor only maps the last element.

What I had in mind for ranges however, wasn't to map only the first or the last element. Neither did I wish to treat ranges as bifunctors. What I really wanted was the ability to project an entire range.

In my Haskell Range implementation, I'd simply treated ranges as tuples of Endpoint values, and although I didn't show that in the article, I ultimately declared Endpoint as a Functor instance:

data Endpoint a = Open a | Closed a deriving (EqShowFunctor)

This enables you to map a single Endpoint value:

ghci> fmap length $ Closed "foo"
Closed 3

That's just a single value, but the Range kata API operates with pairs of Endpoint value. For example, the contains function has this type:

contains :: (Foldable t, Ord a) => (Endpoint a, Endpoint a) -> t a -> Bool

Notice the (Endpoint a, Endpoint a) input type.

Is it possible to treat such a pair as a functor? Yes, indeed, just import Data.Functor.Product, which enables you to package two functor values in a single wrapper:

ghci> import Data.Functor.Product
ghci> Pair (Closed "foo") (Open "corge")
Pair (Closed "foo") (Open "corge")

Now, granted, the Pair data constructor doesn't wrap a tuple, but that's easily fixed:

ghci> uncurry Pair (Closed "foo", Open "corge")
Pair (Closed "foo") (Open "corge")

The resulting Pair value is a Functor instance, which means that you can project it:

ghci> fmap length $ uncurry Pair (Closed "foo", Open "corge")
Pair (Closed 3) (Open 5)

Now, granted, I find the Data.Functor.Product API a bit lacking in convenience. For instance, there's no getPair function to retrieve the underlying values; you'd have to use pattern matching for that.

In any case, my motivation for covering this ground wasn't to argue that Data.Functor.Product is all we need. The point was rather to observe that when you have two functors, you can combine them, and the combination is also a functor.

This is one of the many reasons I get so much value out of Haskell. Its abstraction level is so high that it substantiates relationships that may also exist in other code bases, written in other programming languages. Even if a language like F# or C# can't formally express some of those abstraction, you can still make use of them as 'design patterns' (for lack of a better term).

F# functor #

What we've learned from Haskell is that if we have two functors we can combine them into one. Specifically, I made Endpoint a Functor instance, and from that followed automatically that a Pair of those was also a Functor instance.

I can do the same in F#, starting with Endpoint. In F# I've unsurprisingly defined the type like this:

type Endpoint<'a> = Open of 'a | Closed of 'a

That's just a standard discriminated union. In order to make it a functor, you'll have to add a map function:

module Endpoint =
    let map f = function
        | Open   x -> Open   (f x)
        | Closed x -> Closed (f x)

The function alone, however, isn't enough to give rise to a functor. We must also convince ourselves that the map function obeys the functor laws. One way to do that is to write tests. While tests aren't proofs, we may still be sufficiently reassured by the tests that that's good enough for us. While I could, I'm not going to prove that Endpoint.map satisfies the functor laws. I will, later, do just that with the pair, but I'll leave this one as an exercise for the interested reader.

Since I was already using Hedgehog for property-based testing in my F# code, it was obvious to write properties for the functor laws as well.

[<Fact>]
let ``First functor law`` () = Property.check <| property {
    let genInt32 = Gen.int32 (Range.linearBounded ())
    let! expected = Gen.choice [Gen.map Open genInt32; Gen.map Closed genInt32]
 
    let actual = Endpoint.map id expected
 
    expected =! actual }

This property exercises the first functor law for integer endpoints. Recall that this law states that if you map a value with the identity function, nothing really happens.

The second functor law is more interesting.

[<Fact>]
let ``Second functor law`` () = Property.check <| property {
    let genInt32 = Gen.int32 (Range.linearBounded ())
    let! endpoint = Gen.choice [Gen.map Open genInt32; Gen.map Closed genInt32]
    let! f = Gen.item [id; ((+) 1); ((*) 2)]
    let! g = Gen.item [id; ((+) 1); ((*) 2)]
 
    let actual = Endpoint.map (f << g) endpoint
 
    Endpoint.map f (Endpoint.map g endpoint) =! actual }

This property again exercises the property for integer endpoints. Not only does the property pick a random integer and varies whether the Endpoint is Open or Closed, it also picks two random functions from a small list of functions: The identity function (again), a function that increments by one, and a function that doubles the input. These two functions, f and g, might then be the same, but might also be different from each other. Thus, the composition f << g might be id << id or ((+) 1) << ((+) 1), but might just as well be ((+) 1) << ((*) 2), or one of the other possible combinations.

The law states that the result should be the same regardless of whether you first compose the functions and then map them, or map them one after the other.

Which is the case.

A Range is defined like this:

type Range<'a> = { LowerBound : Endpoint<'a>; UpperBound : Endpoint<'a> }

This record type also gives rise to a functor:

module Range =
    let map f { LowerBound = lowerBound; UpperBound = upperBound } =
       { LowerBound = Endpoint.map f lowerBound
         UpperBound = Endpoint.map f upperBound }

This map function uses the projection f on both the lowerBound and the upperBound. It, too, obeys the functor laws:

[<Fact>]
let ``First functor law`` () = Property.check <| property {
    let genInt64 = Gen.int64 (Range.linearBounded ())
    let genEndpoint = Gen.choice [Gen.map Open genInt64; Gen.map Closed genInt64]
    let! expected = Gen.tuple genEndpoint |> Gen.map Range.ofEndpoints
 
    let actual = expected |> Ploeh.Katas.Range.map id
 
    expected =! actual }
 
[<Fact>]
let ``Second functor law`` () = Property.check <| property {
    let genInt16 = Gen.int16 (Range.linearBounded ())
    let genEndpoint = Gen.choice [Gen.map Open genInt16; Gen.map Closed genInt16]
    let! range = Gen.tuple genEndpoint |> Gen.map Range.ofEndpoints
    let! f = Gen.item [id; ((+) 1s); ((*) 2s)]
    let! g = Gen.item [id; ((+) 1s); ((*) 2s)]
 
    let actual = range |> Ploeh.Katas.Range.map (f << g)
 
    Ploeh.Katas.Range.map f (Ploeh.Katas.Range.map g range) =! actual }

These two Hedgehog properties are cast in the same mould as the Endpoint properties, only they create 64-bit and 16-bit ranges for variation's sake.

C# functor #

As I wrote about the Haskell result, it teaches us which abstractions are possible, even if we can't formalise them to the same degree in, say, C# as we can in Haskell. It should come as no surprise, then, that we can also make Range<T> a functor in C#.

In C# we idiomatically do that by giving a class a Select method. Again, we'll have to begin with Endpoint:

public Endpoint<TResult> Select<TResult>(Func<T, TResult> selector)
{
    return Match(
        whenClosed: x => Endpoint.Closed(selector(x)),
        whenOpen: x => Endpoint.Open(selector(x)));
}

Does that Select method obey the functor laws? Yes, as we can demonstrate (not prove) with a few properties:

[Fact]
public void FirstFunctorLaw()
{
    Gen.OneOf(
        Gen.Int.Select(Endpoint.Open),
        Gen.Int.Select(Endpoint.Closed))
    .Sample(expected =>
    {
        var actual = expected.Select(x => x);
 
        Assert.Equal(expected, actual);
    });
}
 
[Fact]
public void ScondFunctorLaw()
{
    (from endpoint in Gen.OneOf(
        Gen.Int.Select(Endpoint.Open),
        Gen.Int.Select(Endpoint.Closed))
     from f in Gen.OneOfConst<Func<intint>>(x => x, x => x + 1, x => x * 2)
     from g in Gen.OneOfConst<Func<intint>>(x => x, x => x + 1, x => x * 2)
     select (endpoint, f, g))
    .Sample(t =>
    {
        var actual = t.endpoint.Select(x => t.g(t.f(x)));
 
        Assert.Equal(
            t.endpoint.Select(t.f).Select(t.g),
            actual);
    });
}

These two tests follow the scheme laid out by the above F# properties, and they both pass.

The Range class gets the same treatment. First, a Select method:

public Range<TResult> Select<TResult>(Func<T, TResult> selector)
    where TResult : IComparable<TResult>
{
    return new Range<TResult>(min.Select(selector), max.Select(selector));
}

which, again, can be demonstrated with two properties that exercise the functor laws:

[Fact]
public void FirstFunctorLaw()
{
    var genEndpoint = Gen.OneOf(
        Gen.Int.Select(Endpoint.Closed),
        Gen.Int.Select(Endpoint.Open));
    genEndpoint.SelectMany(min => genEndpoint
        .Select(max => new Range<int>(min, max)))
    .Sample(sut =>
    {
        var actual = sut.Select(x => x);
 
        Assert.Equal(sut, actual);
    });
}
 
[Fact]
public void SecondFunctorLaw()
{
    var genEndpoint = Gen.OneOf(
        Gen.Int.Select(Endpoint.Closed),
        Gen.Int.Select(Endpoint.Open));
    (from min in genEndpoint
     from max in genEndpoint
     from f in Gen.OneOfConst<Func<intint>>(x => x, x => x + 1, x => x * 2)
     from g in Gen.OneOfConst<Func<intint>>(x => x, x => x + 1, x => x * 2)
     select (sut : new Range<int>(min, max), f, g))
    .Sample(t =>
    {
        var actual = t.sut.Select(x => t.g(t.f(x)));
 
        Assert.Equal(
            t.sut.Select(t.f).Select(t.g),
            actual);
    });
}

These tests also pass.

Laws #

Exercising a pair of properties can give us a good warm feeling that the data structures and functions defined above are proper functors. Sometimes, tests are all we have, but in this case we can do better. We can prove that the functor laws always hold.

The various above incarnations of a Range type are all product types, and the canonical form of a product type is a tuple (see e.g. Thinking with Types for a clear explanation of why that is). That's the reason I stuck with a tuple in my Haskell code.

Consider the implementation of the fmap implementation of Pair:

fmap f (Pair x y) = Pair (fmap f x) (fmap f y)

We can use equational reasoning, and as always I'll use the the notation that Bartosz Milewski uses. It's only natural to begin with the first functor law, using F and G as placeholders for two arbitrary Functor data constructors.

  fmap id (Pair (F x) (G y))
= { definition of fmap }
  Pair (fmap id (F x)) (fmap id (G y))
= { first functor law }
  Pair (F x) (G y)
= { definition of id }
  id (Pair (F x) (G y))

Keep in mind that in this notation, the equal signs are true equalities, going both ways. Thus, you can read this proof from the top to the bottom, or from the bottom to the top. The equality holds both ways, as should be the case for a true equality.

We can proceed in the same vein to prove the second functor law, being careful to distinguish between Functor instances (F and G) and functions (f and g):

  fmap (g . f) (Pair (F x) (G y))
= { definition of fmap }
  Pair (fmap (g . f) (F x)) (fmap (g . f) (G y))
= { second functor law }
  Pair ((fmap g . fmap f) (F x)) ((fmap g . fmap f) (G y))
= { definition of composition }
  Pair (fmap g (fmap f (F x))) (fmap g (fmap f (G y)))
= { definition of fmap }
  fmap g (Pair (fmap f (F x)) (fmap f (G y)))
= { definition of fmap }
  fmap g (fmap f (Pair (F x) (G y)))
= { definition of composition }
  (fmap g . fmap f) (Pair (F x) (G y))

Notice that both proofs make use of the functor laws. This may seem self-referential, but is rather recursive. When the proofs refer to the functor laws, they refer to the functors F and G, which are both assumed to be lawful.

This is how we know that the product of two lawful functors is itself a functor.

Negations #

During all of this, you may have thought: What happens if we project a range with a negation?

As a simple example, let's consider the range from -1 to 2:

ghci> uncurry Pair (Closed (-1), Closed 2)
Pair (Closed (-1)) (Closed 2)

We may draw this range on the number line like this:

The range from -1 to 2 drawn on the number line.

What happens if we map that range by multiplying with -1?

ghci> fmap negate $ uncurry Pair (Closed (-1), Closed 2)
Pair (Closed 1) (Closed (-2))

We get a range from 1 to -2!

Aha! you say, clearly that's wrong! We've just found a counterexample. After all, range isn't a functor.

Not so. The functor laws say nothing about the interpretation of projections (but I'll get back to that in a moment). Rather, they say something about composition, so let's consider an example that reaches a similar, seemingly wrong result:

ghci> fmap ((+1) . negate) $ uncurry Pair (Closed (-1), Closed 2)
Pair (Closed 2) (Closed (-1))

This is a range from 2 to -1, so just as problematic as before.

The second functor law states that the outcome should be the same if we map piecewise:

ghci> (fmap (+ 1) . fmap negate) $ uncurry Pair (Closed (-1), Closed 2)
Pair (Closed 2) (Closed (-1))

Still a range from 2 to -1. The second functor law holds.

But, you protest, that's doesn't make any sense!

I disagree. It could make sense in at least three different ways.

What does a range from 2 to -1 mean? I can think of three interpretations:

  • It's the empty set
  • It's the range from -1 to 2
  • It's the set of numbers that are either less than or equal to -1 or greater than or equal to 2

We may illustrate those three interpretations, together with the original range, like this:

Four number lines, each with a range interpretation drawn in.

According to the first interpretation, we consider the range as the Boolean and of two predicates. In this interpretation the initial range is really the Boolean expression -1 ≤ x ∧ x ≤ 2. The projected range then becomes the expression 2 ≤ x ∧ x ≤ -1, which is not possible. This is how I've chosen to implement the contains function:

ghci> Pair x y = fmap ((+1) . negate) $ uncurry Pair (Closed (-1), Closed 2)
ghci> contains (x, y) [0]
False
ghci> contains (x, y) [-3]
False
ghci> contains (x, y) [4]
False

In this interpretation, the result is the empty set. The range isn't impossible; it's just empty. That's the second number line from the top in the above illustration.

This isn't, however, the only interpretation. Instead, we may choose to be liberal in what we accept and interpret the range from 2 to -1 as a 'programmer mistake': What you asked me to do is formally wrong, but I think that I understand that you meant the range from -1 to 2.

That's the third number line in the above illustration.

The fourth interpretation is that when the first element of the range is greater than the second, the range represents the complement of the range. That's the fourth number line in the above illustration.

The reason I spent some time on this is that it's easy to confuse the functor laws with other properties that you may associate with a data structure. This may lead you to falsely conclude that a functor isn't a functor, because you feel that it violates some other invariant.

If this happens, consider instead whether you could possibly expand the interpretation of the data structure in question.

Conclusion #

You can model a range as a functor, which enables you to project ranges, either moving them around on an imaginary number line, or changing the type of the range. This might for example enable you to map a date range to an integer range, or vice versa.

A functor enables mapping or projection, and some maps may produce results that you find odd or counter-intuitive. In this article you saw an example of that in the shape of a negated range where the first element (the 'minimum', in one interpretation) becomes greater than the second element (the 'maximum'). You may take that as an indication that the functor isn't, after all, a functor.

This isn't the case. A data structure and its map function is a functor if the the mapping obeys the functor laws, which is the case for the range structures you've seen here.


Page 1 of 73

"Our team wholeheartedly endorses Mark. His expert service provides tremendous value."
Hire me!