Simpler encapsulation with immutability

Wednesday, 12 June 2024 15:33:00 UTC

A worked example.

I've noticed that many software organizations struggle with encapsulation with 'bigger' problems. It may be understandable and easily applicable to define a NaturalNumber type or ensure that a minimum value is less than a maximum value, and so on. How do you, however, guarantee invariants once the scope of the problem becomes bigger and more complex?

In this series of articles, I'll attempt to illustrate how and why this worthy design goal seems elusive, and what you can do to achieve it.

Contracts #

As usual, when I discuss encapsulation, I first need to establish what I mean by the term. It is, after all, one of the most misunderstood concepts in software development. As regular readers will know, I follow the lead of Object-Oriented Software Construction. In that perspective, encapsulation is the appropriate modelling and application of preconditions, invariants, and postconditions.

Particularly when it comes to invariants, things seem to fall apart as the problem being modelled grows in complexity. Teams eventually give up guaranteeing any invariants, leaving client developers with no recourse but defensive coding, which again leads to code duplication, bugs, and maintenance problems.

If you need a reminder, an invariant is an assertion about an object that is always true. The more invariants an object has, the better guarantees it gives, and the more you can trust it. The more you can trust it, the less defensive coding you have to write. You don't have to check if return values are null, strings empty, numbers negative, collections empty, or so on.

The three sets of preconditions, postconditions, and invariants, embedded in their common superset labeled contract.

All together, I usually denote the collection of invariants, pre-, and postconditions as a type's contract.

For a simple example like modelling a natural number, or a range, or a user name, most people are able to produce sensible and coherent designs. Once, however, the problem becomes more complex, and the invariants involve multiple interacting values, maintaining the contract becomes harder.

Immutability to the rescue #

I'm not going to bury the lede any longer. It strikes me that mutation is a major source of complexity. It's not that hard to check a set of conditions when you create a value (or object or record). What makes it hard to maintain invariants is when objects are allowed to change. This implies that for every possible change to the object, it needs to examine its current state in order to decide whether or not it should allow the operation.

If the mutation happens on a leaf node in an object graph, the leaf may have to notify its parent, so that the parent can recheck the invariants. If the graph has cycles it becomes more complicated still, and if you want to make the problem truly formidable, try making the object thread-safe.

Making the object immutable makes most of these problems go away. You don't have to worry about thread-safety, because immutable values are automatically thread-safe; there's no state for any thread to change.

Even better, though, is that an immutable object's contract is smaller and simpler. It still has preconditions, because there are rules that govern what has to be true before you can create such an object. Furthermore, there may also be rules that stipulate what must be true before you can call a method on it.

Likewise, postconditions are still relevant. If you call a method on the object, it may give you guarantees about what it returns.

There are, however, no independent invariants.

The two sets of preconditions and postconditions, embedded in their common superset labeled contract.

Or rather, the invariants for an immutable object entirely coincide with its preconditions. If it was valid at creation, it remains valid.

Priority collection #

As promised, I'll work through a problem to demonstrate what I mean. I'll first showcase how mutation makes the problem hard, and then how trivial it becomes with an immutable design.

The problem is this: Design and implement a class (or just a data structure if you don't want to do Object-Oriented programming) that models a priority list (not a Priority Queue) as you sometimes run into in surveys. You know, one of these survey questions that asks you to distribute 100 points on various different options:

  • Option F: 30%
  • Option A: 25%
  • Option C: 25%
  • Option E: 20%
  • Option B: 0%
  • Option D: 0%

If you have the time, I suggest that you treat this problem as a kata. Try to do the exercise before reading the next articles in this series. You can assume the following, which is what I did.

  • The budget is 100. (You could make it configurable, but the problem is gnarly enough even with a hard-coded value.)
  • You don't need to include items with priority value 0, but you should allow it.
  • The sum of priorities must be exactly 100. This is the invariant.

The difficult part is that last invariant. Let me stress this requirement: At any time, the object should be in a consistent state; i.e. at any time should the sum of priorities be exactly 100. Not 101 or 99, but 100. Good luck with that.

The object should also be valid at initialization.

Of course, having read this far, you understand that all you have to do is to make the object immutable, but just for the sake of argument, try designing a mutable object with this invariant. Once you've tried your hand with that, read on.

Attempts #

There's educational value going through even failed attempts. When I thought of this example, I fairly quickly outlined in my head one approach that was unlikely to ever work, one that could work, and the nice immutable solution that trivially works.

I'll cover each in turn:

  • A failed attempt at priority collection with inheritance
  • A mutable priority collection
  • An immutable priority collection

It's surprising how hard even a simple exercise like this one turns out to be, if you try to do it the object-oriented way.

In reality, business rules are much more involved than what's on display here. For only a taste of how bad it might get, read Hillel Wayne's suggestions regarding a similar kind of problem.

Conclusion #

If you've lived all your programming life with mutation as an ever-present possibility, you may not realize how much easier immutability makes everything. This includes invariants.

When you have immutable data, object graphs tend to be simpler. You can't easily define cyclic graphs (although Haskell, due to its laziness, surprisingly does enable this), and invariants essentially coincide with preconditions.

In the following articles, I'll show how mutability makes even simple invariants difficult to implement, and how immutability easily addresses the issue.

Next: A failed attempt at priority collection with inheritance.


Comments

Marken Foo #

I've been enjoying going through your articles in the past couple months, and I really like the very pedagogic treatment of functional programming and adjacent topics.

The kata here is an interesting one, but I don't think I'd link it with the concept of immutability/mutability. My immediate thought was a naïve struct that can represent illegal values and whose validity is managed through functions containing some tricky logic, but that didn't seem promising whether it was done immutably or not.

Instead, the phrase "distribute 100 points" triggered an association with the stars and bars method for similar problems. The idea is that we have N=100 points in a row, and inserting dividers to break it into (numOptions) groups. Concretely, our data structure is (dividers: int array), which is a sorted array of length (numOptions + 1) where the first element is 0 and the last element is N=100. The priorities are then exactly the differences between adjacent elements of the array. The example in the kata (A=25, B=0, C=25, D=0, E=20, F=30) is then represented by the array [| 0; 25; 25; 50; 50; 70; 100|].

This solution seems to respect the invariant, has a configurable budget, can work with other numerical types, and works well whether immutable or not (if mutable, just ensure the array remains sorted, has min 0, and max N). The invariant is encoded in the representation of the data, which seems to me to be the more relevant point than mutability.

And a somewhat disjoint thought, the kata reminded me of a WinForms TableLayoutPanel (or MS Word table) whose column widths all must fit within the container's width...

2024-06-13 13:55 UTC

Thank you for writing. The danger of writing these article series is always that as soon as I've published the first one, someone comes by and puts a big hole through my premise. Well, I write this blog for a couple of independent reasons, and one of them is to learn.

And you just taught me something. Thank you. That is, at least, an elegant implementation.

How would you design the API encapsulating that implementation?

Clearly, arrays already have APIs, so you could obviously define an array-like API that performs the appropriate boundary checks. That, however, doesn't seem to model the given problem. Rather, it reveals the implementation, and forces a client developer to think in terms of the data structure, rather the problem (s)he has to solve.

Ideally, again channelling Bertrand Meyer, an object should present as an Abstract Data Structure (ADT) that doesn't require client developers to understand the implementation details. I'm curious what such an API would look like.

You've already surprised me once, and please do so once again. I'm always happy to learn something new, and that little stars-and-bars concept I've now added to my tool belt.

All that said, this article makes a more general claim, although its possible that the example it showcases is a tad too simple and naive to be a truly revealing one. The claim is that this kind of 'aggregate constraint' often causes so much trouble in the face of arbitrary state mutation that most programmers give up on encapsulation.

What happens if we instead expand the requirements a bit? Let's say that we will require the user to spend at least 90% of the budget, but no more than 100%. Also, there must be at least three prioritized items, and no individual item can receive more than a third of the budget.

2024-06-14 14:22 UTC
Marken Foo #

Thank you for the response. Here's my thoughts - it's a bit of a wall of text, I might be wrong in any of the following, and the conclusion may be disappointing. When you ask how I'd design the API, I'd say it depends on how the priority list is going to be used. The implementation trick with stars and bars might just be a one-off trick that happens to work here, but it doesn't (shouldn't) affect the contract with the outside world.

If we're considering survey questions or budgets, the interest is in the priority values. So I think the problem then is about a list of priorities with an aggregate constraint. So I would define... an array-like API that performs the appropriate boundary checks (wow), but for the item priorities. My approach would be to go for "private data, public functions", and rely on a legal starting state and preserving the legality through the public API. In pseudocode:

                type PriorityList = { budget: int; dividers: int list }
                create :: numItems: int -> budget: int -> PriorityList
        
                // Returns priorities.
                getAll :: plist: PriorityList -> int list
                get :: itemIdx: int -> plist: PriorityList -> int
        
                // *Sets the priority for an item (taking the priority from other items, starting from the back).
                set :: itemIdx: int -> priority: int -> plist: PriorityList -> PriorityList
        
                // *Adds a new item to (the end of) the PriorityList (with priority zero).
                addItem :: plist: PriorityList -> PriorityList
        
                // *Removes an item from the PriorityList (and assigns its priority to the last item). 
                removeItem :: itemIdx: int -> plist PriorityList -> PriorityList
        
                // Utility functions: see text
                _toPriorities :: dividers: int list -> int list
                _toDividers :: priorities: int list -> int list
            

Crucially: since set, addItem, and removeItem must maintain the invariants, they must have "side effects" of altering other priorities. I think this is unavoidable here because we have aggregate/global constraints, rather than just elementwise/local constraints. (Is this why resizing rows and columns in WinForms tableLayoutPanels and MS Word tables is so tedious?) This will manifest in the API - the client needs to know what "side effects" there are (suggested behaviour in parentheses in the pseudocode comments above). See my crude attempt at implementation.

You may already see where this is going. If I accept that boundary checks are needed, then my secondary goal in encapsulation is to express the constraints as clearly as possible, and hopefully not spread the checking logic all over the code.

Whence the utility functions: it turned out to be useful to convert from a list of dividers to priorities, and vice versa. This is because the elementwise operations/invariants like the individual priority values are easier to express in terms of raw priorities, while the aggregate ones like the total budget are easier in terms of "dividers" (the cumulative priorities). There is a runtime cost to the conversion, but the code becomes clearer. This smells similar to feature envy...

So why not just have the underlying implementation hold a list of priorities in the first place?! Almost everything in the implementation needs translation back to that anyway. D'oh! I refactored myself back to the naïve approach. The original representation seemed elegant, but I couldn't find a way to manipulate it that clients would find intuitive and useful in the given problem.

But... if I approach the design from the angle "what advantages does the cumulative priority model offer?", I might come up with the following candidate API functions, which could be implemented cleanly in the "divider" space:

                // (same type, create, get, getAll, addItem as above)
                // Removes the item and merges its priority with the item before it.
                merge :: ItemIdx: int -> PriorityList
                // Sets the priority of an item to zero and gives it to the item after it.
                collapse :: itemIdx: int -> PriorityList
                // Swaps the priority of an item and the one after it (e.g. to "bubble" a priority value forwards or backwards, although this is easier in the "priority" space)
                swap :: itemIdx: int -> PriorityList
                // Sets (alternative: adds to) the priority of an item, taking the priority from the items after it in sequence ("consuming" them in the forward direction)
                consume :: itemIdx: int -> priority: int -> PriorityList
                // Splits the item into 2 smaller items each with half the priority (could be generalised to n items)
                split :: ItemIdx: int -> PriorityList
                // etc.
            

And this seems like a more fitting API for that table column width example I keep bringing up. What's interesting to me is that despite the data structures of the budget/survey question and the table column widths being isomorphic, we can come up with rather different APIs depending on which view we consider. I think this is my main takeaway from this exploration, actually.

As for the additional requirements, individually each constraint is easy to handle, but their composition is tricky. If it's easy to transform an illegal PriorityList to make it respect the invariants, we can just apply the transformation after every create/set/add/remove. Something like:

                type PriorityList =
                    { budget: int
                      dividers: int list
                      budgetCondition: int -> bool
                      maxPriority: int
                      minChoices: int }
                
                let _enforceBudget (predicate: int -> bool) (defaultBudget: int) (dividers: int list) : int list =
                    if (List.last dividers |> predicate) then
                        dividers
                    else
                        List.take (dividers.Length - 1) dividers @ [ defaultBudget ]
        
                let _enforceMaxPriority (maxPriority: int) (dividers: int list) : int list =
                    _toPriorities dividers |> List.map (fun p -> min p maxPriority) |> _toDividers
            

The problem is those transforms may not preserve each others' invariant. Life would be easy if we could write a single transform to preserve everything (I haven't found one - notice that the two above are operating on different int lists so it's tricky). Otherwise, we could write validations instead of transformations, then let create/set/add/remove fail by returning Option.None (explicitly fail) or the original list (silently fail). This comes at the cost of making the API less friendly.

Ultimately with this approach I can't see a way to make all illegal states unrepresentable without sprinkling ad-hoc checks everywhere in the code. The advantages of the "cumulative priorities" representation I can think of are (a) it makes the total budget invariant obvious, and (b) it maps nicely to a UI where you click and drag segments around. Since you might have gone down a different path in the series, I'm curious to see how that shapes up.

2024-06-15 14:48 UTC

You'll regret using natural keys

Monday, 03 June 2024 19:46:00 UTC

Beating another dead horse.

Although I live in Copenhagen and mostly walk or ride my bicycle in order to get around town, I do own an old car for getting around the rest of the country. In Denmark, cars go through mandatory official inspection every other year, and I've been through a few of these in my life. A few years ago, the mechanic doing the inspection informed me that my car's chassis number was incorrect.

This did make me a bit nervous, because I'd bought the car used, and I was suddenly concerned that things weren't really as I thought. Had I unwittingly bought a stolen car?

But the mechanic just walked over to his computer in order to correct the error. That's when a different kind of unease hit me. When you've programmed for some decades, you learn to foresee various typical failure modes. Since a chassis number is an obvious candidate for a natural key, I already predicted that changing the number would prove to be either impossible, or have all sorts of cascading effects, ultimately terminating in official records no longer recognizing that the car is mine.

As it turned out, though, whoever made that piece of software knew what they were doing, because the mechanic just changed the chassis number, and that was that. This is now five or six years ago, and I still own the same car, and I've never had any problems with the official ownership records.

Uniqueness #

The reason I related this story is that I'm currently following an undergraduate course in databases and information systems. Since this course is aimed at students with no real-world experience, it wisely moves forward in a pedagogical progression. In order to teach database keys, it starts with natural keys. From a didactic perspective, this makes sense, but the result, so far, is that the young people I work with now propose database designs with natural keys.

I'm not blaming anyone. You have to learn to crawl before you can walk.

Still, this situation made me reflect on the following question: Are natural keys ever a good idea?

Let's consider an example. For a little project we're doing, we've created a database of the World's 50 best restaurants. My fellow students suggest a table design like this:

CREATE TABLE Restaurants (
    year TEXT NOT NULL,
    rank TEXT NOT NULL,
    restaurantName TEXT NOT NULL,
    cityName TEXT NOT NULL
);

Granted, at this point, this table definition defines no key at all. I'm not complaining about that. After all, a month ago, the students probably hadn't seen a database table.

From following the course curriculum, it'd be natural, however, to define a key for the Restaurants table as the combination of restaurantName, cityName, and year. The assumption is that name and city uniquely identifies a restaurant.

In this particular example, this assumption may actually turn out to hold. So far. After all, the data set isn't that big, and it's important for restaurants in that league to have recognizable names. If I had to guess, I'd say that there's probably only one Nobelhart & Schmutzig in the world.

Still, a good software architect should challenge the underlying assumptions. Is name and city a natural key? It's easy to imagine that it's not. What if we expand the key to include the country as well? Okay, but what if we had a restaurant named China Wok in Springfield, USA? Hardly unique. Add the state, you say? Probably still not unique.

Identity #

Ensuring uniqueness is only the first of many problems with natural keys. You may quickly reach the conclusion that for a restaurant database, a synthetic key is probably the best choice.

But what about 'natural' natural keys, so to speak? An example may be a car's chassis number. This is already an opaque number, and it probably originates from a database somewhere. Or how about a personal identification number? In Denmark we have the CPR number, and I understand that the US Social Security Number is vaguely analogous.

If you're designing a database that already includes such a personal identification number, you might be tempted to use it as a natural key. After all, it's already a key somewhere else, so it's guaranteed to be unique, right?

Yes, the number may uniquely identify a person, but the converse may not be true. A person may have more than one identification number. At least when time is a factor.

As an example, for technical-historical reasons, the Danish CPR number carries information (which keys shouldn't do), such as a person's date of birth and sex. Since 2014 a new law enables transsexual citizens to get a new CPR number that reflects their perceived gender. The consequence is that the same person may have more than one CPR number. Perhaps not more than one at the same time, but definitely two during a lifetime.

Even if existing keys are guaranteed to be unique, you can't assume that the uniqueness gives rise to a bijection. If you use an external unique key, you may lose track of the entities that you're trying to keep track of.

This is true not only for people, but cars, bicycles (which also have chassis numbers), network cards, etc.

Clerical errors #

Finally, even if you've found a natural key that is guaranteed to be unique and track the actual entity that you want to keep track of, there's a final argument against using an externally defined key in your system: Data-entry errors.

Take the story about my car's chassis number. The mechanic who spotted the discrepancy clearly interpreted it as a clerical error.

After a few decades of programming, I've learned that sooner or later, there will be errors in your data. Either it's a clerical error, or the end-user mistyped, or there was a data conversion error when importing from an external system. Or even data conversion errors within the same system, as it goes through upgrades and migrations.

Your system should be designed to allow corrections to data. This includes corrections of external keys, such as chassis numbers, government IDs, etc. This means that you can't use such keys as database keys in your own system.

Heuristic #

Many were the times, earlier in my career, when I decided to use a 'natural key' as a key in my own database. As far as I recall, I've regretted it every single time.

These days I follow a hard heuristic: Always use synthetic keys for database tables.

Conclusion #

Is it ever a good idea to use natural keys in a database design? My experience tells me that it's not. Ultimately, regardless of how certain you can be that the natural key is stable and correctly tracks the entity that it's supposed to keep track of, data errors will occur. This includes errors in those natural keys.

You should be able to correct such errors without losing track of the involved entities. You'll regret using natural keys. Use synthetic keys.


Comments

There are lots of different types of keys. I agree that using natural keys as physical primary keys is a bad idea but you really should be modelling your data logically with natural keys. Thinking about uniqueness and identity is a part of your data design. Natural keys often end up as constraints, indexes and query plans. When natural keys are not unique enough then you need to consider additional attributes in your design to ensure access to a specific record.

Considering natural keys during design can help elicit additional requirements and business rules. "Does a social security number uniquely identify a person? If not why?" In the UK they recycle them so the natural key is a combination of national insurance number and birth year. You have to ask questions.

2024-06-04 15:43 UTC
2024-06-05 9:33 UTC

I largely agree with James Snape, but wanted to throw in a few other thoughts on top. Surrogates don't defend you from duplicate data, in fact they facilitate it, because the routine generating the surrogate key isn't influenced by any of the other data in the record. The concept of being unable to correct a natural key is also odd, why can't you? Start a transaction, insert a new record with the correct key, update the related records to point to the new record, then delete the old record, done. Want some crucial information about a related record but only have the surrogate to it? I guess you have to join it every time in order to get the columns the user actually wants to see. A foreign key that uses a natural key often often prevents the join entirely, because it tells the user what they wanted to know.

I find the problem with natural keys usually comes from another source entirely. Developers write code and don't tend to prefer using SQL. They typically interact with databases through ORM libraries. ORMs are complicated and rely on conventions to uniformly deal with data. It's not uncommon for ORMs to dictate the structure of tables to some degree, or what datatypes to prefer. It's usually easier in an ORM to have a single datatype for keys (BIGINT?) and use it uniformly across all the tables.

2024-06-05 12:42 UTC

James, Nicholas, thank you for writing. I realize that there are some unstated assumptions and implied concerns that I should have made more explicit. I certainly have no problem with adding constraints and other rules to model data. For the Danish CPR number, for example, while I wouldn't make it a primary key (for the reasons outlined in the article), I'd definitely put a UNIQUE constraint on it.

Another unspoken context that I had in mind is that systems often exist in a wider context where ACID guarantees fall apart. I suppose it's true that if you look at a database in isolation, you may be able to update a foreign key with the help of some cascading changes rippling through the database, but if you've ever shared the old key outside of the database, you now have orphaned data.

A simple example could be sending out an email with a link that embeds the old key. If you change the key after sending out the email, but before the user clicks, the link no longer works.

That's just a simple and easy-to-explain example. The more integration (particularly system-to-system integration) you have, the worse this kind of problem becomes. I briefly discussed the CPR number example with my doctor wife, and she immediately confirmed that this is a real problem in the Danish health sector, where many independent software systems need to exchange patient data.

You can probably work around such problems in various ways, but if you had avoided using natural keys, you wouldn't have had to change the key in the first place.

2024-06-06 6:56 UTC

I think it is best to have two separate generated keys for each row:

  • A key used only for relationships between tables. I like to call this relid, and make it serialised, so it is just an increasing number. This key is the primary key and should never be exposed outside the database.
  • A key used only outside the database as a unique reference to which row to update. I like to call this id, and make it a uuid, since it is well accepted to uniquely identify rows by a uuid, and to expose them to the outside world - many public APIs do this. Theoretically, the same uuid should never be generated twice, so this key doesn't necessarily have to be declared as unique.

The relid can be used in simple foreign keys, and in bridging/join tables - tables that contain primary keys of multiple tables. Generally speaking, the relid is far more readable than a uuid - it is easier to hold in your head a simple integer, which usually is not that large, than a 36 character sequence that looks similar to other 36 character sequences. UUIDs generally look like a jumble.

A relid can be 32-bits for tables you're confident will never need more than 2.1 billion rows, which really is 99.99% of all tables ever created by 99.99% of applications. If this turns out to be wrong, it is possible to upgrade the relids to 64-bit for a given table. It's a bit of a pain, especially if there are lots of references to it, but it can be done.

The relid doesn't always have to be a serialised value, and you don't always have to call the column relid. Since the primary key is never exposed publicly, it doesn't matter if different column types or names are used for different use cases. For example, code tables might use one of the codes as the primary key.

I don't think it makes sense to be religious on key usage; just like everything else, there are valid reasons for periodically varying how they work. I'm sure somebody has a valid case where a single key is better than two. I just think it generally makes sense to have a pair of internal and external keys for most cases.

2024-06-07 3:31 UTC

The thing with databases keys is you really need to be precise on what you mean by a key. Any combination of attributes is a candidate key. There are also logical and physical representations of keys. For example, a SQL Server primary key is a physical record locator but logically a unique key constraint. Yes, these behave poorly when you use natural keys as the primary key for all the reasons you mention. They are a complete implementation detail. Users should never see these attributes though and you shouldn't share the values outside of your implementation. Sharing integer surrogate keys in urls is a classic issue allowing enumeration attacks on your data if not secured properly.

Foreign keys are another logical and physical dual use concept. In SQL Server a physical foreign key constrain must reference the primary key from a parent table but logically that doesn't need to happen for relational theory to work.

Alternate keys are combinations of attributes that identify a record (or many records); these are often the natural keys you use in your user interface and where clauses etc. Alternate keys are also how systems communicate. Take your CPR number example, you cannot exchange patient data unless both systems agree on a common key. This can't be an internally generated surrogate value.

Natural keys also serve another purpose in parent-child relationships. By sharing natural key attributes with a parent you can ensure a child is not accidentally moved to a new parent plus you can query a child table without needing to join to the parent table.

There isn't a one-size-fits all when it comes to databases and keys. Joe Celko has written extensively on the subject so maybe its better to read the following than my small commentary:

2024-06-07 09:57 UTC

Greg, thank you for writing. I agree with everything you wrote, and I've been using that kind of design for... wow, at least a decade, it looks! for a slightly different reason. This kind of design seems, even if motivated by a different concern, congruent with what you describe.

Like you also imply, only a sith speaks in absolutes. The irony of the article is that I originally intended it to be more open-ended, in the sense that I was curious if there were genuinely good reasons to use natural keys. As I wrote, the article turned out more unconditional than I originally had in mind.

I am, in reality, quite ready to consider arguments to the contrary. But really, I was curious: Is it ever a good idea to use natural keys as primary keys? It sounds like a rhetorical question, but I don't mind if someone furnishes a counter-example.

As Nicholas Peterson intimated, it's probably not a real problem if those keys never 'leave' the database. What I failed to make explicit in this article is that the problems I've consistently run into occur when a system has shared keys with external systems or users.

2024-06-14 11:26 UTC

James, thank you for writing. I think we're discussing issues at different levels of abstraction. This just underscores how difficult technical writing is. I should have made my context and assumptions more explicit. The error is mine.

Everything you write sounds correct to me. I am aware of both relational calculus and relational algebra, so I'm familiar with the claims you make, and I don't dispute them.

My focus is rather on systems architecture. Even an 'internal' system may actually be composed from multiple independent systems, and my concern is that using natural keys to exchange data between such systems ultimately turns out to make things more difficult than they could have been. The only statement of yours with which I think I disagree is that you can't exchange data between systems unless you use natural keys. You definitely can, although you need to appoint one of the systems to be a 'master key issuer'.

In practice, like Greg Hall, I'd prefer using GUIDs for that purpose, rather than sequential numbers. That also addresses the concern about enumeration attacks. (Somewhat tangentially, I also recommend signing URLs with a private key in order to prevent reverse-engineering, or 'URL-hacking'.)

2024-06-14 11:55 UTC

I think we are basically agreeing here because I would never use natural keys nor externally visible synthetic keys for physical primary keys. (I think this statement is even more restrictive than the article's main premise). Well, with a rule exception for configurable enum type tables because the overhead of joining to resolve a single column value is inefficient. I would however always use a natural key for a logical primary key.

The only reason why I'm slightly pedantic about this is due the the number of clients why have used surrogate keys in a logical model and then gone on to create databases where the concept of entity identity doesn't exist. This creates many of the issues Nicholas Peterson mentioned above: duplicates, historical change tracking, etc. Frankly, it doesn't help that lots of code examples for ORMs just start with an entity that has an ID attribute.

One final comment on sharing data based on a golden master synthetic key. The moment you do I would argue that you have now committed to maintaining that key through all types of data mergers and acquisitions. It must never collide, and always point to exactly the same record and only that record. Since users can use it to refer to an entity and it makes up part of your external API, it now meets the definition of a natural key. Whether you agree or not on my stretching the definition a little, you still should not use this attribute as the physical primary key (record locator) because we should not expose implementation details in our APIs. The first Celko article I linked to explains some of the difficulties for externally visible synthetic keys.

2024-06-14 13:45 UTC

Continuous delivery without a CI server

Monday, 27 May 2024 13:34:00 UTC

An illustrative example.

More than a decade ago, I worked on a small project. It was a small single-page application (SPA) with a REST API backend, deployed to Azure. As far as I recall, the REST API used blob storage, so all in all it wasn't a complex system.

We were two developers, and although we wanted to do continuous delivery (CD), we didn't have much development infrastructure. This was a little startup, and back then, there weren't a lot of free build services available. We were using GitHub, but it was before it had any free services to compile your code and run tests.

Given those constraints, we figured out a simple way to do CD, even though we didn't have a continuous integration (CI) server.

I'll tell you how we did this.

Shining an extraordinary light on the mundane #

The reason I'm relating this little story isn't to convince you that you, too, should do it that way. Rather, it's a didactic device. By doing something extreme, we can sometimes learn about the ordinary.

You can only be pragmatic if you know how to be dogmatic.

From what I hear and read, it seems that there's a lot of organizations that believe that they're doing CI (or perhaps even CD) because they have a CI server. What the following tale will hopefully highlight is that, while build servers are useful, they aren't a requirement for CI or CD.

Distributed CD #

Dramatis personae: My colleague and me. Scene: One small SPA project with a REST API and blob storage, to be deployed to Azure. Code base in GitHub. Two laptops. Remote work.

One of us (let's say me) would start on implementing a feature, or fixing a bug. I'd use test-driven development (TDD) to get feedback on API ideas, as well as to accumulate a suite of regression tests. After a few hours of effective work, I'd send a pull request to my colleague.

Since we were only two people on the team, the responsibility was clear. It was the other person's job to review the pull request. It was also clear that the longer the reviewer dawdled, the less efficient the process would be. For that reason, we'd typically have agile pull requests with a good turnaround time.

While we were taking advantage of GitHub as a central coordination hub for pull requests, Git itself is famously distributed. Thus, we wondered whether it'd be possible to make the CD process distributed as well.

Yes, apart from GitHub, what we did was already distributed.

A little more automation #

Since we were both doing TDD, we already had automated tests. Due to the simple setup of the system, we'd already automated more than 80% of our process. It wasn't much of a stretch to automate whatever else needed automation. Such as deployment.

We agreed on a few simple rules:

  • Every part of our process should be automated.
  • Reviewing a pull request included running all tests.

When people review pull requests, they often just go to GitHub and look around before issuing an LGTM.

But, you do realize that this is Git, right? You can pull down the proposed changes and run them.

What if you're already in the middle of something, working on the same code base? Stash your changes and pull down the code.

The consequence of this process was that every time a pull request was accepted, we already knew that it passed all automated tests on two physical machines. We actually didn't need a server to run the tests a third time.

Two laptops, a box indicating GitHub, and another box indicating a production system.

After a merge, the final part of the development process mandated that the original author should deploy to production. We had Bash script that did that.

Simplicity #

This process came with some built-in advantages. First of all, it was simple. There wasn't a lot of moving parts, so there weren't many steps that could break.

Have you ever had the pleasure of troubleshooting a build? The code works on your machine, but not on the build server.

It sometimes turns out that there's a configuration mismatch with the compiler or test tools. Thus, the problem with the build server doesn't mean that you prevented a dangerous defect from being deployed to production. No, the code just didn't compile on the build server, but would actually have run fine on the production system.

It's much easier troubleshooting issues on your own machine than on some remote server.

I've also seen build servers that were set up to run tests, but along the way, something had failed and the tests didn't run. And no-one was looking at logs or warning emails from the build system because that system would already be sending hundreds of warnings a day.

By agreeing to manually(!) run the automated tests as part of the review process, we were sure that they were exercised.

Finally, by keeping the process simple, we could focus on what mattered: Delivering value to our customer. We didn't have to waste time learning how a proprietary build system worked.

Does it scale? #

I know what you're going to say: This may have worked because the overall requirements were so simple. This will never work in a 'real' development organization, with a 'real' code base.

I understand. I never claimed that it would.

The point of this story is to highlight what CI and CD is. It's a way of working where you continuously integrate your code with everyone else's code, and where you continuously deploy changes to production.

In reality, having a dedicated build system for that can be useful. These days, such systems tend to be services that integrate with GitHub or other sites, rather than an actual server that you have to care for. Even so, having such a system doesn't mean that your organization makes use of CI or CD.

(Oh, and for the mathematically inclined: In this context continuous doesn't mean actually continuous. It just means arbitrarily often.)

Conclusion #

CI and CD are processes that describe how we work with code, and how we work together.

Continuous integration means that you often integrate your code with everyone else's code. How often? More than once a day.

Continuous deployment means that you often deploy code changes to production. How often? Every time new code is integrated.

A build system can be convenient to help along such processes, but it's strictly speaking not required.


Fundamentals

Monday, 20 May 2024 07:04:00 UTC

How to stay current with technology progress.

A long time ago, I landed my dream job. My new employer was a consulting company, and my role was to be the resident Azure expert. Cloud computing was still in its infancy, and there was a good chance that I might be able to establish myself as a leading regional authority on the topic.

As part of the role, I was supposed to write articles and give presentations showing how to solve various problems with Azure. I dug in with fervour, writing sample code bases and even an MSDN Magazine article. To my surprise, after half a year I realized that I was bored.

At that time I'd already spent more than a decade learning new technology, and I knew that I was good at it. For instance, I worked five years for Microsoft Consulting Services, and a dirty little secret of that kind of role is that, although you're sold as an expert in some new technology, you're often only a few weeks ahead of your customer. For example, I was once engaged as a Windows Workflow Foundation expert at a time when it was still in beta. No-one had years of experience with that technology, but I was still expected to know much more about it than my customer.

I had lots of engagements like that, and they usually went well. I've always been good at cramming, and as a consultant you're also unencumbered by all the daily responsibilities and politics that often occupy the time and energy of regular employees. The point being that while I'm decent at learning new stuff, the role of being a consultant also facilitates that sort of activity.

After more then a decade of learning new frameworks, new software libraries, new programming languages, new tools, new online services, it turned out that I was ready for something else. After spending a few months learning Azure, I realized that I'd lost interest in that kind of learning. When investigating a new Azure SDK, I'd quickly come to the conclusion that, oh, this is just another object-oriented library. There are these objects, and you call this method to do that, etc. That's not to say that learning a specific technology is a trivial undertaking. The worse the design, the more difficult it is to learn.

Still, after years of learning new technologies, I'd started recognizing certain patterns. Perhaps, I thought, well-designed technologies are based on some fundamental ideas that may be worth learning instead.

Staying current #

A common lament among software developers is that the pace of technology is so overwhelming that they can't keep up. This is true. You can't keep up.

There will always be something that you don't know. In fact, most things you don't know. This isn't a condition isolated only to technology. The sum total of all human knowledge is so vast that you can't know it all. What you will learn, even after a lifetime of diligent study, will be a nanoscopic fraction of all human knowledge - even of everything related to software development. You can't stay current. Get used to it.

A more appropriate question is: How do I keep my skill set relevant?

Assuming that you wish to stay employable in some capacity, it's natural to be concerned with how your mad Flash skillz will land you the next gig.

Trying to keep abreast of all new technologies in your field is likely to lead to burnout. Rather, put yourself in a position so that you can quickly learn necessary skills, just in time.

Study fundamentals, rather than specifics #

Those many years ago, I realized that it'd be a better investment of my time to study fundamentals. Often, once you have some foundational knowledge, you can apply it in many circumstances. Your general knowledge will enable you to get quickly up to speed with specific technologies.

Success isn't guaranteed, but knowing fundamentals increases your chances.

This may still seem too abstract. Which fundamentals should you learn?

In the remainder of this article, I'll give you some examples. The following collection of general programmer knowledge spans software engineering, computer science, broad ideas, but also specific tools. I only intend this set of examples to serve as inspiration. The list isn't complete, nor does it constitute a minimum of what you should learn.

If you have other interests, you may put together your own research programme. What follows here are just some examples of fundamentals that I've found useful during my career.

A criterion, however, for constituting foundational knowledge is that you should be able to apply that knowledge in a wide variety of contexts. The fundamental should not be tied to a particular programming language, platform, or operating system.

Design patterns #

Perhaps the first foundational notion that I personally encountered was that of design patterns. As the Gang of Four (GoF) wrote in the book, a design pattern is an abstract description of a solution that has been observed 'in the wild', more than once, independently evolved.

Please pay attention to the causality. A design pattern isn't prescriptive, but descriptive. It's an observation that a particular code organisation tends to solve a particular problem.

There are lots of misconceptions related to design patterns. One of them is that the 'library of patterns' is finite, and more or less constrained to the patterns included in the original book.

There are, however, many more patterns. To illustrate how much wider this area is, here's a list of some patterns books in my personal library:

In addition to these, there are many more books in my library that are patterns-adjacent, including one of my own. The point is that software design patterns is a vast topic, and it pays to know at least the most important ones.

A design pattern fits the criterion that you can apply the knowledge independently of technology. The original GoF book has examples in C++ and Smalltalk, but I've found that they apply well to C#. Other people employ them in their Java code.

Knowing design patterns not only helps you design solutions. That knowledge also enables you to recognize patterns in existing libraries and frameworks. It's this fundamental knowledge that makes it easier to learn new technologies.

Often (although not always) successful software libraries and frameworks tend to follow known patterns, so if you're aware of these patterns, it becomes easier to learn such technologies. Again, be aware of the causality involved. I'm not claiming that successful libraries are explicitly designed according to published design patterns. Rather, some libraries become successful because they offer good solutions to certain problems. It's not surprising if such a good solution falls into a pattern that other people have already observed and recorded. It's like parallel evolution.

This was my experience when I started to learn the details of Azure. Many of those SDKs and APIs manifested various design patterns, and once I'd recognized a pattern it became much easier to learn the rest.

The idea of design patterns, particularly object-oriented design patterns, have its detractors, too. Let's visit that as the next set of fundamental ideas.

Functional programming abstractions #

As I'm writing this, yet another Twitter thread pokes fun at object-oriented design (OOD) patterns as being nothing but a published collection of workarounds for the shortcomings of object orientation. The people who most zealously pursue that agenda tends to be functional programmers.

Well, I certainly like functional programming (FP) better than OOD too, but rather than poking fun at OOD, I'm more interested in how design patterns relate to universal abstractions. I also believe that FP has shortcomings of its own, but I'll have more to say about that in a future article.

Should you learn about monoids, functors, monads, catamorphisms, and so on?

Yes you should, because these ideas also fit the criterion that the knowledge is technology-independent. I've used my knowledge of these topics in Haskell (hardly surprising) and F#, but also in C# and Python. The various LINQ methods are really just well-known APIs associated with, you guessed it, functors, monads, monoids, and catamorphisms.

Once you've learned these fundamental ideas, it becomes easier to learn new technologies. This has happened to me multiple times, for example in contexts as diverse as property-based testing and asynchronous message-passing architectures. Once I realize that an API gives rise to a monad, say, I know that certain functions must be available. I also know how I should best compose larger code blocks from smaller ones.

Must you know all of these concepts before learning, say, F#? No, not at all. Rather, a language like F# is a great vehicle for learning such fundamentals. There's a first time for learning anything, and you need to start somewhere. Rather, the point is that once you know these concepts, it becomes easier to learn the next thing.

If, for example, you already know what a monad is when learning F#, picking up the idea behind computation expressions is easy once you realize that it's just a compiler-specific way to enable syntactic sugaring of monadic expressions. You can learn how computation expressions work without that knowledge, too; it's just harder.

This is a recurring theme with many of these examples. You can learn a particular technology without knowing the fundamentals, but you'll have to put in more time to do that.

On to the next example.

SQL #

Which object-relational mapper (ORM) should you learn? Hibernate? Entity Framework?

How about learning SQL? I learned SQL in 1999, I believe, and it's served me well ever since. I consider raw SQL to be more productive than using an ORM. Once more, SQL is largely technology-independent. While each database typically has its own SQL dialect, the fundamentals are the same. I'm most well-versed in the SQL Server dialect, but I've also used my SQL knowledge to interact with Oracle and PostgreSQL. Once you know one SQL dialect, you can quickly solve data problems in one of the other dialects.

It doesn't matter much whether you're interacting with a database from .NET, Haskell, Python, Ruby, or another language. SQL is not only universal, the core of the language is stable. What I learned in 1999 is still useful today. Can you say the same about your current ORM?

Most programmers prefer learning the newest, most cutting-edge technology, but that's a risky gamble. Once upon a time Silverlight was a cutting-edge technology, and more than one of my contemporaries went all-in on it.

On the contrary, most programmers find old stuff boring. It turns out, though, that it may be worthwhile learning some old technologies like SQL. Be aware of the Lindy effect. If it's been around for a long time, it's likely to still be around for a long time. This is true for the next example as well.

HTTP #

The HTTP protocol has been around since 1991. It's an effectively text-based protocol, and you can easily engage with a web server on a near-protocol level. This is true for other older protocols as well.

In my first IT job in the late 1990s, one of my tasks was to set up and maintain Exchange Servers. It was also my responsibility to make sure that email could flow not only within the organization, but that we could exchange email with the rest of the internet. In order to test my mail servers, I would often just telnet into them on port 25 and type in the correct, text-based instructions to send a test email.

Granted, it's not that easy to telnet into a modern web server on port 80, but a ubiquitous tool like curl accomplishes the same goal. I recently wrote how knowing curl is better than knowing Postman. While this wasn't meant as an attack on Postman specifically, neither was it meant as a facile claim that curl is the only tool useful for ad-hoc interaction with HTTP-based APIs. Sometimes you only realize an underlying truth when you write about a thing and then other people find fault with your argument. The underlying truth, I think, is that it pays to understand HTTP and being able to engage with an HTTP-based web service at that level of abstraction.

Preferably in an automatable way.

Shells and scripting #

The reason I favour curl over other tools to interact with HTTP is that I already spend quite a bit of time at the command line. I typically have a little handful of terminal windows open on my laptop. If I need to test an HTTP server, curl is already available.

Many years ago, an employer introduced me to Git. Back then, there were no good graphical tools to interact with Git, so I had to learn to use it from the command line. I'm eternally grateful that it turned out that way. I still use Git from the command line.

When you install Git, by default you also install Git Bash. Since I was already using that shell to interact with Git, it began to dawn on me that it's a full-fledged shell, and that I could do all sorts of other things with it. It also struck me that learning Bash would be a better investment of my time than learning PowerShell. At the time, there was no indication that PowerShell would ever be relevant outside of Windows, while Bash was already available on most systems. Even today, knowing Bash strikes me as more useful than knowing PowerShell.

It's not that I do much Bash-scripting, but I could. Since I'm a programmer, if I need to automate something, I naturally reach for something more robust than shell scripting. Still, it gives me confidence to know that, since I already know Bash, Git, curl, etc., I could automate some tasks if I needed to.

Many a reader will probably complain that the Git CLI has horrible developer experience, but I will, again, postulate that it's not that bad. It helps if you understand some fundamentals.

Algorithms and data structures #

Git really isn't that difficult to understand once you realize that a Git repository is just a directed acyclic graph (DAG), and that branches are just labels that point to nodes in the graph. There are basic data structures that it's just useful to know. DAGs, trees, graphs in general, adjacency lists or adjacency matrices.

Knowing that such data structures exist is, however, not that useful if you don't know what you can do with them. If you have a graph, you can find a minimum spanning tree or a shortest-path tree, which sometimes turn out to be useful. Adjacency lists or matrices give you ways to represent graphs in code, which is why they are useful.

Contrary to certain infamous interview practices, you don't need to know these algorithms by heart. It's usually enough to know that they exist. I can't remember Dijkstra's algorithm off the top of my head, but if I encounter a problem where I need to find the shortest path, I can look it up.

Or, if presented with the problem of constructing current state from an Event Store, you may realize that it's just a left fold over a linked list. (This isn't my own realization; I first heard it from Greg Young in 2011.)

Now we're back at one of the first examples, that of FP knowledge. A list fold is its catamorphism. Again, these things are much easier to learn if you already know some fundamentals.

What to learn #

These examples may seems overwhelming. Do you really need to know all of that before things become easier?

No, that's not the point. I didn't start out knowing all these things, and some of them, I'm still not very good at. The point is rather that if you're wondering how to invest your limited time so that you can remain up to date, consider pursuing general-purpose knowledge rather than learning a specific technology.

Of course, if your employer asks you to use a particular library or programming language, you need to study that, if you're not already good at it. If, on the other hand, you decide to better yourself, you can choose what to learn next.

Ultimately, if your're learning for your own sake, the most important criterion may be: Choose something that interests you. If no-one forces you to study, it's too easy to give up if you lose interest.

If, however, you have the choice between learning Noun.js or design patterns, may I suggest the latter?

For life #

When are you done, you ask?

Never. There's more stuff than you can learn in a lifetime. I've met a lot of programmers who finally give up on the grind to keep up, and instead become managers.

As if there's nothing to learn when you're a manager. I'm fortunate that, before I went solo, I mainly had good managers. I'm under no illusion that they automatically became good managers. All I've heard said about management is that there's a lot to learn in that field, too. Really, it'd be surprising if that wasn't the case.

I can understand, however, how just keep learning the next library, the next framework, the next tool becomes tiring. As I've already outlined, I hit that wall more than a decade ago.

On the other hand, there are so many wonderful fundamentals that you can learn. You can do self-study, or you can enrol in a more formal programme if you have the opportunity. I'm currently following a course on compiler design. It's not that I expect to pivot to writing compilers for the rest of my career, but rather,

  1. "It is considered a topic that you should know in order to be "well-cultured" in computer science.
  2. "A good craftsman should know his tools, and compilers are important tools for programmers and computer scientists.
  3. "The techniques used for constructing a compiler are useful for other purposes as well.
  4. "There is a good chance that a programmer or computer scientist will need to write a compiler or interpreter for a domain-specific language."

That's good enough for me, and so far, I'm enjoying the course (although it's also hard work).

You may not find this particular topic interesting, but then hopefully you can find something else that you fancy. 3D rendering? Machine learning? Distributed systems architecture?

Conclusion #

Technology moves at a pace with which it's impossible to keep up. It's not just you who's falling behind. Everyone is. Even the best-paid GAMMA programmer knows next to nothing of all there is to know in the field. They may have superior skills in certain areas, but there will be so much other stuff that they don't know.

You may think of me as a thought leader if you will. If nothing else, I tend to be a prolific writer. Perhaps you even think I'm a good programmer. I should hope so. Who fancies themselves bad at something?

You should, however, have seen me struggle with C programming during a course on computer systems programming. There's a thing I'm happy if I never have to revisit.

You can't know it all. You can't keep up. But you can focus on learning the fundamentals. That tends to make it easier to learn specific technologies that build on those foundations.


Gratification

Monday, 13 May 2024 06:27:00 UTC

Some thoughts on developer experience.

Years ago, I was introduced to a concept called developer ergonomics. Despite the name, it's not about good chairs, standing desks, or multiple monitors. Rather, the concept was related to how easy it'd be for a developer to achieve a certain outcome. How easy is it to set up a new code base in a particular language? How much work is required to save a row in a database? How hard is it to read rows from a database and display the data on a web page? And so on.

These days, we tend to discuss developer experience rather than ergonomics, and that's probably a good thing. This term more immediately conveys what it's about.

I've recently had some discussions about developer experience (DevEx, DX) with one of my customers, and this has lead me to reflect more explicitly on this topic than previously. Most of what I'm going to write here are opinions and beliefs that go back a long time, but apparently, it's only recently that these notions have congealed in my mind under the category name developer experience.

This article may look like your usual old-man-yells-at-cloud article, but I hope that I can avoid that. It's not the case that I yearn for some lost past where 'we' wrote Plankalkül in Edlin. That, in fact, sounds like a horrible developer experience.

The point, rather, is that most attractive things come with consequences. For anyone who have been reading this blog even once in a while, this should come as no surprise.

Instant gratification #

Fat foods, cakes, and wine can be wonderful, but can be detrimental to your health if you overindulge. It can, however, be hard to resist a piece of chocolate, and even if we think that we shouldn't, we often fail to restrain ourselves. The temptation of instant gratification is simply too great.

There are other examples like this. The most obvious are the use of narcotics, lack of exercise, smoking, and dropping out of school. It may feel good in the moment, but can have long-term consequences.

Small children are notoriously bad at delaying gratification, and we often associate the ability to delay gratification with maturity. We all, however, fall in from time to time. Food and wine are my weak spots, while I don't do drugs, and I didn't drop out of school.

It strikes me that we often talk about ideas related to developer experience in a way where we treat developers as children. To be fair, many developers also act like children. I don't know how many times I've something like, "I don't want to write tests/go through a code review/refactor! I just want to ship working code now!"

Fine, so do I.

Even if wine is bad for me, it makes life worth living. As the saying goes, even if you don't smoke, don't drink, exercise rigorously, eat healthily, don't do drugs, and don't engage in dangerous activities, you're not guaranteed to live until ninety, but you're guaranteed that it's going to feel that long.

Likewise, I'm aware that doing everything right can sometimes take so long that by the time we've deployed the software, it's too late. The point isn't to always or never do certain things, but rather to be aware of the consequences of our choices.

Developer experience #

I've no problem with aiming to make the experience of writing software as good as possible. Some developer-experience thought leaders talk about the importance of documentation, predictability, and timeliness. Neither do I mind that a development environment looks good, completes my words, or helps me refactor.

To return to the analogy of human vices, not everything that feels good is ultimately bad for you. While I do like wine and chocolate, I also love sushi, white asparagus, turbot, chanterelles, lumpfish roe caviar, true morels, Norway lobster, and various other foods that tend to be categorized as healthy.

A good IDE with refactoring support, statement completion, type information, test runner, etc. is certainly preferable to writing all code in Notepad.

That said, there's a certain kind of developer tooling and language features that strikes me as more akin to candy. These are typically tools and technologies that tend to demo well. Recent examples include OpenAPI, GitHub Copilot, C# top-level statements, code generation, and Postman. Not all of these are unequivocally bad, but they strike me as mostly aiming at immature developers.

The point of this article isn't to single out these particular products, standards, or language features, but on the other hand, in order to make a point, I do have to at least outline why I find them problematic. They're just examples, and I hope that by explaining what is on my mind, you can see the pattern and apply it elsewhere.

OpenAPI #

A standard like OpenAPI, for example, looks attractive because it automates or standardizes much work related to developing and maintaining REST APIs. Frameworks and tools that leverage that standard automatically creates machine-readable schema and contract, which can be used to generate client code. Furthermore, an OpenAPI-aware framework can also autogenerate an entire web-based graphical user interface, which developers can use for ad-hoc testing.

I've worked with clients who also published these OpenAPI user interfaces to their customers, so that it was easy to get started with the APIs. Easy onboarding.

Instant gratification.

What's the problem with this? There are clearly enough apparent benefits that I usually have a hard time talking my clients out of pursuing this strategy. What are the disadvantages? Essentially, OpenAPI locks you into level 2 APIs. No hypermedia controls, no smooth conneg-based versioning, no HATEOAS. In fact, most of what makes REST flexible is lost. What remains is an ad-hoc, informally-specified, bug-ridden, slow implementation of half of SOAP.

I've previously described my misgivings about Copilot, and while I actually still use it, I don't want to repeat all of that here. Let's move on to another example.

Top-level statements #

Among many other language features, C# 9 got top-level-statements. This means that you don't need to write a Main method in a static class. Rather, you can have a single C# code file where you can immediately start executing code.

It's not that I consider this language feature particularly harmful, but it also solves what seems to me a non-problem. It demos well, though. If I understand the motivation right, the feature exists because 'modern' developers are used to languages like Python where you can, indeed, just create a .py file and start adding code statements.

In an attempt to make C# more attractive to such an audience, it, too, got that kind of developer experience enabled.

You may argue that this is a bid to remove some of the ceremony from the language, but I'm not convinced that this moves that needle much. The level of ceremony that a language like C# has is much deeper than that. That's not to target C# in particular. Java is similar, and don't even get me started on C or C++! Did anyone say header files?

Do 'modern' developers choose Python over C# because they can't be arsed to write a Main method? If that's the only reason, it strikes me as incredibly immature. I want instant gratification, and writing a Main method is just too much trouble!

If developers do, indeed, choose Python or JavaScript over C# and Java, I hope and believe that it's for other reasons.

This particular C# feature doesn't bother me, but I find it symptomatic of a kind of 'innovation' where language designers target instant gratification.

Postman #

Let's consider one more example. You may think that I'm now attacking a company that, for all I know, makes a decent product. I don't really care about that, though. What I do care about is the developer mentality that makes a particular tool so ubiquitous.

I've met web service developers who would be unable to interact with the HTTP APIs that they are themselves developing if they didn't have Postman. Likewise, there are innumerable questions on Stack Overflow where people ask questions about HTTP APIs and post screen shots of Postman sessions.

It's okay if you don't know how to interact with an HTTP API. After all, there's a first time for everything, and there was a time when I didn't know how to do this either. Apparently, however, it's easier to install an application with a graphical user interface than it is to use curl.

Do yourself a favour and learn curl instead of using Postman. Curl is a command-line tool, which means that you can use it for both ad-hoc experimentation and automation. It takes five to ten minutes to learn the basics. It's also free.

It still seems to me that many people are of a mind that it's easier to use Postman than to learn curl. Ultimately, I'd wager that for any task you do with some regularity, it's more productive to learn the text-based tool than the point-and-click tool. In a situation like this, I'd suggest that delayed gratification beats instant gratification.

CV-driven development #

It is, perhaps, easy to get the wrong impression from the above examples. I'm not pointing fingers at just any 'cool' new technology. There are techniques, languages, frameworks, and so on, which people pick up because they're exciting for other reasons. Often, such technologies solve real problems in their niches, but are then applied for the sole reason that people want to get on the bandwagon. Examples include Kubernetes, mocks, DI Containers, reflection, AOP, and microservices. All of these have legitimate applications, but we also hear about many examples where people use them just to use them.

That's a different problem from the one I'm discussing in this article. Usually, learning about such advanced techniques requires delaying gratification. There's nothing wrong with learning new skills, but part of that process is also gaining the understanding of when to apply the skill, and when not to. That's a different discussion.

Innovation is fine #

The point of this article isn't that every innovation is bad. Contrary to Charles Petzold, I don't really believe that Visual Studio rots the mind, although I once did publish an article that navigated the same waters.

Despite my misgivings, I haven't uninstalled GitHub Copilot, and I do enjoy many of the features in both Visual Studio (VS) and Visual Studio Code (VS Code). I also welcome and use many new language features in various languages.

I can certainly appreciate how an IDE makes many things easier. Every time I have to begin a new Haskell code base, I long for the hand-holding offered by Visual Studio when creating a new C# project.

And although I don't use the debugger much, the built-in debuggers in VS and VS Code sure beat GDB. It even works in Python!

There's even tooling that I wish for, but apparently never will get.

Simple made easy #

In Simple Made Easy Rich Hickey follows his usual look-up-a-word-in-the-dictionary-and-build-a-talk-around-the-definition style to contrast simple with easy. I find his distinction useful. A tool or technique that's close at hand is easy. This certainly includes many of the above instant-gratification examples.

An easy technique is not, however, necessarily simple. It may or may not be. Rich Hickey defines simple as the opposite of complex. Something that is complex is assembled from parts, whereas a simple thing is, ideally, single and undivisible. In practice, truly simple ideas and tools may not be available, and instead we may have to settle with things that are less complex than their alternatives.

Once you start looking for things that make simple things easy, you see them in many places. A big category that I personally favour contains all the language features and tools that make functional programming (FP) easier. FP tends to be simpler than object-oriented or procedural programming, because it explicitly distinguishes between and separates predictable code from unpredictable code. This does, however, in itself tend to make some programming tasks harder. How do you generate a random number? Look up the system time? Write a record to a database?

Several FP languages have special features that make even those difficult tasks easy. F# has computation expressions and Haskell has do notation.

Let's say you want to call a function that consumes a random number generator. In Haskell (as in .NET) random number generators are actually deterministic, as long as you give them the same seed. Generating a random seed, on the other hand, is non-deterministic, so has to happen in IO.

Without do notation, you could write the action like this:

rndSelect :: Integral i => [a] -> i -> IO [a]
rndSelect xs count = (\rnd -> rndGenSelect rnd xs count) <$> newStdGen

(The type annotation is optional.) While terse, this is hardly readable, and the developer experience also leaves something to be desired. Fortunately, however, you can rewrite this action with do notation, like this:

rndSelect :: Integral i => [a] -> i -> IO [a]
rndSelect xs count = do
  rnd <- newStdGen
  return $ rndGenSelect rnd xs count

Now we can clearly see that the action first creates the rnd random number generator and then passes it to rndGenSelect. That's what happened before, but it was buried in a lambda expression and Haskell's right-to-left causality. Most people would find the first version (without do notation) less readable, and more difficult to write.

Related to developer ergonomics, though, do notation makes the simple code (i.e. code that separates predictable code from unpredictable code) easy (that is; at hand).

F# computation expressions offer the same kind of syntactic sugar, making it easy to write simple code.

Delay gratification #

While it's possible to set up a development context in such a way that it nudges you to work in a way that's ultimately good for you, temptation is everywhere.

Not only may new language features, IDE functionality, or frameworks entice you to do something that may be disadvantageous in the long run. There may also be actions you don't take because it just feels better to move on.

Do you take the time to write good commit messages? Not just a single-line heading, but a proper message that explains your context and reasoning?

Most people I've observed working with source control 'just want to move on', and can't be bothered to write a useful commit message.

I hear about the same mindset when it comes to code reviews, particularly pull request reviews. Everyone 'just wants to write code', and no-one want to review other people's code. Yet, in a shared code base, you have to live with the code that other people write. Why not review it so that you have a chance to decide what that shared code base should look like?

Delay your own gratification a bit, and reap the awards later.

Conclusion #

The only goal I have with this article is to make you think about the consequences of new and innovative tools and frameworks. Particularly if they are immediately compelling, they may be empty calories. Consider if there may be disadvantages to adopting a new way of doing things.

Some tools and technologies give you instant gratification, but may be unhealthy in the long run. This is, like most other things, context-dependent. In the long run your company may no longer be around. Sometimes, it pays to deliberately do something that you know is bad, in order to reach a goal before your competition. That was the original technical debt metaphor.

Often, however, it pays to delay gratification. Learn curl instead of Postman. Learn to design proper REST APIs instead of relying on OpenAI. If you need to write ad-hoc scripts, use a language suitable for that.


Comments

Regarding Postman vs. curl, I have to disagree. Sure, curl is pretty easy to use. But while it's good for one-off tests, it sucks when you need to maintain a collection of requests that you can re-execute whevenever you want. In a testing session, you either need to re-type whole command, or reuse a previous command from the shell's history. Or have a file with all your commands and copy-paste to the shell. Either way, it's not a good experience.

That being said, I'm not very fond of Postman either. It's too heavyweight for what it does, IMHO, and the import/export mechanism is terrible for sharing collections with the team. These days, I tend to use VSCode extensions like httpYac or REST Client, or the equivalent that is now built into Visual Studio and Rider. It's much easier to work with than Postman (it's just text), while still being interactive. And since it's just a text file, you can just add it to the Git to share it with the team.

2024-05-14 02:38 UTC

@Thomas Levesque: I agree with you, yet VSCode or Rider's extensions lock you into an editor quite quickly.

But you can have the best of both worlds: a cli tool first, with editor extensions. Just like Hurl.

Note that you can run a curl command from a file with curl --config [curl_request.file], it makes chaining requests (like with login and secrets) rather cumbersome very quickly.

2024-05-16 13:57 UTC

Thank you, both, for writing. In the end, it's up to every team to settle on technical solutions that work for them, in that context. Likewise, it's up to each developer to identify methodology and tools that work for her or him, as long as it doesn't impact the rest of the team.

The reason I suggest curl over other alternatives is that not only is it free, it also tends to be ubiquitous. Most systems come with curl baked in - perhaps not a consumer installation of Windows, but if you have developer tools installed, it's highly likely that you have curl on your machine. It's a fundamental skill that may serve you well if you know it.

In addition to that, since curl is a CLI you can always script it if you need a kind of semi-automation. What prevents you from maintaining a collection of script files? They could even take command-line arguments, if you'd like.

That said, personally, if I realize that I need to maintain a collection of requests that I can re-execute whenever I want, I'd prefer writing a 'real' program. On the other hand, I find a tool like curl useful for ad-hoc testing.

2024-05-21 5:36 UTC
Johannes Egger #
... maintain a collection of requests that you can re-execute whevenever you want.

@Thomas Levesque: that sounds like a proper collection of automatically executable tests would be a better fit. But yeah, it's just easier to write those simple commands than to set up a test project - instant gratification 😉

2024-05-28 17:02 UTC

Conservative codomain conjecture

Monday, 06 May 2024 06:35:00 UTC

An API design heuristic.

For a while now, I've been wondering whether, in the language of Postel's law, one should favour being liberal in what one accepts over being conservative in what one sends. Yes, according to the design principle, a protocol or API should do both, but sometimes, you can't do that. Instead, you'll have to choose. I've recently reached the tentative conclusion that it may be a good idea favouring being conservative in what one sends.

Good API design explicitly considers contracts. What are the preconditions for invoking an operation? What are the postconditions? Are there any invariants? These questions are relevant far beyond object-oriented design. They are equally important in Functional Programming, as well as in service-oriented design.

If you have a type system at your disposal, you can often model pre- and postconditions as types. In practice, however, it frequently turns out that there's more than one way of doing that. You can model an additional precondition with an input type, but you can also model potential errors as a return type. Which option is best?

That's what this article is about, and my conjecture is that constraining the input type may be preferable, thus being conservative about what is returned.

An average example #

That's all quite abstract, so for the rest of this article, I'll discuss this kind of problem in the context of an example. We'll revisit the good old example of calculating an average value. This example, however, is only a placeholder for any kind of API design problem. This article is only superficially about designing an API for calculating an average. More generally, this is about API design. I like the average example because it's easy to follow, and it does exhibit some characteristics that you can hopefully extrapolate from.

In short, what is the contract of the following method?

public static TimeSpan Average(this IEnumerable<TimeSpantimeSpans)
{
    var sum = TimeSpan.Zero;
    var count = 0;
    foreach (var ts in timeSpans)
    {
        sum += ts;
        count++;
    }
    return sum / count;
}

What are the preconditions? What are the postconditions? Are there any invariants?

Before I answer these questions, I'll offer equivalent code in two other languages. Here it is in F#:

let average (timeSpans : TimeSpan seq) =
    timeSpans
    |> Seq.averageBy (_.Ticks >> double)
    |> int64
    |> TimeSpan.FromTicks

And in Haskell:

average :: (Fractional a, Foldable t) => t a -> a
average xs = sum xs / fromIntegral (length xs)

These three examples have somewhat different implementations, but the same externally observable behaviour. What is the contract?

It seems straightforward: If you input a sequence of values, you get the average of all of those values. Are there any preconditions? Yes, the sequence can't be empty. Given an empty sequence, all three implementations throw an exception. (The Haskell version is a little more nuanced than that, but given an empty list of NominalDiffTime, it does throw an exception.)

Any other preconditions? At least one more: The sequence must be finite. All three functions allow infinite streams as input, but if given one, they will fail to return an average.

Are there any postconditions? I can only think of a statement that relates to the preconditions: If the preconditions are fulfilled, the functions will return the correct average value (within the precision allowed by floating-point calculations).

All of this, however, is just warming up. We've been over this ground before.

Modelling contracts #

Keep in mind that this average function is just an example. Think of it as a stand-in for a procedure that's much more complicated. Think of the most complicated operation in your code base.

Not only do real code bases have many complicated operations. Each comes with its own contract, different from the other operations, and if the team isn't explicitly thinking in terms of contracts, these contracts may change over time, as the team adds new features and fixes bugs.

It's difficult work to keep track of all those contracts. As I argue in Code That Fits in Your Head, it helps if you can automate away some of that work. One way is having good test coverage. Another is to leverage a static type system, if you're fortunate enough to work in a language that has one. As I've also already covered, you can't replace all rules with types, but it doesn't mean that using the type system is ineffectual. Quite the contrary. Every part of a contract that you can offload to the type system frees up your brain to think about something else - something more important, hopefully.

Sometimes there's no good way to to model a precondition with a type, or perhaps it's just too awkward. At other times, there's really only a single way to address a concern. When it comes to the precondition that you can't pass an infinite sequence to the average function, change the type so that it takes some finite collection instead. That's not what this article is about, though.

Assuming that you've already dealt with the infinite-sequence issue, how do you address the other precondition?

Error-handling #

A typical object-oriented move is to introduce a Guard Clause:

public static TimeSpan Average(this IReadOnlyCollection<TimeSpantimeSpans)
{
    if (!timeSpans.Any())
        throw new ArgumentOutOfRangeException(
            nameof(timeSpans),
            "Can't calculate the average of an empty collection.");
 
    var sum = TimeSpan.Zero;
    foreach (var ts in timeSpans)
        sum += ts;
    return sum / timeSpans.Count;
}

You could do the same in F#:

let average (timeSpans : TimeSpan seq) =
    if Seq.isEmpty timeSpans then
        raise (
            ArgumentOutOfRangeException(
                nameof timeSpans,
                "Can't calculate the average of an empty collection."))
 
    timeSpans
    |> Seq.averageBy (_.Ticks >> double)
    |> int64
    |> TimeSpan.FromTicks

You could also replicate such behaviour in Haskell, but it'd be highly unidiomatic. Instead, I'd rather discuss one idiomatic solution in Haskell, and then back-port it.

While you can throw exceptions in Haskell, you typically handle predictable errors with a sum type. Here's a version of the Haskell function equivalent to the above C# code:

average :: (Foldable t, Fractional a) => t a -> Either String a
average xs =
  if null xs
    then Left "Can't calculate the average of an empty collection."
    else Right $ sum xs / fromIntegral (length xs)

For the readers that don't know the Haskell base library by heart, null is a predicate that checks whether or not a collection is empty. It has nothing to do with null pointers.

This variation returns an Either value. In practice you shouldn't just return a String as the error value, but rather a strongly-typed value that other code can deal with in a robust manner.

On the other hand, in this particular example, there's really only one error condition that the function is able to detect, so you often see a variation where instead of a single error message, such a function just doesn't return anything:

average :: (Foldable t, Fractional a) => t a -> Maybe a
average xs = if null xs then Nothing else Just $ sum xs / fromIntegral (length xs)

This iteration of the function returns a Maybe value, indicating that a return value may or may not be present.

Liberal domain #

We can back-port this design to F#, where I'd also consider it idiomatic:

let average (timeSpans : IReadOnlyCollection<TimeSpan>) =
    if timeSpans.Count = 0 then None else
        timeSpans
        |> Seq.averageBy (_.Ticks >> double)
        |> int64
        |> TimeSpan.FromTicks
        |> Some

This version returns a TimeSpan option rather than just a TimeSpan. While this may seem to put the burden of error-handling on the caller, nothing has really changed. The fundamental situation is the same. Now the function is just being more explicit (more honest, you could say) about the pre- and postconditions. The type system also now insists that you deal with the possibility of error, rather than just hoping that the problem doesn't occur.

In C# you can expand the codomain by returning a nullable TimeSpan value, but such an option may not always be available at the language level. Keep in mind that the Average method is just an example standing in for something that may be more complicated. If the original return type is a reference type rather than a value type, only recent versions of C# allows statically-checked nullable reference types. What if you're working in an older version of C#, or another language that doesn't have that feature?

In that case, you may need to introduce an explicit Maybe class and return that:

public static Maybe<TimeSpan> Average(this IReadOnlyCollection<TimeSpan> timeSpans)
{
    if (timeSpans.Count == 0)
        return new Maybe<TimeSpan>();
 
    var sum = TimeSpan.Zero;
    foreach (var ts in timeSpans)
        sum += ts;
    return new Maybe<TimeSpan>(sum / timeSpans.Count);
}

Two things are going on here; one is obvious while the other is more subtle. Clearly, all of these alternatives change the static type of the function in order to make the pre- and postconditions more explicit. So far, they've all been loosening the codomain (the return type). This suggests a connection with Postel's law: be conservative in what you send, be liberal in what you accept. These variations are all liberal in what they accept, but it seems that the API design pays the price by also having to widen the set of possible return values. In other words, such designs aren't conservative in what they send.

Do we have other options?

Conservative codomain #

Is it possible to instead design the API in such a way that it's conservative in what it returns? Ideally, we'd like it to guarantee that it returns a number. This is possible by making the preconditions even more explicit. I've also covered that alternative already, so I'm just going to repeat the C# code here without further comments:

public static TimeSpan Average(this NotEmptyCollection<TimeSpantimeSpans)
{
    var sum = timeSpans.Head;
    foreach (var ts in timeSpans.Tail)
        sum += ts;
    return sum / timeSpans.Count;
}

This variation promotes another precondition to a type. The precondition that the input collection mustn't be empty can be explicitly modelled with a type. This enables us to be conservative about the codomain. The method now guarantees that it will return a value.

This idea is also easily ported to F#:

type NonEmpty<'a> = { Head : 'a; Tail : IReadOnlyCollection<'a> }
 
let average (timeSpans : NonEmpty<TimeSpan>) =
    [ timeSpans.Head ] @ List.ofSeq timeSpans.Tail
    |> List.averageBy (_.Ticks >> double)
    |> int64
    |> TimeSpan.FromTicks

The average function now takes a NonEmpty collection as input, and always returns a proper TimeSpan value.

Haskell already comes with a built-in NonEmpty collection type, and while it oddly doesn't come with an average function, it's easy enough to write:

import qualified Data.List.NonEmpty as NE

average :: Fractional a => NE.NonEmpty a -> a
average xs = sum xs / fromIntegral (NE.length xs)

You can find a recent example of using a variation of that function here.

Choosing between the two alternatives #

While Postel's law recommends having liberal domains and conservative codomains, in the case of the average API, we can't have both. If we design the API with a liberal input type, the output type has to be liberal as well. If we design with a restrictive input type, the output can be guaranteed. In my experience, you'll often find yourself in such a conundrum. The average API examined in this article is just an example, while the problem occurs often.

Given such a choice, what should you choose? Is it even possible to give general guidance on this sort of problem?

For decades, I considered such a choice a toss-up. After all, these solutions seem to be equivalent. Perhaps even isomorphic?

When I recently began to explore this isomorphism more closely, it dawned on me that there's a small asymmetry in the isomorphism that favours the conservative codomain option.

Isomorphism #

An isomorphism is a two-way translation between two representations. You can go back and forth between the two alternatives without loss of information.

Is this possible with the two alternatives outlined above? For example, if you have the conservative version, can create the liberal alternative? Yes, you can:

average' :: Fractional a => [a] -> Maybe a
average' = fmap average . NE.nonEmpty

Not surprisingly, this is trivial in Haskell. If you have the conservative version, you can just map it over a more liberal input.

In F# it looks like this:

module NonEmpty =
    let tryOfSeq xs =
        if Seq.isEmpty xs then None
        else Some { Head = Seq.head xs; Tail = Seq.tail xs |> List.ofSeq }
 
let average' (timeSpans : IReadOnlyCollection<TimeSpan>) =
    NonEmpty.tryOfSeq timeSpans |> Option.map average

In C# we can create a liberal overload that calls the conservative method:

public static TimeSpan? Average(this IReadOnlyCollection<TimeSpan> timeSpans)
{
    if (timeSpans.Count == 0)
        return null;
 
    var arr = timeSpans.ToArray();
    return new NotEmptyCollection<TimeSpan>(arr[0], arr[1..]).Average();
}

Here I just used a Guard Clause and explicit construction of the NotEmptyCollection. I could also have added a NotEmptyCollection.TryCreate method, like in the F# and Haskell examples, but I chose the above slightly more imperative style in order to demonstrate that my point isn't tightly coupled to the concept of functors, mapping, and other Functional Programming trappings.

These examples highlight how you can trivially make a conservative API look like a liberal API. Is it possible to go the other way? Can you make a liberal API look like a conservative API?

Yes and no.

Consider the liberal Haskell version of average, shown above; that's the one that returns Maybe a. Can you make a conservative function based on that?

average' :: Fractional a => NE.NonEmpty a -> a
average' xs = fromJust $ average xs

Yes, this is possible, but only by resorting to the partial function fromJust. I'll explain why that is a problem once we've covered examples in the two other languages, such as F#:

let average' (timeSpans : NonEmpty<TimeSpan>) =
    [ timeSpans.Head ] @ List.ofSeq timeSpans.Tail |> average |> Option.get

In this variation, average is the liberal version shown above; the one that returns a TimeSpan option. In order to make a conservative version, the average' function can call the liberal average function, but has to resort to the partial function Option.get.

The same issue repeats a third time in C#:

public static TimeSpan Average(this NotEmptyCollection<TimeSpan> timeSpans)
{
    return timeSpans.ToList().Average().Value;
}

This time, the partial function is the unsafe Value property, which throws an InvalidOperationException if there's no value.

This even violates Microsoft's own design guidelines:

"AVOID throwing exceptions from property getters."

I've cited Cwalina and Abrams as the authors, since this rule can be found in my 2006 edition of Framework Design Guidelines. This isn't a new insight.

While the two alternatives are 'isomorphic enough' that we can translate both ways, the translations are asymmetric in the sense that one is safe, while the other has to resort to an inherently unsafe operation to make it work.

Encapsulation #

I've called the operations fromJust, Option.get, and Value partial, and only just now used the word unsafe. You may protest that neither of the three examples are unsafe in practice, since we know that the input is never empty. Thus, we know that the liberal function will always return a value, and therefore it's safe to call a partial function, even though these operations are unsafe in the general case.

While that's true, consider how the burden shifts. When you want to promote a conservative variant to a liberal variant, you can rely on all the operations being total. On the other hand, if you want to make a liberal variant look conservative, the onus is on you. None of the three type systems on display here can perform that analysis for you.

This may not be so bad when the example is as simple as taking the average of a collection of numbers, but does it scale? What if the operation you're invoking is much more complicated? Can you still be sure that you safely invoke a partial function on the return value?

As Code That Fits in Your Head argues, procedures quickly become so complicated that they no longer fit in your head. If you don't have well-described and patrolled contracts, you don't know what the postconditions are. You can't trust the return values from method calls, or even the state of the objects you passed as arguments. This tend to lead to defensive coding, where you write code that checks the state of everything all too often.

The remedy is, as always, good old encapsulation. In this case, check the preconditions at the beginning, and capture the result of that check in an object or type that is guaranteed to be always valid. This goes beyond making illegal states unrepresentable because it also works with predicative types. Once you're past the Guard Clauses, you don't have to check the preconditions again.

This kind of thinking illustrates why you need a multidimensional view on API design. As useful as Postel's law sometimes is, it doesn't address all problems. In fact, it turned out to be unhelpful in this context, while another perspective proves more fruitful. Encapsulation is the art and craft of designing APIs in such a way that they suggest or even compels correct interactions. The more I think of this, the more it strikes me that a ranking is implied: Preconditions are more important than postconditions, because if the preconditions are unfulfilled, you can't trust the postconditions, either.

Mapping #

What's going on here? One perspective is to view types as sets. In the average example, the function maps from one set to another:

Mapping from the set of collections to the set of real numbers.

Which sets are they? We can think of the average function as a mapping from the set of non-empty collections of numbers to the set of real numbers. In programming, we can't represent real numbers, so instead, the left set is going to be the set of all the non-empty collections the computer or the language can represent and hold in (virtual) memory, and the right-hand set is the set of all the possible numbers of whichever type you'd like (32-bit signed integers, 64-bit floating-point numbers, 8-bit unsigned integers, etc.).

In reality, the left-hand set is much larger than the set to the right.

Drawing all those arrows quickly becomes awkward , so instead, we may draw each mapping as a pipe. Such a pipe also corresponds to a function. Here's an intermediate step in such a representation:

Mapping from one set to the other, drawn inside a transparent pipe.

One common element is, however, missing from the left set. Which one?

Pipes #

The above mapping corresponds to the conservative variation of the function. It's a total function that maps all values in the domain to a value in the codomain. It accomplishes this trick by explicitly constraining the domain to only those elements on which it's defined. Due to the preconditions, that excludes the empty collection, which is therefore absent from the left set.

What if we also want to allow the empty collection to be a valid input?

Unless we find ourselves in some special context where it makes sense to define a 'default average value', we can't map an empty collection to any meaningful number. Rather, we'll have to map it to some special value, such as Nothing, None, or null:

Mapping the empty collection to null in a pipe separate, but on top of, the proper function pipe.

This extra pipe is free, because it's supplied by the Maybe functor's mapping (Select, map, fmap).

What happens if we need to go the other way? If the function is the liberal variant that also maps the empty collection to a special element that indicates a missing value?

Mapping all collections, including the empty collection, to the set of real numbers.

In this case, it's much harder to disentangle the mappings. If you imagine that a liquid flows through the pipes, we can try to be careful and avoid 'filling up' the pipe.

Pipe partially filled with liquid.

The liquid represents the data that we do want to transmit through the pipe. As this illustration suggests, we now have to be careful that nothing goes wrong. In order to catch just the right outputs on the right side, you need to know how high the liquid may go, and attach a an 'flat-top' pipe to it:

Pipe composed with open-top pipe.

As this illustration tries to get across, this kind of composition is awkward and error-prone. What's worse is that you need to know how high the liquid is going to get on the right side. This depends on what actually goes on inside the pipe, and what kind of input goes into the left-hand side.

This is a metaphor. The longer the pipe is, the more difficult it gets to keep track of that knowledge. The stubby little pipe in these illustrations may correspond to the average function, which is an operation that easily fits in our heads. It's not too hard to keep track of the preconditions, and how they map to postconditions.

Thus, turning such a small liberal function into a conservative function is possible, but already awkward. If the operation is complicated, you can no longer keep track of all the details of how the inputs relate to the outputs.

Additive extensibility #

This really shouldn't surprise us. Most programming languages come with all sorts of facilities that enable extensibility: The ability to add more functionality, more behaviour, more capabilities, to existing building blocks. Conversely, few languages come with removability facilities. You can't, commonly, declare that an object is an instance of a class, except one method, or that a function is just like another function, except that it doesn't accept a particular subset of input.

This explains why we can safely make a conservative function liberal, but why it's difficult to make a liberal function conservative. This is because making a conservative function liberal adds functionality, while making a liberal function conservative attempts to remove functionality.

Conjecture #

All this leads me to the following conjecture: When faced with a choice between two versions of an API, where one has a liberal domain, and the other a conservative codomain, choose the design with the conservative codomain.

If you need the liberal version, you can create it from the conservative operation. The converse need not be true.

Conclusion #

Postel's law encourages us to be liberal with what we accept, but conservative with what we return. This is a good design heuristic, but sometimes you're faced with mutually exclusive alternatives. If you're liberal with what you accept, you'll also need to be too loose with what you return, because there are input values that you can't handle. On the other hand, sometimes the only way to be conservative with the output is to also be restrictive when it comes to input.

Given two such alternatives, which one should you choose?

This article conjectures that you should choose the conservative alternative. This isn't a political statement, but simply a result of the conservative design being the smaller building block. From a small building block, you can compose something bigger, whereas from a bigger unit, you can't easily extract something smaller that's still robust and useful.


Service compatibility is determined based on policy

Monday, 29 April 2024 11:12:00 UTC

A reading of the fourth Don Box tenet, with some commentary.

This article is part of a series titled The four tenets of SOA revisited. In each of these articles, I'll pull one of Don Box's four tenets of service-oriented architecture (SOA) out of the original MSDN Magazine article and add some of my own commentary. If you're curious why I do that, I cover that in the introductory article.

In this article, I'll go over the fourth tenet, quoting from the MSDN Magazine article unless otherwise indicated.

Service compatibility is determined based on policy #

The fourth tenet is the forgotten one. I could rarely remember exactly what it included, but it does give me an opportunity to bring up a few points about compatibility. The articles said:

Object-oriented designs often confuse structural compatibility with semantic compatibility. Service-orientation deals with these two axes separately. Structural compatibility is based on contract and schema and can be validated (if not enforced) by machine-based techniques (such as packet-sniffing, validating firewalls). Semantic compatibility is based on explicit statements of capabilities and requirements in the form of policy.

Every service advertises its capabilities and requirements in the form of a machine-readable policy expression. Policy expressions indicate which conditions and guarantees (called assertions) must hold true to enable the normal operation of the service. Policy assertions are identified by a stable and globally unique name whose meaning is consistent in time and space no matter which service the assertion is applied to. Policy assertions may also have parameters that qualify the exact interpretation of the assertion. Individual policy assertions are opaque to the system at large, which enables implementations to apply simple propositional logic to determine service compatibility.

As you can tell, this description is the shortest of the four. This is also the point where I begin to suspect that my reading of the third tenet may deviate from what Don Box originally had in mind.

This tenet is also the most baffling to me. As I understand it, the motivation behind the four tenets was to describe assumptions about the kind of systems that people would develop with Windows Communication Foundation (WCF), or SOAP in general.

While I worked with WCF for a decade, the above description doesn't ring a bell. Reading it now, the description of policy sounds more like a system such as clojure.spec, although that's not something I know much about either. I don't recall WCF ever having a machine-readable policy subsystem, and if it had, I never encountered it.

It does seem, however, as though what I interpret as contract, Don Box called policy.

Despite my confusion, the word compatibility is worth discussing, regardless of whether that was what Don Box meant. A well-designed service is one where you've explicitly considered forwards and backwards compatibility.

Versioning #

Planning for forwards and backwards compatibility does not imply that you're expected to be able to predict the future. It's fine if you have so much experience developing and maintaining online systems that you may have enough foresight to plan for certain likely changes that you may have to make in the future, but that's not what I have in mind.

Rather, what you should do is to have a system that enables you to detect breaking changes before you deploy them. Furthermore you should have a strategy for how to deal with the perceived necessity to introduce breaking changes.

The most effective strategy that I know of is to employ explicit versioning, particularly message versioning. You can version an entire service as one indivisible block, but I often find it more useful to version at the message level. If you're designing a REST API, for example, you can take advantage of Content Negotiation.

If you like, you can use Semantic Versioning as a versioning scheme, but for services, the thing that mostly matters is the major version. Thus, you may simply label your messages with the version numbers 1, 2, etc.

If you already have a published service without explicit message version information, then you can still retrofit versioning afterwards. Imagine that your existing data looks like this:

{
  "singleTable": {
    "capacity": 16,
    "minimalReservation": 10
  }
}

This JSON document has no explicit version information, but you can interpret that as implying that the document has the 'default' version, which is always 1:

{
  "singleTable": {
    "version": 1,
    "capacity": 16,
    "minimalReservation": 10
  }
}

If you later realize that you need to make a breaking change, you can do that by increasing the (major) version:

{
  "singleTable": {
    "version": 2,
    "id": 12,
    "capacity": 16,
    "minimalReservation": 10
  }
}

Recipients can now look for the version property to learn how to interpret the rest of the message, and failing to find it, infer that this is version 1.

As Don Box wrote, in a service-oriented system, you can't just update all systems in a single coordinated release. Therefore, you must never break compatibility. Versioning enables you to move forward in a way that does break with the past, but without breaking existing clients.

Ultimately, you may attempt to retire old service versions, but be ready to keep them around for a long time.

For more of my thoughts about backwards compatibility, see Backwards compatibility as a profunctor.

Conclusion #

The fourth tenet is the most nebulous, and I wonder if it was ever implemented. If it was, I'm not aware of it. Even so, compatibility is an important component of service design, so I took the opportunity to write about that. In most cases, it pays to think explicitly about message versioning.

I have the impression that Don Box had something in mind more akin to what I call contract. Whether you call it one thing or another, it stands to reason that you often need to attach extra rules to simple types. The schema may define an input value as a number, but the service does require that this particular number is a natural number. Or that a string is really a proper encoding of a date. Perhaps you call that policy. I call it contract. In any case, clearly communicating such expectations is important for systems to be compatible.


Fitting a polynomial to a set of points

Monday, 22 April 2024 05:35:00 UTC

The story of a fiasco.

This is the second in a small series of articles titled Trying to fit the hype cycle. In the introduction, I've described the exercise I had in mind: Determining a formula, or at least a piecewise function, for the Gartner hype cycle. This, to be clear, is an entirely frivolous exercise with little practical application.

In the previous article, I extracted a set of (x, y) coordinates from a bitmap. In this article, I'll showcase my failed attempt at fitting the data to a polynomial.

Failure #

I've already revealed that I failed to accomplish what I set out to do. Why should you read on, then?

You don't have to, and I can't predict the many reasons my readers have for occasionally swinging by. Therefore, I can't tell you why you should keep reading, but I can tell you why I'm writing this article.

This blog is a mix of articles that I write because readers ask me interesting questions, and partly, it's my personal research-and-development log. In that mode, I write about things that I've learned, and I write in order to learn. One can learn from failure as well as from success.

I'm not that connected to 'the' research community (if such a thing exists), but I'm getting the sense that there's a general tendency in academia that researchers rarely publish their negative results. This could be a problem, because this means that the rest of us never learn about the thousands of ways that don't work.

Additionally, in 'the' programming community, we also tend to boast our victories and hide our failures. More than one podcast (sorry about the weasel words, but I don't remember which ones) have discussed how this gives young programmers the wrong impression of what programming is like. It is, indeed, a process of much trial and error, but usually, we only publish our polished, final result.

Well, I did manage to produce code to fit a polynomial to the Gartner hype cycle, but I never managed to get a good fit.

The big picture #

I realize that I have a habit of burying the lede when I write technical articles. I don't know if I've picked up that tendency from F#, which does demand that you define a value or function before you can use it. This, by the way, is a good feature.

Here, I'll try to do it the other way around, and start with the big picture:

data = numpy.loadtxt('coords.txt', delimiter=',')
 
x = data[:, 0]
t = data[:, 1]
w = fit_polynomial(x, t, 9)
 
plot_fit(x, t, w)

This, by the way, is a Python script, and it opens with these imports:

import numpy
import matplotlib.pyplot as plt

The first line of code reads the CSV file into the data variable. The first column in that file contains all the x values, and the second column the y values. The book that I've been following uses t for the data, rather than y. (Now that I think about it, I believe that this may only be because it works from an example in which the data to be fitted are 100 m dash times, denoted t.)

Once the script has extracted the data, it calls the fit_polynomial function to produce a set of weights w. The constant 9 is the degree of polynomial to fit, although I think that I've made an off-by-one error so that the result is only a eighth-degree polynomial.

Finally, the code plots the original data together with the polynomial:

Gartner hype cycle and a eighth-degree fitted polynomial.

The green dots are the (x, y) coordinates that I extracted in the previous article, while the red curve is the fitted eighth-degree polynomial. Even though we're definitely in the realm of over-fitting, it doesn't reproduce the Gartner hype cycle.

I've even arrived at the value 9 after some trial and error. After all, I wasn't trying to do any real science here, so over-fitting is definitely allowed. Even so, 9 seems to be the best fit I can achieve. With lover values, like 8, below, the curve deviates too much:

Gartner hype cycle and a seventh-degree fitted polynomial.

The value 10 looks much like 9, but above that (11), the curve completely disconnects from the data, it seems:

Gartner hype cycle and a tenth-degree fitted polynomial.

I'm not sure why it does this, to be honest. I would have thought that the more degrees you added, the more (over-)fitted the curve would be. Apparently, this is not so, or perhaps I made a mistake in my code.

Calculating the weights #

The fit_polynomial function calculates the polynomial coefficients using a linear algebra formula that I've found in at least two text books. Numpy makes it easy to invert, transpose, and multiply matrices, so the formula itself is just a one-liner. Here it is in the entire context of the function, though:

def fit_polynomial(x, t, degree):
    """
    Fits a polynomial to the given data.
 
    Parameters
    ----------
    x : Array of shape [n_samples]
    t : Array of shape [n_samples]
    degree : degree of the polynomial
 
    Returns
    -------
    w : Array of shape [degree + 1]
    """
 
    # This expansion creates a matrix, so we name that with an upper-case letter
    # rather than a lower-case letter, which is used for vectors.
    X = expand(x.reshape((len(x), 1)), degree)
    return numpy.linalg.inv(X.T @ X) @ X.T @ t

This may look daunting, but is really just two lines of code. The rest is docstring and a comment.

The above-mentioned formula is the last line of code. The one before that expands the input data t from a simple one-dimensional array to a matrix of those values squared, cubed, etc. That's how you use the least squares method if you want to fit it to a polynomial of arbitrary degree.

Expansion #

The expand function looks like this:

def expand(x, degree):
    """
    Expands the given array to polynomial elements of the given degree.
 
    Parameters
    ----------
    x : Array of shape [n_samples, 1]
    degree : degree of the polynomial
 
    Returns
    -------
    Xp : Array of shape [n_samples, degree + 1]
    """
 
    Xp = numpy.ones((len(x), 1))
    for i in range(1, degree + 1):
        Xp = numpy.hstack((Xp, numpy.power(x, i)))
    return Xp

The function begins by creating a column vector of ones, here illustrated with only three rows:

>>> Xp = numpy.ones((3, 1))
>>> Xp
array([[1.],
       [1.],
       [1.]])

It then proceeds to loop over as many degrees as you've asked it to, each time adding a column to the Xp matrix. Here's an example of doing that up to a power of three, on example input [1,2,3]:

>>> x = numpy.array([1,2,3]).reshape((3, 1))
>>> x
array([[1],
       [2],
       [3]])
>>> Xp = numpy.hstack((Xp, numpy.power(x, 1)))
>>> Xp
array([[1., 1.],
       [1., 2.],
       [1., 3.]])
>>> Xp = numpy.hstack((Xp, numpy.power(x, 2))) 
>>> Xp
array([[1., 1., 1.],
       [1., 2., 4.],
       [1., 3., 9.]])
>>> Xp = numpy.hstack((Xp, numpy.power(x, 3))) 
>>> Xp
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  2.,  4.,  8.],
       [ 1.,  3.,  9., 27.]])

Once it's done looping, the expand function returns the resulting Xp matrix.

Plotting #

Finally, here's the plot_fit procedure:

def plot_fit(x, t, w):
    """
    Plots the polynomial with the given weights and the data.
 
    Parameters
    ----------
    x : Array of shape [n_samples]
    t : Array of shape [n_samples]
    w : Array of shape [degree + 1]
    """
 
    xs = numpy.linspace(x[0], x[0]+len(x), 100)
    ys = numpy.polyval(w[::-1], xs)
 
    plt.plot(xs, ys, 'r')
    plt.scatter(x, t, s=10, c='g')
    plt.show()

This is fairly standard pyplot code, so I don't have much to say about it.

Conclusion #

When I started this exercise, I'd hoped that I could get close to the Gartner hype cycle by over-fitting the model to some ridiculous polynomial degree. This turned out not to be the case, for reasons that I don't fully understand. As I increase the degree, the curve begins to deviate from the data.

I can't say that I'm a data scientist or a statistician of any skill, so it's possible that my understanding is still too shallow. Perhaps I'll return to this article later and marvel at the ineptitude on display here.


Comments

I suspect that increasing the degree wound up backfiring by effectively putting too much weight on the right side, whose flatness clashed with the increasingly steep powers you were trying to mix in. A vertically offset damped sinusoid might make a better starting point for modeling, though identifying its parameters wouldn't be quite as straightforward. One additional wrinkle there is that you want to level fully off after the valley; you could perhaps make that happen by plugging a scaled arctangent or something along those lines into the sinusoid.

Incidentally, a neighboring post in my feed reader was about a new release of an open-source data analysis and curve fitting program (QSoas) that might help if you don't want to take such a DIY approach.

2024-05-16 02:37 UTC

Aaron, thank you for writing. In retrospect, it becomes increasingly clear to me why this doesn't work. This highlights, I think, why it's a good idea to sometimes do stupid exercises like this one. You learn something from it, even when you fail.

2024-05-22 6:15 UTC

Services share schema and contract, not class

Monday, 15 April 2024 07:25:00 UTC

A reading of the third Don Box tenet, with some commentary.

This article is part of a series titled The four tenets of SOA revisited. In each of these articles, I'll pull one of Don Box's four tenets of service-oriented architecture (SOA) out of the original MSDN Magazine article and add some of my own commentary. If you're curious why I do that, I cover that in the introductory article.

In this article, I'll go over the third tenet, quoting from the MSDN Magazine article unless otherwise indicated.

Services share schema and contract, not class #

Compared to the second tenet, the following description may at first seem more dated. Here's what the article said:

Object-oriented programming encourages developers to create new abstractions in the form of classes. Most modern development environments not only make it trivial to define new classes, modern IDEs do a better job guiding you through the development process as the number of classes increases (as features like IntelliSense® provide a more specific list of options for a given scenario).

Classes are convenient abstractions as they share both structure and behavior in a single named unit. Service-oriented development has no such construct. Rather, services interact based solely on schemas (for structures) and contracts (for behaviors). Every service advertises a contract that describes the structure of messages it can send and/or receive as well as some degree of ordering constraints over those messages. This strict separation between structure and behavior vastly simplifies deployment, as distributed object concepts such as marshal-by-value require a common execution and security environment which is in direct conflict with the goals of autonomous computing.

Services do not deal in types or classes per se; rather, only with machine readable and verifiable descriptions of the legal "ins and outs" the service supports. The emphasis on machine verifiability and validation is important given the inherently distributed nature of how a service-oriented application is developed and deployed. Unlike a traditional class library, a service must be exceedingly careful about validating the input data that arrives in each message. Basing the architecture on machine-validatible schema and contract gives both developers and infrastructure the hints they need to protect the integrity of an individual service as well as the overall application as a whole.

Because the contract and schema for a given service are visible over broad ranges of both space and time, service-orientation requires that contracts and schema remain stable over time. In the general case, it is impossible to propagate changes in schema and/or contract to all parties who have ever encountered a service. For that reason, the contract and schema used in service-oriented designs tend to have more flexibility than traditional object-oriented interfaces. It is common for services to use features such as XML element wildcards (like xsd:any) and optional SOAP header blocks to evolve a service in ways that do not break already deployed code.

With its explicit discussion of XML, SOAP, and XSD, this description may seem more stuck in 2004 than the two first tenets.

I'll cover the most obvious consequence first.

At the boundaries... #

In the MSDN article, the four tenets guide the design of Windows Communication Foundation (WCF) - a technology that in 2004 was under development, but still not completed. While SOAP already existed as a platform-independent protocol, WCF was a .NET endeavour. Most developers using the Microsoft platform at the time were used to some sort of binary protocol, such as DCOM or .NET Remoting. Thus, it makes sense that Don Box was deliberately explicit that this was not how SOA (or WCF) was supposed to work.

In fact, since SOAP is platform-independent, you could write a web service in one language (say, Java) and consume it with a different language (e.g. C++). WCF was Microsoft's SOAP technology for .NET.

If you squint enough that you don't see the explicit references to XML or SOAP, however, the description still applies. Today, you may exchange data with JSON over REST, Protocol Buffers via gRPC, or something else, but it's still common to have a communications protocol that is independent of specific service implementations. A service may be written in Python, Haskell, C, or any other language that supports the wire format. As this little list suggests, the implementation language doesn't even have to be object-oriented.

In fact,

A formal interface definition language (IDL) may enable you to automate serialization and deserialization, but these are usually constrained to defining the shape of data and operations. Don Box talks about validation, and types don't replace validation - particularly if you allow xsd:any. That particular remark is quite at odds with the notion that a formal schema definition is necessary, or even desirable.

And indeed, today we often see JSON-based REST APIs that are more loosely defined. Even so, the absence of a machine-readable IDL doesn't entail the absence of a schema. As Alexis King wrote related to the static-versus-dynamic-types debate, dynamic type systems are not inherently more open. A similar argument can be made about schema. Regardless of whether or not a formal specification exists, a service always has a de-facto schema.

To be honest, though, when I try to interpret what this and the next tenet seem to imply, an IDL may have been all that Don Box had in mind. By schema he may only have meant XSD, and by contract, he may only have meant SOAP. More broadly speaking, this notion of contract may entail nothing more than a list of named operations, and references to schemas that indicate what input each operation takes, and what output it returns.

What I have in mind with the rest of this article may be quite an embellishment on that notion. In fact, my usual interpretation of the word contract may be more aligned with what Don Box calls policy. Thus, if you want a very literal reading of the four tenets, what comes next may fit better with the fourth tenet, that service compatibility is determined based on policy.

Regardless of whether you think that the following discussion belongs here, or in the next article, I'll assert that it's paramount to designing and developing useful and maintainable web services.

Encapsulation #

If we, once more, ignore the particulars related to SOAP and XML, we may rephrase the notion of schema and contract as follows. Schema describes the shape of data: Is it a number, a string, a tuple, or a combination of these? Is there only one, or several? Is the data composed from smaller such definitions? Does the composition describe the combination of several such definitions, or does it describe mutually exclusive alternatives?

Compliant data may be encoded as objects or data structures in memory, or serialized to JSON, XML, CSV, byte streams, etc. We may choose to call a particular agglomeration of data a message, which we may pass from one system to another. The first tenet already used this metaphor.

You can't, however, just pass arbitrary valid messages from one system to another. Certain operations allow certain data, and may promise to return other kinds of messages. In additions to the schema, we also need to describe a contract.

What's a contract? If you consult Object-Oriented Software Construction, a contract stipulates invariants, pre- and postconditions for various operations.

Preconditions state what must be true before an operation can take place. This often puts the responsibility on the caller to ensure that the system is in an appropriate state, and that the message that it intends to pass to the other system is valid according to that state.

Postconditions, on the other hand, detail what the caller can expect in return. This includes guarantees about response messages, but may also describe the posterior state of the system.

Invariants, finally, outline what is always true about the system.

Although such a description of a contract originates from a book about object-oriented design, it's useful in other areas, too, such as functional programming. It strikes me that it applies equally well in the context of service-orientation.

The combination of contract and well-described message structure is, in other words, encapsulation. There's nothing wrong with that: It works. If you actually apply it as a design principle, that is.

Conclusion #

The third SOA tenet emphasizes that only data travels over service boundaries. In order to communicate effectively, services must agree on the shape of data, and which operations are legal when. While they exchange data, however, they don't share address space, or even internal representation.

One service may be written in F# and the client in Clojure. Even so, it's important that they have a shared understanding of what is possible, and what is not. The more explicit you, as a service owner, can be, the better.

Next: Service compatibility is determined based on policy.


Extracting curve coordinates from a bitmap

Monday, 08 April 2024 05:32:00 UTC

Another example of using Haskell as an ad-hoc scripting language.

This article is part of a short series titled Trying to fit the hype cycle. In the first article, I outlined what it is that I'm trying to do. In this article, I'll describe how I extract a set of x and y coordinates from this bitmap:

Gartner hype cycle.

(Actually, this is scaled-down version of the image. The file I work with is a bit larger.)

As I already mentioned in the previous article, these days there are online tools for just about everything. Most likely, there's also an online tool that will take a bitmap like that and return a set of (x, y) coordinates.

Since I'm doing this for the programming exercise, I'm not interested in that. Rather, I'd like to write a little Haskell script to do it for me.

Module and imports #

Yes, I wrote Haskell script. As I've described before, with good type inference, a statically typed language can be as good for scripting as a dynamic one. Just as might be the case with, say, a Python script, you'll be iterating, trying things out until finally the script settles into its final form. What I present here is the result of my exercise. You should imagine that I made lots of mistakes underway, tried things that didn't work, commented out code and print statements, imported modules I eventually didn't need, etc. Just like I imagine you'd also do with a script in a dynamically typed language. At least, that's how I write Python, when I'm forced to do that.

In other words, the following is far from the result of perfect foresight, but rather the equilibrium into which the script settled.

I named the module HypeCoords, because the purpose of it is to extract the (x, y) coordinates from the above Gartner hype cycle image. These are the imports it turned out that I ultimately needed:

module HypeCoords where
 
import qualified Data.List.NonEmpty as NE
import Data.List.NonEmpty (NonEmpty((:|)))
import Codec.Picture
import Codec.Picture.Types

The Codec.Picture modules come from the JuicyPixels package. This is what enables me to read a .png file and extract the pixels.

Black and white #

If you look at the above bitmap, you may notice that it has some vertical lines in a lighter grey than the curve itself. My first task, then, is to get rid of those. The easiest way to do that is to convert the image to a black-and-white bitmap, with no grey scale.

Since this is a one-off exercise, I could easily do that with a bitmap editor, but on the other hand, I thought that this was a good first task to give myself. After all, I didn't know the JuicyPixels library at all, so this was an opportunity to start with a task just a notch simpler than the one that was my actual goal.

I thought that the easiest way to convert to a black-and-white image would be to turn all pixels white if they are lighter than some threshold, and black otherwise.

A PNG file has more information than I need, so I first converted the image to an 8-bit RGB bitmap. Even though the above image looks as though it's entirely grey scale, each pixel is actually composed of three colours. In order to compare a pixel with a threshold, I needed a single measure of how light or dark it is.

That turned out to be about as simple as it sounds: Just take the average of the three colours. Later, I'd need a function to compute the average for another reason, so I made it a reusable function:

average :: Integral a => NE.NonEmpty a -> a
average nel = sum nel `div` fromIntegral (NE.length nel)

It's a bit odd that the Haskell base library doesn't come with such a function (at least to my knowledge), but anyway, this one is specialized to do integer division. Notice that this function computes only non-exceptional averages, since it requires the input to be a NonEmpty list. No division-by-zero errors here, please!

Once I'd computed a pixel average and compared it to a threshold value, I wanted to replace it with either black or white. In order to make the code more readable I defined two named constants:

black :: PixelRGB8
black = PixelRGB8 minBound minBound minBound
white :: PixelRGB8
white = PixelRGB8 maxBound maxBound maxBound

With that in place, converting to black-and-white is only a few more lines of code:

toBW :: PixelRGB8 -> PixelRGB8
toBW (PixelRGB8 r g b) =
  let threshold = 192 :: Integer
      lum = average (fromIntegral r :| [fromIntegral g, fromIntegral b])
  in if lum <= threshold then black else white

I arrived at the threshold of 192 after a bit of trial-and-error. That's dark enough that the light vertical lines fall to the white side, while the real curve becomes black.

What remained was to glue the parts together to save the black-and-white file:

main :: IO ()
main = do
  readResult <- readImage "hype-cycle-cleaned.png"
  case readResult of
    Left msg -> putStrLn msg
    Right img -> do
      let bwImg = pixelMap toBW $ convertRGB8 img
      writePng "hype-cycle-bw.png" bwImg

The convertRGB8 function comes from JuicyPixels.

The hype-cycle-bw.png picture unsurprisingly looks like this:

Black-and-white Gartner hype cycle.

Ultimately, I didn't need the black-and-white bitmap file. I just wrote the script to create the file in order to be able to get some insights into what I was doing. Trust me, I made a lot of stupid mistakes along the way, and among other issues had some 'fun' with integer overflows.

Extracting image coordinates #

Now I had a general feel for how to work with the JuicyPixels library. It still required quite a bit of spelunking through the documentation before I found a useful API to extract all the pixels from a bitmap:

pixelCoordinates :: Pixel a => Image a -> [((IntInt), a)]
pixelCoordinates = pixelFold (\acc x y px -> ((x,y),px):acc) []

While this is, after all, just a one-liner, I'm surprised that something like this doesn't come in the box. It returns a list of tuples, where the first element contains the pixel coordinates (another tuple), and the second element the pixel information (e.g. the RGB value).

One y value per x value #

There were a few more issues to be addressed. The black curve in the black-and-white bitmap is thicker than a single pixel. This means that for each x value, there will be several black pixels. In order to do linear regression, however, we need a single y value per x value.

One easy way to address that concern is to calculate the average y value for each x value. This may not always be the best choice, but as far as we can see in the above black-and-white image, it doesn't look as though there's any noise left in the picture. This means that we don't have to worry about outliers pulling the average value away from the curve. In other words, finding the average y value is an easy way to get what we need.

averageY :: Integral b => NonEmpty (a, b) -> (a, b)
averageY nel = (fst $ NE.head nel, average $ snd <$> nel)

The averageY function converts a NonEmpty list of tuples to a single tuple. Watch out! The input tuples are not the 'outer' tuples that pixelCoordinates returns, but rather a list of actual pixel coordinates. Each tuple is a set of coordinates, but since the function never manipulates the x coordinate, the type of the first element is just unconstrained a. It can literally be anything, but will, in practice, be an integer.

The assumption is that the input is a small list of coordinates that all share the same x coordinate, such as (42, 99) :| [(42, 100), (42, 102)]. The function simply returns a single tuple that it creates on the fly. For the first element of the return tuple, it picks the head tuple from the input ((42, 99) in the example), and then that tuple's fst element (42). For the second element, the function averages all the snd elements (99, 100, and 102) to get 100 (integer division, you may recall):

ghci> averageY ((42, 99) :| [(42, 100), (42, 102)])
(42,100)

What remains is to glue together the building blocks.

Extracting curve coordinates #

A few more steps were required, but these I just composed in situ. I found no need to define them as individual functions.

The final composition looks like this:

main :: IO ()
main = do
  readResult <- readImage "hype-cycle-cleaned.png"
  case readResult of
    Left msg -> putStrLn msg
    Right img -> do
      let bwImg = pixelMap toBW $ convertRGB8 img
      let blackPixels =
            fst <$> filter ((black ==) . snd) (pixelCoordinates bwImg)
      let h = imageHeight bwImg
      let lineCoords = fmap (h -) . averageY <$> NE.groupAllWith fst blackPixels
      writeFile "coords.txt" $
        unlines $ (\(x,y) -> show x ++ "," ++ show y) <$> lineCoords

The first lines of code, until and including let bwImg, are identical to what you've already seen.

We're only interested in the black pixels, so the main action uses the standard filter function to keep only those that are equal to the black constant value. Once the white pixels are gone, we no longer need the pixel information. The expression that defines the blackPixels value finally (remember, you read Haskell code from right to left) throws away the pixel information by only retaining the fst element. That's the tuple that contains the coordinates. You may want to refer back to the type signature of pixelCoordinates to see what I mean.

The blackPixels value has the type [(Int, Int)].

Two more things need to happen. One is to group the pixels together per x value so that we can use averageY. The other is that we want the coordinates as normal Cartesian coordinates, and right now, they're in screen coordinates.

When working with bitmaps, it's quite common that pixels are measured out from the top left corner, instead of from the bottom left corner. It's not difficult to flip the coordinates, but we need to know the height of the image:

let h = imageHeight bwImg

The imageHeight function is another JuicyPixels function.

Because I sometimes get carried away, I write the code in a 'nice' compact style that could be more readable. I accomplished both of the above remaining tasks with a single line of code:

let lineCoords = fmap (h -) . averageY <$> NE.groupAllWith fst blackPixels

This first groups the coordinates according to x value, so that all coordinates that share an x value are collected in a single NonEmpty list. This means that we can map all of those groups over averageY. Finally, the expression flips from screen coordinates to Cartesian coordinates by subtracting the y coordinate from the height h.

The final writeFile expression writes the coordinates to a text file as comma-separated values. The first ten lines of that file looks like this:

9,13
10,13
11,13
12,14
13,15
14,15
15,16
16,17
17,17
18,18
...

Do these points plot the Gartner hype cycle?

Sanity checking by plotting the coordinates #

To check whether the coordinates look useful, we could plot them. If I wanted to use a few more hours, I could probably figure out how to do that with JuicyPixels as well, but on the other hand, I already know how to do that with Python:

data = numpy.loadtxt('coords.txt', delimiter=',')
x = data[:, 0]
t = data[:, 1]
plt.scatter(x, t, s=10, c='g')
plt.show()

That produces this plot:

Coordinates plotted with Python.

LGTM.

Conclusion #

In this article, you've seen how a single Haskell script can extract curve coordinates from a bitmap. The file is 41 lines all in all, including module declaration and white space. This article shows every single line in that file, apart from some blank lines.

I loaded the file into GHCi and ran the main action in order to produce the CSV file.

I did spend a few hours looking around in the JuicyPixels documentation before I'd identified the functions that I needed. All in all I used some hours on this exercise. I didn't keep track of time, but I guess that I used more than three, but probably fewer than six, hours on this.

This was the successful part of the overall exercise. Now onto the fiasco.

Next: Fitting a polynomial to a set of points.


Page 1 of 74

"Our team wholeheartedly endorses Mark. His expert service provides tremendous value."
Hire me!