Beating another dead horse.

Although I live in Copenhagen and mostly walk or ride my bicycle in order to get around town, I do own an old car for getting around the rest of the country. In Denmark, cars go through mandatory official inspection every other year, and I've been through a few of these in my life. A few years ago, the mechanic doing the inspection informed me that my car's chassis number was incorrect.

This did make me a bit nervous, because I'd bought the car used, and I was suddenly concerned that things weren't really as I thought. Had I unwittingly bought a stolen car?

But the mechanic just walked over to his computer in order to correct the error. That's when a different kind of unease hit me. When you've programmed for some decades, you learn to foresee various typical failure modes. Since a chassis number is an obvious candidate for a natural key, I already predicted that changing the number would prove to be either impossible, or have all sorts of cascading effects, ultimately terminating in official records no longer recognizing that the car is mine.

As it turned out, though, whoever made that piece of software knew what they were doing, because the mechanic just changed the chassis number, and that was that. This is now five or six years ago, and I still own the same car, and I've never had any problems with the official ownership records.

Uniqueness #

The reason I related this story is that I'm currently following an undergraduate course in databases and information systems. Since this course is aimed at students with no real-world experience, it wisely moves forward in a pedagogical progression. In order to teach database keys, it starts with natural keys. From a didactic perspective, this makes sense, but the result, so far, is that the young people I work with now propose database designs with natural keys.

I'm not blaming anyone. You have to learn to crawl before you can walk.

Still, this situation made me reflect on the following question: Are natural keys ever a good idea?

Let's consider an example. For a little project we're doing, we've created a database of the World's 50 best restaurants. My fellow students suggest a table design like this:

CREATE TABLE Restaurants (
    year TEXT NOT NULL,
    rank TEXT NOT NULL,
    restaurantName TEXT NOT NULL,
    cityName TEXT NOT NULL
);

Granted, at this point, this table definition defines no key at all. I'm not complaining about that. After all, a month ago, the students probably hadn't seen a database table.

From following the course curriculum, it'd be natural, however, to define a key for the Restaurants table as the combination of restaurantName, cityName, and year. The assumption is that name and city uniquely identifies a restaurant.

In this particular example, this assumption may actually turn out to hold. So far. After all, the data set isn't that big, and it's important for restaurants in that league to have recognizable names. If I had to guess, I'd say that there's probably only one Nobelhart & Schmutzig in the world.

Still, a good software architect should challenge the underlying assumptions. Is name and city a natural key? It's easy to imagine that it's not. What if we expand the key to include the country as well? Okay, but what if we had a restaurant named China Wok in Springfield, USA? Hardly unique. Add the state, you say? Probably still not unique.

Identity #

Ensuring uniqueness is only the first of many problems with natural keys. You may quickly reach the conclusion that for a restaurant database, a synthetic key is probably the best choice.

But what about 'natural' natural keys, so to speak? An example may be a car's chassis number. This is already an opaque number, and it probably originates from a database somewhere. Or how about a personal identification number? In Denmark we have the CPR number, and I understand that the US Social Security Number is vaguely analogous.

If you're designing a database that already includes such a personal identification number, you might be tempted to use it as a natural key. After all, it's already a key somewhere else, so it's guaranteed to be unique, right?

Yes, the number may uniquely identify a person, but the converse may not be true. A person may have more than one identification number. At least when time is a factor.

As an example, for technical-historical reasons, the Danish CPR number carries information (which keys shouldn't do), such as a person's date of birth and sex. Since 2014 a new law enables transsexual citizens to get a new CPR number that reflects their perceived gender. The consequence is that the same person may have more than one CPR number. Perhaps not more than one at the same time, but definitely two during a lifetime.

Even if existing keys are guaranteed to be unique, you can't assume that the uniqueness gives rise to a bijection. If you use an external unique key, you may lose track of the entities that you're trying to keep track of.

This is true not only for people, but cars, bicycles (which also have chassis numbers), network cards, etc.

Clerical errors #

Finally, even if you've found a natural key that is guaranteed to be unique and track the actual entity that you want to keep track of, there's a final argument against using an externally defined key in your system: Data-entry errors.

Take the story about my car's chassis number. The mechanic who spotted the discrepancy clearly interpreted it as a clerical error.

After a few decades of programming, I've learned that sooner or later, there will be errors in your data. Either it's a clerical error, or the end-user mistyped, or there was a data conversion error when importing from an external system. Or even data conversion errors within the same system, as it goes through upgrades and migrations.

Your system should be designed to allow corrections to data. This includes corrections of external keys, such as chassis numbers, government IDs, etc. This means that you can't use such keys as database keys in your own system.

Heuristic #

Many were the times, earlier in my career, when I decided to use a 'natural key' as a key in my own database. As far as I recall, I've regretted it every single time.

These days I follow a hard heuristic: Always use synthetic keys for database tables.

Conclusion #

Is it ever a good idea to use natural keys in a database design? My experience tells me that it's not. Ultimately, regardless of how certain you can be that the natural key is stable and correctly tracks the entity that it's supposed to keep track of, data errors will occur. This includes errors in those natural keys.

You should be able to correct such errors without losing track of the involved entities. You'll regret using natural keys. Use synthetic keys.


Comments

There are lots of different types of keys. I agree that using natural keys as physical primary keys is a bad idea but you really should be modelling your data logically with natural keys. Thinking about uniqueness and identity is a part of your data design. Natural keys often end up as constraints, indexes and query plans. When natural keys are not unique enough then you need to consider additional attributes in your design to ensure access to a specific record.

Considering natural keys during design can help elicit additional requirements and business rules. "Does a social security number uniquely identify a person? If not why?" In the UK they recycle them so the natural key is a combination of national insurance number and birth year. You have to ask questions.

2024-06-04 15:43 UTC
2024-06-05 9:33 UTC

I largely agree with James Snape, but wanted to throw in a few other thoughts on top. Surrogates don't defend you from duplicate data, in fact they facilitate it, because the routine generating the surrogate key isn't influenced by any of the other data in the record. The concept of being unable to correct a natural key is also odd, why can't you? Start a transaction, insert a new record with the correct key, update the related records to point to the new record, then delete the old record, done. Want some crucial information about a related record but only have the surrogate to it? I guess you have to join it every time in order to get the columns the user actually wants to see. A foreign key that uses a natural key often often prevents the join entirely, because it tells the user what they wanted to know.

I find the problem with natural keys usually comes from another source entirely. Developers write code and don't tend to prefer using SQL. They typically interact with databases through ORM libraries. ORMs are complicated and rely on conventions to uniformly deal with data. It's not uncommon for ORMs to dictate the structure of tables to some degree, or what datatypes to prefer. It's usually easier in an ORM to have a single datatype for keys (BIGINT?) and use it uniformly across all the tables.

2024-06-05 12:42 UTC

James, Nicholas, thank you for writing. I realize that there are some unstated assumptions and implied concerns that I should have made more explicit. I certainly have no problem with adding constraints and other rules to model data. For the Danish CPR number, for example, while I wouldn't make it a primary key (for the reasons outlined in the article), I'd definitely put a UNIQUE constraint on it.

Another unspoken context that I had in mind is that systems often exist in a wider context where ACID guarantees fall apart. I suppose it's true that if you look at a database in isolation, you may be able to update a foreign key with the help of some cascading changes rippling through the database, but if you've ever shared the old key outside of the database, you now have orphaned data.

A simple example could be sending out an email with a link that embeds the old key. If you change the key after sending out the email, but before the user clicks, the link no longer works.

That's just a simple and easy-to-explain example. The more integration (particularly system-to-system integration) you have, the worse this kind of problem becomes. I briefly discussed the CPR number example with my doctor wife, and she immediately confirmed that this is a real problem in the Danish health sector, where many independent software systems need to exchange patient data.

You can probably work around such problems in various ways, but if you had avoided using natural keys, you wouldn't have had to change the key in the first place.

2024-06-06 6:56 UTC

I think it is best to have two separate generated keys for each row:

  • A key used only for relationships between tables. I like to call this relid, and make it serialised, so it is just an increasing number. This key is the primary key and should never be exposed outside the database.
  • A key used only outside the database as a unique reference to which row to update. I like to call this id, and make it a uuid, since it is well accepted to uniquely identify rows by a uuid, and to expose them to the outside world - many public APIs do this. Theoretically, the same uuid should never be generated twice, so this key doesn't necessarily have to be declared as unique.

The relid can be used in simple foreign keys, and in bridging/join tables - tables that contain primary keys of multiple tables. Generally speaking, the relid is far more readable than a uuid - it is easier to hold in your head a simple integer, which usually is not that large, than a 36 character sequence that looks similar to other 36 character sequences. UUIDs generally look like a jumble.

A relid can be 32-bits for tables you're confident will never need more than 2.1 billion rows, which really is 99.99% of all tables ever created by 99.99% of applications. If this turns out to be wrong, it is possible to upgrade the relids to 64-bit for a given table. It's a bit of a pain, especially if there are lots of references to it, but it can be done.

The relid doesn't always have to be a serialised value, and you don't always have to call the column relid. Since the primary key is never exposed publicly, it doesn't matter if different column types or names are used for different use cases. For example, code tables might use one of the codes as the primary key.

I don't think it makes sense to be religious on key usage; just like everything else, there are valid reasons for periodically varying how they work. I'm sure somebody has a valid case where a single key is better than two. I just think it generally makes sense to have a pair of internal and external keys for most cases.

2024-06-07 3:31 UTC

The thing with databases keys is you really need to be precise on what you mean by a key. Any combination of attributes is a candidate key. There are also logical and physical representations of keys. For example, a SQL Server primary key is a physical record locator but logically a unique key constraint. Yes, these behave poorly when you use natural keys as the primary key for all the reasons you mention. They are a complete implementation detail. Users should never see these attributes though and you shouldn't share the values outside of your implementation. Sharing integer surrogate keys in urls is a classic issue allowing enumeration attacks on your data if not secured properly.

Foreign keys are another logical and physical dual use concept. In SQL Server a physical foreign key constrain must reference the primary key from a parent table but logically that doesn't need to happen for relational theory to work.

Alternate keys are combinations of attributes that identify a record (or many records); these are often the natural keys you use in your user interface and where clauses etc. Alternate keys are also how systems communicate. Take your CPR number example, you cannot exchange patient data unless both systems agree on a common key. This can't be an internally generated surrogate value.

Natural keys also serve another purpose in parent-child relationships. By sharing natural key attributes with a parent you can ensure a child is not accidentally moved to a new parent plus you can query a child table without needing to join to the parent table.

There isn't a one-size-fits all when it comes to databases and keys. Joe Celko has written extensively on the subject so maybe its better to read the following than my small commentary:

2024-06-07 09:57 UTC

Greg, thank you for writing. I agree with everything you wrote, and I've been using that kind of design for... wow, at least a decade, it looks! for a slightly different reason. This kind of design seems, even if motivated by a different concern, congruent with what you describe.

Like you also imply, only a sith speaks in absolutes. The irony of the article is that I originally intended it to be more open-ended, in the sense that I was curious if there were genuinely good reasons to use natural keys. As I wrote, the article turned out more unconditional than I originally had in mind.

I am, in reality, quite ready to consider arguments to the contrary. But really, I was curious: Is it ever a good idea to use natural keys as primary keys? It sounds like a rhetorical question, but I don't mind if someone furnishes a counter-example.

As Nicholas Peterson intimated, it's probably not a real problem if those keys never 'leave' the database. What I failed to make explicit in this article is that the problems I've consistently run into occur when a system has shared keys with external systems or users.

2024-06-14 11:26 UTC

James, thank you for writing. I think we're discussing issues at different levels of abstraction. This just underscores how difficult technical writing is. I should have made my context and assumptions more explicit. The error is mine.

Everything you write sounds correct to me. I am aware of both relational calculus and relational algebra, so I'm familiar with the claims you make, and I don't dispute them.

My focus is rather on systems architecture. Even an 'internal' system may actually be composed from multiple independent systems, and my concern is that using natural keys to exchange data between such systems ultimately turns out to make things more difficult than they could have been. The only statement of yours with which I think I disagree is that you can't exchange data between systems unless you use natural keys. You definitely can, although you need to appoint one of the systems to be a 'master key issuer'.

In practice, like Greg Hall, I'd prefer using GUIDs for that purpose, rather than sequential numbers. That also addresses the concern about enumeration attacks. (Somewhat tangentially, I also recommend signing URLs with a private key in order to prevent reverse-engineering, or 'URL-hacking'.)

2024-06-14 11:55 UTC

I think we are basically agreeing here because I would never use natural keys nor externally visible synthetic keys for physical primary keys. (I think this statement is even more restrictive than the article's main premise). Well, with a rule exception for configurable enum type tables because the overhead of joining to resolve a single column value is inefficient. I would however always use a natural key for a logical primary key.

The only reason why I'm slightly pedantic about this is due the the number of clients why have used surrogate keys in a logical model and then gone on to create databases where the concept of entity identity doesn't exist. This creates many of the issues Nicholas Peterson mentioned above: duplicates, historical change tracking, etc. Frankly, it doesn't help that lots of code examples for ORMs just start with an entity that has an ID attribute.

One final comment on sharing data based on a golden master synthetic key. The moment you do I would argue that you have now committed to maintaining that key through all types of data mergers and acquisitions. It must never collide, and always point to exactly the same record and only that record. Since users can use it to refer to an entity and it makes up part of your external API, it now meets the definition of a natural key. Whether you agree or not on my stretching the definition a little, you still should not use this attribute as the physical primary key (record locator) because we should not expose implementation details in our APIs. The first Celko article I linked to explains some of the difficulties for externally visible synthetic keys.

2024-06-14 13:45 UTC
Julius H #

I'd like to comment with an example where using a synthetic key came back to bite me. My system had posts and users with synthetic IDs. Now I wanted to track an unread state across them. Naively, I designed just another entity:

           
public int ID { get; set; }
public int PostID { get; set; }
public int UserID { get; set; }
            

And it worked flawlessly for years. One day, however, a user complained that he always got an exception "Sequence contains more than one element". Of course I used SingleOrDefault() in application code because I expected 0 or 1 record per user and post. The quick solution was deleting the spurious table row. As a permanant solution I removed the ID field (and column) so the unread state had its natural key as primary key (both columns). So if it happens again in the future, the app will error on insertin rather than querying.

Since my application is in control of the IDs and it's just a very simple join table I think it was the best solution. If the future requirements hold different kinds of unread state, I can always add the key again.

2024-07-22 14:40 UTC

Julius, thank you for writing. I see what you mean, and would also tend to model this as just a table with two foreign keys. From the perspective of entity-relationship modelling, such a table isn't even an entity, but rather a relationship. For that reason, it doesn't need its own key; not because the combination is 'natural', but rather because it's not really an independent 'thing'.

2024-07-29 14:39 UTC


Wish to comment?

You can add a comment to this post by sending me a pull request. Alternatively, you can discuss this post on Twitter or somewhere else with a permalink. Ping me with the link, and I may respond.

Published

Monday, 03 June 2024 19:46:00 UTC

Tags



"Our team wholeheartedly endorses Mark. His expert service provides tremendous value."
Hire me!
Published: Monday, 03 June 2024 19:46:00 UTC