Why DRY? by Mark Seemann
Code duplication is often harmful - except when it isn't. Learn how to think about the trade-offs involved.
Good programmers know that code duplication should be avoided. There's a cost associated with duplicated code, so we have catchphrases like Don't Repeat Yourself (DRY) in order to remind ourselves that code duplication is evil.
It seems to me that some programmers see themselves as Terminators: out to eliminate all code duplication with extreme prejudice; sometimes, perhaps, without even considering the trade-offs involved. Every time you remove duplicated code, you add a level of indirection, and as you've probably heard before, all problems in computer science can be solved by another level of indirection, except for the problem of too many levels of indirection.
Removing code duplication is important, but it tends to add a cognitive overhead. Therefore, it's important to understand why code duplication is harmful - or rather: when it's harmful.
Rates of change #
Imagine that you copy a piece of code and paste it into ten other code bases, and then never touch that piece of code again. Is that harmful?
Here's one of my favourite examples. When protecting the invariants of objects, I always add Guard Clauses against nulls:
if (subject == null) throw new ArgumentNullException("subject");
In fact, I have a Visual Studio code snippet for this; I've been using this code snippet for years, which means that I have code like this Guard Clause duplicated all over my code bases. Most likely, there are thousands of examples of such Guard Clauses on my hard drive, with the only variation being the name of the parameter. I don't mind, because, in my experience, these two lines of code never change.
Yet many programmers see that as a violation of DRY, so instead, they introduce something like this:
The end result of this is that you've slightly increased the cognitive overhead, but what have you gained? As far as I can tell: nothing. The code still has the same number of Guard Clauses. Instead of idiomatic
if statements, they are now method calls, but it's hardly DRY when you have to repeat those calls to Guard.AgainstNull all over the place. You'd still be repeating yourself.
The point here is that DRY is a catchphrase, but shouldn't be an excuse for avoiding thinking explicitly about any given problem.
If the duplicated code is likely to change a lot, the cost of duplication is likely to be high, because you'll have to spend time making the same change in lots of different places - and if you forget one, you'll be introducing bugs in the system. If the duplicated code is unlikely to change, perhaps the cost is low. As with all other risk management, you conceptually multiply the risk of the adverse event happening with the cost of the damage associated with that event. If the product is low, don't bother addressing the risk.
The Rule of Three #
It's not a new observation that unconditional elimination of duplicated code can be harmful. The Rule of Three exists for this reason:
- Write a piece of code.
- Write the same piece of code again. Resist the urge to generalise.
- Write the same piece of code again. Now you are allowed to consider generalising it.
Another reason is that even if the duplication is 'real' (and not coincidental), you may not have enough examples to enable you to make the correct refactoring. Often, even duplicated code comes with small variations:
- The logic is the same, but a string value differs.
- The logic is almost the same, but one duplicate performs an extra small step.
- The logic looks similar, but operates on two different types of object.
If you refactor too prematurely, you may perform the wrong refactoring. Often, people introduce helper methods, and then when they realize that the axis of variability was not what they expected, they add more and more parameters to the helper method, and more and more complexity to its implementation. This leads to ripple effects. Ripple effects lead to thrashing. Thrashing leads to poor maintainability. Poor maintainability leads to low productivity.
This is, in my experience, the most important reason to follow the Rule of Three: wait, until you have more facts available to you. You don't have to take the rule literally either. You can wait until you have four, five, or six examples of the duplication, if the rate of change is low.
The parallel to statistics #
If you've ever taken a course in statistics, you would have learned that the less data you have, the less confidence you can have in any sort of analysis. Conversely, the more samples you have, the more confidence can you have if you are trying to find or verify some sort of correlation.
The same holds true for code duplication, I believe. The more samples you have of duplicated code, the better you understand what is truly duplicated, and what varies. The better you understand the axes of variability, the better a refactoring you can perform in order to get rid of the duplication.
Code duplication is costly - but only if the code changes. The cost of code duplication, thus, is C*p, where C is the cost incurred, when you need to change the code, and p is the probability that you'll need to change the code. In my experience, for example, the Null Guard Clause in this article has a cost of duplication of 0, because the probability that I'll need to change it is 0.
There's a cost associated with removing duplication - particularly if you make the wrong refactoring. Thus, depending on the values of C and p, you may be better off allowing a bit of duplication, instead of trying to eradicate it as soon as you see it.
You may not be able to quantify C and p (I'm not), but you should be able to estimate whether these values are small or large. This should help you decide if you need to eliminate the duplication right away, or if you'd be better off waiting to see what happens.
I would add one more aspect to this article: Experience.
The more experience you gain the more likely you will perceive parts of your code which could possibly benefit from a generalization; but not yet because it would be premature.
I try to keep these places in mind and try to simplify the potential later refactoring. When your code meets the rule of three, you’re able to adopt your preparations to efficiently refactor your code. This leads to better code even if the code isn't repeated for the third time.
Needless to say that experience is always crucial but I like to show people, who are new to programing, which decisions are based on experience and which are based on easy to follow guidelines like the rule of three.
Markus, thank you for writing. Yes, I agree that experience is always helpful.
Wonderful article but I'd like to say that I prefer sticking to the Zero one infinity rule instead of the Rule Of Three
That way I do overcome my laziness of searching for duplicates and extracting them to a procedure. Also it does help to keep code DRY without insanity to use it only on 2 or more lines of code. If it's a non-ternary oneliner I'm ok with that. If it's 2 lines of code I may consider making a procedure out of it and 3 lines and more are extracted every time I need to repeat them.
I generally agree with the points raised here, but I'd like to point out another component to the example used of argument validation. I realize this post is fairly old now, so I'm not entirely sure if you'd agree or not considering new development since then.
While it is fair to say that argument validation is simple enough not to warrant some sort of encapsulation like the Guard class, this can quickly change once you add other factors into the mix, like number of arguments, and project-specific style rules. Take this sample code as an example:
This uses "standard" StyleCop rules with 5 parameters (probably the very limit of what is acceptable before having to refactor the code). Every block has the explicit scope added, and a line break after the scope terminator, as per style guidelines. This can get incredibly verbose and distract from the main logic of the method, which in cases like this, can become smaller than the argument validation code itself! Now, compare that to a more modern approach using discards:
Or even better, using C#6 expression name deduction and the new, native guard method:
Now, even that can be improved upon, and C# is moving in the direction of making contract checks like this even easier.
I only wanted to make the point that, in this particular example, there are other benefits to the refactor besides just DRY, and those can be quite important as well since we read code much more than we write it: any gains in clarity and conciseness can net substantial productivity improvements.
Juliano, thank you for writing. I agree that the first example is too verbose, but when addressing it, I'd start from the position that it's just a symptom. The underlying problem, here, is that the style guide is counterproductive. For the same reason, I don't write null guards that way.
I'm not sure that the following was your intent, so please excuse me as I take this topic in the direction of one of my pet peeves. In general, I think that 'we' have a tendency towards action bias. Instead of taking a step back and consider whether a problem even ought to be a problem, we often launch directly into producing a solution.
In the spirit of five whys, we may first find that the underlying explanation of the problem is the style guide.
Asking one more time, we may begin to consider the need for null guards altogether. Why do we need them? Because objects may be null. Is there a way to avoid that? In modern C#, you can turn on the nullable reference types feature. Other languages like Haskell famously have no nulls.
I realise that this line of inquiry is unproductive when you have existing C# code bases.
Your point, however, remains valid. It's perfectly fine to extract a helper method when the motivation is to make the code more readable. And while my book Code That Fits in Your Head is an attempt to instrumentalise readability to some extend, a subjective component will always linger.
I don't agree that the style guide is counter productive per se. There are fair justifications for prefering the explicit braces and the blank lines in the StyleCop documentation. They might not be the easiest in the eyes in this particular scenario, but we have to take into consideration the ease of enforcement in the entire project of a particular style. If you deviate from the style only in certain scenarios, it becomes much harder to manage on a big team. Applying the single rule everywhere is much easier. I do understand however that this is fairly subjective, so I won't pursue the argument here.
Now, regarding nullable reference types, I'd just like to point out that one cannot only rely on that feature for argument validation. As you are aware, that's an optional feature, that might not be enabled on consumer code. If you are writing a reusable library for example, and you rely solely on Nullable Reference Types as your argument validation strategy, this will still cause runtime exceptions if a consumer that does not have NRTs enabled calls you passing null, and this would be a very bad developer experience for them as the stack trace will be potentially very cryptic. For that reason, it is a best practice to keep the runtime checks in place even when using nullable reference types.
Juliano, thank you for writing. I'm sure that there are reasons for that style guide. I admit that I haven't read them (you didn't link to them), but I can think of some arguments myself. Unless you're interested in that particular discussion, I'll leave it at that. Ultimately, this blog post is about DRY in general, and not particularly about Guard Clauses.
I'm aware of the limitations of nullable reference types. This once again highlights the difference between developing and maintaining a reusable library versus a complete application.
Understanding the forces that apply in a particular context is, I believe, a key to making good decisions.
I love the call out of risk * probability. I use threat matrixes to guide decisions all the time.
I'd also like to elaborate that duplication is determined not just by similar code, but by similar forces that change the code. If code looks similar, but changes for different reasons, then centralizing will cause tension between the different cases.
Using the example above, guard clauses are unlikely to change together. Most likely, the programmmer changes a parameter and just the single guard clause is effected. Since the clauses don't change for the same reason, the effect of duplication is very low.
Different forces might effect all guard clauses and justify centralizing. For example, if a system wanted to change its global null handling strategy at debug-time versus production.
However, I often see semantically separate code struggling to share an implementation. I think header interfaces and conforming container-like library wrappers are companion smells of this scenario. For example, I frequently see multiple flows trying to wrap email notifications behind some generic email notification abstraction. The parent flow is still coupled to the idea of email, the two notifications still change separately, and there's an extra poor abstraction between us and sending emails.