Why DRY? by Mark Seemann
Code duplication is often harmful - except when it isn't. Learn how to think about the trade-offs involved.
Good programmers know that code duplication should be avoided. There's a cost associated with duplicated code, so we have catchphrases like Don't Repeat Yourself (DRY) in order to remind ourselves that code duplication is evil.
It seems to me that some programmers see themselves as Terminators: out to eliminate all code duplication with extreme prejudice; sometimes, perhaps, without even considering the trade-offs involved. Every time you remove duplicated code, you add a level of indirection, and as you've probably heard before, all problems in computer science can be solved by another level of indirection, except for the problem of too many levels of indirection.
Removing code duplication is important, but it tends to add a cognitive overhead. Therefore, it's important to understand why code duplication is harmful - or rather: when it's harmful.
Rates of change #
Imagine that you copy a piece of code and paste it into ten other code bases, and then never touch that piece of code again. Is that harmful?
Here's one of my favourite examples. When protecting the invariants of objects, I always add Guard Clauses against nulls:
if (subject == null) throw new ArgumentNullException("subject");
In fact, I have a Visual Studio code snippet for this; I've been using this code snippet for years, which means that I have code like this Guard Clause duplicated all over my code bases. Most likely, there are thousands of examples of such Guard Clauses on my hard drive, with the only variation being the name of the parameter. I don't mind, because, in my experience, these two lines of code never change.
Yet many programmers see that as a violation of DRY, so instead, they introduce something like this:
The end result of this is that you've slightly increased the cognitive overhead, but what have you gained? As far as I can tell: nothing. The code still has the same number of Guard Clauses. Instead of idiomatic
if statements, they are now method calls, but it's hardly DRY when you have to repeat those calls to Guard.AgainstNull all over the place. You'd still be repeating yourself.
The point here is that DRY is a catchphrase, but shouldn't be an excuse for avoiding thinking explicitly about any given problem.
If the duplicated code is likely to change a lot, the cost of duplication is likely to be high, because you'll have to spend time making the same change in lots of different places - and if you forget one, you'll be introducing bugs in the system. If the duplicated code is unlikely to change, perhaps the cost is low. As with all other risk management, you conceptually multiply the risk of the adverse event happening with the cost of the damage associated with that event. If the product is low, don't bother addressing the risk.
The Rule of Three #
It's not a new observation that unconditional elimination of duplicated code can be harmful. The Rule of Three exists for this reason:
- Write a piece of code.
- Write the same piece of code again. Resist the urge to generalise.
- Write the same piece of code again. Now you are allowed to consider generalising it.
Another reason is that even if the duplication is 'real' (and not coincidental), you may not have enough examples to enable you to make the correct refactoring. Often, even duplicated code comes with small variations:
- The logic is the same, but a string value differs.
- The logic is almost the same, but one duplicate performs an extra small step.
- The logic looks similar, but operates on two different types of object.
If you refactor too prematurely, you may perform the wrong refactoring. Often, people introduce helper methods, and then when they realize that the axis of variability was not what they expected, they add more and more parameters to the helper method, and more and more complexity to its implementation. This leads to ripple effects. Ripple effects lead to thrashing. Thrashing leads to poor maintainability. Poor maintainability leads to low productivity.
This is, in my experience, the most important reason to follow the Rule of Three: wait, until you have more facts available to you. You don't have to take the rule literally either. You can wait until you have four, five, or six examples of the duplication, if the rate of change is low.
The parallel to statistics #
If you've ever taken a course in statistics, you would have learned that the less data you have, the less confidence you can have in any sort of analysis. Conversely, the more samples you have, the more confidence can you have if you are trying to find or verify some sort of correlation.
The same holds true for code duplication, I believe. The more samples you have of duplicated code, the better you understand what is truly duplicated, and what varies. The better you understand the axes of variability, the better a refactoring you can perform in order to get rid of the duplication.
Code duplication is costly - but only if the code changes. The cost of code duplication, thus, is C*p, where C is the cost incurred, when you need to change the code, and p is the probability that you'll need to change the code. In my experience, for example, the Null Guard Clause in this article has a cost of duplication of 0, because the probability that I'll need to change it is 0.
There's a cost associated with removing duplication - particularly if you make the wrong refactoring. Thus, depending on the values of C and p, you may be better off allowing a bit of duplication, instead of trying to eradicate it as soon as you see it.
You may not be able to quantify C and p (I'm not), but you should be able to estimate whether these values are small or large. This should help you decide if you need to eliminate the duplication right away, or if you'd be better off waiting to see what happens.