Where's the science? by Mark Seemann
Is a scientific discussion about software development possible?
Have you ever found yourself in a heated discussion about a software development topic? Which is best? Tabs or spaces? Where do you put the curly brackets? Is significant whitespace a good idea? Is Python better than Go? Does test-driven development yield an advantage? Is there a silver bullet? Can you measure software development productivity?
I've had plenty of such discussions, and I'll have them again in the future.
While some of these discussions may resemble debates on how many angels can dance on the head of a pin, other discussions might be important. Ex ante, it's hard to tell which is which.
Why don't we settle these discussions with science?
A notion of science #
I love science. Why don't I apply scientific knowledge instead of arguments based on anecdotal evidence?
To answer such questions, we must first agree on a definition of science. I favour Karl Popper's description of empirical falsifiability. A hypothesis that makes successful falsifiable predictions of the future is a good scientific theory. Such a a theory has predictive power.
Newton's theory of gravity had ample predictive power, but Einstein's theory of general relativity supplanted it because its predictive power was even better.
Mendel's theory of inheritance had predictive power, but was refined into what is modern-day genetics which yields much greater predictive power.
Is predictive power the only distinguishing trait of good science? I'm already venturing into controversial territory by taking this position. I've met people in the programming community who consider my position naive or reductionist.
What about explanatory power? If a theory satisfactorily explains observed phenomena, doesn't that count as a proper scientific theory?
Controversy #
I don't believe in explanatory power as a sufficient criterion for science. Without predictive power, we have little evidence that an explanation is correct. An explanatory theory can even be internally consistent, and yet we may not know if it describes reality.
Theories with explanatory power are the realm of politics or religion. Consider the observation that some people are rich and some are poor. You can believe in a theory that explains this by claiming structural societal oppression. You can believe in another theory that views poor people as fundamentally lazy. Both are (somewhat internally consistent) political theories, but they have yet to demonstrate much predictive power.
Likewise, you may believe that some deity created the universe, but that belief produces no predictions. You can apply Occam's razor and explain the same phenomena without a god. A belief in one or more gods is a religious theory, not a scientific theory.
It seems to me that there's a correlation between explanatory power and controversy. Over time, theories with predictive power become uncontroversial. Even if they start out controversial (such as Einstein's theory of general relativity), the dust soon settles because it's hard to argue with results.
Theories with mere explanatory power, on the other hand, can fuel controversy forever. Explanations can be compelling, and without evidence to refute them, the theories linger.
Ironically, you might argue that Popper's theory of scientific discovery itself is controversial. It's a great explanation, but does it have predictive power? Not much, I admit, but I'm also not aware of competing views on science with better predictive power. Thus, you're free to disagree with everything in this article. I admit that it's a piece of philosophy, not of science.
The practicality of experimental verification #
We typically see our field of software development as one of the pillars of STEM. Many of us have STEM educations (I don't; I'm an economist). Yet, we're struggling to come to grips with the lack of scientific methodology in our field. It seems to me that we suffer from physics envy.
It's really hard to compete with physics when it comes to predictive power, but even modern physics struggle with experimental verification. Consider an establishment like CERN. It takes billions of euros of investment to make today's physics experiments possible. The only reason we make such investments, I think, is that physics so far has had a good track record.
What about another fairly 'hard' science like medicine? In order to produce proper falsifiable predictions, medical science have evolved the process of the randomised controlled trial. It works well when you're studying short-term effects. Does this medicine cure this disease? Does this surgical procedure improve a patient's condition? How does lack of REM sleep for three days affect your ability to remember strings of numbers?
When a doctor tells you that a particular medicine helps, or that surgery might be warranted, he or she is typically on solid scientific grounds.
Here's where things start to become less clear, though, What if a doctor tells you that a particular diet will improve your expected life span? Is he or she on scientific solid ground?
That's rarely the case, because you can't make randomised controlled trials about life styles. Or, rather, a totalitarian society might be able to do that, but we'd consider it unethical. Consider what it would involve: You'd have to randomly select a significant number of babies and group them into those who must follow a particular life style, and those who must not. Then you'll somehow have to force those people to stick to their randomly assigned life style for the entirety of their lives. This is not only unethical, but the experiment also takes the most of a century to perform.
What life-style scientists instead do is resort to demographic studies, with all the problems of statistics they involve. Again, the question is whether scientific theories in this field offer predictive power. Perhaps they do, but it takes decades to evaluate the results.
My point is that medicine isn't exclusively a hard science. Some medicine is, and some is closer to social sciences.
I'm an economist by education. Economics is generally not considered a hard science, although it's a field where it's trivial to make falsifiable predictions. Almost all of economics is about numbers, so making a falsifiable prediction is trivial: The MSFT stock will be at 200 by January 1 2021. The unemployment rate in Denmark will be below 5% in third quarter of 2020. The problem with economics is that most of such predictions turn out to be no better than the toss of a coin - even when made by economists. You can make falsifiable predictions in economics, but most of them do, in fact, get falsified.
On the other hand, with the advances in such disparate fields as DNA forensics, satellite surveys, and computer-aided image processing, a venerable 'art' like archaeology is gaining predictive power. We predict that if we dig here, we'll find artefacts from the iron age. We predict that if we make a DNA test of these skeletal remains, they'll show that the person buried was a middle-aged women. And so on.
One thing is the ability to produce falsifiable predictions. Another things is whether or not the associated experiment is practically possible.
The science of software development #
Do we have a science of software development? I don't think that we have.
There's computer science, but that's not quite the same. That field of study has produced many predictions that hold. In general, quicksort will be faster than bubble sort. There's an algorithm for finding the shortest way through a network. That sort of thing.
You will notice that these result are hardly controversial. It's not those topics that we endlessly debate.
We debate whether certain ways to organise work is more 'productive'. The entire productivity debate revolves around an often implicit context: that what we discuss is long-term productivity. We don't much argue how to throw together something during a weekend hackaton. We argue whether we can complete a nine-month software project safer with test-driven development. We argue whether a code base can better sustain its organisation year after year if it's written in F# or JavaScript.
There's little scientific evidence on those questions.
The main problem, as I see it, is that it's impractical to perform experiments. Coming up with falsifiable predictions is easy.
Let's consider an example. Imagine that your hypothesis is that test-driven development makes you more productive in the middle and long run. You'll have to turn that into a falsifiable claim, so first, pick a software development project of sufficient complexity. Imagine that you can find a project that someone estimates will take around ten months to complete for a team of five people. This has to be a real project that someone needs done, complete with vague, contradictory, and changing requirements. Now you formulate your falsifiable prediction, for example: "This project will be delivered one month earlier with test-driven development."
Next, you form teams to undertake the work. One team to perform the work with test-driven development, and one team to do the work without it. Then you measure when they're done.
This is already impractical, because who's going to pay for two teams when one would suffice?
Perhaps, if you're an exceptional proposal writer, you could get a research grant for that, but alas, that wouldn't be good enough.
With two competing teams of five people each, it might happen that one team member exhibits productivity orders of magnitudes different from the others. That could skew the experimental outcome, so you'd have to go for a proper randomised controlled trial. This would involve picking numerous teams and assigning a methodology at random: either they do test-driven development, or they don't. Nothing else should vary. They should all use the same programming language, the same libraries, the same development tools, and work the same hours. Furthermore, no-one outside the team should know which teams follow which method.
Theoretically possible, but impractical. It would require finding and paying many software teams for most of a year. One such experiment would cost millions of euros.
If you did such an experiment, it would tell you something, but it'd still be open to interpretation. You might argue that the programming language used caused the outcome, but that one can't extrapolate from that result to other languages. Or perhaps there was something special about the project that you think doesn't generalise. Or perhaps you take issue with the pool from which the team members were drawn. You'd have to repeat the experiment while varying one of the other dimensions. That'll cost millions more, and take another year.
Considering the billions of euros/dollars/pounds the world's governments pour into research, you'd think that we could divert a few hundred millions to do proper research in software development, but it's unlikely to happen. That's the reason we have to contend ourselves with arguing from anecdotal evidence.
Conclusion #
I can imagine how scientific inquiry into software engineering could work. It'd involve making a falsifiable prediction, and then set up an experiment to prove it wrong. Unfortunately, to be on a scientifically sound footing, experiments should be performed with randomised controlled trials, with a statistically significant number of participants. It's not too hard to conceive of such experiments, but they'd be prohibitively expensive.
In the meantime, the software development industry moves forward. We share ideas and copy each other. Some of us are successful, and some of us fail. Slowly, this might lead to improvements.
That process, however, looks more like evolution than scientific progress. The fittest software development organisations survive. They need not be the best, as they could be stuck in local maxima.
When we argue, when we write blog posts, when we speak at conferences, when we appear on podcasts, we exchange ideas and experiences. Not genes, but memes.
Comments
That topic is something many software developers think about, at least I do from time to time.
Your post reminded me of that conference talk Intro to Empirical Software Engineering: What We Know We Don't Know by Hillel Wayne. Just curious - have you seen the talk and if so - what do you think? Researches mentioned in the talk are not proper scientific expreiments as you describe, but anyway looked really interesting to me.
Sergey, thank you for writing. I didn't know about that talk, but Hillel Wayne regularly makes an appearance in my Twitter feed. I've now seen the talk, and I think it offers a perspective close to mine.
I've already read The Leprechauns of Software Engineering (above, I linked to my review), but while I was aware of Making Software, I've yet to read it. Several people have reacted to my article by recommending that book, so it's now on it's way to me in the mail.