Are We Ready For Spec-Driven Development?
Spec-driven development is the waterfall dream. Say what you want, and you get it. Pure intent and verification. Many people (including me) think that this is probably where we’re heading in the medium-long term.
Is it practical today? How about tomorrow?
A couple of days ago, Anthropic published a study where they took 52 junior software engineers and gave them a task to perform using the Python library Trio. Trio implements an alternative asynchronous concurrency model to the standard library and a junior engineer is unlikely to be familiar with it, so this was also a test of learning ability. Afterwards, subjects were given a quiz on the Trio concepts used in the task to see how much knowledge they retained. Crucially, they were also divided into a control group (for whom AI use was completely banned) and an experimental group (who could go wild).
We’ve seen AI’s effect on task productivity studied before (METR), where we saw developers with AI were 19% slower than without. That was almost a year ago, though. The models have gotten better and so have we. In the Anthropic study, the effect on task productivity was… statistically insignificant. At least it’s not negative, I guess, but the legendary AI productivity gains are yet to show themselves in the data. I firmly believe this is something that will change in time, but not today.
AI assisted engineers scored an average of 17% lower on the knowledge test after completing the task. The far more interesting part though, is how much this varied by the AI assistance techniques used by the engineer. An average subject in the AI assisted cohort scored ~50% on the quiz, but users of one technique averaged 24% and users of another averaged 86%!
There is, of course, far more to this than the headline numbers.
The authors identify two major categories of AI usage patterns: low-scoring (21-39% average quiz scores) and high-scoring (65-86%).
Low-scoring patterns were: AI delegation, progressive AI reliance, and iterative AI debugging. The key thing tying them together is that the human avoids engaging with the problem domain, instead offloading the cognitive effort to the AI.
High-scoring patterns were: conceptual inquiry, hybrid code-explanation, and generation-then-comprehension. These are the opposite of the low-scoring patterns in terms of mental effort required. Conceptual inquiry involves just using the AI as a “search engine on steroids” and writing the code by hand. The other two approaches still have the AI generate the code, but then also make sure the human engages with it and builds a model of the system along with the machine.
There are limitations to this paper. The sample size was small (n=52), it was a single task and may not be representative of other software engineering work, the subjects were junior rather than experienced engineers, etc. However, it is useful data and a good starting point for informing the development of our working practices as our industry advances.
So two recent studies suggest we’re further from the SDD dream than we’d like. AI usage that delegates cognitive work to the AI is not significantly faster than traditional approaches, and prevents developers from developing an understanding of the codebase.
Prod breaks at 3am, who’s going to fix it? Your engineers used AI delegation to write the code, so they don’t have a deep understanding of it. The Anthropic paper found that debugging skills were most affected by AI use; subjects failed to develop the ability to identify what’s wrong with the code. Now they’re debugging unfamiliar code, under time pressure, probably also using AI to help them debug. Iterative AI debugging was the worst pattern for both building understanding and productivity! Your 3am incident just got longer.
One day we will be able to offload the whole development process (requirements gathering, design, devops, debugging, etc.) to AI. In the meantime, we want to reap the rewards of AI assistance without kneecapping our ability to do the rest of the job. How can we do that?
- Comprehension over delegation. For the time being, we still need to understand the systems we build. Generation-then-comprehension, hybrid code-explanation, and conceptual inquiry are the patterns identified in the study as best for this, but I don’t think we should take these as a direct methodological prescription. The important thing is designing AI assistance workflows that keep the human actively engaged with the problem. I talk more about this in Coding Agents As An Interface To The Codebase.
- Middle ground expertise. METR found that developers were slowed down more on highly familiar tasks. Some of the biggest gains in AI assisted coding come from working with unfamiliar technologies. The sweet spot seems to be where the user knows enough to evaluate outputs, but not enough to be faster alone. The age of the coding agent favors generalists.
- Enhance, rather than replace, developer workflows. As of yet, there is no clear winner for an AI-oriented software development workflow that reliably produces production quality code. There are, however, clear wins for improving human development processes with AI assistance. ‘Conceptual inquiry’ (where subjects used the AI for data gathering and conceptual understanding, then implemented themselves) was the second-fastest pattern overall - only slightly slower than delegating the task to AI wholesale.
Having said all that, perhaps I’m missing the point. Just like the invention of the calculator made the need for drilling long division in school obsolete, maybe coding is the new long division? Perhaps instead of learning programming languages, system design, etc. we should be learning to make robust specifications, divide tasks up for delegation, and verify outputs?
That may well be the case in the long run, and I’ll have more to say on it in another post, but it is not the world we’re in today. If we want good development practices using AI that produce high quality work then we must design around the constraints they have today, while positioning ourselves for the future.
- omegastick