Why Coding Agents Kind of Suck for Most People
Coding agents are hot shit right now. Claude Code, Codex, Antigravity, Cursor, Kiro, and so on and so forth. A new tool that obviously changes everything about making software has come along and everyone is scrambling to figure out how to make the best use of it.
It’s kind of like someone dropped the full Rust toolchain into the laps of 1960s FORTRAN developers, then left with no explanation.
Every major new model release, I give the major agentic software development approaches another try and am consistently disappointed. A snapshot of the current state-of-the-art (Claude Opus 4.5) applied:
- “Just say what you want” - Produces a brittle, incomplete pile of technical debt.
- Spec-driven development - Works for very simple specs, then turns into a brittle, incomplete pile of technical debt as you add features.
- Test-driven (agentic) development: Works for the first few test suites, then turns into a brittle, incomplete pile of technical debt.
- Multi-agent workflows: Produces a brittle, incomplete pile of technical debt very slowly.
- Anthropic’s long-running agent harness: Works for a while, then….
A few of the most common errors I see: - Create a class, forget
it exists 10 minutes later, then create it again. - Spend 150
lines creating a bad heuristic for timezone conversion instead of
calling pytz. The timezone database is right
there. - See nouns and verbs and maps them 1:1 to classes and
methods. Miss the underlying abstractions. - Enumerate the design
options and correctly evaluate the pros and cons of each design,
then pick the wrong one. It’s like watching someone explain
exactly why they shouldn’t touch the hot stove, then touch the hot
stove. - Pile workaround on top of workaround to cajole a bad
design into doing what they want instead of refactoring.
None of these are just aesthetic differences. I’m fine (well maybe not quite fine) with AI-produced code being unreadable and unmaintainable for humans. In fact, I think there’s a good chance this is the unavoidable direction we will go in as we start to produce languages designed for AI rather than humans. But all of the above cases listed (which are a tiny fraction the issues I see on the daily) have real consequences: user-facing bugs, downtime, maintenance burden, etc. that would prevent this code from passing code review on my team.
As of January 2026, not even our best LLMs have the capability to reason about mid-sized codebases as a whole in the way that a human senior software engineer does, taking into account accuracy, maintainability, performance, resilience, observability, etc. all at the same time.
Forgetting raw intellectual capacity for a moment (are there even enough parameters to encode this behavior?) - modern RLVR trains models on short-horizon, contained tasks, not years long development initiatives that would require the model to take all of those things into account. The models simply aren’t incentivized to learn the skills necessary to build large software projects.
While these long-horizon skills will surely come in time, they have scarily impressive short-horizon skills today. How can we best make use of that?
- omegastick