Home

Why Coding Agents Kind of Suck for Most People

Coding agents are hot shit right now. Claude Code, Codex, Antigravity, Cursor, Kiro, and so on and so forth. A new tool that obviously changes everything about making software has come along and everyone is scrambling to figure out how to make the best use of it.

It’s kind of like someone dropped the full Rust toolchain into the laps of 1960s FORTRAN developers, then left with no explanation.

Every major new model release, I give the major agentic software development approaches another try and am consistently disappointed. A snapshot of the current state-of-the-art (Claude Opus 4.5) applied:

A few of the most common errors I see: - Create a class, forget it exists 10 minutes later, then create it again. - Spend 150 lines creating a bad heuristic for timezone conversion instead of calling pytz. The timezone database is right there. - See nouns and verbs and maps them 1:1 to classes and methods. Miss the underlying abstractions. - Enumerate the design options and correctly evaluate the pros and cons of each design, then pick the wrong one. It’s like watching someone explain exactly why they shouldn’t touch the hot stove, then touch the hot stove. - Pile workaround on top of workaround to cajole a bad design into doing what they want instead of refactoring.

None of these are just aesthetic differences. I’m fine (well maybe not quite fine) with AI-produced code being unreadable and unmaintainable for humans. In fact, I think there’s a good chance this is the unavoidable direction we will go in as we start to produce languages designed for AI rather than humans. But all of the above cases listed (which are a tiny fraction the issues I see on the daily) have real consequences: user-facing bugs, downtime, maintenance burden, etc. that would prevent this code from passing code review on my team.

As of January 2026, not even our best LLMs have the capability to reason about mid-sized codebases as a whole in the way that a human senior software engineer does, taking into account accuracy, maintainability, performance, resilience, observability, etc. all at the same time.

Forgetting raw intellectual capacity for a moment (are there even enough parameters to encode this behavior?) - modern RLVR trains models on short-horizon, contained tasks, not years long development initiatives that would require the model to take all of those things into account. The models simply aren’t incentivized to learn the skills necessary to build large software projects.

While these long-horizon skills will surely come in time, they have scarily impressive short-horizon skills today. How can we best make use of that?

- omegastick