De-Vibing A Codebase

2026-01-25

So you’ve vibe coded your dream app. It’s got all the cool features you couldn’t live without and life is good. The only problem is that it has a few bugs that your agent is playing whack-a-mole with. Whenever you fix one issue, another two crop up in its place.

Why does that happen and what can you do to fix it?

The Codebase

FreedomRPG is a solo text-based RPG - inspired by Dungeons & Dragons and Ironsworn, with an AI game master that uses a suite of tools to perform rolls, track game state for both mechanics and plot, and simulate NPC internals. It’s open source, so you can try it yourself.

The bulk of the features were vibe coded with Claude Opus 4.5 over the course of about a week to test out how spec-driven development is going (not as well as I’d hoped). The workflow looked something like: - Write a bunch of feature specifications in the morning - Kick off an agent to churn through them through the day - Check on the agent’s progress in the evening

I made a serious effort to not even look at the code during this time. It was a true test of how far an agent can take a project with only minor steering.

The result? Eh, it kind of mostly seemed to work. You could send a message and go on an adventure. The GM’s tools didn’t show any obvious errors. There were a handful of UI bugs but nothing game-breaking. Hey, maybe this is actually OK…

The Problems

It wasn’t. From unimplemented features to security failures to horrifically unoptimized code to race conditions, there were heaps of bugs hiding amongst the slop. I had Claude Code analyze the commit history and categorize them.

These numbers are underestimated. They are based on commit messages and I often didn’t mention all the bugs fixed in one commit.

Not to mention the total lack of any kind of intentional structure to the codebase. >1,000 line god-objects ruled supreme. Endpoint handlers would be thrown into random domain logic files. Close to 50% of the codebase was dead code!

This was the result of letting Claude chug along at a task for around a week. It had instructions to refactor as necessary and report any issues it found, so I’m going to go ahead and assume that this is basically representative of its abilities and make some observations about the limitations I see:

Jumping the gun. The single most common failure case was where Claude would partially implement something (or implement it with serious bugs) and declare “SUCCESS!”. This could be because it launched the app and decided that was proof enough that the change worked. It could be because it tested the happy path and nothing else. Usually, though, it was because it forgot to test the change at all, despite making a clear to-do for itself saying “test the change”.
Shortsightedness. Claude is like an attack dog. It runs straight at whatever task you put in front of it. An error is causing us to lose data? Commit the data before the error happens. A variable isn’t being passed through the system correctly? Hard code the expected value at the destination. There’s a typo in a variable name on the server, which is causing an error when the client receives it? Add an adapter on the client which looks for the typo in responses and corrects it. Claude has an astounding talent for finding the minimal change to make a problem go away, but no human software engineer would make the above decisions because they create far too much technical debt.
Bad mental models. Sometimes you’re working on a system and it’s clear that Claude gets it. It knows how the inputs relate to the outputs. It can (with some effort) tell you what the value of each variable should be in different situations. But sometimes it does not get it. It makes confident (wrong) guesses about outputs and doesn’t even know which variables exist. I speculate that this is caused by both systems getting too complex and being too far outside the model’s training distribution.
No Design Intentionality. Claude’s approach to software architecture is: Is there any existing piece of code that looks remotely related to the task? Okay, great, chuck it in there. I guess this makes sense from Claude’s perspective. It mostly navigates the codebase by text search anyway, why bother with file organization?

The Solutions

After the initial building period, I got started on The Great De-Vibing. Actually I’ve only de-vibed the backend, the frontend is still kind of a mess and is next on my list.

Percentage of commits adding new functionality vs refactoring and bugfixing over time. Note: 21 Dec is an outlier, it was Christmas I only made 2 commits that week

At this point, I didn’t know the true extent of what horrors existed in the code. I had 80,000 lines of slop and a dream. The plan: refactor the codebase into something with a sane structure, and fix bugs as they come up.

Bug fixes are pretty straightforward, but here are the key refactoring changes I made which you might be able to apply in your own de-vibing projects:

Delete dead code. Almost 50% of my codebase was dead when I started refactoring.
Figure out your testing patterns. What’s your test strategy? Unit, integration, end-to-end? Specify their purposes and scopes and write a canonical example for each, then make sure the agent follows those examples. This is a good time to implement dependency injection.
Decompose god objects. Each component in your system should do one thing, and do it well.
Separate domain from infrastructure. I’m not saying you have to go and read The Blue Book, but put some thought into separating the logic that is key to your application from the ‘plumbing’ that is necessary to bring it to life.
Define your sources of truth. For every piece of data, which component owns it?
Make components stateless. Agents love tracking state for no reason, go directly to the source instead.

- omegastick