The agent-ready codebase

If you want to get the most out of agents, your codebase needs to be agent friendly.
You’ll see incredibly strong performance with existing models, even in complex codebases, and it will set you up for even better models in the future.

Getting this right is the difference between agents that “always produce slop” and agents that ship.

Agentic engineering fails when the human becomes the bottleneck.
The models just want to work . You need to do everything you can to get out of their way.

Charlie Munger said “Take a simple idea and take it seriously.”
That’s what we need to do here. Make the codebase and environment agent-verifiable so the agent can reliably ship changes with minimal intervention.

It comes down to three things:

Environment - the agent can use the system autonomously
Intent - the agent understands what you want and why
Feedback loops - the agent can verify its own work, fast

These are guidelines and principles I’ve used and know work.
But before getting to that, we need to talk about Agentic Engineering.

Agentic Engineering is high leverage and anti-slop

Agentic Engineering is using AI agents as the primary means of writing code, with the engineer as orchestrator and oversight. You’re progressively moving further out of the loop and focusing on higher leverage activities.

It’s about building the machine that builds the machine.

Unlike with Vibe Coding, you aren’t throwing your standards out the window.
If anything, we should aspire to raise our standards. Even more as the agents get better.

As the agentic engineer, you are the architect and reviewer. You define architecture, scope tasks for agents, and review output with the same scrutiny you’d give a junior’s PR. The agent writes the code; you own the system.

Vibing has its place. Low oversight is useful for throwaway projects, personal scripts, or just exploration. As always, try to find a good balance between exploration and exploitation.

1. Environment

The agents should be able to see and do everything you can. If it can’t boot the system or reach the APIs, it can’t do its job, no matter how good the model is.

The first question is whether the agent can spin up a fully isolated instance per change. Think git worktree, multiple checkouts, or cloud sandboxes like e2b , Cloudflare , or Daytona .

Could you launch a hundred agents in your codebase right now?
If not, figure out what’s stopping you. It’s a bottleneck in your factory.

There are incredible engineers that do not do this yet still get great results. Your mileage may vary. I find that I’m either

working in many repositories at once
spinning up many agents in isolated environments in the same repository

or some mix of the two.

It is becoming apparent in my work that, as the agents get better at carrying tasks from design + plan to “shipped,” booting up many of them at once is becoming more important to increase throughput.
As their failure rate goes down, it becomes easier to orchestrate more agents.

The agent also needs access to credentials and APIs.
Can it draw an API key or access token and test the system itself?
Validate the API it just built, or the config it just pushed?
Are your APIs understandable, reachable, and usable for the agent?

It’s useful to build CLIs the agent uses to both draw API keys and interact with the API. Since it uses your APIs through the CLI, it’s already authenticated, and it can handle token refreshes, make error codes agent friendly (“what to do instead”), and build custom workflows that resemble common tasks.
This also makes it easier for the agent to facilitate deterministic testing and validation on its own.

Can it see the interface and navigate your app through similar means to your users, like a browser?

If the answer to any of these is no, you’re the bottleneck.
Unblock your agent.

Keep validation cheap

Validation can’t be expensive. If every test run costs real money, it won’t scale to hundreds of agents.

It probably won’t even feel good running a single agent.

If API calls are expensive, fake them. Write a mock, or use a capture-and-replay proxy that returns the same response for the same input.

Faking an entire API is incredibly cheap with agents. So is building a generic API proxy.

Observability

The agent needs access to logs, metrics, and traces, not just raw log files. Can it query your telemetry directly, including in deployed environments?

Give it read-access to k8s logs. Let it query your observability stack. Give it access to your browser’s devtools. Redirect dev process logs to a dev.log file.
There are so many ways to provide it with the information it needs, so just make it easy for the agent to access.

A navigable codebase

If the codebase is good for humans, it’s probably good for agents.

Tangled abstractions and missing docs hurt agents the same way they hurt people.
If your codebase a mess, I’ve found the agent is more likely to continue the mess .
Don’t live with broken windows.

Verbose test output is another example of something that doesn’t work for either humans or agents. Spitting out thousands of lines only bloat the context.

2. Intent

I’m not going to tell you that context engineering is incredibly important. I feel like that’s rather common knowledge by now. You should research, design and plan well so the agent isn’t led astray.
Doing this is a force multiplier. If the agent doesn’t know what you’re building and why, it’ll always miss.

I’ve found that many underinvest here.
You don’t need a 10,000 line AGENTS.md to teach the agent how to run bun run test.
It knows how to use a computer better than most humans.
But it doesn’t know your domain. Your business. And it certainly doesn’t know the tacit knowledge that your rank-one bus factor employee knows.

Make domain knowledge readable

There is a ton of tacit knowledge scattered through Slack threads, meeting notes, and people’s heads. If the agent can’t access this, either by reading it directly or through you curating it as context, it will struggle to understand intent and domain.

Write it down. Architecture decision records, domain glossaries, constraint docs.

If it’s not written down, it doesn’t exist for the agent.

Encode intent in the codebase itself

Rules files, CLAUDE.md, cursor rules, system prompts. These are how you communicate conventions, patterns, and constraints to the agent persistently.

Types, naming, and code structure are also communication. Self-documenting code matters more when agents are reading it.

My general take on AGENTS.md files is that you’re likely overdoing them. It’s rare that I have something that I need to tell the agent 100% of the time. Use skills or better tooling to allow the agent to learn what it needs, when it needs it.

Scope tasks well

Small, clear, verifiable units of work with the right context attached outperform vague instructions every time.

Agents are increasingly becoming better at “Here’s a problem, debug it”-style tasks.
But it takes experience to know what you can get away with. Just use the agents a lot.

Architectural decisions still need you

Human-in-the-loop is still basically required for architecture. My experience is that the agent executes well, but deciding the shape of the system is your job.

I prefer iterating with a frontier model (like GPT-5.2 Pro) and using RepoPrompt to build context for the ‘oracle’.

As for implementation strategy: the models are becoming better at this, especially with good Context Engineering. The better you are at curating relevant information, about the domain, constraints, and current systems, the better the agent does.
I’m still reading the code, and I encourage you to as well. Even if just for mental alignment.

3. Feedback Loops

The agent shouldn’t have to ask you if it worked.

Every change should be machine-verifiable, and that verification should run automatically, not because the agent remembered to run it.

Start with the basics

Linters, compilation, formatters, static checks. These catch the obvious stuff before anything else runs.

Follow Google’s Beyonce Rule: if you like it, then you shoulda put a test on it.

Use static code analysis tools like Sonar and CodeScene. These have MCP servers ( Sonarqube MCP ).

Write fewer, better tests

Deterministic tests of all kinds: unit, integration, end-to-end.

But don’t chase 100% coverage. It’s useless. Respect Goodhart’s Law .

Write fewer, high-quality behavioral tests. Tests that verify behavior, not implementation details. If you have to change your tests to change your code without the behavior changing, something is wrong.

Enforce architecture mechanically

This is a great idea from OpenAI that I would love to explore further.

Architectural tests (ArchUnit-style tools) and dependency constraints encoded as tests catch structural drift that no amount of code review will consistently find.

Another great tip from OpenAI here: When you build custom lints, write the error messages as remediation instructions . The error output becomes agent context.

Put verification in the agentic loop

More of your validation should run inside the agent’s workflow, not just in CI.

Use hooks, pre-commit checks, or whatever mechanism ensures checks run for sure. Prefer enforcing validation mechanically over hoping the agent “remembers” conventions.

You don’t want correctness to depend on the model’s memory.

I’m oscillating between this being a violation of the bitter lesson or not. Currently, I’m leaning towards it being temporary ‘hack’. The scaffolding you build around today’s weaknesses will dissolve as the models improve.
For now, the juice is worth squeeze. It’s worth continuously pushing the models to see where the boundaries are.

Conclusion

Most of what makes a codebase agent-ready is what makes it good for humans. Clean abstractions, behavioral tests, written-down domain knowledge, fast feedback. Agents didn’t create the need for any of this. They just made the cost of not having it obvious.
It took an army of PhDs with amnesia for us to learn that. Again.

The difference is that this investment now compounds .
And as the models get better, the return goes up.