How Codev brings discipline to AI software development
Casual AI prompting breaks down as codebases grow. Codev introduces strict protocols and multi-model reviews to help teams ship maintainable software.
The software industry is currently obsessed with “vibe coding,” the process of using conversational AI prompts to generate software on the fly. For the first hour of a project, the experience feels like magic. You type a sentence, code appears on the screen, and the application runs.
But unstructured chat hits a hard and painful ceiling. Vibe coding tends to max out and collapse under its own context drift when the codebase grows beyond several thousand lines of code.
The fundamental problem is that chat context is ephemeral. As a project grows, the AI must balance new feature requests against existing architectural rules. In a chat-based interface, instructions, early architectural decisions, and bug-fix logic get compressed and eventually scroll away. Once the AI loses that context, the system’s architecture breaks down. The AI starts hallucinating functions, breaking dependencies, and leaving developers with a brittle codebase they no longer fully understand.
Codev, an open-source platform designed to orchestrate AI coding tools, flips this paradigm with a concept called “Context-Driven Development.” Instead of relying on chat logs to guide the AI, Codev requires developers to treat natural language specifications as the true source code. These specifications are checked into Git alongside the software, allowing the AI’s instructions to be versioned, reviewed, and maintained with the same rigor as the code itself.
The AI chief of staff
To manage this spec-first process, Codev shifts developers away from using AI merely as a smart autocomplete. Instead, it pushes teams into a framework where human developers act as directors, orchestrating specialized AI agents that, in turn, coordinate other agents.
The system relies on an Architect-Builder pattern. The human developer acts as the client commissioning the software. An Architect agent acts as the project manager, and autonomous Builder agents work in parallel to actually write the code.
“Imagine you’re trying to commission a building. You would interact with the architect, and the architect would interact with the builders,” Waleed Kadous, the primary developer behind Codev, told TechTalks. “In the ideal case you have a large team of builders working in parallel and they’ll come back to the architect if they need a final check on their work or if they get stuck.”
In this setup, the Architect agent gathers choices, reviews the Builders’ progress, and surfaces only the critical decisions to the human developer in a “Needs Attention” queue.
“Just like a real architect, it gathers all the choices to be made and then helps you make them, offering its suggestions with an eye to the project as a whole. But you, as the person commissioning the building, make the final call, and if you want to inspect every brick, you can,” Kadous said.
Erasing workspace fragmentation
Previous iterations of AI agent workflows were highly fragmented. Developers had to juggle their primary code editor, a browser tab for GitHub to check pull requests, and multiple terminal windows just to monitor what their autonomous agents were doing.
Codev 3.0 fixes this context switching by bringing the entire ecosystem directly into the integrated development environment (IDE). With a newly introduced VS Code extension, the agent terminals run natively inside the editor. A single sidebar shows the builders, backlog, pull requests, and the “Needs Attention” list. When an agent references a specific file or function during a task, clicking it opens the exact line of code instantly.
The 3.0 release also introduces a modular “forge” abstraction. Forges are repository management platforms like GitHub, GitLab, or Gitea. Historically, integrating AI agents with these platforms required hard-coding API calls for each specific service. Codev abstracts these platforms into a standardized set of 17 distinct operations, such as creating an issue, reading comments, or merging a pull request.
Because the AI sees “the forge” as a single skill with identical commands, teams can mix and match their stack without breaking the AI’s workflow. For example, a team can run a hybrid setup that uses Linear for bug tracking and GitHub for pull requests.
The agent’s real context lives in the repository itself, stored as specifications and plans in version control. The forge simply supplies the live operational data through a single API layer, meaning the AI frontend never needs to know which underlying platform is configured.
Forcing discipline onto autonomous agents
While frontier AI models are highly capable coders, they lack inherent discipline. Left to their own devices, autonomous agents will often take shortcuts, skip writing tests, or ignore the overarching system architecture to quickly solve the immediate prompt.
To prevent agents from going off the rails, Codev uses an orchestrator named “porch” to act as a sheriff, forcing models to adhere to strict, deterministic workflows.
“As one of the AIs themselves told me, they’re good at coding but bad at discipline,” Kadous said. “If the agent doesn’t do this, it’s not allowed to advance to the next stage of work, and it’s told to try again.”
The flagship workflow enforced by the sheriff is the SPIR protocol, which requires agents to walk through four strict phases:
- Specify: Define exactly why and what is being built in clear natural language.
- Plan: Break the specification down into how it will be built.
- Implement: Write the code, write the tests, and verify the requirements.
- Review: Ensure the code meets the quality bar.
At various stages of this protocol, Codev invokes a three-way, multi-model review. Different AI models have entirely different analytical blind spots. For instance, testing has shown that OpenAI’s Codex excels at catching edge cases and security surface area, while Anthropic’s Claude is better at spotting runtime semantics and protocol-level mistakes, and Google’s Gemini excels at overarching architecture.
As Kadous recounted, during a recent Codev development sprint, Codex flagged a Unix socket that was created without restrictive permissions, a flaw that would allow any local user on the machine to hijack a shell session. Both Claude and Gemini missed it. However, later in the same project, Claude caught an OAuth vulnerability where a secret validation token was placed on the wrong URL, opening the door to a severe cross-site request forgery attack. Both Codex and Gemini missed that vulnerability entirely.
Rather than relying on one model’s perspective, Codev brings in all three to review the code and render an independent opinion: Approve, Comment, or Request Changes.
If a reviewing model requests changes, the original Builder agent can push back in a “rebuttal-and-re-iterate” loop. It can either implement the request or debate the reviewer.
“When models fundamentally disagree, Codev doesn’t compute a winner. First there are a few rounds of negotiation between the agents, but if that fails, it surfaces the disagreement and escalates it to a person, because that disagreement is exactly the kind of thing a human should look at,” Kadous said. Averaging away the disagreement would throw out the most useful signal the system generates.
The hard truth of context-driven development
Spec-first development feels unnatural to developers accustomed to instant chat output. It asks developers to slow down at the exact moment they are most eager to see code execution.
“Every instinct trained by chat says you’re wasting time,” Kadous said. “So the hurdle isn’t intellectual; it’s learning to trust a process that front-loads the discipline before you’ve seen it pay off.”
But the data shows the discipline does pay off for larger projects. The Codev team ran a controlled experiment comparing the SPIR protocol against unstructured prompting with Claude Code using the same prompt and the same underlying model. SPIR scored 1.2 points higher overall, as judged by independent AI reviewers.
More importantly, the rigorous process excelled at the unglamorous tasks that separate a quick demo from shippable software. The SPIR protocol delivered roughly three times the test coverage and significantly better deployment readiness.
The catch is the cost. Adopting this rigid structure took roughly 3.7 times longer to execute and cost three to five times more in compute tokens.
The conclusion is pragmatic: vibe coding is genuinely the right call for throwaway weekend prototypes. But the structure of Context-Driven Development earns its keep when a team has to maintain the software long-term. Using this method, Codev has maintained productivity on codebases scaling up to 200,000 lines of source code.
Guardrails and human-in-the-loop gates
To safely scale this process across a team, Codev 3.0 decoupled autonomous builders from single branches. Historically, an AI agent would operate on one branch and issue one pull request. Now, a persistent workspace generates a sequence of pull requests over a feature’s life, starting with a pull request for the specification, then the plan, and finally the code implementation.
This multi-PR approach allows human teammates to review and tweak the AI’s intent before it wastes compute tokens writing code. Kadous points to a recent feature he built with a frontend-focused teammate.
“Usually you have one reviewer for the spec,” Kadous said. But with the 3.0 features, he can stop at the spec stage, and his colleague can review the spec along with him to make sure he’s good with it before implementing the code. He left comments on the specification in GitHub, and the Architect agent read those comments and modified the spec to address his concerns across both the frontend and backend.
When the Builder agents finally do write code, they execute entirely inside isolated Git worktrees. A worktree is an isolated directory linked to the main repository. If an autonomous agent fails, hallucinates, or thrashes, the damage is contained entirely within its own sandbox. The main tree remains untouched.
At the end of the line, critical merge gates cannot be bypassed by AI.
“Approving a gate needs a command with an explicit flag whose literal name says a human approved it, and no code path can supply that flag automatically,” Kadous said. “The builder prompt forbids self-approval.”
The evolution toward hybrid teams
Codev is pushing the boundary toward true hybrid teams where AI agents actively coordinate tasks alongside human colleagues, rather than just acting as passive subordinates waiting for a prompt.
In the near future, software will become increasingly self-improving. Kadous anticipates systems where the AI automatically clusters user feedback and bug reports, translates them into actionable issues, and autonomously spawns its own builder agents to investigate and draft fixes.
Ultimately, the goal is not to remove humans from the loop, but to elevate their role from line-by-line coding to engineering oversight.
“I still feel the ‘humans vs machines’ framing is naive and simplistic, and the question should be what humans and machines can do together that neither of them could do alone,” Kadous said.





Indeed, have humans in the loop and AI with one set of goals (enforcement and quality) push hard AI who just want to get the job done.
The nice thing is that all the lessons learned about multi-AI collaboration and problems encountered will be baked into the next generation of AI (as data in LLM or improved software framework). A weak version of self-recursive improvement.