Your Codebase Determines Your AI Agent’s Success

Your Codebase Determines Your AI Agent’s Success

In a landscape crowded with AI coding assistants from major tech players, Factory is carving out a niche by focusing on a critical, often-overlooked problem: code quality. Their frontier coding agent, Droid, is designed not just to write code, but to improve the very environment it operates in. We sat down with Eno Reyes, the co-founder and CTO of Factory, to explore how his team is tackling the “harness engineering” required for a truly model-agnostic agent, why an agent’s success hinges on a codebase’s “autonomy maturity,” and how AI can be trained to spot the subtle “code smells” that even senior developers find elusive. Our conversation delved into the surprising finding that AI can actually decelerate teams with low-quality code and looked ahead to a future where even building a sales presentation is considered a software task.

Many large engineering teams face a choice between vendor-locked AI tools or forcing developers to switch IDEs. What does the “harness engineering” for a truly model-agnostic agent entail, and could you walk us through the most difficult optimizations your team had to make?

It’s a genuinely hard problem because the goal is to be deployable in any environment, any OS, any IDE, without locking a company into one specific LLM. This requires an incredible amount of what we call harness engineering. It’s not one single secret; it’s the sum of hundreds of little optimizations. We’ve wrestled with things like context management, because an agent might need to work continuously for eight to ten hours on a complex task, and all LLMs have context limits you have to intelligently manage. Then there’s the nuance of how you inject environment information or handle tool calls. The real differentiation comes from building an industrial process around this. We’ve developed a deep institutional knowledge of what a good harness looks like, and just as importantly, what a bad outcome feels like, which allows us to systematize and even automate improvements to the agents themselves.

Agents need clear signals—like linters, formatters, and passing tests—to function autonomously. When a company lacks this “autonomy maturity,” what are the first steps your agent takes to improve the codebase, and can you describe the resulting feedback loop that accelerates development?

This is central to our philosophy. In reality, most organizations have very few of these automated signals fully implemented. A senior engineer can get by, relying on a human code review to catch a formatting mistake. But bringing in an agent isn’t like hiring one person; it’s like hiring a hundred intern-level engineers at once. You can’t manually review all that work. So, the first thing our agent, Droid, does is run an “autonomy maturity analysis” to find all the missing signals. It might identify six areas where linters, tests, or static checkers are absent. A developer can then simply tell Droid, “Fix those six missing signals.” Once the agent itself establishes those guardrails, a powerful feedback loop kicks in. The agent now has clear, automated feedback, which dramatically improves the quality of its own output and accelerates the entire development cycle.

Research suggests that AI agents can actually decelerate teams working with low-quality code. Before a full rollout, how can an organization measure its own code quality, and what are the most critical red flags that indicate an agent might generate more “slop code” than solutions?

There’s a fantastic body of research from Stanford that really validated what we were seeing. They looked at everything—the volume of AI-generated code, user adoption, all of it—and tried to find what predicted a productivity increase. The only signal that correlated at all was the baseline quality of the codebase. It’s incredibly intuitive when you think about it. An AI is fundamentally a great pattern recognizer. If it’s learning from a pattern of high-quality code, it produces high-quality code. If the pattern is spaghetti, you get more spaghetti. The biggest red flag is a lack of automated quality signals. If you don’t have linters, formatters, and a solid testing suite, you’re setting an agent up to fail. We provide tooling to analyze this from the start, so organizations can see if they are set up to be accelerated or decelerated by AI.

Senior developers often talk about “code smells” that go beyond static analysis. How can an AI agent be customized to detect these nuanced issues, and could you describe how a “Droid” plugged into a GitHub pipeline would handle such a fuzzy code review?

That’s a great question, because true autonomy has to go beyond just static, red-or-green signals. This is where AI automation becomes incredibly powerful. You can configure a Droid to handle specific workflows, like code review. By plugging it directly into your GitHub Actions pipeline, it becomes a fully customizable code review tool. A senior engineer can essentially teach it to look for those “fuzzy,” non-statically determinable practices that they instinctively recognize as code smells. So, instead of a simple linter rule, the Droid can analyze the context, the logic, and the architecture to flag potential issues, just like a human reviewer would. This moves the agent from a simple code generator to a true partner in maintaining the craftsmanship and long-term health of the codebase.

You describe tasks like building a sales presentation or answering support tickets as “software tasks.” How does a software development agent approach these non-coding problems, and what capabilities allow it to be more effective than a general-purpose agent?

It’s a shift in perspective that we’re seeing across the industry. As software development agents become more capable, you start to see the whole world as a series of software tasks. What is building a complex PowerPoint for a sales team? It’s a task with dependencies, a required output format, and data integration—it’s a software problem. The same goes for digging through complex documentation to answer a customer support ticket. We’re bullish on the idea that the best general agents will actually be the best software development agents. That’s because they are fundamentally built to break down large, ambiguous problems into executable steps, manage state, and integrate different tools to achieve a goal. This structured problem-solving capability is what makes them so powerful, whether the end product is a running application or a well-researched sales deck.

What is your forecast for AI coding agents?

My forecast is that the very definition of who uses a software development agent is going to expand dramatically. Right now, we think of them as tools for engineers. But we’re already seeing product managers, data scientists, and even people in sales realizing their potential. The line between a “coding task” and a “business task” will continue to blur. As these agents get better at orchestrating complex work, they will become fundamental platforms for productivity across an entire organization, changing the nature of how teams collaborate and solve problems far beyond the traditional software development lifecycle.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later