In the world of enterprise software, many teams find themselves responsible for sprawling, legacy React applications that are as critical as they are fragile. Our guest, Vijay Raina, a seasoned specialist in enterprise SaaS technology and architecture, has spent over a decade navigating these complex codebases. He’s at the forefront of a new approach, pioneering the use of specialized AI agent teams to not just manage, but to truly modernize these monoliths. We’ll explore his five-agent system that breaks down refactoring into discrete, automated tasks, the crucial art of prompt engineering that ensures high-quality code generation, and the practical lessons learned from applying this workflow to real-world projects, turning daunting refactors into manageable sprints.
You’ve described a five-agent team for refactoring, including a Planner, Coder, and Tester. What is the key advantage of this specialized structure over a single, generalist agent, and how do you orchestrate the handoff from the Planner’s JSON output to the Coder agent?
The advantage is focus and reliability. A single, generalist agent trying to analyze, plan, code, and test all at once often gets lost in the complexity. It’s like asking one person to be an architect, a bricklayer, and a building inspector simultaneously; they’ll inevitably miss something. By breaking it down, each agent has a single, well-defined job. The Analyzer is great at pattern recognition, the Planner excels at strategic decomposition, and the Coder can focus entirely on writing clean, idiomatic code based on a clear spec. This separation of concerns dramatically reduces the chances of chaotic or incomplete output. The handoff is the most critical part of the orchestration. The Planner doesn’t just write a prose description; it produces a highly structured JSON output. This JSON is our contract between agents. It contains precise instructions, including the exact file paths for new components, the props they should accept, and the state they need to manage. The Coder agent is specifically prompted to parse this JSON and execute those file operations. It’s a very deterministic, machine-readable handoff, which is what makes the automation so effective.
Many developers struggle with AI generating inconsistent or buggy code. Could you walk me through the specific elements in your Coder agent’s system prompt that prevent common issues like broken imports and ensure it produces modular components with things like lazy loading and context providers?
I’ve learned the hard way that vague instructions are a recipe for disaster. The key is to be relentlessly specific in the system prompt. To prevent broken imports, the prompt explicitly commands the agent to verify all file paths and only import from existing modules or the ones it’s about to create. We’re not just asking it to “refactor the component”; we’re giving it a checklist. For modularity, the prompt contains rules like, “Each new component must be in its own file,” and “Use React Context for state shared between components like authentication or cart status.” I also explicitly instruct it to implement performance patterns. For example, the prompt includes a directive to “Wrap all routes in React.lazy() and a top-level boundary to enable code splitting.” By baking these architectural best practices directly into the agent’s core instructions, we guide it toward producing the kind of clean, performant, and maintainable code that we want, rather than just hoping it figures it out.
Your process involves a Reviewer agent that can loop back to the Coder for fixes. Can you share an example of a subtle bug or a necessary tweak this agent caught that a human might have missed on a first pass, and how that feedback loop is managed automatically?
Absolutely. In one of our recent runs, the Coder successfully extracted a login form into its own component. The code was clean, and the state was managed correctly. A human looking at the diff might have approved it quickly. However, the Reviewer agent, which is prompted to think about user flow and best practices, flagged that after a successful login, the user was left staring at the same login page. It identified a missing piece of logic: there was no redirect to the main dashboard. It sent feedback to the Coder with a specific instruction: “After the authentication state is set to true, use the routing library to programmatically navigate the user to the /products route.” The orchestration, which we manage with LangGraph, is set up as a state machine. The Reviewer’s output triggers a transition that sends the task back to the Coder node with the new instructions. The Coder makes the change, and the cycle repeats. It’s this automated, iterative refinement that catches those small but crucial user experience issues.
In a successful refactor you mentioned, a three-month project was completed in just five weeks. Beyond the time savings, what were the most significant improvements you saw in the codebase, such as changes in bundle size, test coverage, or overall team velocity on subsequent features?
The time savings were obviously a huge win, but the downstream effects were even more impactful. The most immediate technical improvement was the bundle size; introducing lazy loading for each feature slice cut the initial load chunk by a noticeable margin, which is a massive win for mobile users. Before the refactor, test coverage was spotty because the monolithic App.js was so hard to test in isolation. The Tester agent generated focused, maintainable tests for each new component, which gave us much more confidence in shipping changes. But the biggest improvement was in team velocity. Before, adding a simple feature might require a developer to spend days carefully navigating the giant component, terrified of breaking something. After breaking the app into logical Auth, Products, and Cart slices, a developer could work on a new cart feature, for instance, by only touching a few small, self-contained files. This isolation and clarity made subsequent development work faster and far less stressful for the entire team.
You noted that for large, enterprise-scale monoliths, token limits are a real barrier. What specific strategies, like code chunking or using vector search on a repo, have you found most effective for applying this agentic workflow to a codebase that exceeds the model’s context window?
This is the reality check for enterprise-scale work. You simply can’t feed a million-line dashboard into the context window and say, “Go.” The most practical strategy has been a “divide and conquer” approach using code chunking. We don’t try to refactor the whole app at once. Instead, we manually identify a single, bounded feature—say, the user profile settings—and feed only the relevant files and directories for that feature into the agent workflow. This keeps the context manageable. For more advanced needs, we’re building smarter retrieval tools. We’ve had success creating vector embeddings of the entire codebase, which allows an agent to perform a semantic search over the repo. So, if it’s working on a component that needs a specific data-fetching utility, it can query the vector database to find the right helper function and its usage examples, without needing the entire repo in its immediate context. It’s about giving the agents targeted, relevant information instead of overwhelming them with the whole monolith.
Considering the risk of hallucinations and the need for human oversight, how do you balance speed with safety? Could you detail your review process, including the tools you use, when an AI-generated pull request lands on your desk for a critical legacy application?
The balance comes from treating the AI output as a highly skilled, but not infallible, junior developer’s first draft. We never, ever merge its output directly into the main branch without a thorough human review. Speed comes from the AI doing the tedious 80% of the work—moving files, rewriting boilerplate, adding tests. Safety comes from the human providing the final 20% of critical oversight. When a PR from the agent system lands, my first step is to pull it down locally and run the test suite it generated. If the tests pass, that’s a good first signal. Then, I open the diff in an editor with a good comparison tool; I personally rely heavily on the Cursor extension in VS Code. I’m not just looking for syntax errors; I’m scrutinizing the business logic. Did it subtly change how a discount is calculated? Did it miss an important edge case in the user authentication flow? We treat it as a code review, just like we would for any human team member, because at the end of the day, accountability for the code still rests with the human engineers.
What is your forecast for autonomous coding agents in software development?
My forecast is one of pragmatic acceleration, not wholesale replacement. In the next few years, we’re not going to see agents autonomously building and shipping entire complex applications from scratch. The reality of non-code concerns—like infrastructure, security policies, and intricate business negotiations—is far too complex. However, I predict they will become indispensable “accelerators” for targeted, well-defined tasks. We’ll see them embedded directly into our IDEs and CI/CD pipelines, where they’ll handle things like mechanical refactors, generating boilerplate, writing comprehensive test suites, and upgrading dependencies. This will result in a significant productivity boost, somewhere in the 30-50% range for these specific tasks, freeing up developers to focus on architecture, complex problem-solving, and user experience—the creative work that humans do best. The role of the senior developer will shift even more towards that of a systems thinker and a skilled reviewer of AI-generated code.
