Vijay Raina is a seasoned veteran in the enterprise SaaS and software architecture space, known for his deep technical dives into how complex systems communicate—or fail to communicate—with the engineers who manage them. With a career dedicated to refining software design, Vijay has become a leading voice in data engineering, particularly in how we move beyond the “brute force” methods of debugging and toward more intelligent, automated systems. His insights help organizations bridge the gap between raw data processing and actionable business intelligence, ensuring that the pipelines powering modern enterprises are not just functional, but observable and resilient.
This conversation explores the shift from manual data pipeline debugging to a more sophisticated “agentic” observability layer. We dive into the specific dangers of partial data writes in AWS Glue and how traditional alerting often leaves engineers in the dark. Vijay outlines a reference architecture that uses Large Language Models as reasoning engines to classify failures—ranging from schema drift to data skew—without granting them control over production data. He also emphasizes the non-negotiable need for data hygiene, explaining why signals like batch IDs and table commits are the bedrock of any automated triage system. Finally, we discuss the metrics that truly matter when evaluating AI-driven observability and why the “safe-to-retry” decision is the most critical output for any data team.
When a Glue job fails after writing partial data to an Iceberg table, downstream reports can become silently corrupted; could you walk us through why this specific scenario is such a nightmare for data engineers and how manual debugging typically falls short?
The real nightmare isn’t just that the job failed—it’s that the job failed quietly enough to let downstream systems think everything is fine. Imagine a typical overnight Glue job designed to process twelve distinct partitions for a day’s worth of events. If that job errors out after successfully writing nine of those twelve partitions into an Iceberg curated table, you’re in a very dangerous middle ground. On the surface, your control table shows a “FAILED” status for the current batch, but the Iceberg table has already been altered. A downstream report or a dashboard, running on its own rigid schedule, might see those nine partitions and assume it’s the complete set of data for the day. By the time a human engineer logs in at 8:00 AM, the discrepancy has already rippled through the organization. The question then shifts from a simple “Why did it fail?” to a much more terrifying “What did we already break downstream?”
Manual debugging in this environment is a slow, error-prone process of stitching together a forensic puzzle. You have to jump into CloudWatch logs to find the specific error, check the Glue run metadata, verify the source S3 path, and then manually count records or inspect Iceberg snapshots to see where the write stopped. For an experienced engineer, this might take thirty minutes of intense focus, but for someone less familiar with the pipeline, it’s easy to miss a signal. This delay often leads to “blind reruns”—where an engineer just hits the retry button hoping for the best—which can result in duplicate records or even more corrupted partitions. The evidence needed to prevent this is always there, hidden in the logs and metadata, but the pipeline has no structured way to tell its own story. That’s the gap we’re trying to close with a triage layer that can explain the situation before a single manual click occurs.
You’ve used the term “agentic” to describe a new layer of observability; how does this differ from the standard alerts we’ve used for years, and what does the “observe, classify, explain, recommend” loop look like in practice?
The term “agentic” is frequently thrown around as a buzzword, but in the context of data engineering, it has a very specific and disciplined meaning. It’s not about giving an AI the keys to your production environment and letting it run wild. Instead, an agentic observability layer is a controlled workflow that acts as a sophisticated triage nurse for your pipelines. A traditional alert is a blunt instrument; it tells you “Glue job daily_customer_interactions failed,” which is about as helpful as a smoke detector—it tells you there’s a fire, but not whether it’s a candle or the entire kitchen. An agentic layer, by contrast, provides context. It might tell you that the failure happened because of a new column in the source data that isn’t in your schema, and because the write started before the crash, a blind retry will absolutely create duplicates.
The loop itself—observe, classify, explain, recommend—is a structured way to handle incident response. First, we “observe” by pulling a curated set of signals, such as the last fifty error log lines, record counts, and table snapshot IDs. We don’t just dump ten thousand lines of raw logs into an LLM because that’s noisy and expensive. Next, we “classify” the failure into a fixed set of categories, like schema drift or partial write risk. Then, the system “explains” the root cause in plain English, and finally, it “recommends” a specific action, like quarantining the batch. The crucial part of this design is that the loop stops at “recommend.” The system suggests what to do, but a human engineer or a deterministic rule-based orchestrator makes the final call. This reduces the cognitive load on the engineer significantly, allowing them to make a high-quality decision in seconds rather than spending an hour digging through logs.
For a team looking to implement this architecture, what kind of “pipeline hygiene” or foundational data signals must be in place before an agent can actually provide meaningful insights?
I always tell teams that an LLM-based agent cannot create observability out of thin air; it can only reason over the observability you’ve already built. If your house isn’t in order, the agent will just give you high-confidence hallucinations. To make this work, you need a disciplined data engineering foundation. This means you must be tracking batch IDs across your runs, maintaining clear source-to-target paths, and storing the results of data quality checks in a way that’s accessible. If your pipeline doesn’t track whether a write has actually started before a failure, the agent has no way of knowing if there’s a partial write risk. It’s the difference between guessing and knowing.
Specifically, the agent needs to see table commits and deployment versions. Without knowing which version of the code was running, the agent can’t distinguish between a transient data error and a code regression from a recent push. It also needs to see the control table status—did the previous run succeed? Is this the third attempt at the same batch? If these signals aren’t structured and stored, the agent is flying blind. LLMs don’t replace the need for good engineering; they reward it. The teams that see the most success with agentic observability are the ones who already have structured logs and clear ownership mapping. You have to give the agent the right ingredients if you want a useful recommendation.
When a failure occurs, the agent classifies it into specific categories like “Schema Drift” or “Small File Pressure.” How does this classification change the way a data team actually responds to an incident?
Classification is the bridge between a raw error and a targeted resolution. In a traditional setup, every failure goes into a single “data engineering” queue, and the person on call has to sort through them. By using a fixed set of categories—like schema drift, data skew, or permission issues—you can route incidents to the people best equipped to fix them. For instance, if the agent classifies a failure as “Schema Drift” with high confidence, that incident can be automatically routed to the data contract owner or the team responsible for the source system. If it’s a “Permission Error,” it goes straight to the platform or IAM team. This prevents the “ping-pong” effect where a ticket is passed around between teams who aren’t sure if the issue is in the code, the data, or the infrastructure.
Furthermore, these categories allow you to automate certain safety protocols. If an incident is classified as “Partial Write Risk,” your orchestrator can be programmed to block any “Retry” button until a human has manually verified the state of the target table. If it’s “Source Delay,” the system might just wait and retry later without bothering an engineer at all. This turns your observability layer into a traffic controller. By moving away from free-form text and toward a structured JSON output with a confidence score, you can also start tracking which types of failures are most common. If you see “Small File Pressure” popping up five times a week, you know it’s time to invest in a compaction strategy rather than just treating each failure as a one-off event.
You’ve been very clear about the “line in the sand” regarding what the agent should never be allowed to do. Why is it so dangerous to let an AI autonomously “fix” production data issues, even if it has high confidence in its solution?
The cost of a confident but wrong action in a data environment is astronomically high. While it’s tempting to think about “self-healing” pipelines where an AI fixes a schema mismatch or grants itself the necessary IAM permissions to finish a job, the risks far outweigh the benefits. If an agent has a 95% confidence level that it should delete a “corrupted” partition, that 5% chance of being wrong could mean the permanent loss of critical business data or silent corruption that stays hidden for months. In a regulated environment, an agent granting itself Lake Formation permissions isn’t just a technical risk; it’s a major security and compliance violation.
The boundary between observability and production control must remain absolute. An agent should explain the problem and suggest a fix, but it should never have the authority to rewrite production tables, promote quarantined data, or drop partitions. We have to remember that LLMs are reasoning engines, not deterministic logic gates. They can be swayed by unusual log patterns or conflicting signals. A “safe-to-retry” recommendation from an agent is incredibly useful for a human to see, but letting the agent actually trigger that retry on a partially written Iceberg table is an incident waiting to happen. The goal is to reduce the “mean time to recovery” by speeding up the triage, not by removing the necessary safeguards that keep our data systems reliable and compliant.
In your framework, you mention that “the summary looks good” is not a valid evaluation. How should data teams practically measure the success and accuracy of their triage layer to ensure it’s actually helping?
Evaluating a triage layer requires the same rigor you’d apply to any other production software. You can’t just look at a few examples and call it a day. We recommend starting with a synthetic evaluation—basically a “test suite” for your observability agent. You create ten or twenty specific failure scenarios, including things like missing source paths, shuffle spills, and access-denied errors. For each scenario, you define what the “Expected Category” and the “Safe-to-Retry” decision should be. Then, you run the agent against these scenarios and score it. Is it picking the right category? More importantly, is it getting the “safe-to-retry” decision right?
In production, you should be tracking specific metrics like “Safe-Retry Precision.” This measures how often the agent said it was safe to retry when it actually was. If the agent says “Yes, retry” but the job then fails or creates duplicates, your precision is low, and you need to tighten your prompt rules. You also want to track the “False Confidence Rate”—those dangerous moments where the agent is very sure of a wrong answer. Another key metric is the “Human Override Rate.” If your engineers are rejecting the agent’s recommendations 50% of the time, the system isn’t providing value. Ultimately, the goal is to see a measurable reduction in “Mean Triage Time.” If it used to take forty minutes to understand a Glue failure and now it takes four, you’ve built something that’s actually moving the needle for your team.
What is your forecast for the future of agentic data operations over the next few years?
My forecast is that we will see a significant shift from “human-in-the-loop” to “human-on-the-loop” operations, but only for teams that have mastered the fundamentals of data hygiene. In the next three to five years, I expect the “Observe-Classify-Explain-Recommend” pattern to become a standard feature in major data orchestration platforms. We will move away from raw, unparsed log streams and toward these structured incident summaries as the primary way engineers interact with failures. However, I believe the “action” boundary will remain firm for a long time. We won’t see autonomous agents fixing schemas in production anytime soon because the trust gap is still too wide and the cost of failure is too high.
Instead, we will see much tighter integration between these triage agents and “Data Contracts.” The agent will be able to say, “This failure is a violation of the contract signed by the upstream marketing team,” and it will automatically link to the specific line in the contract that was broken. We’ll also see these agents becoming more proactive—identifying “Small File Pressure” or “Data Skew” before a job actually fails, acting more like a performance consultant than just an emergency room doctor. The future isn’t about AI replacing data engineers; it’s about AI finally giving data engineers the structured, clear information they need to stop being forensic investigators and start being architects again. For the readers, my best advice is to start building your “incident context builder” today. Even without an LLM, having a script that pulls your logs, record counts, and table commits into one place will put you miles ahead of the competition.
