In the rapidly evolving world of data engineering, few have navigated the complexities of modernizing data architectures as adeptly as Vijay Raina. As a seasoned expert in enterprise SaaS technology and software design, Vijay has been at the forefront of transitioning from cumbersome batch-processing systems to agile, real-time query models. Today, we dive into his insights on revolutionizing data-serving architectures with tools like Databricks SQL and AWS Lambda, exploring the challenges of legacy systems, the benefits of query-at-source designs, and the practical lessons learned along the way.
How did the traditional batch-load-to-RDS setup function in your experience, and what did it look like in practice?
In the old setup, we relied on a batch process to pull data from a lakehouse environment like Snowflake or Delta Lake. Nightly jobs, often using AWS Glue, would transform this data and load it into an operational data store like Amazon RDS. From there, our Lambda-based APIs would fetch the data to power dashboards and front-end applications. It was straightforward at first—data got updated once a day, and users accessed it through the RDS layer. But as our platform grew, this rigidity started to show cracks, especially with delays and maintenance demands.
What were some of the biggest pain points with that batch-driven design as your platform scaled?
The issues piled up quickly. Data latency was a major problem—since updates happened only nightly, dashboards were often hours or even a full day behind the source data. Then there was the operational burden; ETL pipelines would fail, schema changes caused cascading issues, and recovery ate up engineering hours. Costs also spiraled out of control with RDS clusters and ETL compute running even when usage was low. Worst of all, we had data drift—analysts saw one version of the data in the lakehouse, while end users saw something different via the API. It just wasn’t sustainable.
Can you elaborate on how data latency in the old system impacted the end-user experience?
Absolutely. Because the data was only refreshed during nightly batch jobs, users were often looking at stale information. Imagine a dashboard meant to show real-time business metrics, but it’s reflecting yesterday’s numbers. That delay frustrated users who needed up-to-date insights for decision-making. It eroded trust in the platform, as they couldn’t rely on the data being current, especially in fast-moving environments where every hour counts.
What prompted the shift to a query-at-source architecture, and what was the overarching vision behind it?
The core goal was to cut out the middleman—namely, the persistent RDS layer and the nightly ETL cycles. We wanted to serve data directly from governed sources using an API that could query in real time. The vision was to achieve faster data delivery, reduce operational overhead, and lower costs by simplifying the architecture. Essentially, we aimed to make the lakehouse itself the serving layer, leveraging modern tools to query data where it lives rather than moving it around unnecessarily.
How does the new data flow compare to the old one in terms of structure and efficiency?
The new flow is a complete overhaul. Instead of batch jobs pushing data into RDS, we now have a stateless REST API built on AWS Lambda that connects directly to Databricks SQL via JDBC. Databricks SQL acts as the query engine, accessing curated Delta tables and federated data from sources like Snowflake. There’s no staging or lag—queries execute in real time, and results stream straight to the client. It’s down to just two core components, Lambda and Databricks SQL, which makes it far leaner and more efficient than the old multi-step process.
What role does Databricks SQL play in this modern setup, and why was it the right fit?
Databricks SQL is the heart of the query-at-source model. It serves as the endpoint where queries are executed against our data sources, governed by Unity Catalog for access and security. It’s incredibly powerful for handling real-time analytical queries on large datasets, and its integration with Delta tables ensures optimized performance. We chose it because it’s fast, scalable, and secure, plus it aligns perfectly with the lakehouse paradigm, allowing us to query data directly without needing a separate database layer.
What were the most noticeable improvements after adopting the query-at-source model?
The impact was immediate and transformative. Dashboards started reflecting near real-time data, which was a huge win for user experience. We eliminated the 1–2 hour nightly batch window and got rid of the ETL pipelines entirely, saving 4–6 hours of maintenance per week. Infrastructure costs dropped by nearly 70% since we no longer needed persistent RDS clusters. Plus, the serverless nature of the setup meant we could scale on demand without fixed overhead. It was a game-changer across the board.
What challenges did you encounter during the transition to this new architecture?
It wasn’t a seamless switch by any means. One early hurdle was query performance on large, unoptimized datasets—sometimes slower than the pre-aggregated tables we had in RDS. High concurrency from API requests led to queuing issues, and Lambda cold starts added overhead to JDBC connections. We had to invest time in optimizations like result caching, partitioning Delta tables, and setting API guardrails to ensure efficiency. It was a learning curve, but each challenge taught us how to fine-tune the system.
How did you address performance issues like slow queries or high concurrency in the new system?
We tackled performance with a few key strategies. Databricks SQL’s result caching was a lifesaver for repeated queries, delivering results in milliseconds. We used serverless endpoints to auto-scale with query loads and optimized Delta tables through partitioning and Z-ordering to reduce data scanning. On the API side, we implemented pagination and LIMIT clauses to keep queries efficient. For dashboards needing sub-second responses, we even built lightweight pre-aggregated tables in the lakehouse. These steps turned potential bottlenecks into manageable aspects of the system.
Looking ahead, what is your forecast for the evolution of data-serving architectures in the coming years?
I believe we’re just at the beginning of a broader shift toward query-at-source and lakehouse-centric models. As query engines like Databricks SQL continue to advance, they’ll become even more capable of handling live analytical workloads at massive scale. I expect serverless architectures to dominate, with tighter integrations between APIs and data platforms, reducing latency further. We’ll also see AI and machine learning play a bigger role in automating optimizations like query planning and caching. For teams still tethered to traditional databases, the push for real-time, cost-effective solutions will only grow stronger, making architectures like these the standard.
