QA’s New Role in the Age of SRE and Platform Engineering

QA’s New Role in the Age of SRE and Platform Engineering

In a world where software delivery is accelerating at a breakneck pace, the roles that build, test, and maintain our digital infrastructure are undergoing a seismic shift. We sat down with enterprise SaaS technology and software architecture expert Vijay Raina to cut through the noise surrounding the evolution of DevOps. Today, we’re exploring the rise of Platform Engineering and Site Reliability Engineering (SRE) and what this profound change means for the future of Quality Assurance. We’ll discuss why traditional QA is becoming a bottleneck, how QA can partner with SREs to champion reliability, and the critical new skills—from Infrastructure as Code validation to Chaos Engineering—that will define the next generation of quality leaders. This isn’t just about survival; it’s a roadmap for QA to transform into a strategic force in modern engineering.

We’re seeing a market shift where platform engineering and SRE roles are growing, while dedicated “DevOps engineer” titles are less common. How does this reclassification of responsibilities impact the core philosophy of DevOps, and what does it mean for team structures in practice?

It’s a fascinating and completely natural evolution. The core philosophy of DevOps—automation, collaboration, continuous delivery—is not dying; it’s actually maturing and becoming so ingrained that it’s being operationalized into more specialized roles. Think of it less as a decline and more as a graduation. The market data reflects this perfectly; while you see a reclassification of job titles, the global DevOps market is still projected to grow impressively, hitting over $15 billion by 2025. What this means in practice is that the vague, catch-all “DevOps engineer” who did a bit of everything is being replaced by a team of specialists. You have Platform Engineers who are laser-focused on building that paved road for developers—the CI/CD pipelines, the Kubernetes platforms. Then you have SREs who apply software engineering principles to make sure that road is smooth, reliable, and performant. For teams, it means you’re no longer looking for one hero but fostering a collaborative ecosystem of experts.

As engineering teams adopt cloud-native architectures, traditional manual-heavy QA roles can become bottlenecks. Beyond just speed, what specific technical complexities of microservices and ephemeral infrastructure make these older QA models obsolete? Please share an example of how a team successfully navigated this transition.

The problem goes much deeper than just speed. It’s about a fundamental mismatch between the testing model and the system architecture. In a monolithic world, you could test a single, predictable application. But in a cloud-native ecosystem, you’re dealing with dozens, sometimes hundreds, of microservices in a distributed system. A failure isn’t a simple bug; it can be a cascade of issues across services. Then you have ephemeral infrastructure, where servers and containers can be created and destroyed in minutes. A manual tester simply cannot validate an environment that is in constant flux. The complexity is immense. I saw this firsthand with a large e-commerce company that was plagued by deployment failures, even with separate DevOps and QA teams. The solution was to stop treating them as separate functions. They embedded key QA staff directly into their platform engineering team. This wasn’t about just running tests faster; it was about building quality into the infrastructure itself. The QA professionals started writing automated checks for infrastructure provisioning and code quality that ran directly in the CI/CD pipelines. The result was a stunning 40% drop in deployment failures within just six months. It proved that when QA understands and influences the platform, they prevent problems, not just find them.

The role of QA is evolving from “test executor” to “quality enabler.” In this new capacity, how can QA professionals practically collaborate with SREs to define and monitor SLOs and SLIs? Could you walk through the steps they would take using an observability tool like Prometheus?

This is the most exciting part of the transformation. It’s where QA truly becomes a strategic partner. The collaboration starts by getting a seat at the SRE’s table. QA needs to be in the room when Service Level Objectives (SLOs) and Service Level Indicators (SLIs) are being defined. Their unique perspective on user behavior and failure modes is invaluable. From a practical standpoint, using a tool like Prometheus, the first step is for QA and SRE to jointly identify the critical SLIs—things like request latency, error rate, or system saturation. QA’s role is to design tests that specifically stress these indicators. For instance, during a performance test, QA isn’t just looking for a pass/fail. They’re watching the Prometheus dashboards in real-time. When they see a latency SLI begin to degrade as user load increases, they can pinpoint the exact moment and transaction that caused it. They can then correlate that metric with application logs and traces to give SREs a rich, actionable story: “When we run this specific checkout workflow with 500 concurrent users, the payment service latency spikes, breaching our SLO.” This transforms the conversation from “the app is slow” to a precise, data-driven diagnosis of a reliability issue.

One company found that empowering its QA team to validate Infrastructure as Code scripts caught over 90% of critical misconfigurations pre-deployment. What specific tools and automated checks should QA integrate into a CI/CD pipeline to achieve this? Please provide a step-by-step example for validating a Terraform script.

That 90% figure is incredible, and it highlights how QA can prevent massive security breaches and outages before a single line of infrastructure code hits production. It’s about treating infrastructure with the same rigor as application code. A typical, effective pipeline for validating a Terraform script would have several automated stages. First, when a developer commits a .tf file, the CI pipeline automatically triggers a static analysis scan using a tool like tfsec or Checkov. This is the first line of defense, catching common misconfigurations like public S3 buckets or unencrypted databases. Next, you enforce organizational policies using a tool like Open Policy Agent (OPA). Here, QA can define custom rules, such as requiring specific tags on all AWS resources or disallowing certain machine instance types. If the Terraform plan violates these policies, the build fails immediately. Finally, for more complex validation, you can use a framework like Terratest to write actual integration tests. This step would spin up the infrastructure in an isolated sandbox account, run tests to verify its configuration—like checking if a security group has the correct ingress rules—and then tear it all down. By layering these automated checks, QA builds a robust safety net that makes infrastructure deployments dramatically safer.

Proactively testing system resilience is becoming critical. How can a QA team begin to pilot chaos engineering principles within a single project? What kind of collaboration is needed with platform or SRE teams to safely inject failures and measure the impact on application behavior and reliability?

Chaos engineering can sound intimidating, but the key is to start small and controlled. You don’t start by pulling the plug in production. A perfect pilot project for a QA team is to select a single, non-critical microservice and define a very specific experiment. The collaboration with the platform and SRE teams is absolutely essential here, as they are the guardians of the environment. The process is a true partnership. QA takes the lead on designing the experiment hypothesis: “We believe that if the product recommendation service experiences 300ms of network latency, the main application will gracefully degrade and display a default set of products without crashing.” The SRE or platform team then provides the tools and the safe environment to execute this. They might use a tool to inject that precise amount of latency into the network traffic for that specific service in a staging environment. QA’s role is then to execute their functional and performance tests during this failure event and observe the outcome. Did the application behave as expected? Did circuit breakers trip? Did alerts fire correctly? This creates a powerful feedback loop, turning theoretical resilience plans into proven, battle-tested capabilities.

For a QA manager noticing significant skill gaps in their team, what is the most effective first step: a comprehensive skills audit or starting a small pilot project with the SRE team? Please explain the trade-offs of your chosen approach and how it builds momentum for broader change.

This is a classic question of analysis versus action, and I strongly advocate for starting with a small pilot project. A comprehensive skills audit, while thorough, can often lead to analysis paralysis. It can feel like a top-down mandate, creating anxiety among the team as they list all the things they don’t know. You can spend weeks building spreadsheets and charts, and all that time, no real progress is made. The pilot project, on the other hand, builds immediate momentum and excitement. Partnering with the SRE team on a focused, achievable goal—like the chaos engineering experiment we just discussed—generates a quick, visible win. It shows the team the why behind the need for new skills. They get hands-on experience and see the direct impact of their work. The trade-off, of course, is that a pilot isn’t a complete strategy. But its success creates the energy and buy-in needed to justify a broader upskilling program. The skills audit then becomes a natural next step, driven by a proven need rather than an abstract requirement. It’s far more effective to say, “Look at what we achieved on that pilot; now let’s figure out how to scale those skills across the team.”

What is your forecast for the role of Quality Assurance over the next five years? Will it become a fully integrated engineering discipline, or will a specialized testing function always remain necessary?

My forecast is that it will become both, but the nature of that specialization will be radically different from what we see today. The days of the purely manual, reactive tester are numbered; that function will be almost entirely absorbed by automation and become a baseline responsibility for all engineers. However, this doesn’t mean the end of QA. In fact, I believe a highly specialized quality engineering function will become more critical than ever. As systems become more complex, distributed, and AI-driven, quality is no longer a simple checklist. We will need deep specialists—true Quality Engineers—who are experts in test architecture, reliability science, performance engineering, and security validation. They won’t just be finding bugs; they’ll be working alongside architects and SREs to design resilient, observable, and secure systems from the ground up. So, while QA will be a fully integrated part of the engineering discipline, it will thrive as a specialized field of excellence that other engineers rely on to navigate the immense complexity of modern software.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later