In recent years, Site Reliability Engineering (SRE) teams have faced a rapidly evolving landscape as organizations increasingly adopt AI technologies to enhance their operational efficiency. However, instead of alleviating the expected workload, the adoption of AI has introduced new challenges, leading to an unexpected rise in toil levels and operational complexities. The Catchpoint SRE Report 2025 highlights several key areas of concern, including increased performance management challenges, the growing burden of operational loads, and conflicting organizational priorities. As businesses navigate these complexities, it becomes crucial to understand if SRE teams are successfully keeping up with these challenges and what strategies might help manage rising toil levels.
Rising Toil Levels Amid AI Integration
One of the significant findings from the Catchpoint SRE Report 2025 is the unexpected rise in toil levels, which have reached a median of 20% in 2025, compared to 14% the previous year. This rise is surprising because AI integration was initially anticipated to alleviate operational burdens by streamlining processes and reducing manual tasks. However, the reality has been different. AI systems require extensive maintenance, including updating models and managing GPU clusters, which often necessitate manual supervision for unpredictable errors. As a result, instead of reducing toil, AI has added new layers of complexity that SRE teams must manage.
The increase in toil suggests that the expectations set around AI’s capability to mitigate operational workloads may have been overly optimistic. AI systems, while powerful, are not yet at the stage where they can fully operate autonomously without human intervention. SRE teams must constantly monitor and adjust these systems to ensure they function correctly, which adds to their workload. This finding challenges the prevailing assumption that AI would help reduce toil and highlights the need for a more nuanced understanding of AI’s role in operational management. It also underscores the importance of developing robust strategies to address these challenges and mitigate the impact on SRE teams’ workloads.
Balancing Performance and Uptime
Another critical point highlighted by the Catchpoint SRE Report is the shifting perceptions of performance issues among SRE teams. A notable 53% of respondents agreed that poor performance is as detrimental as complete outages, indicating a shift in organizational priorities that places equal emphasis on performance and uptime. This shift underscores the growing importance of maintaining high performance alongside achieving near-perfect uptime. As a result, SRE teams are under pressure to ensure that systems not only remain operational but also perform optimally under varying conditions.
Additionally, 41% of SRE professionals feel pressured to prioritize release schedules over reliability. This finding highlights the difficult balance that SRE teams must strike between achieving high release velocity and maintaining system stability. Fast release cycles are crucial for organizations to stay competitive and relevant, but they can also introduce new risks and potential for errors. The challenge for SRE teams is to manage these competing priorities effectively and find ways to maintain reliability without compromising on delivery schedules. This requires implementing robust practices that ensure both performance and reliability are maintained.
Toolchain Efficiency and Observability
The Catchpoint SRE Report 2025 also sheds light on the toolchain and observability practices within organizations. Despite employing between two to five monitoring tools, many organizations still face numerous incidents. According to the report, 40% of respondents managed between one to five incidents in the past 30 days. This raises concerns about the efficiency of the current toolchains and whether they provide sufficient observability. The report suggests that these tools might be generating excessive and inactionable telemetry, resulting in information overload and hindering effective incident management.
Efficient observability instrumentation is crucial for handling the complexities of modern infrastructure. The rise in incidents despite having multiple monitoring tools indicates potential inefficiencies and gaps in the current systems. There is a need for better integration of Digital Experience Monitoring (DEM) and Internet Performance Monitoring (IPM) to gain holistic visibility into the user experience. By improving observability, organizations can better identify and address issues before they escalate into significant incidents. Enhancing toolchain efficiency and ensuring actionable insights will be key to reducing the operational burden on SRE teams and improving overall system reliability.
Strategic Approaches to Mitigate Challenges
In recent years, Site Reliability Engineering (SRE) teams have been navigating a rapidly changing landscape due to the increased adoption of AI technologies aimed at optimizing operational efficiency. Instead of reducing the anticipated workload, the integration of AI has introduced a range of new challenges, leading to an increase in toil and operational intricacies. The Catchpoint SRE Report 2025 underscores several critical areas of concern, such as heightened performance management issues, a rising burden of operational tasks, and conflicting organizational priorities. As companies tackle these complexities, it becomes essential to grasp how well SRE teams are coping with these challenges and identify effective strategies to manage the escalating levels of toil. The ability to adapt and implement suitable practices will determine whether SRE teams can sustain their effectiveness and continue contributing to the overall success of their organizations amid this AI-driven evolution.