Home / Testing & Security / How Did Cloudflare Scale Security Insights by 10x?

How Did Cloudflare Scale Security Insights by 10x?

Jun 16, 2026

Grace MorainDigital Transformation Consultant

Maintaining the integrity of millions of online properties requires more than just reactive measures; it demands a proactive infrastructure capable of identifying vulnerabilities before they are exploited. By early 2026, the existing Security Insights tool reached a critical performance threshold where its fourteen-day audit cycle proved insufficient against modern, high-velocity threats. In an environment where a single misconfiguration can lead to a catastrophic data breach within minutes, the engineering teams faced the daunting task of scaling their scanning throughput by a factor of ten. The objective was clear: transition from a processing rate of ten scans per second to a robust hundred scans per second. This leap required more than just incremental hardware upgrades; it necessitated a fundamental reimagining of the data pipeline and scheduling logic. The original system architecture, while initially revolutionary, had begun to buckle under the weight of an expanding user base, resulting in massive backlogs and frequent process crashes that threatened the reliability of automated security recommendations. To address these systemic failures, engineers embarked on a multi-layered optimization project that touched everything from message queuing and concurrent processing to database ingestion and geographic latency reduction.

Overcoming Architectural Bottlenecks

Enhancing Data Streams: Concurrent Processing

The initial bottleneck was identified within the messaging layer, where the sequential nature of Apache Kafka partitions created a significant hurdle for high-speed scanning operations. In the legacy setup, each checker service consumed messages one by one, meaning that a single slow scan involving a complex network configuration would effectively block the entire partition. This “head-of-line blocking” meant that even if thousands of simple scans were ready for processing, they remained trapped behind a single resource-intensive task. To rectify this, the engineering team redesigned the Go-based microservices to leverage concurrent processing models. By implementing sophisticated worker pools within each service instance, the system began to handle multiple scan requests simultaneously from a single partition. This transition allowed the infrastructure to maximize CPU utilization and maintain a steady flow of data, ensuring that the delay of one task no longer crippled the performance of the entire pipeline.

Building on the foundation of concurrency, the team had to manage the increased memory pressure that came with parallelizing hundreds of simultaneous scans. In the new model, each Go routine required its own set of resources to track the state of a scan and communicate with the broader network. This necessitated a fine-tuned approach to resource allocation, where the system would dynamically adjust the number of active workers based on the health of the underlying hardware. The result was a significantly more resilient message consumption pattern that could absorb temporary spikes in traffic without crashing. By decoupling the message ingestion from the actual scan execution, the architecture achieved a level of local throughput that was previously impossible. This fundamental change in how data moved through the checkers provided the necessary headroom to reach the target of one hundred scans per second, effectively turning a formerly rigid pipeline into a flexible and highly responsive stream of security intelligence.

Managing Workload Diversity: Lane Prioritization

Another major challenge involved the sheer diversity of the accounts being scanned, as the workload required for a small personal blog is vastly different from that of a global enterprise with hundreds of thousands of DNS records. The original “one-size-fits-all” approach meant that massive accounts were mixed in with smaller ones, often leading to situations where a single large entity would exhaust the available processing power for an extended period. To resolve this, the engineers implemented a dual-lane architecture that separated tasks based on their predicted complexity and resource consumption. By analyzing the scale of an account’s infrastructure before a scan began, the system could intelligently shunt massive datasets into a dedicated “slow lane.” This ensured that the “fast lane” remained clear for the vast majority of users, who could then receive their security updates and insights in a matter of seconds rather than waiting for hours behind an enterprise-scale audit.

This lane-based prioritization strategy required a sophisticated pre-processing layer capable of estimating the weight of a scan before it ever reached the Kafka queue. The team developed a metadata-driven analyzer that looked at historical scan data and current record counts to assign a priority score to each task. This scoring system not only improved the speed of the fast lane but also allowed for better resource allocation in the slow lane, where heavier compute instances could be deployed specifically for large-scale operations. By isolating these different workload types, the system eliminated the “noisy neighbor” effect that had previously plagued the platform. This modernization ensured that the growth of large enterprise customers would not negatively impact the experience of the millions of smaller users who rely on the platform for their daily security posture. The tiered approach effectively balanced the needs of a diverse user base while maintaining a high overall system velocity.

Optimizing the Storage and Network Layers

Minimizing Database Load: Efficient Ingest

As the scanning throughput increased, the persistence layer became the next primary point of failure, specifically the PostgreSQL database used to store millions of individual security findings. In the original architecture, each finding was inserted into the database as a single transaction, a method that worked well at lower volumes but became catastrophic as the scan rate climbed toward one hundred per second. The sheer number of round trips between the API and the database engine created an unsustainable overhead, leading to lock contention and high CPU wait times. To solve this, the engineering team moved away from individual inserts in favor of a hybrid bulk-loading strategy. This new approach grouped findings into larger batches before attempting to write them to the disk, drastically reducing the number of network calls and transaction headers required to maintain the security log.

The technical implementation of this batching strategy utilized two primary PostgreSQL features: the UNNEST command for standard-sized batches and the COPY command for exceptionally large datasets. For the majority of scan results, the system would package multiple findings into an array and send them in a single call, allowing the database to process them efficiently in memory before committing them to storage. For the massive accounts in the slow lane, the system utilized the COPY command to stream data directly into the tables, bypassing much of the traditional SQL overhead. This transformation turned multi-minute ingestion tasks into operations that completed in mere seconds. By optimizing the storage engine to handle bulk data, the team not only stabilized the database but also reduced the total cost of ownership by decreasing the amount of hardware required to maintain the same level of performance. This efficiency gain was critical for sustaining the 10x increase in scanning volume.

Eliminating Geographic Latency: API Calls

Geography presented its own set of challenges, particularly the fifty-millisecond latency between the data centers located in Amsterdam and the primary database clusters in Portland. While fifty milliseconds may seem insignificant in a vacuum, it became a major bottleneck when scaled across thousands of API calls per second. The checkers in the European region were forced to hold database connections open for much longer than those in North America, leading to rapid exhaustion of the connection pools and increased error rates for European users. To eliminate this geographic paradox, the team shifted from an active-active API model to an active-passive configuration. This change ensured that the active API instance responsible for writing to the database was always located in the same geographic region as the database itself, effectively eliminating the cross-continental round trips during critical write operations.

This strategic shift required a reconfiguration of the global traffic manager to route all write-heavy security insight traffic to the region hosting the primary database. While this introduced a single point of entry for security updates, the reduction in latency and the subsequent stability of the connection pools far outweighed the benefits of a distributed write model that was hampered by physical distance. The transition allowed the system to maintain a consistent processing speed regardless of where the initial scan was triggered. By ensuring that the API and database were physically close, the team achieved a much more predictable performance profile, which was essential for meeting the strict timing requirements of the updated scanning cycles. This geographic optimization was a vital component in creating a truly global security infrastructure that could operate at scale without being held back by the speed of light across fiber optic cables.

Modernizing Scheduler Logic for Stability

Flattening Traffic Peaks: Intelligent Scheduling

The original scheduling logic was another source of systemic instability, as it tended to trigger massive bursts of activity based on the exact time an account was first created. This led to “thundering herd” scenarios where millions of scans would attempt to start simultaneously every Monday morning, overwhelming the infrastructure and creating a backlog that took days to clear. To fix this, the engineering team moved to a zone-independent scheduling model that introduced a randomized “jitter” into the timing of each scan. By slightly shifting the start time of each audit, the team was able to spread the workload evenly across the entire week. This transformation turned an unpredictable and spiky traffic pattern into a smooth, manageable flow of tasks, allowing the system to operate at peak efficiency without the risk of sudden saturation.

In addition to randomized jitter, the team introduced a sophisticated adaptive rate limiter that automatically adjusts the scheduling speed based on the current system load and the total number of active users. This formula recalculates the allowed throughput every thirty minutes, taking into account the specific requirements of different plan types—such as the need for more frequent audits for enterprise users. By automating this process, the engineering team eliminated the need for manual intervention during periods of rapid growth. If a million new users were to join the platform in a single day, the adaptive rate limiter would automatically recalibrate the schedule to ensure that the existing quality of service is maintained for everyone. This intelligent scheduling layer provided the final piece of the puzzle, ensuring that the 10x increase in capacity was matched by a 10x increase in the stability and predictability of the entire security insights ecosystem.

Scaling System Performance: User Features

The culmination of these extensive engineering efforts resulted in a system that now consistently sustains over 120 scans per second, a figure that significantly exceeds the project’s initial goals. This massive increase in throughput has directly translated into tangible benefits for the end-user, with free accounts now receiving a full security audit every week and enterprise users benefiting from daily, high-depth scans. This higher frequency of monitoring ensures that vulnerabilities are caught and remediated much faster than was ever possible under the old architecture. Furthermore, the newfound stability of the backend has paved the way for the introduction of on-demand scanning. This feature allows administrators to trigger a manual audit of their entire infrastructure at any time, providing instant verification of their security posture after making significant configuration changes.

The architectural improvements established a new standard for how security services are delivered at scale, demonstrating that even the most complex bottlenecks can be overcome with a combination of concurrent processing and efficient data management. Engineering teams looking to replicate this success should focus on isolating heavy workloads and minimizing the overhead of high-frequency database interactions. The transition from an active-active to an active-passive API model also highlighted the ongoing importance of physical proximity in a world where every millisecond counts toward system reliability. By moving toward a more automated, adaptive scheduling model, the platform ensured that it could grow indefinitely without sacrificing the performance that millions of websites depend on for their protection. These advancements not only solved the immediate performance crisis but also created a foundation for future security features that will continue to evolve as the threat landscape changes.