Home / Software Development / How Did Superhuman Scale Real-Time AI to 200,000 QPS?

How Did Superhuman Scale Real-Time AI to 200,000 QPS?

May 11, 2026

Benjamin DaigleSoftware Development Expert

The pursuit of instantaneous artificial intelligence at a global scale has forced modern engineering teams to rethink the fundamental relationship between software responsiveness and underlying hardware capacity. For a platform like Superhuman, which facilitates communication for millions of daily users, the integration of generative features like grammar correction and tone adjustment is not merely a luxury but a core expectation of the user experience. To maintain this experience during peak periods, the infrastructure must handle a staggering 200,000 queries per second (QPS) without allowing end-to-end latency to exceed one second at the P99 percentile. This level of throughput requires a specialized architecture that transcends traditional cloud deployments, necessitating a deep technical partnership with an infrastructure provider capable of co-engineering solutions at the edge of possibility. The shift from managing individual server instances to utilizing a highly optimized, managed inference platform has redefined how productivity tools leverage large language models in real-time environments.

Transitioning From Manual Orchestration to Intelligent Managed Platforms

Challenges of Self-Managed Infrastructure and Scaling

In the early stages of their artificial intelligence journey, Superhuman relied on a self-managed serving stack built upon vLLM, an open-source library designed for efficient large language model inference. While this “Do-It-Yourself” approach provided the team with initial control and flexibility, it quickly revealed significant operational limitations as the user base expanded toward 40 million daily active users. The primary workload centered on a sophisticated, custom-built model dedicated to grammatical error correction, which demanded constant updates and fine-tuning. Maintaining this infrastructure required a lean engineering team to manage everything from hardware provisioning to the complex configurations of Kubernetes clusters. These manual tasks created a bottleneck, as every new model iteration necessitated months of performance tuning on L40S GPUs to ensure that the system could handle the increasing traffic without crashing or slowing down to unacceptable levels.

The move toward a more robust solution became inevitable when the operational burden began to detract from the company’s ability to innovate on the actual AI models. Capacity planning and the management of autoscaling groups became a full-time endeavor, yet the system remained vulnerable to sudden spikes in demand that are typical of productivity software. When tens of thousands of users simultaneously begin their workday, the surge in requests can overwhelm even the most carefully tuned manual setups. The engineering team realized that to sustain “four nines” of reliability—99.99% uptime—they needed to move away from the high-maintenance DIY model. The objective was to find a partner that could provide a managed platform capable of delivering extreme performance while allowing the internal team to focus exclusively on model quality and user-facing features, rather than the underlying plumbing of the cloud.

Strategic Collaboration for Seamless Model Serving

The transition to a co-engineered platform with Databricks was driven by the need for strict Service Level Agreements (SLAs) that guaranteed both latency and reliability at a scale few companies ever reach. This migration was far more than a simple hand-off of infrastructure; it was a collaborative effort to re-engineer the serving stack for the specific requirements of high-frequency communication tools. The partnership focused on creating a seamless pipeline where models could be trained, evaluated, and deployed within a single ecosystem. By moving to a managed model serving environment, Superhuman could leverage advanced hardware, including #00 GPUs, without having to manage the low-level complexities of driver updates, CUDA versions, or hardware-specific optimizations. This allowed for a much faster deployment cycle, reducing the time from model completion to production from several months to just a few days.

During this integration process, the focus remained on ensuring that the move to a managed service did not result in a “black box” environment where performance was opaque. Instead, the collaboration allowed for deep visibility into the inference process, ensuring that the grammatical error correction models maintained their high standards of accuracy and speed. The teams established clear Service Level Objectives (SLOs) that targeted sub-second P99 latency even at peak loads of 200,000 QPS. This strategic shift enabled the platform to support a more diverse range of AI tasks, from simple text completion to complex stylistic transformations, all while maintaining a consistent and predictable performance profile. The result was a modernized infrastructure that could scale elastically with user demand, providing a stable foundation for the next generation of productivity tools without the overhead of manual cluster management.

Engineering High-Performance Infrastructure for Massive Workloads

Advanced Traffic Distribution and Load Balancing Strategies

Managing 200,000 queries per second requires an orchestration layer that is significantly more sophisticated than standard cloud-native load balancers. Traditional Kubernetes round-robin approaches often fail at this scale because they do not account for the varying processing times of different requests, which can lead to “hotspots” where individual pods become overwhelmed while others remain underutilized. To solve this, the engineering teams implemented a “power of two choices” algorithm within an Endpoint Discovery Service (EDS). This lightweight control plane constantly monitors the state of the fleet and, when a new request arrives, samples two candidate pods to determine which one has fewer active requests. By routing the traffic to the less-burdened node, the system ensures an exceptionally even distribution of work, which is critical for preventing the tail-latency spikes that frustrate users during high-concurrency periods.

In addition to intelligent routing, the system utilizes an asymmetric dynamic autoscaling strategy to balance cost-efficiency with extreme responsiveness. In the fast-paced world of digital communication, traffic patterns follow predictable but aggressive diurnal cycles, with massive ramps in demand occurring as different global time zones begin their workdays. The autoscaling logic is designed to be highly aggressive during “scale-up” events, adding new GPU capacity the moment a surge is detected to ensure that the system never falls behind the incoming request volume. Conversely, the “scale-down” process is deliberately conservative to prevent a phenomenon known as “flapping,” where pods are rapidly added and removed in quick succession. This stability is vital because the process of initializing a new inference pod is resource-intensive; avoiding unnecessary restarts keeps the overall system performance smooth and predictable for the end-user.

Accelerating Deployment With Lazy Loading and Image Metadata

One of the most persistent bottlenecks in scaling large-scale artificial intelligence clusters is the time required to boot up new server instances and load the necessary software environments. Traditionally, pulling a container image that includes heavy AI libraries and model weights can take several minutes, a delay that is unacceptable when trying to react to a sudden spike in traffic. To overcome this, the platform adopted advanced image acceleration technology that enables “lazy loading.” This process involves converting standard container images into a block-device-based format that the system can read incrementally. Instead of waiting for the entire multi-gigabyte image to download before starting, the container runtime only fetches the specific metadata required to initiate the root directory, allowing the application to begin execution in a fraction of the usual time.

As the application starts running, the system dynamically retrieves only the data blocks necessary for the current operations from the registry, typically in 4MB sectors. These blocks are then cached locally to ensure that subsequent requests for the same data are handled with near-zero latency. This innovation reduced the total time to bring a new inference pod online from several minutes to just a few seconds, transforming how the infrastructure responds to volatility. For a company handling 200,000 QPS, this means that the capacity of the entire cluster can be doubled or tripled almost instantaneously in response to real-time demand. This capability not only improves the reliability of the service during unexpected traffic events but also allows for much tighter cost control, as the company no longer needs to maintain a large, expensive buffer of idle servers “just in case” a surge occurs.

Maximizing Computational Efficiency at the Runtime Layer

Hardware Optimization Through Precision and Quantization

While infrastructure-level scaling provides the necessary breadth for handling massive traffic, maximizing the efficiency of each individual GPU is essential for maintaining a sustainable cost-to-performance ratio. The teams achieved a significant breakthrough by implementing FP8 quantization on #00 GPUs, a technique that converts the model’s mathematical weights into a lower-precision 8-bit floating-point format. This allows the hardware to perform calculations much faster and reduces the memory footprint of the model, enabling higher batch sizes and greater throughput per node. However, this was not a blanket application of lower precision; it was a surgical optimization. Through rigorous testing, it was determined that quantizing the Multi-Layer Perceptron (MLP) projections and attention mechanisms yielded the best performance gains, increasing throughput by approximately 30% without compromising the linguistic accuracy of the grammar correction features.

The precision of this quantization was further enhanced by using per-channel scaling rather than per-tensor scaling. By calculating unique scale factors for each output channel, the system preserved the dynamic range of the model’s activations, which is crucial for tasks that require high levels of nuance and correctness. The team deliberately avoided quantizing the KV-cache, as their evaluations showed that doing so introduced unacceptable regressions in the quality of the generated text for this specific use case. This balanced approach ensured that the performance gains were “free” in terms of user experience, providing the speed of a smaller model with the intelligence and accuracy of a much larger one. By optimizing the model to run more efficiently on the latest #00 hardware, the platform was able to increase the maximum requests per second (RPS) per replica from 750 to over 1,200, a massive leap in efficiency.

Eliminating Bottlenecks Through Low-Level Software Refinement

An often-overlooked challenge in high-speed AI inference is that as the GPU becomes faster, the CPU can become the primary bottleneck in the system. If the GPU completes its forward pass through the model faster than the CPU can prepare the next batch of input data, the most expensive part of the hardware sits idle. To address this, the engineering team introduced a multiprocessing RPC server that allows multiple CPU processes to work in parallel to prepare and dispatch tasks. This ensured that the GPU was constantly saturated with work, leading to a 20% increase in overall throughput. By decoupling the data preparation from the inference execution, the system could maintain a high “duty cycle” for the hardware, ensuring that every dollar spent on premium GPU compute was utilized to its fullest extent.

Further refinements were made to the software stack to minimize the overhead associated with the Python programming language, which is commonly used for AI development but can be slow for high-concurrency operations. The engineers replaced critical Python-based tensor operations with specialized C++ calls and implemented an asynchronous scheduling model. This architecture allows the CPU to handle the post-processing of one batch of results while the GPU simultaneously begins the forward pass for the next batch. This overlap of duties eliminates the sequential “wait time” that often plagues inference pipelines, shaving valuable milliseconds off the total response time. Collectively, these low-level optimizations and infrastructure improvements allowed the platform to support 200,000 QPS with ease, providing a blueprint for how modern organizations can scale real-time AI to meet the demands of a global user base.

Achieving Sustainable Scale in the Intelligence Era

The successful deployment of an inference platform capable of handling 200,000 QPS marks a fundamental shift in the maturity of real-time artificial intelligence. This achievement demonstrated that the primary challenge for AI-driven companies has moved beyond model training into the realm of complex, high-scale engineering and infrastructure optimization. By successfully increasing per-pod throughput by 60% and reducing pod startup times from minutes to seconds, the collaboration proved that massive scale does not have to come at the expense of latency or model quality. The implementation of FP8 quantization and advanced load balancing showed that when hardware and software are tuned in unison, the resulting efficiencies can significantly lower the operational costs of deploying large language models. This project established a new standard for how productivity platforms can integrate AI into the daily workflows of millions without the risk of system instability.

Moving forward, the focus for engineering teams should shift toward building deep technical partnerships that allow for this level of co-innovation. The era of simply renting raw compute is being replaced by a model where the infrastructure provider and the application developer work together to squeeze every possible millisecond of performance out of the stack. Organizations looking to replicate this success should prioritize the elimination of CPU bottlenecks and the adoption of lazy-loading container technologies to ensure they can react to the volatility of modern internet traffic. As AI becomes even more deeply embedded in communication and collaboration tools, the ability to serve these models at extreme scale will be the primary differentiator between services that feel “magical” and those that feel sluggish. The playbook developed here provides a clear path for any organization aiming to push the boundaries of what is possible in the age of real-time intelligence.