Vijay Raina is a preeminent expert in enterprise SaaS technology and systems architecture, with a specialized focus on the intersection of infrastructure and high-performance software design. As organizations rush to integrate Large Language Models (LLMs) into their production stacks, Vijay provides the critical bridge between machine learning theory and the harsh realities of systems engineering. His approach moves beyond simple prompt tuning, treating LLM serving as a complex problem of applied mathematics, queueing theory, and load management. In this discussion, he explores the mechanics of Little’s Law, the nuances of dynamic batching, and the strategies required to maintain rigorous service level objectives (SLOs) under the pressure of bursty, unpredictable traffic.
The following conversation delves into the quantitative frameworks necessary for scaling inference, managing tail latency, and establishing robust admission control policies.
How do you use Little’s Law to determine the average time a request spends in the system? When utilization approaches saturation, what specific indicators signal that queues are growing superlinearly, and how do you calculate the required number of workers to maintain a safe headroom?
Little’s Law provides an incredibly elegant foundation for our SLO meetings because it dictates that the average number of jobs in the system, L, is equal to the arrival rate, λ, multiplied by the average time in the system, W. When we monitor our infrastructure, if we see the number of concurrent jobs rising while our arrival rate remains steady, we know immediately that our latency—the time a request spends waiting and being processed—is increasing. As our utilization rate, represented by ρ, inches toward 1.0, the system hits a tipping point where even a tiny increase in traffic causes the queue to explode superlinearly, turning a 900 ms response into a 6-second delay. To prevent this, I use a practical sizing formula where the number of workers, k, is calculated by dividing the product of the arrival rate and average service time by a target utilization rate, typically between 0.4 and 0.7. For example, if we have a peak arrival rate of 120 requests per second and an average service time of 0.18 seconds, targeting a 0.6 utilization rate tells us we need exactly 36 workers to maintain that essential headroom.
LLM inference involves highly variable service times due to fluctuating prompt lengths and cache hit rates. How does this variability complicate capacity planning compared to traditional web serving, and what specific metrics should be tracked to distinguish between prefill and decode performance?
Unlike traditional web serving where requests are relatively uniform, LLM inference is a volatile environment where service time variability is the primary enemy of stability. Because prompt lengths, output lengths, and KV cache hit rates vary wildly from one request to the next, we cannot rely on mean latency; we must instrument our per-request compute time with a sharp focus on the split between prefill and decode phases. Prefill often scales better with batching, whereas the iterative nature of decoding can become a bottleneck as the sequence grows. To navigate this, we track prompt tokens and output tokens as primary metrics to understand the weight of each job. This granular telemetry allows us to see when a specific distribution of long-form generations is starting to eat into our capacity, enabling us to move beyond “average” predictions and into safe, mathematically-defensible bounds.
Dynamic batching involves a trade-off between compute efficiency and batch formation delay. How do you determine the optimal maximum batch wait timer to protect tail latency, and under what specific traffic conditions does a small timer dominate the performance of larger, fixed batches?
Dynamic batching is essentially a scheduling policy where we hunt for compute efficiency without sacrificing the user experience. I recommend a heuristic where you take your total latency SLO—say, 800 ms—and subtract the time needed for model compute and network orchestration; the remainder is your budget for queueing and batching. If your budget is 400 ms, setting a maximum batch wait timer, or T_max, at 20 to 50 ms ensures you aren’t manufacturing tail latency by holding requests too long. In bursty traffic conditions, a small, strict timer is far superior to large, fixed batches because it allows for high-throughput processing when the load is heavy while ensuring that during lulls, individual requests aren’t sitting idle waiting for a batch that won’t fill. This “in-flight” batching strategy, supported by stacks like TensorRT-LLM, is what allows us to hit p95 targets while still maximizing the expensive GPU hardware we’ve deployed.
Engineering guidelines often suggest targeting a utilization rate between 0.4 and 0.7 to manage bursty traffic. What are the practical steps for calculating worker requirements at peak arrival rates, and how do you adjust these targets when optimizing for p99 latency versus overall cost efficiency?
The calculation begins with a simple capacity bound: your arrival rate must be less than your number of workers divided by the service time. To get a realistic deployment number, we use the formula k = ceil(λS / ρ_target), which forces us to make a conscious choice about our business goals. If the priority is strictly p99 latency for a premium application, we lean toward a conservative utilization target of 0.4, effectively buying our way out of queueing spikes with extra hardware. However, if we are optimizing for cost efficiency, we might push that target toward 0.7 and implement stricter admission control to shed load when the queue builds. It is a classic engineering lever—either you pay for the idle capacity to handle bursts, or you accept that during peak moments, your tail latency will degrade or requests will be throttled.
When a system reaches saturation and p99 latency spikes, what admission control policies or circuit breakers are most effective at preventing infinite queue accumulation? How can request shaping, such as routing long prompts to separate pools, help stabilize performance during release-driven spikes?
To prevent the “infinite queue” death spiral described in standard SRE playbooks, you must implement a queue length circuit breaker that triggers long before the system completely stalls. Effective admission control means degrading gracefully—rejecting new requests with a 503 error rather than letting them sit in a queue for 10 seconds. Request shaping is a more sophisticated version of this; by identifying long-context prompts or massive generation tasks, we can route them to a dedicated pool of workers. This prevents a single “heavy” job from blocking a dozen “light” jobs, stabilizing the p99 for the majority of users. We can also cap the maximum generation length by user tier during release-driven spikes, giving us a second lever to pull when hardware alone can’t keep up with the surge.
Discrete event simulations are frequently used to explore capacity and service time variability. How should engineers adapt these simulations to reflect real telemetry distributions, and what specific insights do these models provide that simple analytical queueing formulas might overlook?
Standard analytical models like M/M/k are useful for back-of-the-envelope math, but they often assume exponential service times, which LLMs rarely follow. A discrete event simulation allows us to plug in our actual measured telemetry distributions for prompt and decode times, reflecting the “real” jaggedness of our traffic. These simulations are invaluable because they reveal the interaction between batching timers and arrival bursts that simple formulas miss—specifically how p99 latency can deviate drastically from the mean even when average utilization looks safe. By sweeping through different variables like batch_max and worker counts in a simulated environment, we turn a fuzzy infrastructure debate into a quantitative policy discussion. This allows us to prove to leadership exactly why p99 jumped from 900 ms to 6 seconds when our utilization moved just a few percentage points closer to saturation.
What is your forecast for LLM serving infrastructure?
I believe we are moving away from the era of “brute force” inference and toward a future where the serving stack is as intelligent as the model it hosts. We will see the widespread adoption of multi-tiered request shaping and PagedAttention-style memory management becoming the baseline, not the exception. The infrastructure of the next two years will focus heavily on “density”—finding ways to pack more concurrent requests into the same VRAM footprint through aggressive quantization and dynamic KV cache offloading. Ultimately, the winners in this space won’t just have the fastest models; they will have the most mathematically rigorous schedulers that can maintain 0.6 utilization while guaranteeing sub-second p99s. Serving will become a game of micro-optimizations where the difference between a profitable service and a money-loser is determined by how well you manage the queue.
