Paul Lainez sits down with Vijay Raina, a SaaS and Software expert known for his pragmatic architecture decisions in harsh, resource-constrained environments. Vijay walks us through a complete, production-style observability pipeline that runs on the edge and keeps traces, logs, and metrics correlated end-to-end despite flapping links and limited CPU, RAM, and disk. The conversation digs into tail-based sampling tuned to keep all errors and anything slower than 200 ms, a Fluent Bit plus Lua filter that mirrors trace policies to drop 86% of logs at the source, and a file-backed queue using bbolt that survives multi-hour disconnects without lying about backlog. We also explore why a 30 MB OpenTelemetry Collector can push roughly 2,500 spans per second at only 1% to 5% of one CPU core and 80–150 MiB RAM, how spanmetrics placed after sampling guarantees zero dead links in dashboards, and why metrics continuity sometimes gives way to pipeline liveness when prometheusremotewrite is the only safe path on v0.95–v0.96.
Key themes include: choosing tail-based over head-based sampling for correctness while still dropping around 80% of spans; mirroring sampling logic across logs and traces to ensure deterministic correlation; using a bbolt-backed queue with num_consumers: 1 to expose true backlog and tame GC; validating “time travel” tolerance in Jaeger and enabling unordered_writes in Loki; accepting intentional metric loss during outages to avoid OTLP 400 retry storms; and performance-tuning the collector with a 320/400 MiB memory limiter, 512-span/5-second batching, and a single exporter consumer to smooth drains after reconnection.
When bandwidth is scarce and storage unreliable on edge nodes, how do you prioritize which signals to keep? Walk me through your sampling philosophy, the thresholds you choose, and the operational risks you’re willing to take. Where have these trade-offs bitten you, and how did you recover?
At the edge, I preserve the signals that answer “what broke and where” without shipping the haystack. Tail-based sampling keeps 100% of traces containing errors and 100% of traces exceeding 200 ms, while normal fast successes are dropped; in practice this trims roughly 80% of spans before they ever touch the wire. On logs, Fluent Bit plus a Lua filter mirrors those exact rules and drops around 86% of log lines at the source so our bandwidth savings aren’t undone. The risk is losing rare-but-interesting successes; I accept that because the 200 ms threshold and error focus bias us toward actionable outliers. The one time it hurt was a brief performance regression hidden under 200 ms; we lowered the duration threshold and baked a time-window policy that tightens during incidents, then restored it after the storm passed.
Tail-based sampling kept all error traces and those exceeding 200 ms. Why that duration, and how would you tune it under changing latency baselines? Share the metrics you monitor to validate effectiveness, and describe the rollback plan if sampling becomes too aggressive.
The 200 ms cutoff came from the application’s steady-state latency profile: it’s a clean divider between “fast path” and “user-noticeable,” and it keeps the error budget honest without exploding volume. If the baseline drifts, I watch spanmetrics histograms and their exemplars; when the p95 slides toward 200 ms, I’ll lift the threshold in 25–50 ms steps, and when the p95 recovers, I bring it back down. I validate by checking that the saved span rate remains near 20% of peak (since about 80% are dropped), dashboard exemplars resolve reliably, and exporter backpressure stays flat during steady-state. If I overshoot and starve diagnostics, the rollback is a config flip: revert to 200 ms, temporarily keep 100% for specific routes, and redeploy the collector—since the policy is data-plane only, it’s a low-risk, fast rollback.
Head-based vs tail-based sampling: where does head-based still win at the edge? Detail latency overheads, memory costs, and failure modes you’ve seen. If you had to blend them, how would you route traffic, set policies, and verify coverage during incidents?
Head-based wins when cycles and memory are vanishingly small and you need the lowest-latency decision at the ingress span. It adds virtually no compute latency and essentially zero buffering, which matters when your node is under 1%–5% CPU headroom and 80–150 MiB RAM is earmarked for the collector. The failure mode is obvious: you can drop a trace that later fails or crosses 200 ms, which is catastrophic for rare, high-severity events. A hybrid approach routes low-risk paths through head-based “keep at N%” and sends critical routes to tail-based policies that keep errors and >200 ms; I verify coverage by comparing spanmetrics rates to expected traffic, confirming exemplars click through to Jaeger, and spot-checking that roughly 80% of spans are still dropped without losing error cases.
Logs were filtered at the source with Fluent Bit plus Lua, mirroring trace policies. How did you design the logic to avoid drift between logs and traces? Share examples where minor field mismatches caused misalignment, and the test harness you use to catch regressions.
Drift disappears when you drive both sides off the same semantics: errors and >200 ms. The Lua filter reads JSON logs that include trace_id and span_id, evaluates level and duration, and only keeps records that match the tail-sampling logic, so a sampled trace and its logs either both live or both get dropped. We once saw misalignment when a service wrote duration_ms while the Lua expected duration; another time, error levels used “ERR” instead of “error,” creating silent drops. Our harness replays golden logs and traces with known outcomes, asserts a 1:1 mapping on trace_id across the kept sets, and flags schema drift the moment field names or values deviate.
You used file-backed queues via bbolt to survive outages. How did you size disk, compaction, and batch serialization to avoid corruption or thrash? Describe the exact telemetry and alerts that tell you the queue is healthy during a multi-hour disconnect.
We lean on the file_storage extension’s bbolt store and keep batches modest—512 spans per batch with a 5-second cap—so serialized blobs remain small and durable. Disk sizing is empirical: I multiply peak ingest (about 2,500 spans/s under test) by expected outage length, add headroom for logs, and ensure the node’s local disk can absorb that plus bbolt overhead without compaction churn. Because the default exporter had four consumers that masked depth, we set num_consumers: 1 so nearly all backlog stays on disk, which tames memory and GC when the link returns. Health is a set of boring truths: queue depth rises linearly, exporter errors go up without panicking, RSS holds near 80–150 MiB thanks to the 320/400 MiB limiter, and drain completes cleanly on reconnection; alerts key off queue growth velocity, free disk thresholds, and consecutive export failures so operators know it’s expected, not a meltdown.
Setting num_consumers to 1 exposed true backlog depth. What symptoms convinced you the default concurrency masked problems? Explain the before-and-after behavior of memory, retries, and backpressure. How do you protect throughput when connectivity flaps rapidly?
With the default four consumers, the queue metric lied—depth sat at zero while in-memory retry buffers ballooned, and you’d only notice when GC began to chatter and the exporter thrashed. After switching to num_consumers: 1, only one batch is in-flight; the rest sit safely in bbolt, so depth is accurate, memory stays steady, and retries follow a clean cadence. Backpressure becomes visible and honest, which lets the memory_limiter at 320/400 MiB and the 512-span/5-second batcher smooth allocation and flush patterns. During flaps, we keep the single consumer, rely on the batch timeout to avoid oversized payloads, and let the queue absorb bursts so the collector never stampedes the link on partial reconnects.
After reconnection, out-of-order “time travel” data floods backends. How did you validate Jaeger’s tolerance and Loki’s unordered_writes behavior under stress? Share concrete error rates, queue drain speeds, and any guardrails you added to prevent overload or data loss.
We validated Jaeger first because it’s append-only; when connectivity returned, traces with old timestamps simply appeared in the right chronological spot with no rejections. Loki was harsher—without unordered_writes: true, you hit HTTP 400s on out-of-order streams; with that flag enabled, the backlog drained successfully even after multi-hour disconnects. Drain pace was bounded by network availability; in our load test we generated about 2,500 spans/second, and on restoration we observed the exporter running flat-out until the bbolt depth hit zero. Guardrails included limiting concurrent consumers to 1, setting batch limits to 512/5s, and alerting on sustained HTTP 4xx so we’d know immediately if a config regression reintroduced ordering strictness.
Metrics via prometheusremotewrite lacked file-backed queues in certain collector versions. How did you conclude that sacrificing outage-window metrics was safer than blocking the pipeline? Walk us through your decision matrix, simulated failure results, and operator communications.
On the collector versions we used (v0.95 and v0.96), prometheusremotewrite can’t use file_storage; keeping metrics in RAM and retrying old samples is a recipe for out-of-order rejections and pipeline stalls. We chose remote-write precisely because Prometheus will return 204 with zero written on stale data, which drains the queue cleanly and unblocks exporters. The simulation showed a clear split: either block forever with OTLP 400 storms, or accept that the outage window’s metrics are lost while logs and traces remain healthy. We documented this explicitly for operators—“metrics continuity during outages is traded for pipeline liveness”—and paired it with dashboards that highlight “partial data” banners during known disconnect periods.
OTLP-based metrics often trigger 400s on out-of-order samples, causing retry storms. What safeguards would you add if you had to use OTLP anyway? Outline circuit breakers, time-window filters, and backoff strategies, including exact thresholds that worked in practice.
If I were forced onto OTLP, I’d cut retries off at the knees: a time-window filter drops samples older than 5 seconds beyond head, and a circuit breaker opens after consecutive 400s, holding for 5 seconds before a half-open probe. Backoff grows from 1 to 5 seconds so we don’t saturate the path; the batcher still caps at 512 samples or 5 seconds to avoid oversized resubmissions. The memory_limiter at 320/400 MiB prevents retry storms from starving the process, and num_consumers: 1 ensures most backlog remains on disk, not in RAM. None of this makes OTLP love out-of-order data, but it converts a storm into a controlled drizzle that doesn’t freeze the pipeline.
Deterministic correlation hinged on injecting trace_id and span_id into every log and using spanmetrics for exemplars post-sampling. How did you ensure no orphaned signals? Describe your schema, logger configuration, and a concrete dashboard workflow from metric spike to trace to log.
The schema is boring by design: JSON logs always include trace_id and span_id sourced from the request context, plus level and duration fields the Lua understands. Tail-based sampling decides keep/drop on errors and >200 ms; the Fluent Bit Lua code mirrors it so the same events survive on both sides, which means no orphaned logs and no orphaned traces. The spanmetrics connector runs after sampling, so every exemplar in the latency histogram points at a trace that’s guaranteed to exist in Jaeger. In practice, a spike on the p95 panel shows diamonds; click one, land in the exact trace; from there, jump to associated logs via the embedded IDs and see the precise error lines that justified the sampler’s decision.
Placing spanmetrics after tail-sampling guarantees resolvable exemplars. What problems emerge if it runs pre-sampling? Share real metrics showing dashboard dead-link rates, and how you verify exemplar integrity during load or partial outages.
Pre-sampling spanmetrics creates Schrödinger exemplars—some point to traces you later drop, so dashboard diamonds become dead ends. We measured this the hard way: before we moved it post-sampling, clicking exemplars during high load would frequently land on nothing in Jaeger, even though the chart said a trace existed. After wiring spanmetrics strictly after the tail sampler, dead links disappeared because every exemplar’s trace_id had already cleared the 200 ms/error rules. We verify integrity by load-testing at about 2,500 spans/second, triggering controlled outages, and confirming that all exemplar clicks resolve even while the bbolt queue drains and Loki ingests with unordered_writes: true.
Performance tuning targeted a 30 MB collector using minimal components. How did you choose which processors/exporters to cut? Provide before/after CPU, RSS, and p99 latency numbers. What profiling or flame graphs revealed surprising hot spots in Go’s runtime or your config?
We used the Collector Builder to remove anything not essential to tail-based sampling, spanmetrics, and our exporters—no extra receivers, no vanity processors. The result was a 30 MB binary that held RSS between 80 and 150 MiB and sipped 1%–5% of a single CPU core while processing thousands of spans per second. p99 export latency stabilized thanks to the 512-span/5-second batcher, which reduced tiny-fragment overhead without creating jumbo payload spikes. Profiling showed allocation churn at exporter reconnection; setting num_consumers: 1 and enforcing the 320/400 MiB memory limits cooled GC, turning jagged flame stacks into flatter profiles with shorter pauses.
Memory-limiter at 320/400 MiB, batch at 512 spans or 5s, and a single exporter consumer stabilized GC under drain spikes. Why did this exact trio matter? Show step-by-step how each setting influenced allocation rates, GC pause times, and export burst smoothness.
The memory_limiter pins a soft ceiling at 320 MiB and a hard stop at 400 MiB, which prevents runaway queues and gives GC a predictable heap range. The batcher collects 512 spans or waits 5 seconds; that size is big enough to amortize overhead yet small enough not to create bursty allocations on serialize/flush. num_consumers: 1 ensures only one batch is live in memory, so when the link returns and bbolt drains, you don’t have four goroutines racing to inflate retry buffers. In combination, allocation rates step down, GC pauses shorten, and exports arrive as even pulses instead of spiky surges that used to cascade into backpressure and jitter.
Load testing with the k6 Operator ramped to 500 VUs and ~2,500 spans/second. How did you architect test scenarios to hit error paths and latency buckets reliably? Share failure thresholds, saturation points, and what broke first in early runs.
We ran a deliberate one-minute ramp—0 to 100 VUs in the first 30 seconds, 250 by the next 30, and 500 at the one-minute mark—then sustained for 40 minutes. Traffic hit four endpoints with distinct latency and error profiles so the sampler could exercise both branches: errors and >200 ms. The saturation point showed up when the exporter met network limits; that’s where bbolt depth grew linearly and the sampler continued dropping the expected ~80% of spans, proving selectivity. Early runs broke in subtle ways: logs outpaced trace keeps until the Lua matched the 200 ms threshold exactly, and GC spiked until we locked in 512/5s batching and num_consumers: 1.
Network chaos used host-namespace netshoot, iptables DROP rules, and targeted conntrack flushes. What pitfalls did you hit with stateful connections, and how did you script repeatable transitions? Provide the exact command sequence and the telemetry you watch minute-by-minute.
Stateful gRPC connections will happily sail past new DROP rules unless you tear down conntrack state. The script runs in a privileged netshoot pod on the host network: first insert iptables rules in the FORWARD chain targeting Jaeger, Loki, and Prometheus ports; then flush conntrack entries for the collector’s IP so handshakes are forced and dropped. A representative sequence is:
- iptables -I FORWARD 1 -p tcp –dport 14250 -j DROP
- iptables -I FORWARD 1 -p tcp –dport 3100 -j DROP
- iptables -I FORWARD 1 -p tcp –dport 9090 -j DROP
- conntrack -D -s
|| trueTo restore: iptables -D FORWARD 1 (repeat for each rule). Minute-by-minute, I watch queue depth rise, exporter error codes, Loki 400s vanish once unordered_writes: true is set, RSS near 80–150 MiB, and then the glorious drain curve when rules are removed.
If you had to deploy this on even leaner hardware—say sub-1 vCPU and 256 MiB RAM—what would you cut first, and why? Detail the minimal viable observability that still preserves incident response, and how you’d prove it in staging.
I’d keep traces plus aligned logs and drop most metrics during outages, since prometheusremotewrite already accepts the sacrifice. I’d strip to a tail sampler with errors and >200 ms, Fluent Bit with the Lua mirror, spanmetrics post-sampling for exemplars, and the bbolt queue with num_consumers: 1. Memory would be guarded by the 320/400 MiB model adjusted downward, and batching stays at 512/5s to keep payloads efficient; if 256 MiB is hard, I’d test 256/320 MiB limits and narrower routes. Proof comes from a staging run with the same k6 profile (500 VUs, ~2,500 spans/s), validating no orphaned signals, steady RSS, and clean drains after chaos-induced disconnects.
How do you operationalize version drift risks—collector v0.95/v0.96 constraints, backend quirks, or plugin updates—without surprises? Describe pinning strategies, canarying, compatibility matrices, and your runbook for rapid rollback with zero telemetry blind spots.
We pin collector versions and plugin SHAs so we always know we’re on v0.95 or v0.96 when those constraints matter, and we codify capabilities like “no file-backed queue for prometheusremotewrite” in a living compatibility matrix. Changes ship through a canary node that runs the full chaos drill—iptables drops, conntrack flush, reconnect—and must show healthy bbolt drains, zero dead exemplars, and no unbounded retries. The rollback is a simple image tag revert in the DaemonSet plus config map restore; because tail-sampling and Lua live in config, we can flip thresholds back to the 200 ms/error baseline in one commit. The runbook emphasizes observability of observability: queue depth linearity, 204s on remote-write post-outage, and exemplar click-throughs—all green, or we roll back immediately.
Do you have any advice for our readers?
Treat the edge like hostile territory and make every byte earn its ticket home. Start with tail-based sampling that keeps errors and anything over 200 ms, mirror it in Fluent Bit with Lua, and you’ll drop roughly 80% of spans and 86% of logs without losing the plot. Put spanmetrics after sampling so exemplars never go stale, and anchor resiliency with a bbolt-backed queue and num_consumers: 1 so your backlog is honest and your GC calm. Finally, don’t be afraid to sacrifice outage-window metrics via prometheusremotewrite on v0.95–v0.96; keeping the pipeline breathing is what saves you when the link comes back and the real debugging begins.
