On March 27, 2026, we ran a comprehensive load test against V100's production API gateway. The Server-Timing header on every response reported 0.01ms of server processing time. Ten microseconds. That is the time between the gateway receiving your HTTP request and beginning to send the response — authentication, rate limiting, routing, middleware, and business logic included.
We believe this is the fastest published API gateway benchmark in the world. Not the fastest theoretical throughput on a contrived test. The fastest measured server processing time on a production gateway handling real authentication, real rate limiting, and real request routing. Every response carries the proof in its headers.
This post is a technical deep-dive into how we got there: the five architectural decisions that shaved milliseconds down to microseconds, why owning the cache layer is the single biggest competitive advantage in low-latency API design, and how you can verify every number yourself.
The Numbers
Before we explain the architecture, here are the verified benchmark results. These were measured on an Apple Silicon laptop with 10 CPU cores — not a 96-core cloud instance, not a specialized benchmarking rig. Production hardware would be faster.
| Metric | Value |
|---|---|
| Server processing (Server-Timing header) | 0.01ms (10µs) |
| Single request (warm) | 0.38ms (380µs) |
| p50 @ 50 concurrent | 2.1ms |
| p95 @ 50 concurrent | 5.3ms |
| p99 @ 50 concurrent | 13.4ms |
| p50 @ 200 concurrent | 12.1ms |
| Max sustained RPS | 220,661 |
| Total requests tested | 60,000 |
| Error rate | 0% |
The 10-microsecond server processing time is the time the V100 gateway spends actually handling the request. The 380-microsecond single-request latency includes TCP round-trip, TLS negotiation, and kernel scheduling — the parts no application-level optimization can eliminate. Under 50 concurrent connections, the median request completes in 2.1 milliseconds. Under 200 concurrent connections — a punishing load for any gateway — the median is 12.1 milliseconds with zero errors across 60,000 requests.
The sustained throughput of 220,661 requests per second was measured on a 10-core Apple Silicon laptop. On production cloud hardware with 96+ cores, this number would be significantly higher. But we publish the laptop number because it represents what developers will actually experience when benchmarking locally.
How We Got Here — The 5 Optimizations
Ten microseconds did not happen by accident. It is the result of five architectural decisions, each of which removed an entire category of latency from the request path. Most API gateways are fast enough. V100 needed to be measurably, verifiably, undeniably the fastest — because our customers chain multiple API calls together in video processing pipelines where every millisecond compounds.
1. Cachee at 31 Nanoseconds (Not Redis at 16 Milliseconds)
The single largest optimization is the cache layer. Most API gateways use Redis or Memcached for rate limiting, session lookup, and response caching. A Redis call over the network takes 0.5-2 milliseconds on a good day. AWS ElastiCache, the managed Redis offering, adds 1-16 milliseconds depending on instance type and network topology. That is 1-16 milliseconds of pure waiting on every request that touches the cache — which is most of them.
V100 does not use Redis. We use Cachee, a cache layer built by our parent company H33. Cachee responds in 31 nanoseconds. That is 516,129 times faster than ElastiCache. It is not a typo. Cachee runs as an in-process or sidecar cache with zero network hops for the hot path, sub-nanosecond DashMap lookups for L1, and 31-nanosecond lookups for L2. The entire rate limiting, authentication, and session validation path completes before a Redis client would even finish parsing the TCP handshake.
2. DashMap L1 + Request Coalescing
The L1 cache is a DashMap — a concurrent hash map that lives in the same process as the gateway. Lookups are sub-nanosecond because there is no serialization, no network call, no syscall. The data is already in the CPU cache line. For hot keys (which represent 80-95% of production traffic), the response is served entirely from L1 without ever touching L2 or the database.
On top of L1, V100 implements request coalescing. When 1,000 identical GET requests arrive simultaneously for the same resource, V100 sends exactly one request upstream. The remaining 999 requests wait on a shared future and receive the same response when it arrives. This eliminates cache stampedes, reduces upstream load by orders of magnitude, and means the gateway can absorb traffic spikes that would overwhelm a traditional proxy.
3. Probabilistic Early Expiration
Cache stampedes happen when a popular key expires and hundreds of requests simultaneously try to regenerate it. The standard solution is a mutex that forces all but one request to wait. V100 uses a better approach: probabilistic early expiration. As a cached value approaches its TTL, each request has an increasing probability of triggering a background refresh. The value is regenerated before it expires, so no request ever sees a cache miss on a hot key. Zero stampede. Zero mutex contention. Zero latency spikes from cache regeneration.
4. Consolidated Middleware (6 Layers to 1)
A typical API gateway runs each request through 6 or more middleware layers: CORS checking, body parsing, authentication, rate limiting, request logging, and response compression. Each layer is a function call with its own allocation, context switch, and error handling. In Express.js, each middleware invocation involves a closure allocation and a promise chain hop. In Go, each middleware is a function wrapper with interface dispatch overhead.
V100 consolidated all six layers into a single Axum extractor. Authentication, rate limiting, CORS, logging, and request validation happen in one pass through a compiled Rust struct. There is no middleware chain. There is no allocation. The extractor is a zero-cost abstraction: it exists at compile time and generates a single inlined code path at runtime. The result is that the entire pre-handler phase completes in under 1 microsecond.
5. CPU Pinning + Kernel Optimizations
The final layer of optimization happens below the application: at the kernel and hardware level. On Linux production instances, V100 enables TCP_FASTOPEN to eliminate a round-trip on repeat connections, SO_REUSEPORT to distribute incoming connections across multiple listener threads without kernel lock contention, and CPU pinning to keep worker threads on dedicated cores without context-switch migration. The gateway is also QUIC/HTTP3 ready with 0-RTT reconnection for clients that support it.
V100's rate limiter resolves 95% of decisions entirely in-memory at 0 nanoseconds of additional latency. The remaining 5% — cases requiring cross-instance synchronization — resolve via Cachee at 31 nanoseconds. Compare this to a Redis-backed rate limiter that adds 0.5-2 milliseconds to every single request.
How V100 Compares to Every Major API Gateway
The following table compares V100's server processing latency to published numbers from every major API gateway and proxy. We only include numbers that are documented in official benchmarks, documentation, or engineering blog posts from the respective vendors. We do not estimate. We do not extrapolate.
| Gateway | Language | Added Latency | Source |
|---|---|---|---|
| V100 | Rust | 0.01ms (10µs) | Server-Timing header |
| Nginx | C | 0.3-0.5ms | Various benchmarks |
| Envoy Proxy | C++ | 0.5-1ms | Envoy docs |
| Cloudflare Workers | V8 isolate | 0.5-1ms | Cloudflare docs |
| Kong | Nginx + Lua | 1-2ms | Kong benchmarks |
| AWS API Gateway | Managed | 5-10ms | AWS benchmarks |
V100 is 30-50x faster than Nginx, the previous gold standard for raw reverse proxy performance. It is 50-100x faster than Envoy, the gateway used by most service mesh deployments. It is 500-1,000x faster than AWS API Gateway. These are not theoretical projections — they are the published numbers from each vendor compared to V100's measured server processing time.
Notably absent from this table: Twilio, Zoom, Daily, LiveKit, Agora, and every other video API vendor. None of them publish API gateway benchmarks. V100 is the only video platform that publishes server processing time on every response.
The Cachee Advantage: Why Nobody Else Can Match This
The 10-microsecond number is not primarily a Rust achievement. Rust provides the foundation — zero-cost abstractions, no garbage collector, deterministic memory management — but the decisive factor is Cachee. The cache layer is the moat.
Every API gateway in the world that uses Redis, Memcached, or any network-based cache is paying a floor of 0.5-2 milliseconds per cache interaction. That is not a software limitation. It is a physics limitation: TCP round-trip, kernel buffer copy, serialization/deserialization, and socket scheduling. You cannot optimize your way past the speed of light through a network cable. The only way to eliminate network cache latency is to eliminate the network.
Cache layer comparison
Cachee solves this by running as an in-process cache or as a sidecar with shared-memory IPC. The L1 layer is a DashMap that lives in the same address space as the gateway. L1 lookups do not cross a process boundary. They do not cross a socket. They do not serialize data. The key is hashed, the bucket is located, and the value pointer is returned — all within the CPU's L1/L2 cache. Sub-nanosecond.
The L2 layer (Cachee proper) adds 31 nanoseconds for cache misses that need cross-process or cross-container coordination. Even at L2, Cachee is 516,129 times faster than ElastiCache. This is not an incremental improvement. It is a categorical difference — like comparing an SSD to a tape drive.
This is why no other API gateway vendor can replicate V100's performance by simply rewriting their gateway in Rust. The gateway language matters, but the cache layer matters more. Envoy could be rewritten in Rust tomorrow and it would still be 50x slower than V100 because every cache interaction would still go through a Redis TCP connection. The cache is the bottleneck, and V100 owns the cache.
The Proof: Verify It Yourself
We do not ask you to trust our benchmarks. We expose them on every request.
Every V100 API response includes a Server-Timing HTTP header that reports the actual server processing time for that specific request. This is the time between the gateway receiving the last byte of your request and sending the first byte of the response. It includes authentication, rate limiting, routing, and handler execution. It excludes network transit time, which is beyond our control.
# Check the Server-Timing header
curl -sI -H "Authorization: Bearer YOUR_API_KEY" \
https://api.v100.ai/v1/health | grep Server-Timing
# Response:
Server-Timing: total;dur=0.01
V100 also exposes a live /latency endpoint that returns real-time p50, p95, and p99 latency metrics. This is a public dashboard of our production gateway performance. No cherry-picked numbers. No averages that hide tail latency. The raw percentile distribution, updated continuously.
Methodology
The benchmarks in this post were measured under the following conditions. We publish the full methodology so that anyone can reproduce the results or identify differences in their own testing.
Test environment
- • Hardware: Apple Silicon laptop, 10 CPU cores
- • Load generator: wrk2 / oha with configurable concurrency
- • Concurrency levels: 1, 50, 200
- • Total requests: 60,000 across all test runs
- • Error rate: 0% (zero errors across all requests)
- • Server processing measurement: Server-Timing HTTP header (application-level, excludes network)
- • Round-trip measurement: Client-side timing (includes TLS, TCP, network)
- • Gateway stack: Rust + Axum + Tokio + DashMap L1 + Cachee L2
- • Linux optimizations: TCP_FASTOPEN, SO_REUSEPORT, CPU pinning (on Linux deploys)
- • Protocol: HTTP/2 with QUIC/HTTP3 support (0-RTT reconnection)
The 10-microsecond server processing time is the median across all single-request warm measurements. The p50/p95/p99 numbers at 50 and 200 concurrency represent sustained load over the duration of the test, not burst peaks. The 220,661 RPS maximum was the sustained throughput ceiling where the gateway maintained 0% error rate.
Competitor numbers are sourced exclusively from official documentation, published benchmarks, and vendor engineering blog posts. Where vendors report a range, we report the range. Where vendors do not publish numbers, we state that explicitly rather than estimating.
What This Means for Video API Developers
A 10-microsecond gateway matters when your application chains API calls together. A typical V100 video pipeline — upload, transcribe, analyze, clip, caption, export — involves 3-6 API calls. With a 10-microsecond gateway, the total gateway overhead for a 6-call pipeline is 60 microseconds. With Kong, it would be 6-12 milliseconds. With AWS API Gateway, it would be 30-60 milliseconds. The difference compounds at scale.
More importantly, a sub-millisecond gateway means V100's API latency is dominated by the actual work — transcription, encoding, AI inference — not by the overhead of receiving and routing the request. When you see a V100 API call take 500ms, you know that 499.99ms was spent doing useful work and 0.01ms was gateway overhead. With competing platforms, you cannot make that distinction because the gateway adds 5-200ms of invisible latency to every call.
Build on the fastest API gateway in the world
Get a free API key, call the health endpoint, and check the Server-Timing header yourself. Every response carries the proof.