Every API request touches the cache. Rate limiting checks whether the caller has exceeded their quota. Authentication validates the API key. Response caching determines whether the result is already available. Session lookup retrieves the caller's context. These are not optional operations. They happen on every single request, and the speed at which they resolve determines the floor of your API latency.
For most API platforms, the cache is Redis. Redis is excellent software. It is fast, reliable, and well-understood. It is also, fundamentally, a network service. Every Redis operation requires a TCP round-trip: the client serializes the command, sends it over a socket, the kernel copies it into a buffer, Redis processes it, and the response travels back through the same chain. On a good day, this takes 0.5-2 milliseconds. On AWS ElastiCache, the managed Redis offering, it takes 1-16 milliseconds depending on instance type, availability zone placement, and network congestion.
V100 does not use Redis. V100 uses Cachee, a cache built by H33 that responds in 31 nanoseconds. That is not a typo. Thirty-one nanoseconds. To put that in context: light travels approximately 9.3 meters in 31 nanoseconds. A Redis round-trip over a network cable takes longer than light traveling three kilometers. The difference between Cachee and ElastiCache is not an optimization. It is a different category of technology.
The Numbers: Cachee vs. Everything Else
Before we explain how Cachee works, let us establish the scale of the difference. These are measured lookup latencies for a single cache GET operation, comparing Cachee against every commonly used cache layer.
| Cache Layer | Lookup Latency | vs. Cachee | Transport |
|---|---|---|---|
| V100 L1 (DashMap) | <1ns | 31x faster | In-process (CPU cache line) |
| V100 L2 (Cachee) | 31ns | Baseline | In-process / sidecar SHM |
| Memcached (localhost) | 0.5ms (500,000ns) | 16,129x slower | TCP loopback |
| Redis (localhost) | 1ms (1,000,000ns) | 32,258x slower | TCP loopback |
| Redis (same AZ) | 2-5ms | 64K-161Kx slower | TCP over VPC |
| AWS ElastiCache | 1-16ms | 32K-516Kx slower | TCP over VPC + managed proxy |
The gap is not linear. Cachee is not "a bit faster than Redis." It is five to six orders of magnitude faster. The performance difference between Cachee and ElastiCache is comparable to the difference between an SSD and a floppy disk, or between fiber optic and carrier pigeon. They solve the same problem — key-value lookup — but Cachee solves it by eliminating the network entirely.
How Cachee Works: Eliminating the Network
The fundamental insight behind Cachee is that the network is the bottleneck, not the data structure. A hash map lookup takes nanoseconds. A network round-trip takes milliseconds. When your cache is accessed on every request, the correct architecture is to put the cache in the same process as the application and eliminate the network entirely.
Cachee operates in two modes. In in-process mode, the cache is a DashMap (a concurrent, sharded hash map) that lives in the same address space as the application. There is no serialization, no deserialization, no syscall, no context switch. The key is hashed, the shard lock is acquired (a lightweight spinlock), the pointer is looked up, and the value is returned. The entire operation completes in less than 1 nanosecond for L1 hits. In sidecar mode, Cachee runs as a separate process communicating via shared memory, adding the 31-nanosecond lookup time for L2. Even in sidecar mode, there is no TCP, no kernel buffer copy, and no serialization — data is read directly from a memory-mapped region.
V100 runs Cachee in in-process mode. The L1 DashMap handles 95% of cache operations. The L2 layer handles the remaining 5% that require cross-worker synchronization or persistence. For a typical V100 API request, rate limiting, API key validation, and session lookup all resolve at L1 — sub-nanosecond — and the entire pre-handler phase completes before a Redis client would finish serializing its first command.
How V100 Uses Cachee: The Four Cache Paths
V100's API gateway uses Cachee for four distinct operations on every request. Each operation would add 1-16ms of latency with a network-based cache. With Cachee, all four together add less than 1 microsecond.
1. Rate Limiting (95% L1, 5% L2)
Every incoming request is checked against a per-API-key rate limit. The rate limiter uses a sliding window counter stored in the DashMap. For 95% of requests, the current window counter is already in L1 — the calling pattern has not changed, the key has not expired, and the counter is incremented atomically in-place. No cache miss, no L2 lookup, no network call. The rate limit check completes at the speed of an atomic increment: effectively zero additional latency.
The remaining 5% of rate limit checks — new windows, expired counters, or cross-instance synchronization events — fall through to Cachee L2 at 31 nanoseconds. Compare this to Redis-backed rate limiting, where every single check costs 0.5-2ms regardless of hit pattern. At 220,000 requests per second, the difference between 0ns (L1 hit) and 1ms (Redis) is 220 CPU-seconds per second. That is not rounding error. That is the difference between handling your traffic on 5 servers or 50.
2. API Key Validation (5-minute TTL)
V100 validates the API key on every request. The first time a key is seen, it is verified against the database and the result (valid/invalid, tier, rate limits, permissions) is cached in L1 with a 5-minute TTL. Subsequent requests with the same key resolve entirely from the DashMap. An active API key that sends requests at any reasonable frequency will never incur a database lookup after the initial validation.
The 5-minute TTL is a deliberate trade-off. It means that a revoked API key will continue to be accepted for up to 5 minutes. In exchange, 99.9% of API key validations complete at sub-nanosecond latency. For security-critical revocation (compromised keys), V100 supports immediate cache invalidation through the admin API, which clears the L1 entry instantly.
3. Response Caching with Probabilistic Early Expiration
Idempotent API responses (room configurations, participant lists for read-only endpoints, encoding presets) are cached with TTLs ranging from 1 second to 5 minutes depending on the endpoint. Cachee uses probabilistic early expiration to prevent cache stampedes: as a cached value approaches its TTL, each incoming request has an increasing probability of triggering a background refresh. The probability follows a logarithmic curve, ensuring the refresh happens before expiration but not too early.
This is a critical detail. Without probabilistic early expiration, a popular cached key that expires simultaneously triggers hundreds of upstream requests — a cache stampede. The traditional solution (mutex + single-flight) adds lock contention. Cachee's approach eliminates both the stampede and the lock: exactly one request refreshes the cache, and no request ever sees a cold miss on a popular key.
4. Request Coalescing
Request coalescing is not a cache operation per se, but it is deeply integrated with the cache layer. When 1,000 identical GET requests arrive within the same time window for the same resource, V100 sends exactly one request upstream. The first request acquires a coalescing lock (DashMap entry), sends the upstream request, and broadcasts the result to all waiting requests through a Tokio broadcast channel. The remaining 999 requests attach to the broadcast receiver and return the same response without any upstream call.
// V100 Request Coalescing via DashMap + Broadcast Channel
use dashmap::DashMap;
use tokio::sync::broadcast;
pub struct Coalescer {
inflight: DashMap<String, broadcast::Sender<Response>>,
}
impl Coalescer {
pub async fn get_or_fetch(
&self,
key: &str,
fetch: impl Future<Output = Response>,
) -> Response {
// Check if another request is already fetching this key
if let Some(tx) = self.inflight.get(key) {
// Attach to existing broadcast -- zero upstream calls
let mut rx = tx.subscribe();
return rx.recv().await.unwrap();
}
// First request: acquire lock, fetch, broadcast to all waiters
let (tx, _) = broadcast::channel(1);
self.inflight.insert(key.to_string(), tx.clone());
let response = fetch.await;
tx.send(response.clone()).ok();
self.inflight.remove(key);
response
// 1,000 identical GETs = 1 upstream fetch, 999 broadcast receives
}
}
The combined effect of these four cache paths is that the typical V100 API request resolves rate limiting, authentication, and response lookup in under 100 nanoseconds. The 10-microsecond gateway processing time is not a Rust optimization alone. It is a cache architecture decision. Rust gives us zero-cost abstractions and deterministic memory. Cachee gives us zero-cost cache operations. Together, they produce a gateway that is 30-1,000x faster than anything else published.
Architecture: The Tiered Cache Pipeline
Incoming Request
|
v
+-----------------+
| L1: DashMap | <1 ns (95% of lookups resolve here)
| (in-process) | Sub-nanosecond concurrent hash map
+--------+--------+
| miss
v
+-----------------+
| L2: Cachee | 31 ns (4.9% resolve here)
| (in-process | Shared-memory or in-process L2
| or sidecar) | Probabilistic early expiration
+--------+--------+
| miss
v
+-----------------+
| Coalescer | Dedup identical requests
| (broadcast) | 1,000 GETs = 1 upstream fetch
+--------+--------+
| single fetch
v
+-----------------+
| Upstream Service| Database / API / compute
| (origin) | Response cached in L1 + L2
+-----------------+
Total pre-handler latency: <1 µs (typical)
Comparison:
Redis path: 1-2 ms per cache operation
ElastiCache: 1-16 ms per cache operation
V100/Cachee: <0.001 ms (31 ns) per cache operation
The Math: Why Cache Latency Determines Server Count
Cache latency is not just a performance metric. It is a cost multiplier. Every millisecond your cache adds to request processing is a millisecond that a CPU core is occupied waiting instead of handling the next request. At scale, this directly translates to the number of servers you need.
Consider a video API handling 220,000 requests per second (V100's measured throughput on a 10-core laptop). Each request touches the cache 3 times on average (rate limit check, auth validation, response lookup). With Redis at 1ms per operation, the total cache wait time is 3ms per request. At 220,000 RPS, that is 660 CPU-seconds per second spent waiting on cache. On a machine with 10 effective cores, you need 66 servers just to handle the cache wait time — before any actual request processing begins.
With Cachee at 31ns per L2 operation (and <1ns for L1), the total cache time is 0.000093ms per request. At 220,000 RPS, that is 0.02 CPU-seconds per second. Effectively zero. The cache layer is invisible. All 10 cores are available for actual request handling. One machine handles what would require 66 machines with Redis.
Server cost comparison at 220K RPS
This is not a theoretical exercise. It is the actual infrastructure cost equation that V100 faces. By using Cachee instead of Redis, V100 can serve the same traffic on dramatically fewer servers. The savings compound at scale: at 1 million RPS, the difference between Cachee and Redis is the difference between 5 servers and 300 servers. At cloud computing prices, that is the difference between a viable business and a cost-prohibitive one.
Why Redis Cannot Close the Gap
Redis is not slow because of bad engineering. Redis is one of the most well-optimized network services ever built. The single-threaded event loop, the optimized RESP protocol, the memory-efficient data structures — every decision in Redis is carefully considered for performance. But Redis is fundamentally a network service, and network services have a latency floor that no amount of optimization can eliminate.
A TCP round-trip on localhost involves: application serializes command, system call to send (context switch to kernel), kernel copies data to socket buffer, kernel schedules packet, loopback driver delivers to destination socket, kernel copies to receive buffer, context switch back to application, application deserializes response. On Linux, the minimum achievable TCP loopback latency is approximately 30-50 microseconds under ideal conditions. That is 1,000x slower than Cachee's 31 nanoseconds, and it assumes zero contention, zero GC pauses, and a perfectly warm connection.
Unix domain sockets improve on TCP by eliminating the network stack, but they still require two context switches (send and receive), kernel buffer copies, and serialization. Measured UDS latency for Redis is approximately 20-40 microseconds — better than TCP, but still 600-1,300x slower than Cachee.
The only way to match Cachee's latency is to eliminate the process boundary entirely and put the cache in the same address space as the application. That is what Cachee does. It is not a faster Redis. It is a fundamentally different approach to caching.
Cachee Is an H33 Product — V100 Dogfoods It
Cachee was not built specifically for V100. It is a standalone caching product built by H33, V100's parent company. H33 originally built Cachee for its post-quantum authentication infrastructure, which processes over 2.1 million authentications per second and requires sub-microsecond cache lookups for ZKP (zero-knowledge proof) verification results.
V100 is Cachee's first and most demanding customer. Every V100 API request exercises Cachee's L1 and L2 tiers at production scale. Every performance regression in Cachee shows up immediately in V100's Server-Timing headers. Every cache stampede, every edge case in probabilistic early expiration, every DashMap contention issue under high concurrency — V100 surfaces these problems in production before any external Cachee customer encounters them.
This is the advantage of dogfooding. V100 customers get a cache layer that has been battle-tested under conditions that no synthetic benchmark can replicate: real API traffic patterns, real concurrency spikes, real-world key distributions. And Cachee customers get a product that is continuously hardened by one of the most demanding API workloads on the internet.
Cachee is available as a standalone product at cachee.ai. If your application uses Redis or Memcached and cache latency is a bottleneck, Cachee can replace it with a one-line integration. Managed deployment, sidecar mode, and in-process mode are all supported.
What This Means for V100 Customers
V100 customers do not need to know what Cachee is. They do not need to configure it, manage it, or think about it. The benefit is invisible and automatic: every API call is faster because the cache layer resolves in nanoseconds instead of milliseconds. The Server-Timing: total;dur=0.01 header on every response is the visible proof that the entire pipeline — including caching — completed in 10 microseconds.
But for engineering teams evaluating video API platforms, the cache architecture should be a critical factor. Ask any video API vendor: what cache do you use, and what is the measured lookup latency? If the answer is Redis, Memcached, or ElastiCache, you know that every API call is paying a 0.5-16ms tax that V100 does not pay. That tax compounds across every API call in your pipeline. Over millions of requests, it determines your infrastructure cost, your tail latency, and your ability to deliver real-time experiences.
V100 is the only video API built on a sub-microsecond cache layer. It is not something competitors can replicate by switching to a faster Redis instance or a closer availability zone. The gap is architectural, and it is five orders of magnitude wide.
Build on the fastest cache in the world
V100's 10-microsecond gateway is powered by Cachee's 31-nanosecond cache. Start a free trial and check the Server-Timing header on every response.