Why is Rust better than Node.js for post-quantum cryptography?

Post-quantum operations are computationally intensive: ML-KEM-768 encapsulation takes ~0.08ms, ML-DSA-65 signing takes ~0.3ms, and FALCON-512 signing takes ~0.5ms per operation. At 220,000+ requests per second, these operations consume significant CPU. Rust provides zero-cost abstractions, no garbage collection pauses, SIMD-optimized NTT (Number Theoretic Transform), and direct hardware access that Node.js cannot match. In benchmarks, V100's Rust PQ implementation is 12-47x faster than equivalent JavaScript implementations.

How many microservices does V100 use?

V100 consists of 20 Rust microservices covering: API gateway, authentication, session management, signaling (WebRTC/WebSocket), SFU (Selective Forwarding Unit), recording, transcription, media processing, encoding, CDN edge, PQ key exchange, PQ signature service, attestation, analytics, billing, webhook delivery, storage, search, notification, and the AI inference pipeline. All 20 are written in Rust and all perform PQ attestation.

What is the NTT and why does it matter for post-quantum performance?

The Number Theoretic Transform (NTT) is the core mathematical operation in lattice-based cryptography (ML-KEM, ML-DSA, FALCON). It converts polynomial multiplication from O(n^2) to O(n log n) operations. V100's Rust implementation uses Montgomery multiplication (no division in the hot path), radix-4 butterflies, NEON SIMD on ARM, and Harvey lazy reduction to achieve NTT performance that is impossible in garbage-collected languages.

What performance does V100 achieve in production?

V100's underlying cryptographic engine (H33) sustains 1,667,875 authentications per second on Graviton4 hardware, with a full pipeline of FHE batch encryption (943 microseconds for 32 users), batch attestation with Dilithium signatures (391 microseconds), and ZKP cached verification (0.358 microseconds). The API gateway handles 220,000+ requests per second with sub-10ms p99 latency including PQ operations.

Why We Built V100 in Rust: 20 Microservices and the Performance Tax of Post-Quantum Crypto

When we began building V100 in 2024, we faced a decision that would define the platform's architecture forever: what language do you write a post-quantum video platform in? The answer seemed obvious in hindsight, but let us walk through the engineering reasoning that made Rust the only viable choice.

The core problem is this: post-quantum cryptography is computationally expensive. Not slightly expensive — fundamentally, architecturally expensive. The mathematical operations underlying lattice-based cryptography (polynomial multiplication via NTT, matrix operations over polynomial rings, rejection sampling) demand tight control over memory layout, CPU cache utilization, and instruction scheduling. These are not operations you can hide behind an async event loop.

V100 is not a single service. It is 20 microservices, all written in Rust, all performing post-quantum attestation on every request. The cumulative performance requirements made garbage-collected languages, JIT-compiled runtimes, and interpreted languages categorically unsuitable.

The Post-Quantum Performance Problem

Classical cryptography is fast because the underlying operations are simple: elliptic curve point multiplication is a sequence of modular additions and doublings on 256-bit integers. Ed25519 signing takes approximately 50 microseconds. ECDH key agreement takes approximately 50 microseconds. These operations are so fast they are effectively free in the context of a web request.

Post-quantum cryptography is a different magnitude of work:

Post-quantum operation costs (per operation)

Operation	Latency	Output Size	Classical Equivalent
ML-KEM-768 encapsulate	~0.08 ms	1,088 B ciphertext	ECDH: ~0.05 ms, 32 B
ML-KEM-768 decapsulate	~0.10 ms	32 B shared secret	ECDH: ~0.05 ms, 32 B
ML-DSA-65 sign	~0.30 ms	3,309 B signature	Ed25519: ~0.05 ms, 64 B
ML-DSA-65 verify	~0.10 ms	boolean	Ed25519: ~0.07 ms
FALCON-512 sign	~0.50 ms	666 B signature	Ed25519: ~0.05 ms, 64 B
FALCON-512 verify	~0.07 ms	boolean	Ed25519: ~0.07 ms

Individually, these numbers seem manageable. But at scale, they compound catastrophically. Consider V100's API gateway processing 220,000 requests per second. Each request requires at minimum one ML-DSA-65 signature verification (authentication) and one ML-KEM decapsulation (if establishing a new session). That is:

220,000 RPS * 0.10 ms verify = 22,000 ms of CPU time per second

That requires 22 CPU cores just for signature verification

With classical Ed25519: 220,000 * 0.07 ms = 15,400 ms = ~15 cores

PQ overhead: +47% CPU just for authentication

And that is just one service. V100 has 20 services, each performing its own PQ operations. The cumulative CPU overhead of post-quantum cryptography at V100's scale is measured in hundreds of cores. Every microsecond saved per operation translates directly to infrastructure cost savings measured in thousands of dollars per month.

Why Node.js Cannot Absorb the PQ Tax

The video industry has a strong preference for Node.js. Many video platforms (Twilio, Vonage, Daily) use Node.js extensively. For classical cryptography, this works fine — Node.js calls out to OpenSSL's C implementation for crypto operations, so the actual computation is native code. But post-quantum cryptography introduces new challenges that Node.js fundamentally cannot address:

1. GC pauses during crypto operations

Node.js's V8 garbage collector pauses all JavaScript execution periodically. At default settings, GC pauses average 5-20ms and can spike to 100ms+ under memory pressure. When your crypto operation takes 0.3ms and your GC pause takes 20ms, the GC dominates your p99 latency entirely. There is no way to prevent GC from interrupting a crypto operation in progress. Rust has no GC — zero pauses, deterministic latency.

2. Memory overhead for PQ key material

ML-KEM-768 requires 2,400 bytes of key material per session (1,184 public key + 1,184 secret key + 32 shared secret). ML-DSA-65 requires 4,032 bytes per keypair (1,952 public + 2,560 secret + 3,309 per signature). In Node.js, each Buffer allocation has V8 heap overhead (16-48 bytes per object header). With 100,000 concurrent sessions, the V8 heap overhead alone is significant. In Rust, key material is stack-allocated or arena-pooled with zero per-object overhead.

3. No SIMD access for NTT optimization

The Number Theoretic Transform (NTT) is the computational core of ML-KEM and ML-DSA. It consists of butterfly operations (modular multiply-accumulate) on arrays of 256-512 coefficients. ARM NEON and x86 AVX2 SIMD instructions can process 4-8 coefficients simultaneously. Rust has stable SIMD intrinsics and auto-vectorization. Node.js has no SIMD access from JavaScript — you must use WebAssembly (significant overhead) or native addons (defeats the purpose of Node.js).

4. Single-threaded event loop bottleneck

Node.js is fundamentally single-threaded for JavaScript execution. Worker threads exist but have high communication overhead (structured clone for message passing). Post-quantum batch operations (signing 32 attestations simultaneously) benefit enormously from work-stealing parallelism. Rust's Rayon library provides zero-overhead parallel iterators that Node.js cannot match without significant architectural compromise.

In our benchmarks, a highly optimized Node.js implementation of ML-DSA-65 (using a native C addon for the core NTT) achieved approximately 3,200 signature verifications per second per core. V100's Rust implementation achieves approximately 47,000 verifications per second per core — a 14.7x advantage. For ML-KEM-768 encapsulation, the ratio is similar: approximately 12x faster in Rust than the best Node.js implementation.

Why Not Java or Go?

Java (via BouncyCastle or similar) and Go are also candidates for crypto-heavy workloads. Both are closer to Rust in raw performance than Node.js. But both share critical weaknesses:

Language comparison for PQ crypto workloads

Factor	Rust	Java	Go	Node.js
GC pauses	None	5-200ms (G1GC)	0.1-1ms (low-latency)	5-100ms (V8)
SIMD intrinsics	Stable, full access	Vector API (incubator)	No direct access	None (WASM only)
Memory layout control	Full (repr(C), align)	None (JVM decides)	Limited (unsafe)	None
Zero-copy buffers	Native (slices, borrows)	ByteBuffer (limited)	Slices (with GC pressure)	Buffer.from (copies)
Work-stealing parallelism	Rayon (zero overhead)	ForkJoinPool (GC pressure)	Goroutines (scheduler overhead)	Worker threads (high overhead)
Constant-time guarantees	Controllable (no JIT reorder)	JIT can break constant-time	Compiler may optimize	V8 JIT breaks constant-time

Java's critical weakness is GC unpredictability. Even with low-latency collectors (ZGC, Shenandoah), Java cannot guarantee sub-millisecond pauses under high allocation pressure — which is exactly what PQ crypto operations create (temporary polynomial buffers, NTT intermediate arrays). The JVM's inability to provide stable memory layout also prevents cache-line-aware data structures that are critical for NTT performance.

Go's weaknesses are subtler. Go's GC is excellent for general-purpose workloads (sub-millisecond pauses), but Go lacks SIMD intrinsics, has limited control over memory layout, and its goroutine scheduler adds overhead to tight computational loops. For I/O-heavy microservices without crypto hotspots, Go is excellent. For services where every request involves 0.3ms of dense computation, the scheduler overhead and lack of SIMD access costs 3-5x performance.

V100's 20 Rust Microservices

Every service in V100's architecture is written in Rust. This is not an aesthetic choice — it is a requirement driven by the fact that every service performs PQ attestation. In a system where any single service using classical crypto creates a weak link, there is no room for "the billing service can be Python."

V100 service architecture (all Rust, all PQ-attested)

01 API Gateway (220K+ RPS)

02 Authentication (ML-DSA-65)

03 Session Management

04 WebRTC Signaling

05 SFU (Selective Forwarding)

06 Recording Engine

07 Transcription Pipeline

08 Media Processing

09 Encoding (H.264/VP9/AV1)

10 CDN Edge Nodes

11 PQ Key Exchange Service

12 PQ Signature Service

13 H33-74 Attestation

14 Analytics Ingestion

15 Billing and Metering

16 Webhook Delivery

17 Object Storage Interface

18 Search and Indexing

19 Notification Service

20 AI Inference Pipeline

Rust-Specific Optimizations That Make PQ Feasible

Rust is not just "fast C with safety." It provides specific language features that enable PQ crypto optimizations impossible in other languages:

1. Montgomery NTT with zero-division hot path

The Number Theoretic Transform (NTT) is the core operation in lattice cryptography. Every ML-KEM encapsulation and every ML-DSA signature performs multiple NTT forward and inverse transforms. The naive NTT uses modular reduction (division) after every butterfly operation. In Rust, we use Montgomery multiplication: all twiddle factors are pre-computed in Montgomery form, and the modular reduction is replaced with cheap shift-and-multiply operations. The hot loop contains zero division instructions.

This optimization requires exact control over integer representation (we use u64 with values in [0, 2q) using Harvey lazy reduction) and memory layout (twiddle factors in contiguous, cache-aligned arrays). Rust's type system and #[repr(align)] make this natural. In Java, the JVM might rearrange your array layout. In Go, the GC might move your twiddle table mid-computation.

2. NEON SIMD for ARM acceleration

V100's production infrastructure runs on AWS Graviton4 (ARM64). Rust provides stable access to ARM NEON intrinsics through std::arch::aarch64. Our Galois rotation (used in key-switching during FHE operations) uses branchless NEON permutations that process 4 coefficients per instruction. The same operations in Go would require cgo (FFI overhead defeats the purpose) or assembly (no safety guarantees).

3. Work-stealing parallelism with Rayon

V100's batch attestation signs 32 user authentications in a single Dilithium operation. The preparation work (computing frame hashes, assembling Merkle trees) parallelizes perfectly. Rust's Rayon library provides zero-overhead parallel iterators: frames.par_iter().map(|f| sha3_256(f)) automatically distributes work across all available cores with work-stealing. No thread pool configuration, no executor boilerplate, no async/await coloring.

4. Zero-copy request processing

Every V100 API request arrives as a byte buffer containing an ML-DSA signature that must be verified. In Rust, we parse the request using zero-copy deserialization (the parsed signature is a reference into the original buffer, not a copy). The signature verification reads directly from the network buffer without any intermediate allocation. In Node.js, the Buffer would be copied to a V8 heap object, then copied again into the native addon's memory space. Two unnecessary copies per request, at 220K RPS, equals billions of wasted memory operations per second.

5. Deterministic constant-time operations

Cryptographic operations must be constant-time to prevent side-channel attacks. In Rust, we control whether operations are constant-time through explicit types (subtle::ConstantTimeEq) and the compiler respects this because there is no JIT that might "optimize" a constant-time comparison into a short-circuit evaluation. Java's HotSpot JIT has been demonstrated to break constant-time guarantees. V8 (Node.js) absolutely breaks them. Rust's ahead-of-time compilation with explicit optimization barriers provides the only reliable constant-time guarantee outside of assembly.

Production Performance: The Numbers

V100's underlying cryptographic engine (H33) has been benchmarked extensively on production hardware. These are not microbenchmarks — they are sustained throughput measurements under realistic load:

Production benchmarks (Graviton4 c8g.metal-48xl, 192 vCPU)

1.67M

auth/sec sustained

Full PQ pipeline (30s run)

42 µs

per authentication

FHE + attest + ZKP verify

1,345 µs

32-user batch

FHE 943 + attest 391 + ZKP 0.4

The pipeline breakdown shows where Rust's optimizations pay off:

•

FHE Batch (943 µs for 32 users): BFV homomorphic encryption with Montgomery NTT, NEON-accelerated Galois rotation, and NTT-domain persistence (ciphertexts remain in NTT form between operations, eliminating redundant transforms).

•

Batch Attestation (391 µs for 32 users): One Dilithium sign + verify per 32-user batch (31x savings vs individual signing). SHA3-256 hash chain computation parallelized via Rayon.

•

ZKP Cached Verify (0.358 µs): In-process DashMap cache (44x faster than network-based alternatives). Cache hit returns pre-verified proof in sub-microsecond time.

These numbers are only possible in Rust. No garbage-collected language can sustain 1.67 million PQ-attested operations per second without GC-induced throughput degradation. We measured: Java (ZGC) achieves approximately 60% of this throughput with 3x higher p99 latency. Go achieves approximately 45% with erratic throughput due to goroutine scheduling overhead during NTT computation.

The Full Rust Stack: Tokio + Rayon + Custom Allocator

V100's Rust services use a carefully chosen runtime stack:

•

Tokio for async I/O (network, disk, timers). The multi-threaded runtime with work-stealing handles hundreds of thousands of concurrent connections without blocking.

•

Rayon for CPU-parallel computation (NTT, hash chains, batch operations). Separate thread pool from Tokio to prevent crypto work from starving I/O tasks.

•

System allocator (not jemalloc) on ARM64. We tested jemalloc extensively — it causes an 8% throughput regression on Graviton4 because glibc's malloc is already optimized for ARM's flat memory model, and jemalloc's arena bookkeeping is pure overhead with 96 workers doing tight NTT loops.

•

DashMap for concurrent caching. Lock-free concurrent hashmap that provides sub-microsecond lookups without mutex contention. Used for ZKP proof caching, session state, and rate limiting.

Conclusion: PQ Crypto Demands Rust

The decision to write V100 in Rust was not ideological. It was mathematical. Post-quantum cryptography introduces a performance tax that scales linearly with request volume. At 220,000+ RPS across 20 services, that tax must be minimized at every level: no GC pauses, no unnecessary copies, no scheduler overhead, no JIT recompilation of hot crypto loops, no memory layout surprises.

Rust is the only mainstream language that provides all of: zero-cost abstractions over hardware (SIMD, cache-aligned structures, stack allocation), fearless concurrency (Rayon for data parallelism, Tokio for I/O parallelism), deterministic performance (no GC, no JIT warmup), and constant-time guarantees (no optimizer rewriting crypto operations).

The result: V100 delivers post-quantum security at classical performance. Users experience no latency penalty for quantum protection. The PQ tax is absorbed by the language and architecture choices — invisible to the end user, but defining the entire engineering stack underneath.

Experience what PQ video performance feels like

20 Rust microservices. Three post-quantum algorithm families. 1.67 million authentications per second. Zero perceptible latency overhead. Start free and see the green PQ-E2E badge on your first call.

Quantum Security Architecture Try V100 Free