Peer-to-peer WebRTC is elegant for two people. Each participant sends their video and audio directly to the other participant. One connection. No server in the middle. The latency is as low as physics allows.
Add a third person and each participant now needs two connections. Four people: three connections each. Six people: five connections each, for a total of 30 directional streams. Ten people: 90 streams. Twenty people: 380 streams. The formula is N × (N-1), and it scales catastrophically. By the time you reach 6 participants, most consumer internet connections cannot sustain the upstream bandwidth required to send a separate video stream to each peer. At 10, it is impossible. At 200, the number of peer connections (39,800) is absurd on its face.
This is the fundamental scaling problem of WebRTC, and it is why every large video call in the world uses a Selective Forwarding Unit (SFU). The SFU sits between participants and changes the connection topology from "everyone connects to everyone" to "everyone connects to one server." Each participant sends one stream to the SFU. The SFU forwards each participant's stream to all other participants. The number of connections is 2N (one inbound, one outbound per participant) instead of N×(N-1). This is the architecture that makes Zoom, Teams, Meet, and every other large video call work.
V100's SFU is written from scratch in Rust. It supports 200 participants per room, 1,000 rooms per server instance, three-layer simulcast with adaptive quality selection, and automatic P2P-to-SFU switching. This post is the complete architectural deep-dive: how the SFU is structured, how simulcast works, how adaptive quality selection decides what to forward, and why Rust gives us a structural advantage over Go and C++ alternatives.
Architecture: SfuRoom + SfuRouter
V100's SFU is organized around two primary structures: SfuRoom and SfuRouter. The SfuRoom manages a single room of up to 200 participants. The SfuRouter manages up to 1,000 SfuRooms on a single server instance. The separation of concerns is clean: the Room handles participant lifecycle, stream forwarding, and quality selection. The Router handles room creation, destruction, and resource allocation.
SFU Architecture
Participant state is stored in a DashMap<ParticipantId, ParticipantState>, a lock-free concurrent hash map. DashMap provides sharded locking: the map is divided into shards (typically 64), and each shard has its own lock. Operations on different participants almost never contend on the same shard. This means 200 participants can join, leave, publish, and unpublish streams concurrently without blocking each other. Compare this to a single sync.Mutex protecting a Go map, where every concurrent operation must acquire the same lock.
Simulcast: 3 Quality Layers
Simulcast is the technique that makes SFU-based video calls practical at scale. Without simulcast, the SFU forwards the single video stream from each participant to all recipients at the same quality. With 200 participants, the server must forward 200 high-quality streams to each recipient — an egress bandwidth explosion. Simulcast solves this by having each participant encode and upload three quality layers simultaneously. The SFU then selects the appropriate layer for each recipient based on their bandwidth and the sender's importance (active speaker vs. non-speaker).
| Layer | Resolution | Frame Rate | Bitrate | Use Case |
|---|---|---|---|---|
| Low | 320x180 | 15fps | 150 Kbps | Non-speaking thumbnails |
| Medium | 640x360 | 25fps | 500 Kbps | Visible but not speaking |
| High | 1280x720 | 30fps | 1.5 Mbps | Active speaker, pinned |
Each participant's browser encodes all three layers simultaneously using the WebRTC RTCRtpEncodingParameters API. The three RTP streams share the same SSRC base and are distinguished by RTP stream IDs (RID). The participant's upstream bandwidth requirement is the sum of all three layers: approximately 2.15 Mbps. This is slightly more than a single 720p stream, but it gives the SFU the flexibility to select the optimal layer for each recipient independently.
Adaptive Quality Selection
The SFU's core intelligence is in layer selection: deciding which simulcast layer of each sender to forward to each recipient. This decision is made per-sender, per-recipient, and is re-evaluated continuously. The algorithm considers three factors: whether the sender is the active speaker, the recipient's available downstream bandwidth, and whether the sender's tile is visible in the recipient's viewport.
Layer selection rules
- • Active speaker: Forward high layer (1280x720@30fps) to all recipients. The person talking deserves the best quality.
- • Visible, not speaking, good bandwidth: Forward medium layer (640x360@25fps). Good enough for facial expressions without consuming high bandwidth.
- • Visible, not speaking, constrained bandwidth: Forward low layer (320x180@15fps). Preserves presence without overwhelming the connection.
- • Not visible (scrolled off-screen): Forward nothing. Zero bandwidth for tiles the recipient cannot see.
- • Audio-only mode: Forward audio track only. Video paused to conserve bandwidth.
The "not visible" optimization is significant in large rooms. In a 200-person call, a typical participant sees 9-16 tiles on screen at a time (depending on layout and screen size). The remaining 184-191 participants are off-screen. For those participants, the SFU forwards zero video — only audio (which is tiny at ~30Kbps per stream with Opus). This means the recipient's downstream bandwidth scales with the number of visible tiles, not the number of total participants. A 200-person call consumes roughly the same downstream bandwidth as a 16-person call.
Bandwidth Math: 200 Participants
Let us work through the bandwidth numbers for a 200-participant call with adaptive quality selection. This is the math that determines whether the architecture is practical.
Per-recipient downstream bandwidth (200 participants)
Approximately 12 Mbps downstream per recipient in a 200-person call. This is well within the capacity of a typical broadband connection (25-100 Mbps). The server-side egress for the same room: 12 Mbps × 200 recipients = 2.4 Gbps. This is high but manageable on modern cloud instances with 10-25 Gbps network interfaces. And this is the ceiling — most real-world rooms do not have 200 participants all with video enabled. Webinar-style rooms with 1-3 presenters and 197 view-only participants require dramatically less server egress.
Compare this to naive forwarding without simulcast or visibility optimization: 200 participants × 1.5 Mbps each × 199 recipients = 59.7 Gbps per room. Simulcast and adaptive quality reduce server egress by approximately 25x.
DashMap: Lock-Free Concurrency in Rust
The SFU's performance under high participant counts depends on concurrent access to shared state: participant lists, stream metadata, quality selections, and room events. In Go, the standard approach is a sync.RWMutex protecting a map. This works at small scale but creates contention at large scale. When 200 participants are simultaneously joining, leaving, publishing, and unpublishing, every operation must acquire the mutex. Even a read-write lock creates contention when writes are frequent.
V100 uses DashMap, a concurrent hash map from the Rust ecosystem. DashMap uses sharded locking: the key space is divided into 64 shards, each with its own RwLock. Two operations contend only if they hash to the same shard. With 200 participants distributed across 64 shards, the probability of contention on any single operation is approximately 3/200, or 1.5%. In practice, this means the SFU handles concurrent participant operations with near-zero lock contention.
The DashMap advantage compounds at the Router level. With 1,000 rooms, the Router's room map is also a DashMap. Room creation, destruction, and lookup are all concurrent with per-shard locking. A Go implementation would need either a global mutex (creating a bottleneck) or manual sharding (adding complexity). DashMap provides optimal sharding out of the box.
Room Events: Real-Time Participant Lifecycle
The SFU broadcasts room events to all participants when state changes occur. These events drive the client's UI: adding and removing video tiles, updating speaker indicators, showing mute state, and triggering auto-zoom speaker transitions.
| Event | Trigger | Payload |
|---|---|---|
| participant:join | New participant connects | participant_id, display_name, capabilities |
| participant:leave | Participant disconnects | participant_id, reason |
| track:publish | Participant starts sending video/audio | participant_id, track_id, kind, simulcast_layers |
| track:unpublish | Participant stops sending | participant_id, track_id |
| track:mute | Track muted (audio or video) | participant_id, track_id, muted |
| speaker:change | Active speaker changes | participant_id, audio_level |
Events are broadcast via the signaling channel (WebSocket), not through WebRTC data channels. This ensures event delivery is reliable and ordered, even when media streams are experiencing packet loss. The signaling server and SFU share the same process, so event dispatch has zero network latency — the event is sent directly from the SfuRoom to the WebSocket handler.
Auto P2P ↔ SFU Switching
An SFU adds a server hop. For 2-3 participants, this hop is unnecessary overhead — P2P gives the lowest latency and simplest topology. For 4+ participants, the SFU is essential. V100 handles this transition automatically.
When a room has 3 or fewer participants, V100 uses direct P2P WebRTC connections. When the 4th participant joins, V100 seamlessly transitions to SFU mode. Each existing participant's peer connection is renegotiated to route through the SFU instead of directly to peers. The transition happens in under 500ms and is invisible to the user — no reconnection dialog, no audio gap, no video freeze.
When participants leave and the room drops back below 4, V100 transitions back to P2P. The hysteresis threshold prevents oscillation: the SFU-to-P2P transition requires staying at 3 or fewer participants for 5 seconds, preventing flip-flopping when a 4th participant briefly connects and disconnects.
Topology switching
Code Sample: Room Creation and Management
// Create a room with SFU configuration
const room = await v100.createRoom({
name: 'company-all-hands',
maxParticipants: 200,
mode: 'auto', // 'p2p' | 'sfu' | 'auto'
simulcast: {
enabled: true,
layers: [
{ rid: 'q', width: 320, height: 180, maxBitrate: 150_000, maxFramerate: 15 },
{ rid: 'h', width: 640, height: 360, maxBitrate: 500_000, maxFramerate: 25 },
{ rid: 'f', width: 1280, height: 720, maxBitrate: 1_500_000, maxFramerate: 30 },
],
},
adaptiveQuality: true, // SFU selects layer per recipient
autoSfuThreshold: 4, // Switch to SFU at 4 participants
});
// Listen for participant events
room.on('participant:join', (p) => {
console.log(`${p.displayName} joined (${room.participantCount}/200)`);
});
room.on('participant:leave', (p) => {
console.log(`${p.displayName} left`);
});
room.on('speaker:change', (event) => {
console.log(`Active speaker: ${event.participantId}`);
});
// Get room stats
const stats = await room.getStats();
console.log(`Mode: ${stats.mode}`); // 'p2p' or 'sfu'
console.log(`Participants: ${stats.count}`);
console.log(`Egress: ${stats.egressMbps} Mbps`);
V100 SFU vs. LiveKit vs. mediasoup
| Feature | V100 | LiveKit | mediasoup |
|---|---|---|---|
| Language | Rust | Go | C++ / Node.js |
| Concurrency model | DashMap (lock-free) | sync.Mutex | Single-threaded event loop |
| GC pauses | None (no GC) | Yes (Go GC) | V8 GC (control plane) |
| Max participants/room | 200 | ~100 (docs) | Varies by deployment |
| Simulcast layers | 3 (configurable) | 3 | 3 |
| Adaptive quality | Bandwidth + viewport + speaker | Bandwidth + priority | Manual layer selection |
| Auto P2P/SFU switch | Yes (seamless) | No (SFU always) | No (SFU always) |
| Visibility optimization | Yes (zero BW off-screen) | Partial | Manual |
| Open source | No (proprietary) | Yes (Apache 2.0) | Yes (ISC) |
| Managed platform | Yes (V100 API) | Yes (LiveKit Cloud) | No (self-hosted only) |
LiveKit is the closest comparable SFU. It is well-engineered, open source, and powers many production applications. The architectural difference is the concurrency model. Go's goroutine scheduler is excellent for I/O-bound workloads, but the garbage collector introduces non-deterministic pauses that can cause frame drops during peak load. Rust eliminates GC pauses entirely. Additionally, DashMap's sharded locking outperforms Go's sync.Mutex under high contention — the scenario that occurs when 200 participants are simultaneously active.
mediasoup is the C++ heavyweight of the SFU world. Its media-handling layer (written in C++) is extremely fast. But its control plane runs on Node.js, which introduces V8 GC pauses and single-threaded event loop limitations for room management operations. V100's pure Rust stack has no such split — room management, media forwarding, and participant lifecycle all run in the same process with the same memory model.
The honest tradeoff: LiveKit is open source and mediasoup is open source. V100's SFU is proprietary. If you need source code access or want to host the SFU yourself with full customization, LiveKit or mediasoup are strong choices. If you want a managed SFU with the best performance characteristics and no infrastructure to maintain, V100's API gives you all of this without deploying a single server.
33 Tests Passing
V100's SFU has a comprehensive test suite covering every aspect of room management, participant lifecycle, simulcast layer selection, and edge cases. The test suite runs on every commit and must pass before any deployment.
Test coverage
- ✓ Room creation and destruction
- ✓ Participant join/leave (including abrupt disconnection)
- ✓ Max participant enforcement (201st participant rejected)
- ✓ Track publish/unpublish lifecycle
- ✓ Simulcast layer negotiation
- ✓ Adaptive quality selection (speaker, bandwidth, visibility)
- ✓ Mute/unmute propagation
- ✓ Speaker detection accuracy
- ✓ P2P to SFU transition (forward and reverse)
- ✓ Hysteresis on SFU-to-P2P fallback
- ✓ Room event broadcast ordering
- ✓ Concurrent participant operations (DashMap stress test)
- ✓ Router room limit enforcement (1,001st room rejected)
What This Means for Developers
For developers building on V100, the SFU is invisible. You create a room, participants join, and V100 handles the rest: deciding when to use P2P vs. SFU, selecting simulcast layers, managing bandwidth, and broadcasting events. You do not configure SFU instances. You do not manage TURN servers. You do not tune simulcast parameters (unless you want to). The API abstracts the entire SFU layer behind room configuration options that express intent ("I want up to 200 participants with adaptive quality") rather than implementation ("route RTP packets through a media relay with these codec parameters").
The SFU works seamlessly with V100's other features: AI auto-zoom uses the SFU's speaker detection events. Per-tile zoom works on SFU-forwarded streams. Noise suppression processes the local audio before it reaches the SFU. These features compose because V100 owns the entire stack from client to server.
Build for 200 participants
Create a room, invite participants, and scale to 200 without configuring a single server. V100's SFU handles the infrastructure so you can focus on the product.