Open a video call with eight people. Look at any participant's tile. Their face occupies maybe 80 pixels across on your screen — smaller than a postage stamp, too small to read their expression, too small to see if they are nodding, frowning, or about to interrupt. You are in a meeting, but you are functionally blind to the nonverbal communication that makes in-person conversation work.
Every video platform in the world has this problem. Zoom, Microsoft Teams, Google Meet, Webex — they all render participants as fixed-size tiles in a grid. The more people who join, the smaller every tile gets. The facial expressions, body language, and micro-reactions that drive human communication shrink into irrelevance. On mobile, it is worse: eight faces on a 6-inch screen means each face is about 40 pixels wide. You are squinting at a grid of thumbnails while trying to have a conversation.
Manual zoom does not solve it. Even if a platform let you zoom in — and most do not — you would have to manually zoom, track who is speaking, and re-zoom every time the speaker changes. In a meeting where three or four people are having a discussion, you would spend the entire call adjusting your viewport instead of listening. The cognitive load of manually tracking the speaker defeats the purpose.
V100 solves this with AI auto-zoom: a system that automatically detects the active speaker, locates their face in the video frame, and smoothly zooms in — on every participant's screen simultaneously. When the speaker changes, the zoom smoothly transitions to the new speaker. No manual interaction. No cognitive overhead. You see the speaker's face at 3-5x magnification, large enough to read every expression, while the system handles the tracking.
Nobody else has this. Not as a product, not as an API, not even as an experimental feature. This post explains the full architecture: client-side face detection, server-side speaker tracking, the smoothing and hysteresis algorithms that prevent jitter, and why this is harder than it sounds.
The Architecture: Client + Rust Server
Auto-zoom speaker tracking requires coordination between two systems: client-side face detection that knows where the face is in the video frame, and server-side speaker detection that knows who is speaking. Neither system alone is sufficient. Face detection without speaker detection does not know which face to zoom into. Speaker detection without face detection does not know where the face is in the frame to center the zoom. V100 combines both.
Auto-Zoom Pipeline
The key insight is that speaker detection happens on the server, but face detection and zoom happen on the client. This division of labor is deliberate. The server has access to all audio streams and can compare levels across participants. The client has access to the video frame and can run face detection without sending video data back to the server. No additional bandwidth. No privacy concerns from server-side face analysis.
Client-Side Face Detection: Three Tiers of Compatibility
Face detection in the browser is a fragmented landscape. Chrome ships the FaceDetector API as part of the Shape Detection API — a native, hardware-accelerated face detector that returns bounding boxes in under 10ms. It is fast, accurate, and available in Chrome 70+ and Edge 79+. But it is not available in Safari, Firefox, or any WebKit-based browser.
For Safari and Firefox, V100 falls back to MediaPipe's Face Detection model, loaded as a WASM module. MediaPipe is slightly heavier — the model weighs about 1.5MB and detection takes 15-30ms per frame — but it is accurate and runs in all modern browsers. On devices where neither the FaceDetector API nor MediaPipe is available (older mobile browsers, embedded webviews), V100 uses a motion-detection fallback that tracks the region of highest pixel-delta activity in the video frame. This is less precise than face detection but still provides a reasonable zoom target for the speaker.
| Tier | Technology | Browsers | Detection Time | Accuracy |
|---|---|---|---|---|
| Tier 1 | FaceDetector API | Chrome, Edge | <10ms | Excellent |
| Tier 2 | MediaPipe WASM | Safari, Firefox | 15-30ms | Excellent |
| Tier 3 | Motion fallback | All browsers | <5ms | Good |
The tiered approach means auto-zoom works everywhere, with the best possible quality on each platform. Chrome users get hardware-accelerated face detection. Safari users get MediaPipe. Everyone else gets motion tracking. The zoom behavior is identical across all three tiers — only the precision of the face bounding box differs.
Server-Side Speaker Tracking: The ActiveSpeakerTracker
Determining who is speaking sounds trivial: whoever has the highest audio level is the speaker. In practice, it is a signal processing problem. Audio levels fluctuate frame-to-frame. Background noise creates false positives. Two people talking simultaneously creates ambiguity. Brief pauses between sentences should not cause the zoom to snap away. And when the speaker does change, the transition should happen quickly enough to feel responsive but slowly enough to avoid jarring jumps.
V100's ActiveSpeakerTracker runs on the Rust signaling server and solves these problems with three mechanisms: exponential smoothing, hysteresis, and hold time.
Exponential Smoothing
Raw audio levels are noisy. A participant's microphone might report 0.8 one frame and 0.3 the next, even during continuous speech, due to consonant/vowel transitions, breath pauses, and ambient variation. The ActiveSpeakerTracker applies exponential smoothing with an alpha of 0.3 to each participant's audio level stream. This means each new sample contributes 30% to the smoothed value, and the historical average contributes 70%. The result is a stable signal that tracks sustained speech without reacting to transient spikes.
The formula is straightforward: smoothed = alpha * current + (1 - alpha) * previous. With beacons arriving every 100ms and alpha at 0.3, it takes approximately 300-500ms of sustained speech for a new speaker's smoothed level to overtake the previous speaker. This is fast enough to feel responsive but slow enough to ignore a brief cough or laugh.
Hysteresis: Preventing Flip-Flop
Even with smoothing, two participants with similar audio levels can cause the speaker designation to oscillate rapidly — a phenomenon called flip-flopping. Imagine two people in a debate, both speaking at similar volumes. Without protection, the auto-zoom would snap back and forth between them multiple times per second, creating a nauseating visual experience.
V100 prevents this with hysteresis: a minimum time of 1 second between speaker switches. Once the system designates a new active speaker, it will not switch again for at least 1,000ms, regardless of audio levels. This means even in a rapid back-and-forth exchange, the zoom settles on one speaker at a time and transitions at a pace the viewer can follow. The 1-second threshold was determined through user testing — shorter intervals felt jittery, longer intervals felt sluggish.
Hold Time: Handling Pauses
People pause when they speak. They stop to think, to take a breath, to let a point land. A naive speaker tracker would interpret any pause as "this person stopped talking" and immediately switch to someone else — or worse, to background noise from another participant's microphone.
V100 implements a 500ms hold time that keeps the zoom focused on the current speaker even when their audio level drops to zero. If another participant starts speaking during the hold period with a smoothed level that exceeds the threshold, the system switches. But if the hold period expires without a new speaker emerging, the zoom stays on the last speaker. This produces the natural behavior users expect: the camera stays on the speaker while they pause for breath, and only moves when someone else genuinely takes over the conversation.
The Zoom Mechanism: Canvas Viewport Transformation
When the server broadcasts an auto_zoom:focus event, every client receives the speaker ID simultaneously. The client then performs face detection on the speaker's video tile, obtains the face bounding box, and smoothly animates the canvas viewport to center on and zoom into that face.
The zoom is canvas-based, not CSS-based. This distinction matters. CSS transform: scale() on a video element creates aliased, blurry results because the browser scales the composited pixels. V100 renders each participant's video to a <canvas> element and applies the zoom as a viewport transformation on the canvas context. The video source remains at full resolution, and the canvas draws the zoomed region by sampling from the full-resolution frame. The result is sharp, artifact-free zoom at any magnification level.
The zoom animation uses easeOutCubic easing over 300ms, producing a smooth deceleration that feels natural. The zoom range is 1x to 5x, with auto-zoom typically settling between 2x and 3.5x depending on how much of the tile the face occupies. The algorithm calculates the optimal zoom level to fill approximately 60% of the tile with the speaker's face, leaving enough context to see head movement and hand gestures.
Auto-zoom target calculation
Per-Tile Independent Zoom
A critical design decision: the auto-zoom applies to the speaker's tile only. Every other tile remains at its default zoom level. This means in a 6-person call, you see 5 participants at normal size and the active speaker zoomed in. When the speaker changes, the previous speaker's tile smoothly zooms back to 1x while the new speaker's tile zooms in.
Each tile maintains independent zoom state: its current zoom level, pan offset, and animation progress. This independence also means manual zoom works alongside auto-zoom. If you manually zoom into a non-speaking participant's tile (using pinch or scroll), your manual zoom is preserved even when auto-zoom is changing the active speaker's tile. Auto-zoom and manual zoom never conflict because they operate on separate per-tile state.
If a user manually zooms the active speaker's tile, manual zoom takes priority and auto-zoom is suppressed for that tile until the user resets to 1x. This "manual override" behavior ensures that auto-zoom enhances the experience without ever fighting the user's intent.
Enable Auto-Zoom via API
Auto-zoom is enabled with a single configuration option when creating a room. Participants automatically receive the zoomed experience with no additional client-side code required.
// Create a room with auto-zoom enabled
const room = await v100.createRoom({
name: 'deposition-2026-03-28',
maxParticipants: 12,
autoZoom: {
enabled: true,
smoothingAlpha: 0.3, // Exponential smoothing factor
hysteresisMs: 1000, // Min time between speaker switches
holdTimeMs: 500, // Hold focus during brief pauses
maxZoom: 3.5, // Maximum auto-zoom level
faceFillTarget: 0.6, // Target face-to-tile ratio
transitionMs: 300, // Zoom animation duration
}
});
// Listen for auto-zoom events
room.on('auto_zoom:focus', (event) => {
console.log(`Speaker changed to: ${event.participantId}`);
console.log(`Zoom level: ${event.zoomLevel}x`);
console.log(`Face bounds: ${JSON.stringify(event.faceBounds)}`);
});
// Disable auto-zoom for a specific tile (manual override)
room.setAutoZoom(participantId, { enabled: false });
Use Cases: Where Auto-Zoom Changes Everything
Remote Depositions
In a legal deposition, every facial expression matters. The twitch of an eyebrow, the micro-hesitation before answering, the nervous glance away from the camera — these are the signals attorneys depend on to evaluate credibility. In a standard video deposition on Zoom or Teams, the witness's face is a 120-pixel-wide tile in a gallery of 6-8 participants. You cannot read those signals. With V100's auto-zoom, the witness's face fills the screen when they are speaking. The attorney sees every expression at 3x magnification. The court reporter sees the witness clearly. Every participant sees the same zoomed view simultaneously, creating a shared visual record.
Telehealth
Physicians rely on visual assessment more than most people realize. Skin color, eye dilation, facial asymmetry, signs of pain or distress — these are diagnostic data points that require seeing the patient's face clearly. In a telehealth call with a patient, nurse, and specialist, V100's auto-zoom ensures that when the patient describes their symptoms, every clinician sees their face at high magnification. When the specialist asks a question, the zoom shifts to them. The patient always sees who is addressing them.
Education
In online classrooms, students need to see the instructor's face to stay engaged. Research consistently shows that visible facial expressions improve learning outcomes and reduce attention drift. Auto-zoom keeps the instructor's face prominent when they are lecturing, and automatically shifts to a student when they ask a question. In seminars with 10-20 participants, this transforms the experience from a grid of anonymous tiles to a focused conversation.
Sales Calls
The best sales professionals read their prospects. They detect hesitation, interest, confusion, and enthusiasm from facial cues — and adjust their pitch accordingly. On a video call with 4 stakeholders from the prospect's side, those cues are invisible in standard gallery view. Auto-zoom gives the salesperson a clear view of whoever is speaking, making video calls nearly as effective as in-person meetings for reading the room.
V100 Auto-Zoom vs. Everyone Else
| Feature | V100 | Zoom | Teams | Meet |
|---|---|---|---|---|
| AI auto-zoom to speaker face | Yes | No | No | No |
| Face detection in-browser | Yes (3-tier) | No | No | No |
| Smooth zoom transitions | 300ms easeOutCubic | N/A | N/A | N/A |
| Per-tile independent zoom | Yes | No | No | No |
| Manual zoom (pinch/scroll) | Yes (1x-5x) | No | No | No |
| Hysteresis (anti-flip-flop) | 1s minimum | N/A | N/A | N/A |
| Speaker view | Zoomed face | Full tile | Full tile | Full tile |
Zoom, Teams, and Meet all have "speaker view" that enlarges the active speaker's tile. But they enlarge the tile — not the face within it. If a participant is sitting 3 feet from their camera in a wide-angle frame, speaker view gives you a bigger version of a distant face. V100's auto-zoom gives you the face itself, cropped and magnified to fill the viewport. The difference is the difference between watching someone on a security camera versus sitting across a table from them.
Microsoft Teams' "Together Mode" places participants in a virtual shared space, which is creative but does not address the fundamental problem: you still cannot see individual faces clearly. Google Meet's tiled layout is clean but offers no zoom at all. Zoom's gallery view is the industry standard for "everyone is a postage stamp." None of them attempt face-level zoom.
Why Nobody Else Has Built This
Auto-zoom sounds simple in concept: detect the speaker, find their face, zoom in. The reason nobody else has shipped it comes down to four engineering challenges that must be solved simultaneously.
First, cross-browser face detection is fragmented. The FaceDetector API exists only in Chromium. Building a reliable face detection pipeline that works in Safari and Firefox requires shipping a fallback model, managing its lifecycle, and handling the latency differences between native and WASM detection. Most video platforms avoid this complexity entirely.
Second, server-coordinated speaker tracking requires a signaling architecture that can broadcast speaker changes to all clients within 100ms. The signaling server must process audio level beacons from every participant at 10Hz (100ms intervals), run the smoothing and hysteresis algorithm, and broadcast the result — all while handling room management, presence, and media negotiation. V100's Rust signaling server handles this without breaking a sweat. A Node.js or Python signaling server would struggle at scale.
Third, canvas-based zoom requires rendering every video tile through a canvas pipeline instead of displaying raw <video> elements. This is more complex, more CPU-intensive, and requires careful frame scheduling to avoid jank. But it is the only way to achieve pixel-sharp zoom without aliasing artifacts.
Fourth, the interaction model is subtle. Auto-zoom must coexist with manual zoom, not fight it. The priority system (manual overrides auto, auto resumes when manual resets) requires per-tile state management that accounts for both zoom sources. Getting this wrong produces a confusing experience where the user feels the system is fighting them.
V100 invested in solving all four problems because auto-zoom is not a nice-to-have feature. It is the feature that makes video calls work like in-person meetings. When you can see the speaker's face clearly, the entire communication dynamic changes. Misunderstandings decrease. Engagement increases. Meeting fatigue decreases. It is the single most impactful feature we have built for the video conferencing experience.
See auto-zoom in action
Start a free trial, create a room with autoZoom enabled, and invite 3 people. Watch the zoom follow the conversation automatically. It is the feature that makes people say "wait, how did it do that?"