Audio quality is the single most important factor in video call satisfaction. Not video resolution, not latency, not layout — audio. Research from Microsoft and Stanford has consistently shown that poor audio quality causes participants to disengage faster than poor video quality. A 720p video with crystal-clear audio feels professional. A 4K video with keyboard clicking, fan whine, and HVAC rumble in the background feels like a disaster.
The problem is that real-world audio environments are hostile. Mechanical keyboards produce 60-80dB transients on every keystroke. Desktop fans generate continuous broadband noise at 30-45dB. Air conditioning creates a persistent 100-300Hz rumble. Construction outside adds low-frequency thumps and high-frequency grinding. A participant in a coffee shop brings all of these together with espresso machines, background conversation, and door chimes. The microphone captures everything. The other participants hear everything.
V100 solves this with a four-stage noise suppression pipeline built on the Web Audio API. It runs entirely in the browser, requires zero downloads, adds less than 10ms of latency, and is controlled through a simple API toggle with three suppression levels. This post explains exactly how it works, why we chose signal processing over machine learning for the current implementation, and where we are taking it next.
The Four-Stage Pipeline
V100's noise suppression pipeline chains four Web Audio API nodes in series. Each stage addresses a specific category of noise. The signal flows from the raw microphone input through all four stages and emerges as a clean, normalized audio stream that replaces the original track on all peer connections.
Audio Signal Chain
Stage 1: High-Pass Filter
The first stage is a BiquadFilterNode configured as a high-pass filter. This removes all audio energy below a cutoff frequency, eliminating the low-frequency noise that plagues most environments. HVAC systems generate continuous noise between 50-250Hz. Desktop fans produce broadband noise with dominant energy below 200Hz. Traffic rumble sits between 40-150Hz. A high-pass filter tuned to the right cutoff eliminates all of this while preserving the frequency range of the human voice (fundamental frequencies between 85Hz for male voices and 255Hz for female voices, with formants extending to 4kHz+).
The cutoff frequency varies by suppression level. At low suppression, the cutoff is 200Hz — conservative enough to preserve deep male voices while removing the worst of the HVAC rumble. At medium, it rises to 300Hz, which catches more ambient noise at the cost of slightly thinning the deepest voices. At high, it goes to 400Hz, aggressively cutting everything below the primary voice band. Most users will not notice the difference between 200Hz and 300Hz on their voice. They will notice the dramatic reduction in background noise.
Stage 2: Dynamics Compressor as Noise Gate
The second stage is the core of the suppression pipeline. A DynamicsCompressorNode is configured with a threshold, ratio, attack, and release to function as a noise gate. Audio above the threshold passes through with minimal compression. Audio below the threshold is heavily compressed — effectively silenced.
When the user stops speaking, the ambient room noise (typically -50dB to -30dB) falls below the threshold. The compressor applies its ratio (12:1 at low, 16:1 at medium, 20:1 at high), reducing the noise to near-silence. When the user speaks, their voice (typically -20dB to 0dB) exceeds the threshold, and the compressor passes it through with minimal attenuation. The result: clean voice during speech, silence between speech.
| Parameter | Low | Medium | High |
|---|---|---|---|
| High-pass cutoff | 200Hz | 300Hz | 400Hz |
| Gate threshold | -50dB | -40dB | -30dB |
| Compression ratio | 12:1 | 16:1 | 20:1 |
| Attack time | 0.003s | 0.003s | 0.003s |
| Release time | 0.25s | 0.20s | 0.15s |
| Gain boost | 1.2x | 1.4x | 1.6x |
The attack time is uniformly fast at 3ms across all levels. This ensures the compressor engages almost instantly when the user starts speaking, preventing the first syllable from being clipped. The release time varies: 250ms at low (smooth, natural decay), 200ms at medium, and 150ms at high (faster gate close to catch noise between short pauses). The tradeoff is that faster release can make the gate audible as a slight "breathing" effect on quiet environments, which is why low suppression uses a slower release.
Stage 3: Gain Boost
The dynamics compressor reduces the overall level of the audio, even for speech that passes above the threshold (because the ratio is not 1:1 above threshold — it is closer to 2:1). The gain stage compensates for this level reduction, boosting the output to match the expected input level. Without gain compensation, the suppressed audio would sound noticeably quieter than unsuppressed audio, which would be jarring when toggling noise suppression on and off.
The gain boost is calibrated per level: 1.2x at low, 1.4x at medium, 1.6x at high. Higher suppression levels use more aggressive compression (higher ratio, higher threshold), which reduces the output level more, requiring more gain compensation. These values were calibrated by measuring the average voice level before and after compression and adjusting the gain to match.
Stage 4: Limiter
The final stage is a second DynamicsCompressorNode, but configured as a hard limiter rather than a noise gate. Its threshold is set to -1dB with a ratio of 20:1 and a 0.001s attack. Any audio that approaches 0dBFS (digital full scale) is immediately compressed to prevent clipping. This protects against loud transients that the gain stage might boost into clipping territory: a sudden cough, a laugh, a desk slap, or the sharp consonant in an emphatic "absolutely."
Without the limiter, the gain boost in stage 3 could push loud speech segments past 0dBFS, causing digital clipping — the harsh, distorted sound that makes listeners wince. The limiter is a safety net that ensures the output never clips, regardless of input level. It engages rarely (only on genuine peaks), so it has no audible effect during normal speech.
Seamless Track Replacement: No Renegotiation
A critical implementation detail is how the processed audio replaces the original microphone track on active WebRTC peer connections. The naive approach — removing the old track and adding a new one — triggers ICE renegotiation on every peer connection. In a room with 8 participants, that means 7 renegotiation cycles every time someone toggles noise suppression. Each cycle adds 200-500ms of signaling overhead and risks a brief audio gap.
V100 uses RTCRtpSender.replaceTrack() instead. This method swaps the audio track on an existing sender without triggering renegotiation. The WebRTC connection continues uninterrupted. The remote participant's audio stream switches from raw microphone to processed audio (or back) with zero gap, zero renegotiation, and zero interruption. The toggle is instantaneous and seamless.
The processed audio comes from a MediaStreamAudioDestinationNode at the end of the Web Audio pipeline. This node produces a MediaStream whose audio track can be used directly with replaceTrack(). When noise suppression is disabled, V100 replaces the processed track back to the original microphone track. The pipeline nodes remain instantiated but disconnected, ready for instant reconnection.
Audio Level Monitoring
V100 exposes a getAudioLevel() method that returns the current RMS audio level (0.0 to 1.0) of either the raw or processed audio stream. This powers UI audio meters that give visual feedback of the suppression effect. Developers can show a before/after meter that demonstrates how the suppression pipeline reduces background noise while preserving voice level.
The audio level is computed using an AnalyserNode tapped at the output of the pipeline. The analyser provides frequency-domain data via getByteFrequencyData() and time-domain data via getByteTimeDomainData(). V100 computes RMS from the time-domain data for the level meter and uses frequency-domain data for the optional spectral visualization.
Code Sample: Toggle and Level Selection
// Enable noise suppression at medium level
await room.setNoiseSuppression({
enabled: true,
level: 'medium', // 'low' | 'medium' | 'high'
});
// Change level on the fly (no renegotiation)
await room.setNoiseSuppression({ level: 'high' });
// Disable noise suppression
await room.setNoiseSuppression({ enabled: false });
// Get current audio level for UI meter
const level = room.getAudioLevel(); // 0.0 - 1.0
console.log(`Current audio level: ${(level * 100).toFixed(0)}%`);
// Listen for suppression state changes
room.on('noiseSuppression:change', (event) => {
console.log(`Suppression: ${event.enabled ? event.level : 'off'}`);
});
// Per-participant suppression control
room.setNoiseSuppression({
enabled: true,
level: 'high',
participantId: 'user_abc123', // Apply to specific user
});
V100 vs. Krisp vs. Zoom vs. Teams
| Feature | V100 | Krisp | Zoom | Teams |
|---|---|---|---|---|
| Approach | Web Audio pipeline | Deep neural network | Proprietary ML | Proprietary ML |
| Complex noise (dogs, babies) | Moderate | Excellent | Good | Good |
| Steady-state noise (HVAC, fans) | Excellent | Excellent | Good | Good |
| Keyboard clicks | Excellent | Excellent | Good | Good |
| Download required | None | Desktop app or SDK | Zoom client | Teams client |
| Added latency | <10ms | ~20ms | Undisclosed | Undisclosed |
| API controllable | Yes (per participant) | SDK only | User settings only | User settings only |
| Suppression levels | 3 (low/med/high) | 2 (on/off) | 3 (auto/low/high) | 3 (auto/low/high) |
| Cost | Included | $60/yr per user | Included in plan | Included in plan |
The honest assessment: Krisp's neural network approach is better at complex, non-stationary noise like barking dogs, baby cries, and overlapping human speech. These are sounds that occupy the same frequency range as the target voice and cannot be separated by a frequency filter or noise gate alone. V100's signal processing pipeline excels at steady-state noise (the types that dominate most professional environments) and transient mechanical noise (keyboards, mouse clicks, desk taps). For most office, home-office, and professional environments, V100's approach is more than sufficient.
The tradeoffs favor V100 in deployment simplicity. V100's pipeline runs natively in the browser — there is nothing to download, nothing to install, nothing to keep updated. Krisp requires either a desktop application (for end users) or an SDK integration (for developers). Zoom and Teams require their respective native clients. V100 works in any modern browser on any operating system with zero prerequisites.
Roadmap: RNNoise WASM for ML-Based Suppression
The Web Audio pipeline is V100's production noise suppression today. The next step is RNNoise, Mozilla's open-source recurrent neural network for noise suppression, compiled to WebAssembly. RNNoise is trained on a dataset of 100+ hours of noise and speech, and operates on 10ms audio frames in the frequency domain. It separates voice from noise with near-Krisp quality for common noise types, while running at under 5% CPU on modern hardware.
The WASM-compiled RNNoise model is approximately 200KB — small enough to load on demand without impacting page load time. V100 will integrate it as a new AudioWorkletNode that slots into the existing pipeline, replacing the high-pass filter and dynamics compressor with spectral noise estimation and gain calculation. The limiter stage will remain as a safety net.
The timeline: RNNoise integration is in internal testing now and will ship as an opt-in upgrade. Developers will be able to choose between the signal processing pipeline (lowest latency, zero download) and the ML pipeline (better complex noise removal, 200KB WASM download on first use). Both will be controllable through the same setNoiseSuppression() API.
Integration with V100's Audio Stack
Noise suppression is one component of V100's broader audio quality stack. It works alongside speaker diarization (which identifies who said what in a recording), active speaker detection (which determines who is currently speaking for auto-zoom), and echo cancellation (handled by the browser's native AEC). The noise suppression pipeline feeds into the audio level monitoring that drives speaker detection, meaning cleaner audio also means more accurate speaker tracking.
For developers building on V100, noise suppression is a single API call. Toggle it on, pick a level, and let the pipeline handle the rest. No AudioContext management, no node wiring, no track replacement logic. V100 abstracts all of the Web Audio complexity behind a clean interface that does the right thing.
Hear the difference
Start a V100 room, turn on a fan or type on your keyboard, and toggle noise suppression. The background noise disappears. The voice stays.