We analyzed 50,000 meeting recordings processed through V100's API in February 2026. The numbers tell a consistent story: the average meeting recording is 40% silence and filler. That is not an outlier -- it is the median. The 90th percentile is 52%. For every hour of meeting video your company records, roughly 24 minutes are dead air, thinking pauses, "can you hear me?" troubleshooting, and filler words like "um," "uh," "like," and "you know."
Nobody watches those 24 minutes. They scrub past them, or they watch at 2x speed (which introduces its own problems -- pitch distortion, cognitive load, missing content during fast-forward). The better solution is to remove the silence before anyone watches. This article explains how silence detection works at the signal processing level, how V100's API implements it, and how to integrate it into your product.
The Problem in Numbers
Before diving into the technical implementation, let us quantify the problem. These numbers come from V100's production data across 50,000 meeting recordings in February 2026.
Silence Analysis: 50K Meeting Recordings (Feb 2026)
The watch-through improvement is the most commercially relevant metric. When you remove silence from a 60-minute meeting recording and deliver a focused 36-minute version, viewership completion rates increase 2.4x on average. People are willing to watch the content -- they just are not willing to sit through the dead air.
How Silence Detection Works
There are two complementary approaches to silence detection: energy-based and transcript-based. Using both together produces better results than either alone.
Energy-based detection operates on the raw audio waveform. The audio is divided into short frames (typically 50ms windows with 25ms overlap). For each frame, the algorithm computes the RMS (root mean square) energy -- a measure of the frame's loudness. Frames with energy below a threshold are classified as silence.
The threshold is critical. A fixed absolute threshold (e.g., -40 dBFS) fails in practice because recording environments vary enormously. A quiet office has ambient noise at -60 dBFS; a cafe has ambient noise at -30 dBFS. V100 uses an adaptive threshold based on the noise floor of each specific recording. The algorithm samples the first 2 seconds of audio (which typically contains ambient noise before speech begins), computes the noise floor, and sets the silence threshold at 6 dB above the noise floor. This works reliably across recording environments from professional studios to laptop microphones in noisy rooms.
Transcript-based detection catches what energy analysis misses: filler words and semi-silent hesitations. The audio is transcribed with word-level timestamps. Gaps between words that exceed the threshold are classified as silence. More importantly, the transcript identifies filler words ("um," "uh," "like," "you know," "so," "basically," "actually," "right") that are not silence in the acoustic sense (they have audible energy) but contribute nothing to the content.
The combination of both methods catches three categories of unwanted audio: hard silence (no sound at all), ambient silence (background noise without speech), and filler speech (audible but contentless utterances). Each category can be configured independently -- you might want to remove hard silence and fillers but keep ambient gaps shorter than 2 seconds for natural pacing.
Configurable Thresholds
Different content types need different silence detection parameters. A podcast interview has natural thinking pauses that should be shortened but not eliminated. A lecture recording has long pauses while the speaker writes on a whiteboard that should be removed entirely. A meeting recording has troubleshooting gaps ("can everyone see my screen?") that are pure waste.
Recommended thresholds by content type:
The V100 API Integration
V100's silence removal is available through both the natural language editor endpoint (where you simply say "remove silence") and the structured editor endpoint (where you configure exact parameters). Here is the structured approach for production use.
const job = await v100.editor.edit({
source: 's3://recordings/standup-2026-03-16.mp4',
instructions: 'Remove silence',
silence_options: {
threshold_seconds: 1.0, // min pause to detect
remove_fillers: true, // also cut filler words
filler_words: [ // customize filler list
'um', 'uh', 'like', 'you know',
'basically', 'actually', 'right'
],
keep_padding_ms: 150, // breathing room around cuts
crossfade_ms: 80, // smooth audio transition
noise_floor_auto: true, // auto-detect noise floor
max_removal_percent: 60 // safety cap: never remove > 60%
},
output: { format: 'mp4', resolution: 'source' },
webhook: 'https://your-app.com/api/webhooks/v100'
});
The max_removal_percent parameter is a safety guardrail. If silence detection would remove more than this percentage of the video, the job completes but flags a warning. This prevents edge cases where the entire video is misclassified as silence (e.g., very quiet speech recorded at extremely low gain).
Before and After: Real Production Metrics
Here are real before/after metrics from three different content types processed through V100's API in production. These are median values across hundreds of recordings in each category.
The Crossfade Problem
Naive silence removal creates audible clicks and pops at every cut point. When you splice two audio segments together, the waveform discontinuity at the splice point produces a click. This is imperceptible in a single cut but becomes extremely distracting when you are making 200+ cuts in a 60-minute recording.
The solution is crossfading. At each cut point, the audio from the end of the preceding segment is faded out over a short window (typically 40-100ms), and the audio from the start of the following segment is faded in over the same window. The two fades overlap, creating a smooth transition. V100 defaults to 80ms crossfades, which is short enough to be imperceptible as a transition but long enough to eliminate clicks.
The video track is handled differently. Video cannot crossfade at splice points without creating ghosting artifacts. Instead, V100 cuts the video at the nearest keyframe boundary within 100ms of the audio cut point. The slight (~33ms at 30fps) timing discrepancy between audio and video cuts is below the threshold of human perception.
Integration Patterns
The most common integration pattern is automatic silence removal on every meeting recording. Your conferencing system (Zoom, Teams, Google Meet, or a custom WebRTC implementation) fires a webhook when a recording is complete. Your backend receives the webhook, submits a silence removal job to V100, and stores the cleaned video alongside the original. Users always see the cleaned version by default, with the option to access the unedited original.
A more advanced pattern combines silence removal with other operations in a single request. "Remove silence over 1 second, remove filler words, add English captions, and generate a 2-minute highlight clip" executes as a single pipeline. The silence removal happens first (so subsequent operations work on the cleaned timeline), then captions are generated on the shortened audio, then the highlight clip is extracted based on transcript engagement scoring.
Silence removal is the single highest-impact video processing operation you can automate. It requires no creative judgment, produces consistently positive results, and dramatically improves the viewer experience. Every minute of dead air you remove is a minute your users do not have to skip through. At scale, that adds up to thousands of hours of saved viewer time per month.
Remove the Dead Air
V100's API removes silence and filler words with configurable thresholds. Free tier: 60 minutes/month.
Get API Key — Free Tier