Developer Guide 10 min read

How to Remove Silence from Video Programmatically

The average hour-long meeting recording contains 24 minutes of silence and filler words. Here is how silence detection works at the signal level, how to integrate it into your product, and real before/after metrics from production workloads.

V1
V100 Engineering
March 16, 2026

We analyzed 50,000 meeting recordings processed through V100's API in February 2026. The numbers tell a consistent story: the average meeting recording is 40% silence and filler. That is not an outlier -- it is the median. The 90th percentile is 52%. For every hour of meeting video your company records, roughly 24 minutes are dead air, thinking pauses, "can you hear me?" troubleshooting, and filler words like "um," "uh," "like," and "you know."

Nobody watches those 24 minutes. They scrub past them, or they watch at 2x speed (which introduces its own problems -- pitch distortion, cognitive load, missing content during fast-forward). The better solution is to remove the silence before anyone watches. This article explains how silence detection works at the signal processing level, how V100's API implements it, and how to integrate it into your product.

The Problem in Numbers

Before diving into the technical implementation, let us quantify the problem. These numbers come from V100's production data across 50,000 meeting recordings in February 2026.

Silence Analysis: 50K Meeting Recordings (Feb 2026)

40%
Median dead air
12%
Avg. filler words
284
Avg. silence segments per hour
2.4x
Watch-through improvement

The watch-through improvement is the most commercially relevant metric. When you remove silence from a 60-minute meeting recording and deliver a focused 36-minute version, viewership completion rates increase 2.4x on average. People are willing to watch the content -- they just are not willing to sit through the dead air.

How Silence Detection Works

There are two complementary approaches to silence detection: energy-based and transcript-based. Using both together produces better results than either alone.

Energy-based detection operates on the raw audio waveform. The audio is divided into short frames (typically 50ms windows with 25ms overlap). For each frame, the algorithm computes the RMS (root mean square) energy -- a measure of the frame's loudness. Frames with energy below a threshold are classified as silence.

The threshold is critical. A fixed absolute threshold (e.g., -40 dBFS) fails in practice because recording environments vary enormously. A quiet office has ambient noise at -60 dBFS; a cafe has ambient noise at -30 dBFS. V100 uses an adaptive threshold based on the noise floor of each specific recording. The algorithm samples the first 2 seconds of audio (which typically contains ambient noise before speech begins), computes the noise floor, and sets the silence threshold at 6 dB above the noise floor. This works reliably across recording environments from professional studios to laptop microphones in noisy rooms.

Transcript-based detection catches what energy analysis misses: filler words and semi-silent hesitations. The audio is transcribed with word-level timestamps. Gaps between words that exceed the threshold are classified as silence. More importantly, the transcript identifies filler words ("um," "uh," "like," "you know," "so," "basically," "actually," "right") that are not silence in the acoustic sense (they have audible energy) but contribute nothing to the content.

The combination of both methods catches three categories of unwanted audio: hard silence (no sound at all), ambient silence (background noise without speech), and filler speech (audible but contentless utterances). Each category can be configured independently -- you might want to remove hard silence and fillers but keep ambient gaps shorter than 2 seconds for natural pacing.

Configurable Thresholds

Different content types need different silence detection parameters. A podcast interview has natural thinking pauses that should be shortened but not eliminated. A lecture recording has long pauses while the speaker writes on a whiteboard that should be removed entirely. A meeting recording has troubleshooting gaps ("can everyone see my screen?") that are pure waste.

Recommended thresholds by content type:

Podcast / Interview 0.8 - 1.2 seconds
Meeting Recording 1.0 - 1.5 seconds
Lecture / Presentation 1.5 - 3.0 seconds
Screencast / Tutorial 0.5 - 1.0 seconds
Sales Call / Demo 1.0 - 2.0 seconds

The V100 API Integration

V100's silence removal is available through both the natural language editor endpoint (where you simply say "remove silence") and the structured editor endpoint (where you configure exact parameters). Here is the structured approach for production use.

Silence removal with full configuration
const job = await v100.editor.edit({
  source: 's3://recordings/standup-2026-03-16.mp4',
  instructions: 'Remove silence',
  silence_options: {
    threshold_seconds: 1.0,         // min pause to detect
    remove_fillers: true,            // also cut filler words
    filler_words: [                   // customize filler list
      'um', 'uh', 'like', 'you know',
      'basically', 'actually', 'right'
    ],
    keep_padding_ms: 150,            // breathing room around cuts
    crossfade_ms: 80,               // smooth audio transition
    noise_floor_auto: true,          // auto-detect noise floor
    max_removal_percent: 60          // safety cap: never remove > 60%
  },
  output: { format: 'mp4', resolution: 'source' },
  webhook: 'https://your-app.com/api/webhooks/v100'
});

The max_removal_percent parameter is a safety guardrail. If silence detection would remove more than this percentage of the video, the job completes but flags a warning. This prevents edge cases where the entire video is misclassified as silence (e.g., very quiet speech recorded at extremely low gain).

Before and After: Real Production Metrics

Here are real before/after metrics from three different content types processed through V100's API in production. These are median values across hundreds of recordings in each category.

Team Standup Meetings (15 min avg) -34% duration
15:00 original --> 9:54 after silence removal. 284 silence segments removed. 23 filler words cut. Threshold: 1.0s.
Sales Demo Calls (45 min avg) -38% duration
45:00 original --> 27:54 after. Screen sharing transitions and "let me pull that up" delays account for most of the reduction. Threshold: 1.5s.
Podcast Interviews (60 min avg) -22% duration
60:00 original --> 46:48 after. More conservative threshold (0.8s) preserves conversational rhythm while cutting dead air and filler words.

The Crossfade Problem

Naive silence removal creates audible clicks and pops at every cut point. When you splice two audio segments together, the waveform discontinuity at the splice point produces a click. This is imperceptible in a single cut but becomes extremely distracting when you are making 200+ cuts in a 60-minute recording.

The solution is crossfading. At each cut point, the audio from the end of the preceding segment is faded out over a short window (typically 40-100ms), and the audio from the start of the following segment is faded in over the same window. The two fades overlap, creating a smooth transition. V100 defaults to 80ms crossfades, which is short enough to be imperceptible as a transition but long enough to eliminate clicks.

The video track is handled differently. Video cannot crossfade at splice points without creating ghosting artifacts. Instead, V100 cuts the video at the nearest keyframe boundary within 100ms of the audio cut point. The slight (~33ms at 30fps) timing discrepancy between audio and video cuts is below the threshold of human perception.

Integration Patterns

The most common integration pattern is automatic silence removal on every meeting recording. Your conferencing system (Zoom, Teams, Google Meet, or a custom WebRTC implementation) fires a webhook when a recording is complete. Your backend receives the webhook, submits a silence removal job to V100, and stores the cleaned video alongside the original. Users always see the cleaned version by default, with the option to access the unedited original.

A more advanced pattern combines silence removal with other operations in a single request. "Remove silence over 1 second, remove filler words, add English captions, and generate a 2-minute highlight clip" executes as a single pipeline. The silence removal happens first (so subsequent operations work on the cleaned timeline), then captions are generated on the shortened audio, then the highlight clip is extracted based on transcript engagement scoring.

Silence removal is the single highest-impact video processing operation you can automate. It requires no creative judgment, produces consistently positive results, and dramatically improves the viewer experience. Every minute of dead air you remove is a minute your users do not have to skip through. At scale, that adds up to thousands of hours of saved viewer time per month.

Remove the Dead Air

V100's API removes silence and filler words with configurable thresholds. Free tier: 60 minutes/month.

Get API Key — Free Tier

Related Reading