AI & LANGUAGE

Live Voice Dubbing in 40 Languages

March 27, 2026 · 9 min read · V100 Engineering

Imagine a La Liga commentator calling a goal — and viewers in Tokyo, Mumbai, Cairo, and Sao Paulo hearing the same voice, the same excitement, in their own language. Not a generic AI voice. The commentator's voice. In real-time.

The Pipeline: Under 2 Seconds End-to-End

V100's voice dubbing engine is a four-stage pipeline running on GPU:

StageTechnologyLatency
1. Speech-to-TextDeepgram WebSocket (streaming)~100ms
2. TranslationV100 live_translation service~200ms
3. Voice SynthesisXTTS-v2 ONNX (GPU, voice cloned)~500ms
4. Audio MixDucking mixer (original audio reduced)~50ms
Total~850ms

Voice Profiles: 6 Seconds to Clone

A voice profile is created from a 6-second audio sample of the speaker. The sample passes through a speaker encoder neural network that produces a 512-dimensional embedding vector — a mathematical fingerprint of the speaker's vocal characteristics (timbre, pitch range, speaking rhythm, accent qualities).

This embedding is stored and reused for all TTS synthesis. The XTTS-v2 model conditions its output on this embedding, producing speech that sounds like the original speaker in any of its 40+ supported languages.

The Audio Mix: Natural, Not Jarring

Simply replacing the original audio track sounds terrible. V100's audio mixer uses intelligent ducking:

Lip-Sync Metadata

For platforms that support it (AR glasses, Vision Pro, video overlays), V100 outputs phoneme-level timing metadata with every synthesized segment. Each phoneme maps to a viseme (mouth shape category): bilabial, labiodental, dental, velar, etc. Downstream systems can use this to drive avatar lip-sync or overlay mouth animations on the original speaker.

The Sports Rights Revolution

Today, a broadcaster buys exclusive rights for a specific territory in a specific language. V100 makes this model obsolete. A single global broadcast — with automated multi-camera direction (AI Director) and real-time voice dubbing — can serve every market simultaneously. The content owner captures 40x the audience from a single production.

Combined with V100's PPV geo-gating, each territory can have its own pricing, its own blackout rules, and its own language track — all managed through a single API.

POST  /api/v1/dubbing/profiles             — Upload 6s sample, get voice profile
POST  /api/v1/dubbing/sessions             — Start dubbing with target languages
POST  /api/v1/dubbing/sessions/{id}/tracks — Add language tracks on the fly
GET   /api/v1/dubbing/sessions/{id}        — Pipeline latency + track status
← All posts Schedule a Demo