Imagine a La Liga commentator calling a goal — and viewers in Tokyo, Mumbai, Cairo, and Sao Paulo hearing the same voice, the same excitement, in their own language. Not a generic AI voice. The commentator's voice. In real-time.
The Pipeline: Under 2 Seconds End-to-End
V100's voice dubbing engine is a four-stage pipeline running on GPU:
| Stage | Technology | Latency |
|---|---|---|
| 1. Speech-to-Text | Deepgram WebSocket (streaming) | ~100ms |
| 2. Translation | V100 live_translation service | ~200ms |
| 3. Voice Synthesis | XTTS-v2 ONNX (GPU, voice cloned) | ~500ms |
| 4. Audio Mix | Ducking mixer (original audio reduced) | ~50ms |
| Total | ~850ms |
Voice Profiles: 6 Seconds to Clone
A voice profile is created from a 6-second audio sample of the speaker. The sample passes through a speaker encoder neural network that produces a 512-dimensional embedding vector — a mathematical fingerprint of the speaker's vocal characteristics (timbre, pitch range, speaking rhythm, accent qualities).
This embedding is stored and reused for all TTS synthesis. The XTTS-v2 model conditions its output on this embedding, producing speech that sounds like the original speaker in any of its 40+ supported languages.
The Audio Mix: Natural, Not Jarring
Simply replacing the original audio track sounds terrible. V100's audio mixer uses intelligent ducking:
- Original audio ducked to 15% when dubbed voice is active — ambient sound (crowd, music, effects) remains audible
- 50ms attack, 200ms release — ducking transitions are smooth enough to be imperceptible
- Soft clipping prevents distortion when dubbed voice and ambient peaks coincide
- Dubbed voice at 95% — clearly dominant without feeling artificial
Lip-Sync Metadata
For platforms that support it (AR glasses, Vision Pro, video overlays), V100 outputs phoneme-level timing metadata with every synthesized segment. Each phoneme maps to a viseme (mouth shape category): bilabial, labiodental, dental, velar, etc. Downstream systems can use this to drive avatar lip-sync or overlay mouth animations on the original speaker.
The Sports Rights Revolution
Today, a broadcaster buys exclusive rights for a specific territory in a specific language. V100 makes this model obsolete. A single global broadcast — with automated multi-camera direction (AI Director) and real-time voice dubbing — can serve every market simultaneously. The content owner captures 40x the audience from a single production.
Combined with V100's PPV geo-gating, each territory can have its own pricing, its own blackout rules, and its own language track — all managed through a single API.
POST /api/v1/dubbing/profiles — Upload 6s sample, get voice profile
POST /api/v1/dubbing/sessions — Start dubbing with target languages
POST /api/v1/dubbing/sessions/{id}/tracks — Add language tracks on the fly
GET /api/v1/dubbing/sessions/{id} — Pipeline latency + track status