Clone any speaker's voice from a 6-second sample. Translate and synthesize in real-time. Under 2 seconds end-to-end.
The commentator speaks English — Tokyo, Mumbai, Cairo, and São Paulo hear their voice in their language.
The Pipeline
Speech enters, dubbed audio exits. Every stage is optimized for real-time broadcast latency.
Deepgram streaming ASR with word-level timestamps and punctuation recovery.
Context-aware neural machine translation across 40+ language pairs simultaneously.
XTTS-v2 neural TTS conditioned on the speaker's 512-dim voice embedding. Preserves timbre, pitch, and accent.
Intelligent ducking blends dubbed voice over the original. Crowd noise, music, and ambience preserved.
Global Coverage
Each language supports full voice cloning from a 6-second sample. The speaker's identity travels with the translation.
Voice Cloning
A brief audio sample is all it takes. The speaker encoder captures everything that makes a voice unique.
6-second audio clip
Any audio format. Studio or field quality.
512-dimensional embedding captured.
One-time extraction, unlimited use.
Every output sounds like the original speaker.
Captures timbre, pitch range, speaking rhythm, accent, and vocal texture. Listeners recognize the original speaker in every language.
Audio Engineering
The dubbed voice blends seamlessly with the original broadcast. Crowd noise, music, and ambience stay intact.
All parameters are configurable per session via the API. Defaults tuned for sports broadcast with crowd ambience.
Lip-Sync
Every synthesized syllable includes viseme metadata with millisecond timing. Feed it directly into avatar engines or AR overlays.
The Business Case
Each territory gets its own pricing, blackout rules, and language track — all through a single API. Pair voice dubbing with V100 Broadcast and AI Director for the complete stack.
Developer API
Create a voice profile, start a dubbing session, and add language tracks. That's the entire integration.
POST /api/v1/dubbing/profiles
{
"speaker_name": "Martin Tyler",
"audio_sample": "base64://...", // 6-second WAV/MP3
"sample_rate": 16000
}
// Response
{
"profile_id": "vp_mtyler_a1b2c3",
"embedding_dims": 512,
"quality_score": 0.94
}
POST /api/v1/dubbing/sessions
{
"profile_id": "vp_mtyler_a1b2c3",
"source_language": "en",
"mode": "live",
"input_stream": "rtmp://ingest.v100.ai/live/abc123",
"ducking": { "voice_gain": 0.95, "duck_level": 0.15 }
}
// Response
{
"session_id": "dub_live_x7y8z9",
"status": "active",
"ws_endpoint": "wss://dub.v100.ai/ws/x7y8z9"
}
POST /api/v1/dubbing/sessions/{id}/tracks
{
"languages": [
"ja", "hi", "ar", "pt-BR",
"es", "fr", "de", "zh",
"ko", "it"
],
"lip_sync": true,
"output_format": "hls"
}
// Response
{
"tracks": [
{ "lang": "ja", "hls": "https://cdn.v100.ai/.../ja.m3u8" },
...
],
"viseme_ws": "wss://dub.v100.ai/ws/x7y8z9/visemes"
}
Performance
Every millisecond measured. Every stage optimized for live broadcast.
One integration. 40+ languages. The speaker's own voice. Under 850 milliseconds.