40+ LANGUAGES | REAL-TIME | VOICE CLONING

Live Voice Dubbing. The Speaker's Voice. Every Language.

Clone any speaker's voice from a 6-second sample. Translate and synthesize in real-time. Under 2 seconds end-to-end.

The commentator speaks English — Tokyo, Mumbai, Cairo, and São Paulo hear their voice in their language.

<850ms
End-to-End
Full pipeline latency
40+
Languages
Clone-ready voices
6s
Voice Sample
To clone any speaker
512
Dim Embedding
Vocal fingerprint

The Pipeline

Four Stages. One API Call.

Speech enters, dubbed audio exits. Every stage is optimized for real-time broadcast latency.

~100ms
STAGE 1

Speech-to-Text

Deepgram streaming ASR with word-level timestamps and punctuation recovery.

Deepgram Nova-2 Streaming
~200ms
STAGE 2

Translation

Context-aware neural machine translation across 40+ language pairs simultaneously.

40+ Language Pairs
~500ms
STAGE 3

Voice Synthesis

XTTS-v2 neural TTS conditioned on the speaker's 512-dim voice embedding. Preserves timbre, pitch, and accent.

XTTS-v2 Neural TTS
~50ms
STAGE 4

Audio Mix

Intelligent ducking blends dubbed voice over the original. Crowd noise, music, and ambience preserved.

Ducking + Blend Engine
TOTAL PIPELINE LATENCY
~850ms
end-to-end — speech in, dubbed audio out
STT 100ms Translation 200ms TTS 500ms Mix 50ms

Global Coverage

40+ Languages. Every Voice Cloned.

Each language supports full voice cloning from a 6-second sample. The speaker's identity travels with the translation.

🇬🇧
English
Clone Ready
🇪🇸
Spanish
Clone Ready
🇫🇷
French
Clone Ready
🇩🇪
German
Clone Ready
🇯🇵
Japanese
Clone Ready
🇨🇳
Chinese
Clone Ready
🇸🇦
Arabic
Clone Ready
🇮🇳
Hindi
Clone Ready
🇧🇷
Portuguese
Clone Ready
🇰🇷
Korean
Clone Ready
🇮🇹
Italian
Clone Ready
🇷🇺
Russian
Clone Ready
🇹🇷
Turkish
Clone Ready
🇵🇱
Polish
Clone Ready
🇳🇱
Dutch
Clone Ready
🇸🇪
Swedish
Clone Ready
🇳🇴
Norwegian
Clone Ready
🇩🇰
Danish
Clone Ready
🇫🇮
Finnish
Clone Ready
🇨🇿
Czech
Clone Ready
🇬🇷
Greek
Clone Ready
🇷🇴
Romanian
Clone Ready
🇭🇺
Hungarian
Clone Ready
🇹🇭
Thai
Clone Ready
🇻🇳
Vietnamese
Clone Ready
🇮🇩
Indonesian
Clone Ready
🇲🇾
Malay
Clone Ready
🇵🇭
Filipino
Clone Ready
🇺🇦
Ukrainian
Clone Ready
🇮🇱
Hebrew
Clone Ready
🇮🇷
Persian
Clone Ready
🇧🇩
Bengali
Clone Ready
🇵🇰
Urdu
Clone Ready
🇳🇬
Hausa
Clone Ready
🇿🇦
Zulu
Clone Ready
🇪🇹
Amharic
Clone Ready
🇬🇪
Georgian
Clone Ready
🇸🇰
Slovak
Clone Ready
🇧🇬
Bulgarian
Clone Ready
🇭🇷
Croatian
Clone Ready
🇷🇸
Serbian
Clone Ready
🇱🇹
Lithuanian
Clone Ready
🇱🇻
Latvian
Clone Ready
🇪🇪
Estonian
Clone Ready
🇸🇮
Slovenian
Clone Ready
🇬🇱
Galician
Clone Ready
🇪🇬
Swahili
Clone Ready
🇲🇲
Burmese
Clone Ready
🇳🇵
Nepali
Clone Ready

Voice Cloning

6 Seconds to Clone Any Voice

A brief audio sample is all it takes. The speaker encoder captures everything that makes a voice unique.

Audio Sample

6-second audio clip

Speaker Encoder

Extract mel-spectrogram features
3-layer LSTM encoder network
L2-normalized projection
512-dimensional output vector

Voice Embedding

float32[512] — Vocal Fingerprint
1

Upload 6-second sample

Any audio format. Studio or field quality.

2

Speaker encoder extracts fingerprint

512-dimensional embedding captured.

3

Embedding stored for all future synthesis

One-time extraction, unlimited use.

4

TTS model conditions on embedding

Every output sounds like the original speaker.

Voice Identity Preserved

Captures timbre, pitch range, speaking rhythm, accent, and vocal texture. Listeners recognize the original speaker in every language.

Audio Engineering

Intelligent Audio Ducking

The dubbed voice blends seamlessly with the original broadcast. Crowd noise, music, and ambience stay intact.

Original Audio 100% Volume
15% duck
Dubbed Voice 95% Volume
95% dubbed voice
Blended Output Broadcast Ready

Mix Parameters

Voice Gain
95%
Duck Level
15%
Attack
50ms
Release
200ms
Soft Clipping
ENABLED

All parameters are configurable per session via the API. Defaults tuned for sports broadcast with crowd ambience.

Lip-Sync

Phoneme-Level Timing for Perfect Lip-Sync

Every synthesized syllable includes viseme metadata with millisecond timing. Feed it directly into avatar engines or AR overlays.

Silence
0ms
Mouth closed
👄
Bilabial
120ms
B, M, P sounds
🗣
Dental
250ms
D, T, N sounds
👀
Rounded
380ms
O, U, W sounds
📣
Open Wide
500ms
A, AH sounds
0ms 120ms 250ms 380ms 500ms
Silence Bilabial Dental Rounded Open
AR Glasses
Apple Vision Pro
Avatar Overlays
Real-Time 3D

The Business Case

One Global Broadcast.
40 Revenue Streams.

Before V100
  • Buy exclusive broadcast rights per territory, per language
  • $10M+ per market for premium sports rights
  • Hire separate commentary teams for each language
  • Weeks of lead time per new market entry
  • Small markets ignored because ROI doesn't justify cost
With V100 Dubbing
  • One broadcast + V100 dubbing covers every market simultaneously
  • Same commentator voice in 40+ languages in real-time
  • Per-territory pricing, blackout rules, and language tracks via single API
  • Launch new markets in minutes, not months
  • PPV geo-gating: each territory gets own pricing and language track
Combined with PPV Geo-Gating

Each territory gets its own pricing, blackout rules, and language track — all through a single API. Pair voice dubbing with V100 Broadcast and AI Director for the complete stack.

Developer API

Three Endpoints. Full Dubbing.

Create a voice profile, start a dubbing session, and add language tracks. That's the entire integration.

Create Voice Profile
POST /api/v1/dubbing/profiles
{
  "speaker_name": "Martin Tyler",
  "audio_sample": "base64://...",   // 6-second WAV/MP3
  "sample_rate": 16000
}

// Response
{
  "profile_id": "vp_mtyler_a1b2c3",
  "embedding_dims": 512,
  "quality_score": 0.94
}
Start Dubbing Session
POST /api/v1/dubbing/sessions
{
  "profile_id": "vp_mtyler_a1b2c3",
  "source_language": "en",
  "mode": "live",
  "input_stream": "rtmp://ingest.v100.ai/live/abc123",
  "ducking": { "voice_gain": 0.95, "duck_level": 0.15 }
}

// Response
{
  "session_id": "dub_live_x7y8z9",
  "status": "active",
  "ws_endpoint": "wss://dub.v100.ai/ws/x7y8z9"
}
Add Language Tracks
POST /api/v1/dubbing/sessions/{id}/tracks
{
  "languages": [
    "ja", "hi", "ar", "pt-BR",
    "es", "fr", "de", "zh",
    "ko", "it"
  ],
  "lip_sync": true,
  "output_format": "hls"
}

// Response
{
  "tracks": [
    { "lang": "ja", "hls": "https://cdn.v100.ai/.../ja.m3u8" },
    ...
  ],
  "viseme_ws": "wss://dub.v100.ai/ws/x7y8z9/visemes"
}

Performance

Pipeline Latency Breakdown

Every millisecond measured. Every stage optimized for live broadcast.

Stage Technology Latency
Speech-to-Text
Deepgram Nova-2 Streaming ~100ms
Translation
Neural MT (40+ pairs) ~200ms
Voice Synthesis
XTTS-v2 + Voice Embedding ~500ms
Audio Mix
Ducking + Soft Clip + Blend ~50ms
TOTAL
End-to-End Pipeline
~850ms
VOICE PROFILE
6 seconds
sample required
CONCURRENT TRACKS
40+
simultaneous languages
EMBEDDING
512-dim
float32 vector
LIVE NOW

Ready to broadcast in
every language?

One integration. 40+ languages. The speaker's own voice. Under 850 milliseconds.