40+ LANGUAGES | REAL-TIME | VOICE CLONING

Live Voice Dubbing. The Speaker's Voice. Every Language.

Clone any speaker's voice from a 6-second sample. Translate and synthesize in real-time. Under 2 seconds end-to-end.

The commentator speaks English — Tokyo, Mumbai, Cairo, and São Paulo hear their voice in their language.

Schedule Demo Read the Technical Blog

<850ms

End-to-End

Full pipeline latency

40+

Languages

Clone-ready voices

Voice Sample

To clone any speaker

512

Dim Embedding

Vocal fingerprint

The Pipeline

Four Stages. One API Call.

Speech enters, dubbed audio exits. Every stage is optimized for real-time broadcast latency.

~100ms

STAGE 1

Speech-to-Text

Deepgram streaming ASR with word-level timestamps and punctuation recovery.

Deepgram Nova-2 Streaming

~200ms

STAGE 2

Translation

Context-aware neural machine translation across 40+ language pairs simultaneously.

40+ Language Pairs

~500ms

STAGE 3

Voice Synthesis

XTTS-v2 neural TTS conditioned on the speaker's 512-dim voice embedding. Preserves timbre, pitch, and accent.

XTTS-v2 Neural TTS

~50ms

STAGE 4

Audio Mix

Intelligent ducking blends dubbed voice over the original. Crowd noise, music, and ambience preserved.

Ducking + Blend Engine

TOTAL PIPELINE LATENCY

~850ms

end-to-end — speech in, dubbed audio out

STT 100ms Translation 200ms TTS 500ms Mix 50ms

Global Coverage

40+ Languages. Every Voice Cloned.

Each language supports full voice cloning from a 6-second sample. The speaker's identity travels with the translation.

🇬🇧
English
Clone Ready

🇪🇸
Spanish
Clone Ready

🇫🇷
French
Clone Ready

🇩🇪
German
Clone Ready

🇯🇵
Japanese
Clone Ready

🇨🇳
Chinese
Clone Ready

🇸🇦
Arabic
Clone Ready

🇮🇳
Hindi
Clone Ready

🇧🇷
Portuguese
Clone Ready

🇰🇷
Korean
Clone Ready

🇮🇹

Italian

Clone Ready

🇷🇺

Russian

Clone Ready

🇹🇷

Turkish

Clone Ready

🇵🇱

Polish

Clone Ready

🇳🇱

Dutch

Clone Ready

🇸🇪

Swedish

Clone Ready

🇳🇴

Norwegian

Clone Ready

🇩🇰

Danish

Clone Ready

🇫🇮

Finnish

Clone Ready

🇨🇿

Czech

Clone Ready

🇬🇷

Greek

Clone Ready

🇷🇴

Romanian

Clone Ready

🇭🇺

Hungarian

Clone Ready

🇹🇭

Thai

Clone Ready

🇻🇳

Vietnamese

Clone Ready

🇮🇩

Indonesian

Clone Ready

🇲🇾

Malay

Clone Ready

🇵🇭

Filipino

Clone Ready

🇺🇦

Ukrainian

Clone Ready

🇮🇱

Hebrew

Clone Ready

🇮🇷

Persian

Clone Ready

🇧🇩

Bengali

Clone Ready

🇵🇰

Urdu

Clone Ready

🇳🇬

Hausa

Clone Ready

🇿🇦

Zulu

Clone Ready

🇪🇹

Amharic

Clone Ready

🇬🇪

Georgian

Clone Ready

🇸🇰

Slovak

Clone Ready

🇧🇬

Bulgarian

Clone Ready

🇭🇷

Croatian

Clone Ready

🇷🇸

Serbian

Clone Ready

🇱🇹

Lithuanian

Clone Ready

🇱🇻

Latvian

Clone Ready

🇪🇪

Estonian

Clone Ready

🇸🇮

Slovenian

Clone Ready

🇬🇱

Galician

Clone Ready

🇪🇬

Swahili

Clone Ready

🇲🇲

Burmese

Clone Ready

🇳🇵

Nepali

Clone Ready

Voice Cloning

6 Seconds to Clone Any Voice

A brief audio sample is all it takes. The speaker encoder captures everything that makes a voice unique.

Audio Sample

6-second audio clip

Speaker Encoder

Extract mel-spectrogram features

3-layer LSTM encoder network

L2-normalized projection

512-dimensional output vector

Voice Embedding

float32[512] — Vocal Fingerprint

Upload 6-second sample

Any audio format. Studio or field quality.

Speaker encoder extracts fingerprint

512-dimensional embedding captured.

Embedding stored for all future synthesis

One-time extraction, unlimited use.

TTS model conditions on embedding

Every output sounds like the original speaker.

Voice Identity Preserved

Captures timbre, pitch range, speaking rhythm, accent, and vocal texture. Listeners recognize the original speaker in every language.

Audio Engineering

Intelligent Audio Ducking

The dubbed voice blends seamlessly with the original broadcast. Crowd noise, music, and ambience stay intact.

Original Audio 100% Volume

Dubbed Voice 95% Volume

Blended Output Broadcast Ready

Mix Parameters

Voice Gain

95%

Duck Level

15%

Attack

50ms

Release

200ms

Soft Clipping

ENABLED

All parameters are configurable per session via the API. Defaults tuned for sports broadcast with crowd ambience.

Lip-Sync

Phoneme-Level Timing for Perfect Lip-Sync

Every synthesized syllable includes viseme metadata with millisecond timing. Feed it directly into avatar engines or AR overlays.

◌

Silence

0ms

Mouth closed

👄

Bilabial

120ms

B, M, P sounds

🗣

Dental

250ms

D, T, N sounds

👀

Rounded

380ms

O, U, W sounds

📣

Open Wide

500ms

A, AH sounds

0ms 120ms 250ms 380ms 500ms

Silence Bilabial Dental Rounded Open

AR Glasses

Apple Vision Pro

Avatar Overlays

Real-Time 3D

The Business Case

One Global Broadcast.
40 Revenue Streams.

Before V100

Buy exclusive broadcast rights per territory, per language
$10M+ per market for premium sports rights
Hire separate commentary teams for each language
Weeks of lead time per new market entry
Small markets ignored because ROI doesn't justify cost

With V100 Dubbing

One broadcast + V100 dubbing covers every market simultaneously
Same commentator voice in 40+ languages in real-time
Per-territory pricing, blackout rules, and language tracks via single API
Launch new markets in minutes, not months
PPV geo-gating: each territory gets own pricing and language track

Combined with PPV Geo-Gating

Each territory gets its own pricing, blackout rules, and language track — all through a single API. Pair voice dubbing with V100 Broadcast and AI Director for the complete stack.

Developer API

Three Endpoints. Full Dubbing.

Create a voice profile, start a dubbing session, and add language tracks. That's the entire integration.

Create Voice Profile

POST /api/v1/dubbing/profiles
{
  "speaker_name": "Martin Tyler",
  "audio_sample": "base64://...",   // 6-second WAV/MP3
  "sample_rate": 16000
}

// Response
{
  "profile_id": "vp_mtyler_a1b2c3",
  "embedding_dims": 512,
  "quality_score": 0.94
}

Start Dubbing Session

POST /api/v1/dubbing/sessions
{
  "profile_id": "vp_mtyler_a1b2c3",
  "source_language": "en",
  "mode": "live",
  "input_stream": "rtmp://ingest.v100.ai/live/abc123",
  "ducking": { "voice_gain": 0.95, "duck_level": 0.15 }
}

// Response
{
  "session_id": "dub_live_x7y8z9",
  "status": "active",
  "ws_endpoint": "wss://dub.v100.ai/ws/x7y8z9"
}

Add Language Tracks

POST /api/v1/dubbing/sessions/{id}/tracks
{
  "languages": [
    "ja", "hi", "ar", "pt-BR",
    "es", "fr", "de", "zh",
    "ko", "it"
  ],
  "lip_sync": true,
  "output_format": "hls"
}

// Response
{
  "tracks": [
    { "lang": "ja", "hls": "https://cdn.v100.ai/.../ja.m3u8" },
    ...
  ],
  "viseme_ws": "wss://dub.v100.ai/ws/x7y8z9/visemes"
}

Full API Reference →

Performance

Pipeline Latency Breakdown

Every millisecond measured. Every stage optimized for live broadcast.

Stage Technology Latency

Speech-to-Text

Deepgram Nova-2 Streaming ~100ms

Translation

Neural MT (40+ pairs) ~200ms

Voice Synthesis

XTTS-v2 + Voice Embedding ~500ms

Audio Mix

Ducking + Soft Clip + Blend ~50ms

TOTAL

End-to-End Pipeline

~850ms

VOICE PROFILE

6 seconds

sample required

CONCURRENT TRACKS

40+

simultaneous languages

EMBEDDING

512-dim

float32 vector

View Full Benchmarks →

Live Voice Dubbing. The Speaker's Voice. Every Language.

Four Stages. One API Call.

Speech-to-Text

Translation

Voice Synthesis

Audio Mix

40+ Languages. Every Voice Cloned.

6 Seconds to Clone Any Voice

Audio Sample

Speaker Encoder

Voice Embedding

Upload 6-second sample

Speaker encoder extracts fingerprint

Embedding stored for all future synthesis

TTS model conditions on embedding

Intelligent Audio Ducking

Mix Parameters

Phoneme-Level Timing for Perfect Lip-Sync

One Global Broadcast. 40 Revenue Streams.

Three Endpoints. Full Dubbing.

Pipeline Latency Breakdown

Ready to broadcast in every language?

One Global Broadcast.
40 Revenue Streams.

Ready to broadcast in
every language?