What is a real-time voice dubbing API?

A real-time voice dubbing API automatically translates and re-voices audio content into other languages during or after recording. V100's dubbing module supports 40+ languages, preserves speaker tone and cadence, and integrates directly into video processing pipelines via a single API call.

How does V100's voice dubbing compare to ElevenLabs?

ElevenLabs specializes in voice cloning with industry-leading naturalness. V100 integrates dubbing into a full video platform that includes transcription, editing, captioning, and publishing. ElevenLabs is better for standalone voice synthesis. V100 is better when dubbing is part of a larger video pipeline.

How many languages does V100 voice dubbing support?

V100's dubbing module supports 40+ languages including English, Spanish, Mandarin, Hindi, Arabic, Portuguese, French, German, Japanese, Korean, and many more. Language support maps to V100's transcription engine, so any language that can be transcribed can also be dubbed.

Real-Time Voice Dubbing API: 40+ Languages, Sub-Second La...

Content localization is no longer optional. YouTube reports that 80% of its watch time comes from outside the United States. Netflix dubs every original series into at least 34 languages. Coursera, Udemy, and Khan Academy have all launched multilingual video initiatives. The audience is global, but most video content is produced in one language and stays that way because dubbing has historically been expensive, slow, and manual.

Traditional dubbing requires voice actors, recording studios, audio engineers, and weeks of turnaround per language. A 60-minute documentary dubbed into 10 languages through a traditional studio costs $50,000-$150,000 and takes 4-8 weeks. That timeline and price point means only the largest media companies can afford global localization. Everyone else publishes subtitles and accepts the engagement drop — viewers retain 30-40% less information from subtitled content compared to native-language audio.

V100's voice dubbing module is designed to change that equation. We are building a pipeline that takes video in, extracts speech, translates it, and synthesizes dubbed audio in the target language — all through a single API call. This post covers what the module does today, what is still in development, and how the pricing compares to standalone TTS providers.

Architecture: How the Dubbing Pipeline Works

The voice dubbing pipeline has four stages. Each stage is a distinct microservice in V100's Rust backend, connected through the same event-driven architecture that powers our transcription and editing pipelines.

Dubbing pipeline stages

Speech Extraction & Diarization

V100's transcription engine extracts speech with per-word timestamps and speaker diarization. Each speaker segment is isolated as a discrete audio chunk with precise start/end boundaries. This uses the same 40+ language transcription engine that powers our captioning module.

Translation

Each speaker segment is translated to the target language with context-aware translation that preserves sentence structure and idiomatic meaning. The translator receives the full transcript as context, not isolated sentences, so it maintains coherence across the entire piece.

TTS Synthesis

The translated text is synthesized into speech using a TTS engine that matches the original speaker's cadence, pitch range, and pacing. The synthesized audio is time-stretched or compressed to fit within the original segment duration so that lip sync remains plausible.

Audio Mixing & Export

The dubbed audio segments are mixed back with the original background audio (music, sound effects, ambient noise) and exported as a new video file or a separate audio track. The original dialogue is attenuated and the dubbed track is mixed in at matching volume levels.

What Works Today

We believe in being transparent about the current state of our technology. Here is what is production-ready and what is still in development.

Production-ready today: Stages 1 (speech extraction and diarization) and 2 (translation) are fully operational in production. V100's transcription engine handles 40+ languages with per-word timestamps, speaker identification, and medical/technical vocabulary support. The translation pipeline supports all 40+ languages with context-aware translation. Stage 4 (audio mixing and export) is also production-ready — we have been mixing and exporting audio through our editing pipeline since launch.

In development: Stage 3 (TTS synthesis) is where we are investing the most engineering effort right now. The TTS integration framework is built and tested. We can accept any ONNX-exported TTS model and run inference through our Rust pipeline with sub-second latency per segment. What is pending is the final voice cloning model — the ONNX model that captures a speaker's voice characteristics from a short sample and reproduces them in the target language. We are evaluating multiple open-source and licensed models and expect to ship the first supported model in Q2 2026.

In the interim, V100's dubbing module works with standard TTS voices. You can dub a video into 40+ languages today using high-quality synthetic voices. What you cannot do yet is clone the original speaker's voice for the dubbed version. That is the gap we are closing.

The API

Voice dubbing is exposed as a single API endpoint that accepts a video and returns the dubbed version. The design philosophy is the same as every other V100 module: one call does the entire job.

terminal

# Dub a video into Spanish and French
curl -X POST https://api.v100.ai/v1/dub \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "video_url": "https://cdn.example.com/keynote.mp4",
    "source_language": "en",
    "target_languages": ["es", "fr"],
    "preserve_background_audio": true,
    "voice_mode": "standard",
    "output_format": "mp4"
  }'

# Response:
{
  "job_id": "dub_a1b2c3d4",
  "status": "processing",
  "estimated_duration_seconds": 45,
  "outputs": {
    "es": { "status": "pending" },
    "fr": { "status": "pending" }
  }
}

The voice_mode parameter controls the TTS voice selection. Today, "standard" uses high-quality synthetic voices matched to the detected speaker gender and age range. When voice cloning ships, "clone" will use the original speaker's voice characteristics. The API contract remains the same — only the output quality improves.

Use Cases

Voice dubbing serves four primary markets, each with distinct requirements for latency, quality, and language coverage.

Live broadcasts and events. Sports commentary, news broadcasts, and live conference presentations need dubbing with minimal delay. V100's pipeline processes segments as they arrive, producing dubbed audio within seconds of the original speech. This is not instantaneous — there is inherent latency from transcription, translation, and synthesis — but it is fast enough for near-live broadcast with a 3-5 second delay per segment. For live sports where commentary is continuous, this enables multilingual audio tracks that trail the original by a few seconds.

Content localization at scale. Media companies, e-learning platforms, and marketing teams that produce video content in one language and need to distribute it globally. A course creator on Udemy can upload a 2-hour lecture and receive dubbed versions in 10 languages within minutes instead of weeks. The economics shift from $5,000-$15,000 per language to pennies per minute of processed audio.

Accessibility. Voice dubbing makes video content accessible to audiences who prefer listening in their native language over reading subtitles. This is particularly important for educational content, government communications, and healthcare information where comprehension is critical. Dubbed audio also serves visually impaired audiences who cannot read subtitles.

Enterprise internal communications. Global companies producing training videos, executive communications, and compliance content need localized versions for regional offices. V100's dubbing module integrates directly into existing video workflows through the API, so the localization step can be automated as part of the publishing pipeline.

Pricing: V100 vs. ElevenLabs vs. PlayHT vs. Resemble.ai

The voice dubbing and TTS market has several strong players. Here is an honest comparison of pricing and capabilities. We include competitor strengths because choosing the right tool depends on your specific use case.

Factor	V100	ElevenLabs	PlayHT	Resemble.ai
Primary focus	Full video platform with dubbing	Voice cloning & TTS	TTS & voice generation	Voice cloning & TTS
Voice cloning quality	Standard voices now; cloning Q2 2026	Industry-leading	Strong	Strong, custom models
Languages	40+	29+	140+ (TTS only)	24+
Pricing model	Per-minute video processed	$5-$330/mo (character quotas)	$31-$99/mo (word quotas)	$0.006/sec synthesized
Includes transcription	Yes	No	No	No
Includes video editing	Yes	No	No	No
Includes publishing/CDN	Yes	No	No	No

When ElevenLabs wins: If your primary need is the highest-quality voice cloning available today, ElevenLabs is the leader. Their voice cloning technology is the most natural-sounding on the market, and their dubbing studio product handles the end-to-end dubbing workflow well. If you only need TTS/voice cloning and do not need video processing, transcription, or publishing, ElevenLabs is a strong standalone choice.

When PlayHT wins: PlayHT offers the broadest language coverage for TTS (140+ languages) and has a straightforward API. Their pricing is competitive for teams that need high-volume text-to-speech without video processing. If you are building a podcast or audio content platform and need pure TTS, PlayHT is worth evaluating.

When Resemble.ai wins: Resemble focuses on custom voice model training, which is valuable for brands that need a consistent synthetic voice across all content. Their per-second pricing is transparent and predictable. For enterprise voice branding use cases, Resemble is a strong contender.

When V100 wins: V100 wins when dubbing is part of a larger video workflow. If you need to record a video, transcribe it, translate it, dub it, add captions, edit out dead air, and publish it to a CDN — doing that through one API with one bill is significantly simpler than integrating V100 (or another video platform) with ElevenLabs for dubbing, Deepgram for transcription, and a CDN for delivery. The total cost of ownership drops because you eliminate integration engineering and multiple vendor contracts.

Technical Details: ONNX Runtime and the TTS Pipeline

V100's TTS inference runs on ONNX Runtime, which allows us to load any TTS model exported in the ONNX format and run inference in our Rust pipeline without Python overhead. The ONNX Runtime binding runs natively in our Rust process, eliminating the inter-process communication latency that Python-based TTS servers introduce.

The inference pipeline is designed for parallelism. When a video has multiple speaker segments that need dubbing, each segment is synthesized concurrently on separate threads. A 10-minute video with 60 segments can run TTS inference on all 60 segments in parallel, bounded only by available CPU cores. On production hardware, this means a 10-minute video can be fully dubbed in under 30 seconds.

The audio mixing stage handles the non-trivial problem of time alignment. Translated text is rarely the same length as the original. German translations are typically 20-30% longer than English. Japanese translations are typically 10-20% shorter. The mixing stage time-stretches or compresses the synthesized audio to fit within the original segment's duration window, using a phase vocoder that preserves pitch while adjusting tempo. This keeps lip sync plausible for talking-head videos and maintains pacing for documentary narration.

What Is Next

The voice dubbing module roadmap has three milestones. First, shipping the voice cloning ONNX model that captures speaker characteristics from a 30-second voice sample and reproduces them across languages. Second, real-time streaming dubbing for live broadcasts, where the pipeline processes audio chunks as they arrive and produces dubbed output with less than 5 seconds of delay. Third, emotion preservation — ensuring that the synthesized voice matches the emotional tone (excitement, concern, humor) of the original speaker, not just the words.

We are building this in the open because we believe transparency about what works and what does not is more valuable than marketing claims. The dubbing module works today with standard voices in 40+ languages. Voice cloning is coming. We will publish benchmarks when it is ready.

Try voice dubbing today

Get a free API key and dub your first video into any of 40+ languages. Standard voices are available now. Voice cloning is coming in Q2 2026.

Start Free Trial Auto-Caption API

Real-Time Voice Dubbing API: 40+ Languages, One API Call

Architecture: How the Dubbing Pipeline Works

Dubbing pipeline stages

What Works Today

The API

Use Cases

Pricing: V100 vs. ElevenLabs vs. PlayHT vs. Resemble.ai

Technical Details: ONNX Runtime and the TTS Pipeline

What Is Next

Try voice dubbing today

Related Reading

Auto-Caption API: 40+ Languages in One Call

Natural Language Video Editing

Why V100 Is the Fastest Video API

Post-Quantum Video Encryption