Content localization is no longer optional. YouTube reports that 80% of its watch time comes from outside the United States. Netflix dubs every original series into at least 34 languages. Coursera, Udemy, and Khan Academy have all launched multilingual video initiatives. The audience is global, but most video content is produced in one language and stays that way because dubbing has historically been expensive, slow, and manual.
Traditional dubbing requires voice actors, recording studios, audio engineers, and weeks of turnaround per language. A 60-minute documentary dubbed into 10 languages through a traditional studio costs $50,000-$150,000 and takes 4-8 weeks. That timeline and price point means only the largest media companies can afford global localization. Everyone else publishes subtitles and accepts the engagement drop — viewers retain 30-40% less information from subtitled content compared to native-language audio.
V100's voice dubbing module is designed to change that equation. We are building a pipeline that takes video in, extracts speech, translates it, and synthesizes dubbed audio in the target language — all through a single API call. This post covers what the module does today, what is still in development, and how the pricing compares to standalone TTS providers.
Architecture: How the Dubbing Pipeline Works
The voice dubbing pipeline has four stages. Each stage is a distinct microservice in V100's Rust backend, connected through the same event-driven architecture that powers our transcription and editing pipelines.
Dubbing pipeline stages
What Works Today
We believe in being transparent about the current state of our technology. Here is what is production-ready and what is still in development.
Production-ready today: Stages 1 (speech extraction and diarization) and 2 (translation) are fully operational in production. V100's transcription engine handles 40+ languages with per-word timestamps, speaker identification, and medical/technical vocabulary support. The translation pipeline supports all 40+ languages with context-aware translation. Stage 4 (audio mixing and export) is also production-ready — we have been mixing and exporting audio through our editing pipeline since launch.
In development: Stage 3 (TTS synthesis) is where we are investing the most engineering effort right now. The TTS integration framework is built and tested. We can accept any ONNX-exported TTS model and run inference through our Rust pipeline with sub-second latency per segment. What is pending is the final voice cloning model — the ONNX model that captures a speaker's voice characteristics from a short sample and reproduces them in the target language. We are evaluating multiple open-source and licensed models and expect to ship the first supported model in Q2 2026.
In the interim, V100's dubbing module works with standard TTS voices. You can dub a video into 40+ languages today using high-quality synthetic voices. What you cannot do yet is clone the original speaker's voice for the dubbed version. That is the gap we are closing.
The API
Voice dubbing is exposed as a single API endpoint that accepts a video and returns the dubbed version. The design philosophy is the same as every other V100 module: one call does the entire job.
# Dub a video into Spanish and French
curl -X POST https://api.v100.ai/v1/dub \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"video_url": "https://cdn.example.com/keynote.mp4",
"source_language": "en",
"target_languages": ["es", "fr"],
"preserve_background_audio": true,
"voice_mode": "standard",
"output_format": "mp4"
}'
# Response:
{
"job_id": "dub_a1b2c3d4",
"status": "processing",
"estimated_duration_seconds": 45,
"outputs": {
"es": { "status": "pending" },
"fr": { "status": "pending" }
}
}
The voice_mode parameter controls the TTS voice selection. Today, "standard" uses high-quality synthetic voices matched to the detected speaker gender and age range. When voice cloning ships, "clone" will use the original speaker's voice characteristics. The API contract remains the same — only the output quality improves.
Use Cases
Voice dubbing serves four primary markets, each with distinct requirements for latency, quality, and language coverage.
Live broadcasts and events. Sports commentary, news broadcasts, and live conference presentations need dubbing with minimal delay. V100's pipeline processes segments as they arrive, producing dubbed audio within seconds of the original speech. This is not instantaneous — there is inherent latency from transcription, translation, and synthesis — but it is fast enough for near-live broadcast with a 3-5 second delay per segment. For live sports where commentary is continuous, this enables multilingual audio tracks that trail the original by a few seconds.
Content localization at scale. Media companies, e-learning platforms, and marketing teams that produce video content in one language and need to distribute it globally. A course creator on Udemy can upload a 2-hour lecture and receive dubbed versions in 10 languages within minutes instead of weeks. The economics shift from $5,000-$15,000 per language to pennies per minute of processed audio.
Accessibility. Voice dubbing makes video content accessible to audiences who prefer listening in their native language over reading subtitles. This is particularly important for educational content, government communications, and healthcare information where comprehension is critical. Dubbed audio also serves visually impaired audiences who cannot read subtitles.
Enterprise internal communications. Global companies producing training videos, executive communications, and compliance content need localized versions for regional offices. V100's dubbing module integrates directly into existing video workflows through the API, so the localization step can be automated as part of the publishing pipeline.
Pricing: V100 vs. ElevenLabs vs. PlayHT vs. Resemble.ai
The voice dubbing and TTS market has several strong players. Here is an honest comparison of pricing and capabilities. We include competitor strengths because choosing the right tool depends on your specific use case.
| Factor | V100 | ElevenLabs | PlayHT | Resemble.ai |
|---|---|---|---|---|
| Primary focus | Full video platform with dubbing | Voice cloning & TTS | TTS & voice generation | Voice cloning & TTS |
| Voice cloning quality | Standard voices now; cloning Q2 2026 | Industry-leading | Strong | Strong, custom models |
| Languages | 40+ | 29+ | 140+ (TTS only) | 24+ |
| Pricing model | Per-minute video processed | $5-$330/mo (character quotas) | $31-$99/mo (word quotas) | $0.006/sec synthesized |
| Includes transcription | Yes | No | No | No |
| Includes video editing | Yes | No | No | No |
| Includes publishing/CDN | Yes | No | No | No |
When ElevenLabs wins: If your primary need is the highest-quality voice cloning available today, ElevenLabs is the leader. Their voice cloning technology is the most natural-sounding on the market, and their dubbing studio product handles the end-to-end dubbing workflow well. If you only need TTS/voice cloning and do not need video processing, transcription, or publishing, ElevenLabs is a strong standalone choice.
When PlayHT wins: PlayHT offers the broadest language coverage for TTS (140+ languages) and has a straightforward API. Their pricing is competitive for teams that need high-volume text-to-speech without video processing. If you are building a podcast or audio content platform and need pure TTS, PlayHT is worth evaluating.
When Resemble.ai wins: Resemble focuses on custom voice model training, which is valuable for brands that need a consistent synthetic voice across all content. Their per-second pricing is transparent and predictable. For enterprise voice branding use cases, Resemble is a strong contender.
When V100 wins: V100 wins when dubbing is part of a larger video workflow. If you need to record a video, transcribe it, translate it, dub it, add captions, edit out dead air, and publish it to a CDN — doing that through one API with one bill is significantly simpler than integrating V100 (or another video platform) with ElevenLabs for dubbing, Deepgram for transcription, and a CDN for delivery. The total cost of ownership drops because you eliminate integration engineering and multiple vendor contracts.
Technical Details: ONNX Runtime and the TTS Pipeline
V100's TTS inference runs on ONNX Runtime, which allows us to load any TTS model exported in the ONNX format and run inference in our Rust pipeline without Python overhead. The ONNX Runtime binding runs natively in our Rust process, eliminating the inter-process communication latency that Python-based TTS servers introduce.
The inference pipeline is designed for parallelism. When a video has multiple speaker segments that need dubbing, each segment is synthesized concurrently on separate threads. A 10-minute video with 60 segments can run TTS inference on all 60 segments in parallel, bounded only by available CPU cores. On production hardware, this means a 10-minute video can be fully dubbed in under 30 seconds.
The audio mixing stage handles the non-trivial problem of time alignment. Translated text is rarely the same length as the original. German translations are typically 20-30% longer than English. Japanese translations are typically 10-20% shorter. The mixing stage time-stretches or compresses the synthesized audio to fit within the original segment's duration window, using a phase vocoder that preserves pitch while adjusting tempo. This keeps lip sync plausible for talking-head videos and maintains pacing for documentary narration.
What Is Next
The voice dubbing module roadmap has three milestones. First, shipping the voice cloning ONNX model that captures speaker characteristics from a 30-second voice sample and reproduces them across languages. Second, real-time streaming dubbing for live broadcasts, where the pipeline processes audio chunks as they arrive and produces dubbed output with less than 5 seconds of delay. Third, emotion preservation — ensuring that the synthesized voice matches the emotional tone (excitement, concern, humor) of the original speaker, not just the words.
We are building this in the open because we believe transparency about what works and what does not is more valuable than marketing claims. The dubbing module works today with standard voices in 40+ languages. Voice cloning is coming. We will publish benchmarks when it is ready.
Try voice dubbing today
Get a free API key and dub your first video into any of 40+ languages. Standard voices are available now. Voice cloning is coming in Q2 2026.