Developer Guide 11 min read

Auto-Caption Videos in 20 Languages via API: A Developer Guide

Captions are no longer optional. They are a legal requirement, an engagement multiplier, and an SEO signal. This guide covers how to add accurate captions to any video via API, with code examples for every output format.

V1
V100 Engineering
March 18, 2026

In 2025, the European Accessibility Act made video captions mandatory for most commercial content distributed in the EU. In the US, the ADA has long required captions for public-facing video content, and the FCC extended closed captioning requirements to online video in 2023. Beyond legal compliance, the business case is clear: 85% of Facebook video is watched without sound. LinkedIn reports that captioned videos get 70% more engagement than uncaptioned ones. And Google can index caption text, making your video content discoverable through search.

Despite all of this, most video on the internet still lacks captions. The reason is simple: captioning has traditionally been expensive and slow. Manual captioning services charge $1-3 per minute of video and take 24-48 hours for turnaround. For a company producing 50 videos per week, that is $2,600-7,800 per month in captioning costs alone, plus the workflow overhead of uploading files, waiting for delivery, and quality-checking the results.

Automated captioning via API changes the economics entirely. V100's captioning endpoint processes video at roughly 10x real-time (a 60-minute video is captioned in ~6 minutes), supports 20 languages, and costs a fraction of manual captioning. This guide walks through how it works, how to integrate it, and how to get the best results.

How Auto-Captioning Works

Auto-captioning is a three-step process: speech recognition, alignment, and rendering. Understanding each step helps you make better decisions about configuration and quality tuning.

Speech recognition converts the audio track into text. V100 uses a fine-tuned variant of Whisper large-v3 that has been optimized for production throughput without sacrificing accuracy. The model produces raw text with word-level timestamps and confidence scores. For multi-speaker content, a separate diarization model identifies who is speaking at each point, enabling speaker-labeled captions.

Alignment groups individual words into caption segments. A naive approach (one word per caption) is unreadable. A better approach groups words into phrases that match natural speech rhythm, with each caption displayed for 2-5 seconds. The alignment algorithm considers sentence boundaries, clause boundaries, maximum character count per line (typically 42 characters for readability), and minimum display duration (1.5 seconds) to prevent captions from flashing too quickly.

Rendering is the final step. For burned-in captions, the text is rendered directly onto the video frames using a configurable font, size, position, and background style. For sidecar files, the aligned segments are serialized into SRT or WebVTT format with millisecond-precision timestamps.

Three Output Formats, Three Use Cases

V100 supports three caption output modes, each serving different distribution needs. You can request all three in a single API call.

Burned-in captions are rendered directly into the video. The text becomes part of the visual content and cannot be turned off by the viewer. This is the standard for social media (Instagram Reels, TikTok, LinkedIn, Twitter/X) where native caption support is limited or inconsistent. Burned-in captions also work well for ads, where you want maximum control over the viewing experience.

SRT files (SubRip Text) are the most widely supported sidecar caption format. YouTube, Vimeo, Facebook, LinkedIn, and most video players accept SRT uploads. SRT is plain text with sequential numbering and timestamp ranges. It is easy to edit manually if you need to correct a word or adjust timing.

WebVTT files (Web Video Text Tracks) are the W3C standard for web-based captions. They support CSS styling, positioning, speaker identification via voice tags, and are natively supported by the HTML5 <track> element. If you are building a web video player, VTT is the right choice.

Code Examples

Here is how to generate captions in each format using V100's API.

Burned-in captions with custom styling
const job = await v100.captions.create({
  source: 's3://content/webinar-2026-03.mp4',
  languages: ['en'],
  style: 'burned_in',
  caption_options: {
    font_family: 'Inter',
    font_size: 36,
    font_weight: 'bold',
    color: '#FFFFFF',
    background: 'semi_transparent',  // or 'none', 'solid', 'shadow'
    background_color: '#000000',
    background_opacity: 0.7,
    position: 'bottom_center',     // or 'top_center', 'middle'
    max_chars_per_line: 42,
    max_lines: 2,
    animation: 'fade'              // or 'none', 'pop', 'typewriter'
  }
});
SRT + VTT sidecar files in 3 languages
const job = await v100.captions.create({
  source: 'https://cdn.example.com/course-lecture-14.mp4',
  languages: ['en', 'es', 'pt'],
  style: 'sidecar',
  sidecar_formats: ['srt', 'vtt'],
  diarize: true  // label speakers in VTT output
});

const result = await v100.jobs.wait(job.id);
// result.outputs.en.srt_url → "https://cdn.v100.ai/out/.../en.srt"
// result.outputs.en.vtt_url → "https://cdn.v100.ai/out/.../en.vtt"
// result.outputs.es.srt_url → "https://cdn.v100.ai/out/.../es.srt"
// ...etc for each language
HTML5 video with VTT captions
<video controls width="100%">
  <source src="video.mp4" type="video/mp4">
  <track kind="captions" src="en.vtt"
         srclang="en" label="English" default>
  <track kind="captions" src="es.vtt"
         srclang="es" label="Espanol">
  <track kind="captions" src="pt.vtt"
         srclang="pt" label="Portugues">
</video>

Language Support and Accuracy

V100's captioning supports 20 languages. Accuracy varies by language, audio quality, and domain. English achieves 98.2% word-level accuracy on clear studio audio and 94-95% on noisy meeting recordings. CJK languages (Japanese, Korean, Chinese) achieve slightly lower accuracy (95-96%) due to the complexity of character segmentation and homophones. All accuracy figures are measured on V100's internal benchmark suite of 10,000 hours of diverse audio.

Factors that affect accuracy:

  • Audio quality. Studio-recorded audio with a good microphone achieves 97-98% accuracy. Laptop microphones in echoing conference rooms drop to 92-94%. Phone recordings are typically 90-93%.
  • Speaker clarity. Clear enunciation, moderate speaking pace, and minimal cross-talk produce the best results. Heavy accents, rapid speech, and multiple speakers talking simultaneously reduce accuracy.
  • Domain vocabulary. Medical, legal, and technical content may contain domain-specific terms that the general model has not seen frequently. V100 supports custom vocabulary lists to boost accuracy for specific terms.
  • Background noise. Music, ambient noise, and HVAC hum reduce accuracy. V100 applies noise reduction before transcription, but heavily degraded audio will still produce lower accuracy.

Accuracy Tuning: Custom Vocabulary

For domain-specific content, you can boost accuracy by providing a custom vocabulary list. This is particularly useful for product names, brand names, technical terminology, and proper nouns that the general model might not recognize.

Custom vocabulary for medical content
const job = await v100.captions.create({
  source: 's3://medical-ed/cardiology-lecture.mp4',
  languages: ['en'],
  style: 'burned_in',
  vocabulary: [
    'myocardial infarction',
    'troponin',
    'angioplasty',
    'electrocardiogram',
    'STEMI',
    'NSTEMI',
    'percutaneous coronary intervention'
  ]
});

Integration Patterns

There are three common patterns for integrating auto-captioning into your application.

Pattern 1: Caption-on-upload. Every video uploaded to your platform is automatically captioned. Wire your upload handler to submit a captioning job to V100 with a webhook callback. When the webhook fires, update the video record with the caption URLs. This is the simplest pattern and works well for platforms where every video needs captions.

Pattern 2: Caption-on-demand. Users request captions for specific videos, or select specific languages. This pattern is common for platforms with a mix of internal and public content, where only public-facing videos need captions. The user clicks "Generate Captions" in your UI, your backend submits the job, and you display the result when the webhook arrives.

Pattern 3: Batch backfill. You have an existing library of uncaptioned videos and need to add captions to all of them. Use V100's batch API to submit up to 10,000 captioning jobs in a single request. The system processes them in parallel and fires webhooks as each completes. A 1,000-video backfill typically completes in 4-8 hours depending on video lengths.

Compliance Considerations

If your platform serves the EU market, the European Accessibility Act (EAA), which took full effect in June 2025, requires captions for most commercial video content. In the US, Section 508 of the Rehabilitation Act requires captions for federal agency content, and the ADA has been increasingly applied to commercial websites. WCAG 2.1 Level AA -- the most commonly referenced accessibility standard -- requires captions for all prerecorded audio content (Success Criterion 1.2.2) and audio descriptions for prerecorded video (Success Criterion 1.2.5).

Automated captions meet these requirements when accuracy is sufficiently high. The FCC and NAD (National Association of the Deaf) generally consider 99% accuracy the standard for broadcast captions. AI-generated captions typically achieve 95-98% accuracy, which is acceptable for most web content but may need human review for broadcast or legally sensitive material. V100's API returns confidence scores for each caption segment, allowing you to flag low-confidence segments for human review in a hybrid workflow.

The key takeaway for developers: automated captioning with 97%+ accuracy handles 95% of your captioning needs at 1/10th the cost and 100x the speed of manual captioning. For the remaining 5% (broadcast, legal, high-stakes content), use V100's confidence scores to build a human-review queue that only routes the segments that need attention.

Start Captioning Today

V100's captioning API supports 20 languages with 97%+ accuracy. Free tier includes 60 minutes of processing per month.

Get API Key — Free Tier

Related Reading