How do I add captions to a video automatically?

Upload your video to an AI-powered captioning service like V100. The AI transcribes the audio using speech recognition, generates word-level timestamps, and produces captions in SRT or VTT format. You can burn the captions directly into the video or download the subtitle file. V100's API does this in a single API call with support for 40+ languages and costs $0.006 per minute of video.

What is the most accurate auto-captioning tool?

Accuracy depends on audio quality, accents, and background noise. For clean audio (podcasts, presentations, studio recordings), modern AI captioning tools achieve 95-99% accuracy. V100 uses state-of-the-art speech recognition models optimized for video content and supports speaker diarization to label who is speaking. For noisy environments, accuracy drops to 85-92%, but V100's noise-aware models perform better than generic transcription services.

Can I add captions to 100 videos at once?

Yes. V100's batch captioning API lets you submit hundreds of videos in a single request. Each video is processed in parallel, and you receive a webhook notification as each video completes. A batch of 100 ten-minute videos typically completes in 10-15 minutes. You can export captions as SRT files, VTT files, or burn them directly into each video.

How much does automatic video captioning cost?

V100 charges $0.006 per minute of video for captioning. A 10-minute video costs $0.06. A 60-minute webinar costs $0.36. V100 also offers a free tier with 100 API calls per month. By comparison, Rev charges $1.50/min for human captioning and $0.25/min for AI, Descript includes captioning in its $24-33/month plan, and CapCut offers free captioning with watermarks on the free tier.

Can I auto-caption videos in multiple languages?

V100 supports auto-captioning in 40+ languages including English, Spanish, French, German, Portuguese, Japanese, Korean, Chinese (Simplified and Traditional), Arabic, Hindi, and more. You can also auto-detect the spoken language. For multi-language translation, V100 can transcribe in the original language and translate the captions into any of the supported languages, producing separate subtitle files for each.

How to Add Captions to Video Automatically

Video captions are no longer optional. Facebook reports that captioned video ads increase view time by 12% on average. On Instagram and TikTok, 80% of users watch videos with the sound off. Accessibility laws like the ADA and the European Accessibility Act require captions for public-facing video content. YouTube's algorithm ranks captioned videos higher because captions provide text for indexing. Every major platform now recommends or requires captions.

Yet adding captions to video remains painful. The manual process involves watching the video, typing every word, syncing the timestamps, and exporting the subtitle file. For a 10-minute video, this takes 40-60 minutes of work. Professional captioning services like Rev charge $1.50 per minute for human captioners. Even automated tools like YouTube's built-in captions are often inaccurate and require extensive manual correction.

Modern AI speech recognition has changed this. State-of-the-art models achieve 95-99% accuracy on clean audio, support 40+ languages, and process video in real time. This guide walks through three methods for adding captions to video automatically, from free tools to enterprise-scale batch APIs, with honest comparisons and pricing.

Why Captions Matter: The Numbers

The business case for captions is overwhelming. Captions are not just an accessibility feature; they are a growth lever. Every piece of data points in the same direction: captioned videos get more views, more engagement, and more conversions than uncaptioned videos.

Caption impact by the numbers

80% of videos on social media are watched on mute

People scroll through feeds in offices, on public transit, in bed next to sleeping partners. If your video requires audio to understand, you lose 80% of potential viewers immediately. Captions are the only way to communicate your message to the silent majority.

Captioned videos see 40% more views on average

PLYMedia and 3Play Media studies consistently show that captioned videos receive significantly more views. This is a combination of the muted-viewer effect and improved SEO from searchable caption text.

YouTube indexes caption text for search ranking

YouTube cannot watch your video. It reads the transcript. Videos with accurate captions rank higher for relevant keyword searches because YouTube has more text to index. Auto-generated captions from YouTube are often inaccurate, especially for technical terms, proper nouns, and non-English content. Uploading accurate SRT files gives you an SEO advantage.

Legal requirements are expanding

The ADA applies to public-facing web content in the US. The European Accessibility Act (effective June 2025) requires captions for video content from businesses serving EU customers. Universities, government agencies, and publicly traded companies have been sued for lacking video captions. The legal surface area is growing every year.

466 million people worldwide have disabling hearing loss

The WHO estimates that by 2050, nearly 700 million people will have disabling hearing loss. Captions are not a nice-to-have for this population; they are the primary way to consume video content. Building accessible content expands your addressable audience by hundreds of millions.

Method 1: Manual Captioning (SRT Files)

The traditional approach is to create captions manually using a subtitle editor. You watch the video, type the dialogue, set the start and end timestamps for each caption block, and export an SRT or VTT file. Tools like Subtitle Edit (free, Windows), Aegisub (free, cross-platform), and Kapwing (browser-based) provide the editing interface.

The SRT format is straightforward. Each caption block has a sequence number, a timestamp range, and the text. Here is an example of what an SRT file looks like:

example.srt

1
00:00:01,000 --> 00:00:04,500
Welcome to our product demo. Today we are
going to walk through the new dashboard.

2
00:00:05,200 --> 00:00:09,800
The first thing you will notice is the
redesigned navigation on the left side.

3
00:00:10,500 --> 00:00:15,000
We have consolidated the settings menu
into a single panel for easier access.

The problem with manual captioning is time. Professional captioners work at approximately 4-6x real time, meaning a 10-minute video takes 40-60 minutes to caption. If you are captioning your own videos, it takes even longer because you are constantly pausing, rewinding, and adjusting timestamps. For a YouTube creator publishing three 15-minute videos per week, manual captioning adds 3-4.5 hours of work every week. That is 150-230 hours per year spent on subtitles alone.

Manual captioning does produce the highest accuracy because a human is verifying every word. For legal depositions, medical content, and broadcast television, human captioning is still the standard. But for social media content, marketing videos, podcasts, and educational content, AI captioning now matches or exceeds the accuracy of rushed human captioners, at a fraction of the cost and time.

Method 2: Desktop Apps and SaaS Tools

The next tier is using a desktop application or SaaS tool with built-in AI captioning. These tools combine video editing with automatic speech recognition so you can generate, edit, and burn in captions within a single interface.

Descript ($24-33/month): Descript transcribes your video and lets you edit it by editing the transcript. Captions are generated automatically from the transcript and can be styled and burned into the video. Accuracy is excellent for English, with support for about 20 languages. The limitation is that Descript is a desktop application designed for individual users. There is no API for automation, and processing happens on your local machine, which can be slow for long videos on underpowered hardware.

CapCut (Free / $7.99/month): CapCut offers auto-captions with animated word highlighting, which is the style that has become popular on TikTok and Instagram Reels. The free tier includes watermarks and limited exports. CapCut is excellent for short-form social content but lacks batch processing, API access, and enterprise features. Accuracy is good for common languages but drops significantly for specialized vocabulary and less-common languages.

Premiere Pro ($22.99/month): Adobe added speech-to-text captioning in 2021. It works well within the Premiere ecosystem, but it requires a Premiere Pro subscription and is designed for manual editing workflows, not automation. There is no batch mode, and processing is tied to your desktop hardware.

These tools work well for individual creators editing one video at a time. The problem emerges at scale. If you need to caption 50 videos from a webinar series, or add captions to every video your marketing team produces, or build captioning into your own application, desktop tools do not have APIs, do not support batch processing, and cannot be integrated into automated workflows.

Method 3: V100 Auto-Caption API

V100 provides automatic captioning as an API. You send a video URL or upload a file, and V100 returns a timestamped transcript with word-level accuracy. You can export the transcript as SRT, VTT, or JSON, or burn the captions directly into the video with customizable styling. The entire process is a single API call.

Here is how to auto-caption a video with a single curl command:

auto-caption.sh

# Generate captions and burn them into the video
curl -X POST https://api.v100.ai/v1/captions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "video_url": "https://storage.example.com/my-video.mp4",
    "language": "auto",
    "burn_in": true,
    "style": {
      "font": "Inter",
      "size": 42,
      "color": "#FFFFFF",
      "background": "rgba(0,0,0,0.7)",
      "position": "bottom-center",
      "animation": "word-highlight"
    },
    "export_formats": ["srt", "vtt"],
    "webhook_url": "https://your-app.com/webhooks/captions-complete"
  }'

The response includes a job ID. When processing completes, V100 sends a webhook to your application with the captioned video URL, the SRT file URL, and the VTT file URL. For a 10-minute video, processing typically takes 30-60 seconds.

Here is the same operation using the V100 JavaScript SDK:

auto-caption.js

import { V100 } from 'v100-sdk';
const v100 = new V100('YOUR_API_KEY');

// Auto-caption a video
const result = await v100.captions.generate({
  video_url: 'https://storage.example.com/my-video.mp4',
  language: 'auto',               // Auto-detect language
  burn_in: true,                  // Burn captions into video
  style: {
    font: 'Inter',
    size: 42,
    color: '#FFFFFF',
    background: 'rgba(0,0,0,0.7)',
    position: 'bottom-center',
    animation: 'word-highlight'   // TikTok-style word highlighting
  },
  export_formats: ['srt', 'vtt']
});

// result.captioned_video_url — video with burned-in captions
// result.srt_url — downloadable SRT file
// result.vtt_url — downloadable VTT file
// result.transcript — full word-level timestamped transcript
console.log(`Captioned video: ${result.captioned_video_url}`);

40+ Languages: Auto-Caption in Any Language

V100 supports automatic captioning in over 40 languages. The language parameter accepts an ISO 639-1 code (en, es, fr, de, ja, ko, zh, ar, hi, pt, etc.) or "auto" for automatic language detection. Auto-detection identifies the spoken language within the first 30 seconds of audio and applies the appropriate speech recognition model.

Accuracy varies by language. Tier 1 languages (English, Spanish, French, German, Portuguese, Italian, Dutch, Japanese, Korean, Chinese) achieve 95-99% accuracy on clean audio. Tier 2 languages (Arabic, Hindi, Turkish, Polish, Czech, Vietnamese, Thai, Indonesian) achieve 90-97% accuracy. Less-common languages achieve 85-95% accuracy. These numbers assume clean audio with minimal background noise and clear pronunciation. For noisy environments, all accuracy rates drop by 3-8 percentage points.

For content that needs captions in multiple languages, V100 can transcribe in the original language and translate the captions into any other supported language. A Spanish-language podcast can be captioned in Spanish and simultaneously translated into English, French, and Portuguese. Each translation is a separate subtitle file with properly synced timestamps.

multi-language-captions.sh

# Caption in original language + translate to 3 others
curl -X POST https://api.v100.ai/v1/captions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "video_url": "https://storage.example.com/spanish-podcast.mp4",
    "language": "es",
    "translate_to": ["en", "fr", "pt"],
    "export_formats": ["srt", "vtt"],
    "burn_in": false
  }'

# Returns: SRT/VTT files for es, en, fr, pt

Batch Captioning: 100 Videos in One Request

The real power of an API-based captioning solution appears at scale. If you have a backlog of 100 uncaptioned videos on YouTube, or your company produces 50 training videos per month, or your platform hosts thousands of user-generated videos that need captions for accessibility compliance, manual captioning is not feasible.

V100's batch API accepts an array of video URLs and processes them in parallel. Each video receives its own webhook notification when complete. A batch of 100 ten-minute videos (1,000 total minutes) typically completes in 10-15 minutes and costs $6.00 at V100's rate of $0.006/min.

batch-caption.js

import { V100 } from 'v100-sdk';
const v100 = new V100('YOUR_API_KEY');

// Batch caption 100 videos
const videoUrls = [
  'https://storage.example.com/video-001.mp4',
  'https://storage.example.com/video-002.mp4',
  // ... up to 100 videos per batch
  'https://storage.example.com/video-100.mp4'
];

const batch = await v100.captions.batch({
  videos: videoUrls.map(url => ({
    video_url: url,
    language: 'auto',
    burn_in: true,
    style: {
      font: 'Inter',
      size: 38,
      color: '#FFFFFF',
      background: 'rgba(0,0,0,0.75)',
      position: 'bottom-center'
    },
    export_formats: ['srt', 'vtt']
  })),
  webhook_url: 'https://your-app.com/webhooks/batch-complete'
});

console.log(`Batch ${batch.id}: ${batch.video_count} videos queued`);
// Each video triggers a webhook when its captions are ready

Batch captioning is particularly valuable for three use cases. First, backlog clearing: if you have hundreds of existing videos that lack captions, a single batch request can caption all of them overnight. Second, platform-wide compliance: if you run a video platform and need to ensure every video has captions for ADA compliance, batch captioning handles the initial backlog while webhook-driven captioning handles new uploads automatically. Third, content repurposing: translating an entire video library into multiple languages for international markets.

SRT vs. VTT: Which Format to Use

V100 exports captions in both SRT and VTT formats. SRT (SubRip Subtitle) is the most widely supported format. YouTube, Vimeo, Facebook, LinkedIn, and most video players accept SRT files. The format is plain text with sequence numbers, timestamps, and caption text.

VTT (Web Video Text Tracks) is the HTML5 standard for web video captions. If you are building a web application with a video player, VTT is the format you need. VTT supports additional styling (font color, position, alignment) that SRT does not. The HTML5 <track> element accepts VTT files natively, making VTT the best choice for custom video players.

For most use cases, export both. SRT for uploading to social platforms and video hosting services. VTT for your own web video player. V100 generates both from a single transcription, so there is no additional cost for requesting both formats.

Comparison: Manual vs. Descript vs. CapCut vs. V100

Feature	Manual	Descript	CapCut	V100
Accuracy (clean audio)	99%+ (human)	95-98%	90-95%	95-99%
Languages	Depends on captioner	~20	~15	40+
Time (10-min video)	40-60 min	2-3 min	1-2 min	30-60 sec
Batch processing	No	No	No	100+ videos/batch
API access	No	No	No	REST + SDK
SRT/VTT export	Manual	Yes	Limited	Both formats
Burned-in captions	Separate tool needed	Yes	Yes (with styles)	Yes (custom styles)
Word-level timestamps	Manual	Yes	Yes	Yes
Multi-language translation	No	Limited	Limited	40+ languages
Speaker diarization	Manual	Yes	No	Yes
Cost (10-min video)	$15 (Rev human)	$24-33/mo flat	Free (watermark)	$0.06

Pricing: What Auto-Captioning Actually Costs

V100 charges $0.006 per minute of video for captioning. This includes transcription, word-level timestamps, SRT/VTT export, and burned-in caption rendering. Translation to additional languages costs an additional $0.003 per minute per target language.

Real-world pricing examples

10-minute YouTube video $0.06

60-minute webinar recording $0.36

100 videos (10 min avg) batch $6.00

10-min video + 3 language translations $0.15

1,000 videos/month (10 min avg) $60/mo

V100 also offers a free tier with 100 API calls per month. For a creator publishing 3-4 videos per week, the free tier covers all captioning needs. For teams and platforms processing hundreds or thousands of videos, the per-minute pricing scales linearly with no minimum commitment.

Caption Styles: From Simple to Animated

V100 supports multiple caption styles, from traditional subtitles at the bottom of the screen to the animated word-highlighting style that has become the standard on TikTok and Instagram Reels. The style parameter in the API controls font, size, color, background, position, and animation type.

The animated word-highlight style displays 3-5 words at a time in the center of the screen, with the currently spoken word highlighted in a different color. This style originated on TikTok and has become the default expectation for short-form social video. CapCut popularized it, and now every major platform's audience expects it. V100's "word-highlight" animation mode reproduces this style with customizable colors and timing.

For professional content like webinars, presentations, and corporate video, the traditional bottom-of-screen subtitle style is more appropriate. V100's "standard" mode places captions in a semi-transparent background bar at the bottom of the frame, with configurable font, size, and color. For accessibility compliance, this is the recommended style because it follows established caption conventions and does not interfere with the visual content.

When to Use Each Method

Use manual captioning when:

Legal depositions, medical content, broadcast television, or any content where 100% accuracy is legally required and AI accuracy of 95-99% is not sufficient. The cost and time are justified by the regulatory requirement.

Use desktop apps (Descript, CapCut) when:

You are a single creator editing one video at a time, you want a visual editor interface for caption styling, and you do not need batch processing or API integration. Descript is excellent for long-form editing; CapCut is excellent for short-form social clips.

Use V100 API when:

You need to caption videos at scale (batch processing), you are building captioning into your own application, you need multi-language support, you want to automate captioning in a CI/CD or content pipeline, or you need the lowest per-video cost. V100 is the backend; you build or choose the frontend.

Integrating Auto-Captions Into Your Workflow

For content teams that publish regularly, the highest-leverage setup is an automated captioning pipeline. When a new video is uploaded to your storage (S3, Google Cloud Storage, or any URL-accessible location), a webhook triggers V100's captioning API. Captions are generated, the SRT file is saved alongside the video, and the captioned version is uploaded to your publishing queue. The entire process is hands-free after initial setup.

For YouTube creators specifically, the workflow is: record video, upload to V100 for captioning, download the SRT file, upload both the video and SRT to YouTube. The SRT file replaces YouTube's auto-generated captions with your accurate, AI-generated captions. This improves both accessibility and SEO because YouTube trusts uploaded captions more than its own auto-generated ones for search indexing.

For platforms and applications, V100's webhook-driven architecture means captioning is asynchronous. Your application does not block while captions are being generated. Upload the video, receive a webhook when captions are ready, and update the UI. Users see a "generating captions" state for 30-60 seconds, then the captions appear. This is the same pattern that Loom, Descript, and every modern video platform uses.

Start adding captions to your videos today

V100's free tier includes 100 API calls per month. Upload a test video, generate captions in any language, and download the SRT file. No credit card required.

Start Free Trial Auto-Captions Feature Page