What is a video transcription API?

A video transcription API converts spoken audio in video files to text programmatically. V100's transcription API returns word-level timestamps, speaker diarization, confidence scores, and supports 40+ languages. It works with both real-time video streams and uploaded files, and exports in SRT, VTT, and JSON formats.

Does V100 support real-time transcription?

Yes. V100 supports both real-time transcription via WebSocket for live meetings and conferencing, and asynchronous transcription for uploaded video files. Real-time transcripts are delivered with sub-second latency and include word-level timestamps and speaker identification.

How does V100 compare to Deepgram for video transcription?

V100 includes transcription as part of a full video platform (conferencing, editing, captioning, publishing), while Deepgram is a standalone speech-to-text API. V100 adds transcript-based video editing, where editing text edits the video itself. V100 also includes post-quantum signed transcripts for tamper-evident proof of authenticity.

TRANSCRIPTION API

Video Transcription API
40+ Languages, Word-Level Timestamps

V100's video transcription API converts spoken audio to text with word-level timestamps, speaker diarization, and per-word confidence scores. It supports 40+ languages, works in real-time on live meetings or asynchronously on uploaded files, and exports transcripts as SRT, VTT, or structured JSON. Unlike standalone speech-to-text services, V100's transcription is integrated into a full video platform: edit the transcript text and the video edits itself.

Get API Key — Free Tier V100 vs Deepgram

How It Works

One API call. Upload a video or provide a URL. Get back a full transcript with timestamps, speakers, and confidence scores. Here is the request and response.

POST /v1/transcribe

curl -X POST https://api.v100.ai/v1/transcribe \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "source": "https://storage.example.com/meeting-recording.mp4",
    "language": "auto",
    "features": {
      "diarization": true,
      "word_timestamps": true,
      "confidence_scores": true,
      "punctuation": true,
      "paragraphs": true
    },
    "export": ["json", "srt", "vtt"]
  }'

Response — 200 OK

{
  "transcript_id": "tr_8f3a2b1c",
  "status": "completed",
  "language": "en",
  "duration_seconds": 1847.3,
  "speakers": ["Speaker A", "Speaker B", "Speaker C"],
  "words": [
    {
      "text": "Welcome",
      "start": 0.240,
      "end": 0.680,
      "confidence": 0.98,
      "speaker": "Speaker A"
    },
    {
      "text": "to",
      "start": 0.680,
      "end": 0.800,
      "confidence": 0.99,
      "speaker": "Speaker A"
    },
    // ... every word in the video
  ],
  "paragraphs": [
    {
      "speaker": "Speaker A",
      "start": 0.240,
      "end": 14.920,
      "text": "Welcome to the quarterly product review..."
    }
  ],
  "exports": {
    "json": "https://api.v100.ai/v1/transcripts/tr_8f3a2b1c.json",
    "srt": "https://api.v100.ai/v1/transcripts/tr_8f3a2b1c.srt",
    "vtt": "https://api.v100.ai/v1/transcripts/tr_8f3a2b1c.vtt"
  },
  "pq_signature": "dilithium3:0x7a8b9c..."
}

Real-Time vs File Upload

V100 supports two transcription modes. Choose based on whether your audio is live or pre-recorded.

Real-Time Transcription

For live meetings, webinars, and conferencing. Connect via WebSocket and receive transcript events as participants speak. Sub-second latency from speech to text. Speaker identification updates in real time as the AI learns voice signatures throughout the session.

WebSocket streaming

Sub-second latency

Interim + final results

Live speaker diarization

File Upload Transcription

For pre-recorded video and audio files. Upload via API or provide a URL. V100 processes the file asynchronously and delivers the transcript via webhook or polling. Higher accuracy than real-time because the model makes multiple passes over the audio for optimal diarization and punctuation.

URL or direct file upload

Async with webhook callback

Higher accuracy (multi-pass)

Batch processing support

Supported Languages

V100 transcribes in 40+ languages with automatic language detection. Here are the top 20 languages by accuracy and usage.

Language	Code	Real-Time	File Upload	Diarization
English	en	Yes	Yes	Yes
Spanish	es	Yes	Yes	Yes
French	fr	Yes	Yes	Yes
German	de	Yes	Yes	Yes
Portuguese	pt	Yes	Yes	Yes
Japanese	ja	Yes	Yes	Yes
Korean	ko	Yes	Yes	Yes
Mandarin Chinese	zh	Yes	Yes	Yes
Arabic	ar	Yes	Yes	Yes
Hindi	hi	Yes	Yes	Yes
Italian	it	Yes	Yes	Yes
Dutch	nl	Yes	Yes	Yes
Russian	ru	Yes	Yes	Yes
Turkish	tr	Yes	Yes	Yes
Polish	pl	Yes	Yes	Yes
Swedish	sv	Yes	Yes	Yes
Indonesian	id	Yes	Yes	Yes
Vietnamese	vi	Yes	Yes	Yes
Thai	th	Yes	Yes	Yes
Hebrew	he	Yes	Yes	Yes

Plus 20+ additional languages. Automatic language detection available when "language": "auto" is specified.

Speaker Diarization

V100 automatically identifies and labels each speaker in a video. Every word in the transcript is tagged with a speaker identifier, enabling you to build features like color-coded transcripts, per-speaker summaries, and speaker-specific search.

Automatic Detection

No pre-registration required. The AI detects unique speakers from audio characteristics and assigns consistent labels throughout the transcript. Works with 2-20+ speakers per recording.

Per-Word Attribution

Diarization is not paragraph-level. Every individual word includes a speaker field, so overlapping speech and rapid speaker transitions are captured accurately. This enables precise speaker timelines.

Speaker Naming

Speakers are labeled as "Speaker A", "Speaker B", etc. by default. Use the API to assign real names after transcription, which retroactively updates all exports (SRT, VTT, JSON) with named speakers.

Edit Video by Editing the Transcript

This is V100's most powerful feature and what separates it from standalone transcription APIs like Deepgram and AssemblyAI. Because V100 is a full video platform, the transcript and the video are linked at the word level. Delete a sentence from the transcript, and the corresponding video segment is removed. Rearrange paragraphs, and the video timeline rearranges. It works like Descript, but through an API.

Edit video via transcript manipulation

curl -X POST https://api.v100.ai/v1/editor/transcript-edit \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "transcript_id": "tr_8f3a2b1c",
    "edits": [
      { "action": "delete", "word_range": [42, 67], "reason": "Remove tangent" },
      { "action": "delete_speaker", "speaker": "Speaker C", "reason": "Remove interviewer questions" },
      { "action": "delete_filler", "patterns": ["um", "uh", "like", "you know"] }
    ],
    "output": { "format": "mp4", "resolution": "1080p" }
  }'

// V100 cuts the video at the exact word boundaries.
// No timeline editor. No FFmpeg. Just text manipulation.

Delete words

Video frames removed automatically

Remove speakers

All segments from a speaker cut at once

Strip filler words

Um, uh, like, you know — gone

Export Formats

Every transcript can be exported in multiple formats simultaneously. Specify the formats you need in the API request and all are generated in a single pass.

.srt

SubRip Subtitle

The most widely supported subtitle format. Compatible with every major video player, YouTube, Vimeo, and social media platforms. Sequential numbering with start/end timestamps and text segments.

.vtt

WebVTT

The web-native subtitle format. Used with HTML5 video elements, HLS streams, and DASH manifests. Supports styling, positioning, and cue settings that SRT does not. The right choice for web-based video players.

.json

Structured JSON

Full transcript data with word-level timestamps, confidence scores, speaker labels, and paragraph segmentation. The format to use when building custom UIs, search indexes, or analytics on transcript content.

Transcription API Comparison

How V100's transcription API compares to Deepgram, AssemblyAI, and OpenAI Whisper.

Feature	V100	Deepgram	AssemblyAI	Whisper
Word-level timestamps	Yes	Yes	Yes	Yes
Speaker diarization	Yes	Yes	Yes	No
Real-time streaming	Yes	Yes	Yes	No
Languages	40+	36+	English-focused	99+
Confidence scores	Per-word	Per-word	Per-word	Per-segment
Transcript-based video editing	Yes	No	No	No
Video conferencing built in	Yes	No	No	No
Auto-captioning	Yes (burned-in + SRT/VTT)	No	SRT only	No
Publishing / CDN	Yes	No	No	No
Post-quantum signed transcripts	Yes (ML-DSA)	No	No	No
Translation	Yes (40+ target languages)	No	No	English output only
Self-hosted option	Enterprise	Enterprise	No	Yes (open-source)
HIPAA / BAA	All plans	Enterprise	Enterprise	Self-hosted only

Comparison data accurate as of March 2026. V100 is a full video platform; Deepgram and AssemblyAI are speech-to-text APIs; Whisper is an open-source model.

Pricing

Transcription is included in V100's platform pricing. No separate transcription bill. No per-word charges. No surprise overages.

Free Tier

10 hours/month of transcription

All 40+ languages
Word-level timestamps
Speaker diarization
SRT, VTT, JSON export

Growth

$0.006/min

Unlimited transcription

Everything in Free
Real-time streaming
Transcript-based editing
Translation
Batch processing

Enterprise

Custom

Volume discounts + SLA

Everything in Growth
Custom vocabulary models
PQ-signed transcripts
HIPAA + BAA
Dedicated support

Start Transcribing in 5 Minutes

Get an API key, send a POST request with a video URL, and receive a full transcript with timestamps, speakers, and confidence scores. Free tier includes 10 hours per month. No credit card required.

Get API Key — Free Tier V100 vs Deepgram

Explore Features

Feature

Video Transcription API
40+ Languages, Word-Level Timestamps

How It Works

Real-Time vs File Upload

Real-Time Transcription

File Upload Transcription

Supported Languages

Speaker Diarization

Automatic Detection

Per-Word Attribution

Speaker Naming

Edit Video by Editing the Transcript

Export Formats

SubRip Subtitle

WebVTT

Structured JSON

Transcription API Comparison

Pricing

Start Transcribing in 5 Minutes

Explore Features

Auto-Captions in 40 Languages

Video Editing API

Silence Removal

Video Transcription API40+ Languages, Word-Level Timestamps

How It Works

Real-Time vs File Upload

Real-Time Transcription

File Upload Transcription

Supported Languages

Speaker Diarization

Automatic Detection

Per-Word Attribution

Speaker Naming

Edit Video by Editing the Transcript

Export Formats

SubRip Subtitle

WebVTT

Structured JSON

Transcription API Comparison

Pricing

Start Transcribing in 5 Minutes

Explore Features

Auto-Captions in 40 Languages

Video Editing API

Silence Removal

Video Transcription API
40+ Languages, Word-Level Timestamps