<300ms
Transcription Latency
Word
Timestamp Granularity
What You Will Build
This tutorial produces a complete transcription system for video calls and recordings. You will build seven React components that work together: a live caption overlay that displays real-time speech as it happens, a synchronized transcript view with word-level highlighting during video playback, speaker diarization that color-codes each participant, full-text search across the entire transcript, export to SRT, VTT, and JSON formats, and an edit-by-transcript interface where clicking any word jumps the video to that timestamp and deleting a word range cuts that segment from the video.
✓ Live caption overlay during calls
✓ Word-level timestamps
✓ Speaker diarization (color-coded)
✓ Full-text transcript search
✓ Export to SRT, VTT, JSON
✓ Edit-by-transcript (click to seek)
Transcription Data Flow
Audio Track V100 STT Engine React App
| | |
|--- audio stream --------->| |
| |--- interim result ------->| CaptionOverlay
| |--- final result --------->| TranscriptView
| | (word timestamps, |
| | speaker labels, |
| | confidence scores) |
| | |
| [Post-call: full transcript JSON with diarization]
Step 1 — V100 Transcription Setup
V100's transcription works in two modes: real-time during a live video call (captions arrive over the existing signaling WebSocket) and post-call on a recorded video file (submit the recording URL and get back a structured transcript). This tutorial covers both. Start by enabling transcription when you create a meeting.
Enable transcription on meeting creation
const API = 'https://api.v100.ai';
const KEY = import.meta.env.VITE_V100_API_KEY;
const createMeetingWithTranscription = async () => {
const res = await fetch(`${API}/api/meetings`, {
method: 'POST',
headers: {
'Authorization': `Bearer ${KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
title: 'Transcription Demo',
transcription: {
enabled: true,
language: 'en',
showCaptions: true,
wordTimestamps: true,
diarization: true,
},
}),
}).then(r => r.json());
return res;
};
For post-call transcription of a recorded file, use the dedicated transcription endpoint:
Transcribe a recorded video file
const transcribeRecording = async (videoUrl) => {
const res = await fetch(`${API}/api/transcription`, {
method: 'POST',
headers: {
'Authorization': `Bearer ${KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
url: videoUrl,
language: 'auto',
wordTimestamps: true,
diarization: true,
format: 'verbose_json',
}),
}).then(r => r.json());
return res;
};
Language auto-detection identifies the spoken language within the first 5 seconds and switches models automatically. For multilingual meetings where speakers switch languages mid-sentence, set language: 'multi' to enable continuous language detection.
Step 2 — TranscriptionOverlay Component
The TranscriptionOverlay component renders live captions during a video call. It listens for transcription events on the WebSocket and displays both interim (partial) and final results. Interim results update in real time as the speaker talks; final results replace them when the phrase is complete.
src/components/TranscriptionOverlay.jsx
import { useState, useEffect, useRef } from 'react';
export function TranscriptionOverlay({ wsRef }) {
const [lines, setLines] = useState([]);
const [interim, setInterim] = useState('');
const fadeTimers = useRef([]);
useEffect(() => {
const ws = wsRef.current;
if (!ws) return;
const handler = (event) => {
const msg = JSON.parse(event.data);
if (msg.type !== 'transcription') return;
if (!msg.isFinal) {
setInterim(msg.text);
return;
}
setInterim('');
const line = {
id: Date.now(),
speaker: msg.speaker,
text: msg.text,
};
setLines(prev => [...prev.slice(-2), line]);
const timer = setTimeout(() => {
setLines(prev => prev.filter(l => l.id !== line.id));
}, 6000);
fadeTimers.current.push(timer);
};
ws.addEventListener('message', handler);
return () => {
ws.removeEventListener('message', handler);
fadeTimers.current.forEach(clearTimeout);
};
}, [wsRef]);
return (
<div style={{
position: 'absolute', bottom: 80, left: '50%',
transform: 'translateX(-50%)',
maxWidth: '80%', textAlign: 'center',
zIndex: 50,
}}>
{lines.map(line => (
<div key={line.id} style={{
background: 'rgba(0,0,0,0.85)', color: '#fff',
padding: '6px 16px', borderRadius: 8,
marginBottom: 4, fontSize: 15,
}}>
<strong style={{ color: '#818cf8' }}>{line.speaker}:</strong> {line.text}
</div>
))}
{interim && (
<div style={{
background: 'rgba(0,0,0,0.6)', color: '#aaa',
padding: '6px 16px', borderRadius: 8,
fontSize: 14, fontStyle: 'italic',
}}>
{interim}
</div>
)}
</div>
);
}
The component keeps a maximum of three visible lines (the current interim plus the last two final results) and fades each line after 6 seconds. This matches the UX pattern viewers expect from YouTube and Zoom captions. The speaker name is displayed in the V100 brand color to distinguish who is talking.
Step 3 — Word-Level Timestamps
V100's verbose_json format returns every word with its exact start and end time. This enables precise highlighting during video playback — the current word lights up as the video plays, similar to karaoke lyrics. The TranscriptView component below takes a full transcript and a video ref, and highlights the active word based on the current playback position.
src/components/TranscriptView.jsx
import { useState, useEffect, useCallback } from 'react';
export function TranscriptView({ transcript, videoRef, onWordClick }) {
const [activeIdx, setActiveIdx] = useState(-1);
const words = transcript.segments.flatMap(seg =>
seg.words.map(w => ({ ...w, speaker: seg.speaker }))
);
useEffect(() => {
const video = videoRef.current;
if (!video) return;
const onTimeUpdate = () => {
const t = video.currentTime;
const idx = words.findIndex(
w => t >= w.start && t < w.end
);
setActiveIdx(idx);
};
video.addEventListener('timeupdate', onTimeUpdate);
return () => video.removeEventListener('timeupdate', onTimeUpdate);
}, [videoRef, words]);
const handleClick = useCallback((word, idx) => {
if (videoRef.current) {
videoRef.current.currentTime = word.start;
}
onWordClick?.(word, idx);
}, [videoRef, onWordClick]);
return (
<div style={{ padding: 20, lineHeight: 2 }}>
{words.map((word, i) => (
<span
key={i}
onClick={() => handleClick(word, i)}
style={{
cursor: 'pointer',
padding: '2px 3px',
borderRadius: 4,
background: i === activeIdx ? 'rgba(99,102,241,0.3)' : 'transparent',
color: i === activeIdx ? '#fff' : '#888',
transition: 'all 0.15s',
}}>
{word.text}{' '}
</span>
))}
</div>
);
}
Each word is a clickable <span>. When the user clicks a word, the video seeks to that word's start timestamp. During playback, the timeupdate event fires roughly 4 times per second, and the component highlights whichever word's time range contains the current playback position. The visual effect is a smooth highlight that tracks exactly what is being spoken.
Step 4 — Speaker Diarization
When diarization: true is set, V100 identifies distinct speakers and labels each segment. The transcript response includes a speaker field on every segment. Build a color-coded transcript by assigning each speaker a unique color.
src/components/DiarizedTranscript.jsx
const SPEAKER_COLORS = [
'#818cf8',
'#4ade80',
'#fb923c',
'#f87171',
'#22d3ee',
'#fbbf24',
];
export function DiarizedTranscript({ transcript }) {
const speakers = [...new Set(
transcript.segments.map(s => s.speaker)
)];
const colorMap = Object.fromEntries(
speakers.map((s, i) => [s, SPEAKER_COLORS[i % SPEAKER_COLORS.length]])
);
return (
<div style={{ padding: 20 }}>
{transcript.segments.map((seg, i) => (
<div key={i} style={{ marginBottom: 16 }}>
<div style={{ display: 'flex', gap: 10, alignItems: 'baseline' }}>
<span style={{
color: colorMap[seg.speaker],
fontWeight: 700,
fontSize: 13,
minWidth: 90,
}}>{seg.speaker}</span>
<span style={{
color: '#555', fontSize: 11,
fontFamily: 'JetBrains Mono, monospace',
}}>
{formatTime(seg.start)}
</span>
</div>
<p style={{
margin: '4px 0 0', color: '#ccc',
borderLeft: `3px solid ${colorMap[seg.speaker]}`,
paddingLeft: 12,
}}>{seg.text}</p>
</div>
))}
</div>
);
}
function formatTime(seconds) {
const m = Math.floor(seconds / 60);
const s = Math.floor(seconds % 60).toString().padStart(2, '0');
return `${m}:${s}`;
}
V100's diarization uses voice embeddings to distinguish speakers. It works without any prior enrollment — the model clusters voices automatically during the conversation. For meetings with 2-6 speakers, accuracy is typically above 95%. Speaker labels are consistent within a session: Speaker 1 is always the same person throughout the transcript.
Step 5 — Transcript Search
Full-text search across a transcript lets users find specific moments in a recording instantly. The TranscriptSearch component performs client-side substring matching across all segments and highlights the matches. Clicking a search result seeks the video to that timestamp.
src/components/TranscriptSearch.jsx
import { useState, useMemo } from 'react';
export function TranscriptSearch({ transcript, videoRef }) {
const [query, setQuery] = useState('');
const results = useMemo(() => {
if (!query.trim() || query.length < 2) return [];
const q = query.toLowerCase();
return transcript.segments
.filter(seg => seg.text.toLowerCase().includes(q))
.map(seg => {
const idx = seg.text.toLowerCase().indexOf(q);
return {
...seg,
before: seg.text.slice(Math.max(0, idx - 40), idx),
match: seg.text.slice(idx, idx + query.length),
after: seg.text.slice(idx + query.length, idx + query.length + 40),
};
});
}, [query, transcript]);
const seekTo = (time) => {
if (videoRef.current) videoRef.current.currentTime = time;
};
return (
<div style={{ padding: 16 }}>
<input
value={query}
onChange={e => setQuery(e.target.value)}
placeholder="Search transcript..."
style={{
width: '100%', padding: '10px 16px',
background: '#161616', border: '1px solid #252525',
borderRadius: 8, color: '#e8e8e8', fontSize: 14,
}}
/>
<div style={{ marginTop: 12 }}>
{results.map((r, i) => (
<div
key={i}
onClick={() => seekTo(r.start)}
style={{
padding: '10px 14px', cursor: 'pointer',
borderBottom: '1px solid #252525',
fontSize: 13,
}}
>
<span style={{ color: '#555', fontSize: 11 }}>
{formatTime(r.start)} — {r.speaker}
</span>
<div style={{ color: '#888' }}>
...{r.before}<mark style={{
background: 'rgba(99,102,241,0.3)',
color: '#fff', borderRadius: 3,
}}>{r.match}</mark>{r.after}...
</div>
</div>
))}
</div>
</div>
);
}
function formatTime(s) {
return `${Math.floor(s / 60)}:${Math.floor(s % 60).toString().padStart(2, '0')}`;
}
This is a pure client-side search that runs in the browser. For recordings under an hour, substring matching on the flat transcript text is instant. For longer recordings or multi-meeting search, V100's GET /api/transcription/search endpoint provides server-side full-text search with relevance ranking across all your transcripts.
Step 6 — Export to SRT, VTT, and JSON
Users expect to download transcripts in standard formats for use in video editors, LMS platforms, and compliance archives. The exportTranscript function converts V100's verbose JSON into SRT, VTT, or plain JSON downloads.
src/utils/exportTranscript.js
export function exportTranscript(transcript, format) {
let content, mime, ext;
if (format === 'srt') {
content = transcript.segments.map((seg, i) =>
[
i + 1,
`${srtTime(seg.start)} --> ${srtTime(seg.end)}`,
seg.text,
'',
].join('\n')
).join('\n');
mime = 'text/srt';
ext = 'srt';
}
if (format === 'vtt') {
const cues = transcript.segments.map(seg =>
`${vttTime(seg.start)} --> ${vttTime(seg.end)}\n${seg.speaker}: ${seg.text}`
).join('\n\n');
content = `WEBVTT\n\n${cues}`;
mime = 'text/vtt';
ext = 'vtt';
}
if (format === 'json') {
content = JSON.stringify(transcript, null, 2);
mime = 'application/json';
ext = 'json';
}
const blob = new Blob([content], { type: mime });
const url = URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = `transcript.${ext}`;
a.click();
URL.revokeObjectURL(url);
}
function srtTime(s) {
const h = Math.floor(s / 3600).toString().padStart(2, '0');
const m = Math.floor((s % 3600) / 60).toString().padStart(2, '0');
const sec = Math.floor(s % 60).toString().padStart(2, '0');
const ms = Math.floor((s % 1) * 1000).toString().padStart(3, '0');
return `${h}:${m}:${sec},${ms}`;
}
function vttTime(s) {
return srtTime(s).replace(',', '.');
}
SRT and VTT are the standard subtitle formats accepted by YouTube, Vimeo, video editors (Premiere, DaVinci Resolve), and LMS platforms. The JSON export preserves all metadata including word-level timestamps, speaker labels, and confidence scores — useful for building custom integrations or feeding into downstream AI pipelines.
Tip: You can also request SRT or VTT directly from V100's API without client-side conversion. Pass format: 'srt' or format: 'vtt' to the POST /api/transcription endpoint and the response body will be the formatted subtitle file.
Step 7 — Edit by Transcript
Edit-by-transcript is the most powerful feature in this tutorial. Users select a range of words in the transcript and delete them, which creates a video edit that removes the corresponding time segment. This is how Descript and similar tools work. V100's Video Editing API handles the actual video cutting server-side.
src/components/EditableTranscript.jsx
import { useState, useCallback } from 'react';
export function EditableTranscript({ transcript, videoRef, onCut }) {
const [selection, setSelection] = useState({ start: -1, end: -1 });
const [deleted, setDeleted] = useState(new Set());
const words = transcript.segments.flatMap(seg =>
seg.words.map(w => ({ ...w, speaker: seg.speaker }))
);
const handleMouseDown = (i) => setSelection({ start: i, end: i });
const handleMouseOver = (i) => {
if (selection.start >= 0) {
setSelection(prev => ({ ...prev, end: i }));
}
};
const handleDelete = useCallback(() => {
const lo = Math.min(selection.start, selection.end);
const hi = Math.max(selection.start, selection.end);
const next = new Set(deleted);
for (let i = lo; i <= hi; i++) next.add(i);
setDeleted(next);
const cutStart = words[lo].start;
const cutEnd = words[hi].end;
onCut?.({ start: cutStart, end: cutEnd });
setSelection({ start: -1, end: -1 });
}, [selection, deleted, words, onCut]);
const isSelected = (i) => {
const lo = Math.min(selection.start, selection.end);
const hi = Math.max(selection.start, selection.end);
return i >= lo && i <= hi;
};
return (
<div style={{ padding: 20, lineHeight: 2.2, userSelect: 'none' }}>
{words.map((w, i) => (
<span
key={i}
onMouseDown={() => handleMouseDown(i)}
onMouseOver={() => handleMouseOver(i)}
style={{
cursor: 'pointer',
padding: '2px 3px',
borderRadius: 3,
background: isSelected(i) ? 'rgba(248,113,113,0.3)' : 'transparent',
textDecoration: deleted.has(i) ? 'line-through' : 'none',
color: deleted.has(i) ? '#555' : '#ccc',
}}>
{w.text}{' '}
</span>
))}
{selection.start >= 0 && (
<button
onClick={handleDelete}
onMouseUp={() => {}}
style={{
position: 'fixed', bottom: 30, right: 30,
background: '#f87171', color: '#fff',
border: 'none', padding: '10px 20px',
borderRadius: 8, cursor: 'pointer',
fontWeight: 700,
}}>
Delete Selected
</button>
)}
</div>
);
}
When the user selects a word range and clicks Delete Selected, the onCut callback fires with the start and end timestamps. You can then call V100's video editing API to produce a new version of the recording with that segment removed:
Create a video cut via V100 API
const cutVideoSegment = async (recordingId, cuts) => {
const result = await fetch(`https://api.v100.ai/api/recordings/${recordingId}/edit`, {
method: 'POST',
headers: {
'Authorization': `Bearer ${KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
operations: cuts.map(c => ({
type: 'cut',
start: c.start,
end: c.end,
})),
}),
}).then(r => r.json());
return result;
};
The edited video is produced server-side and returned as a new URL. The original recording is never modified, so you can always undo by referencing the original. Multiple cuts can be batched in a single request for complex edits.
V100 vs Deepgram vs AssemblyAI vs Whisper
Choosing a transcription provider for a React app means evaluating latency, accuracy, language support, and how much integration work you need to do. Here is how V100 compares to the main alternatives:
|
V100 |
Deepgram |
AssemblyAI |
Whisper (self-hosted) |
| Real-time latency |
<300ms |
~300ms |
~400ms |
1-5s (depends on GPU) |
| Word-level timestamps |
Yes |
Yes |
Yes |
Yes (verbose mode) |
| Speaker diarization |
Built-in |
Built-in |
Built-in |
Requires pyannote (extra setup) |
| Video call integration |
Native (same WebSocket) |
Separate integration |
Separate integration |
Build everything yourself |
| Video editing from transcript |
API endpoint |
Not available |
Not available |
Not available |
| Languages |
40+ |
36+ |
30+ |
99+ |
| React components |
This tutorial |
None provided |
None provided |
None provided |
| GPU infrastructure |
Managed |
Managed |
Managed |
You provision and maintain |
| Pricing |
Free tier, then $0.004/min |
$0.0043/min |
$0.0065/min |
GPU cost ($0.50-2/hr) |
V100's key advantage for React developers is the unified platform. If you are already using V100 for video calls, transcription is a config flag — not a separate API integration. The captions arrive on the same WebSocket you use for signaling, and the transcript is linked to the recording. Deepgram and AssemblyAI are strong standalone transcription services, but integrating them with a video call means piping audio to a second API and correlating the results. Whisper is free and supports the most languages, but self-hosting requires GPU infrastructure and real-time performance is difficult to achieve without careful optimization.
Pricing
V100's free tier includes 100 API calls per month, which covers transcription for short meetings and testing. No credit card required.
- Free — 100 API calls/month, real-time + post-call transcription, 40+ languages. No credit card.
- Pro — $0.004/minute of audio. Word-level timestamps, diarization, SRT/VTT export, video editing API.
- Enterprise — Volume pricing, custom language models, on-prem deployment option, SLA.
See the pricing page for full details.
Add Transcription to Your App Today
Get your free API key and start building. Live captions, word-level timestamps, speaker diarization, and edit-by-transcript — all from one API.
Get Your Free API Key