Add Real-Time Video Transcription to Your React App

<300ms

Transcription Latency

40+

Languages Supported

Word

Timestamp Granularity

Free

100 API Calls/mo

What You Will Build

This tutorial produces a complete transcription system for video calls and recordings. You will build seven React components that work together: a live caption overlay that displays real-time speech as it happens, a synchronized transcript view with word-level highlighting during video playback, speaker diarization that color-codes each participant, full-text search across the entire transcript, export to SRT, VTT, and JSON formats, and an edit-by-transcript interface where clicking any word jumps the video to that timestamp and deleting a word range cuts that segment from the video.

✓ Live caption overlay during calls

✓ Word-level timestamps

✓ Speaker diarization (color-coded)

✓ Full-text transcript search

✓ Export to SRT, VTT, JSON

✓ Edit-by-transcript (click to seek)

Transcription Data Flow

Audio Track              V100 STT Engine            React App
    |                          |                          |
    |--- audio stream --------->|                          |
    |                          |--- interim result ------->| CaptionOverlay
    |                          |--- final result --------->| TranscriptView
    |                          |   (word timestamps,       |
    |                          |    speaker labels,        |
    |                          |    confidence scores)     |
    |                          |                          |
    |                     [Post-call: full transcript JSON with diarization]

Step 1 — V100 Transcription Setup

V100's transcription works in two modes: real-time during a live video call (captions arrive over the existing signaling WebSocket) and post-call on a recorded video file (submit the recording URL and get back a structured transcript). This tutorial covers both. Start by enabling transcription when you create a meeting.

Enable transcription on meeting creation

const API = 'https://api.v100.ai';
const KEY = import.meta.env.VITE_V100_API_KEY;

const createMeetingWithTranscription = async () => {
  const res = await fetch(`${API}/api/meetings`, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      title: 'Transcription Demo',
      transcription: {
        enabled: true,
        language: 'en',          // or 'auto' for language detection
        showCaptions: true,
        wordTimestamps: true,  // include per-word timestamps
        diarization: true,     // identify speakers
      },
    }),
  }).then(r => r.json());
  return res;
};

For post-call transcription of a recorded file, use the dedicated transcription endpoint:

Transcribe a recorded video file

const transcribeRecording = async (videoUrl) => {
  const res = await fetch(`${API}/api/transcription`, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      url: videoUrl,
      language: 'auto',
      wordTimestamps: true,
      diarization: true,
      format: 'verbose_json',  // 'verbose_json' | 'srt' | 'vtt' | 'text'
    }),
  }).then(r => r.json());
  return res;
};

Language auto-detection identifies the spoken language within the first 5 seconds and switches models automatically. For multilingual meetings where speakers switch languages mid-sentence, set language: 'multi' to enable continuous language detection.

Step 2 — TranscriptionOverlay Component

The TranscriptionOverlay component renders live captions during a video call. It listens for transcription events on the WebSocket and displays both interim (partial) and final results. Interim results update in real time as the speaker talks; final results replace them when the phrase is complete.

src/components/TranscriptionOverlay.jsx

import { useState, useEffect, useRef } from 'react';

export function TranscriptionOverlay({ wsRef }) {
  const [lines, setLines] = useState([]);
  const [interim, setInterim] = useState('');
  const fadeTimers = useRef([]);

  useEffect(() => {
    const ws = wsRef.current;
    if (!ws) return;

    const handler = (event) => {
      const msg = JSON.parse(event.data);
      if (msg.type !== 'transcription') return;

      if (!msg.isFinal) {
        setInterim(msg.text);
        return;
      }

      // Final result — add to visible lines
      setInterim('');
      const line = {
        id: Date.now(),
        speaker: msg.speaker,
        text: msg.text,
      };
      setLines(prev => [...prev.slice(-2), line]);

      // Auto-fade after 6 seconds
      const timer = setTimeout(() => {
        setLines(prev => prev.filter(l => l.id !== line.id));
      }, 6000);
      fadeTimers.current.push(timer);
    };

    ws.addEventListener('message', handler);
    return () => {
      ws.removeEventListener('message', handler);
      fadeTimers.current.forEach(clearTimeout);
    };
  }, [wsRef]);

  return (
    <div style={{
      position: 'absolute', bottom: 80, left: '50%',
      transform: 'translateX(-50%)',
      maxWidth: '80%', textAlign: 'center',
      zIndex: 50,
    }}>
      {lines.map(line => (
        <div key={line.id} style={{
          background: 'rgba(0,0,0,0.85)', color: '#fff',
          padding: '6px 16px', borderRadius: 8,
          marginBottom: 4, fontSize: 15,
        }}>
          <strong style={{ color: '#818cf8' }}>{line.speaker}:</strong> {line.text}
        </div>
      ))}
      {interim && (
        <div style={{
          background: 'rgba(0,0,0,0.6)', color: '#aaa',
          padding: '6px 16px', borderRadius: 8,
          fontSize: 14, fontStyle: 'italic',
        }}>
          {interim}
        </div>
      )}
    </div>
  );
}

The component keeps a maximum of three visible lines (the current interim plus the last two final results) and fades each line after 6 seconds. This matches the UX pattern viewers expect from YouTube and Zoom captions. The speaker name is displayed in the V100 brand color to distinguish who is talking.

Step 3 — Word-Level Timestamps

V100's verbose_json format returns every word with its exact start and end time. This enables precise highlighting during video playback — the current word lights up as the video plays, similar to karaoke lyrics. The TranscriptView component below takes a full transcript and a video ref, and highlights the active word based on the current playback position.

src/components/TranscriptView.jsx

import { useState, useEffect, useCallback } from 'react';

export function TranscriptView({ transcript, videoRef, onWordClick }) {
  const [activeIdx, setActiveIdx] = useState(-1);

  // Flatten all words from all segments
  const words = transcript.segments.flatMap(seg =>
    seg.words.map(w => ({ ...w, speaker: seg.speaker }))
  );

  // Sync highlight with video playback
  useEffect(() => {
    const video = videoRef.current;
    if (!video) return;

    const onTimeUpdate = () => {
      const t = video.currentTime;
      const idx = words.findIndex(
        w => t >= w.start && t < w.end
      );
      setActiveIdx(idx);
    };

    video.addEventListener('timeupdate', onTimeUpdate);
    return () => video.removeEventListener('timeupdate', onTimeUpdate);
  }, [videoRef, words]);

  const handleClick = useCallback((word, idx) => {
    if (videoRef.current) {
      videoRef.current.currentTime = word.start;
    }
    onWordClick?.(word, idx);
  }, [videoRef, onWordClick]);

  return (
    <div style={{ padding: 20, lineHeight: 2 }}>
      {words.map((word, i) => (
        <span
          key={i}
          onClick={() => handleClick(word, i)}
          style={{
            cursor: 'pointer',
            padding: '2px 3px',
            borderRadius: 4,
            background: i === activeIdx ? 'rgba(99,102,241,0.3)' : 'transparent',
            color: i === activeIdx ? '#fff' : '#888',
            transition: 'all 0.15s',
          }}>
          {word.text}{' '}
        </span>
      ))}
    </div>
  );
}

Each word is a clickable <span>. When the user clicks a word, the video seeks to that word's start timestamp. During playback, the timeupdate event fires roughly 4 times per second, and the component highlights whichever word's time range contains the current playback position. The visual effect is a smooth highlight that tracks exactly what is being spoken.

Step 4 — Speaker Diarization

When diarization: true is set, V100 identifies distinct speakers and labels each segment. The transcript response includes a speaker field on every segment. Build a color-coded transcript by assigning each speaker a unique color.

src/components/DiarizedTranscript.jsx

const SPEAKER_COLORS = [
  '#818cf8', // indigo
  '#4ade80', // green
  '#fb923c', // orange
  '#f87171', // red
  '#22d3ee', // cyan
  '#fbbf24', // yellow
];

export function DiarizedTranscript({ transcript }) {
  // Build a speaker-to-color map
  const speakers = [...new Set(
    transcript.segments.map(s => s.speaker)
  )];
  const colorMap = Object.fromEntries(
    speakers.map((s, i) => [s, SPEAKER_COLORS[i % SPEAKER_COLORS.length]])
  );

  return (
    <div style={{ padding: 20 }}>
      {transcript.segments.map((seg, i) => (
        <div key={i} style={{ marginBottom: 16 }}>
          <div style={{ display: 'flex', gap: 10, alignItems: 'baseline' }}>
            <span style={{
              color: colorMap[seg.speaker],
              fontWeight: 700,
              fontSize: 13,
              minWidth: 90,
            }}>{seg.speaker}</span>
            <span style={{
              color: '#555', fontSize: 11,
              fontFamily: 'JetBrains Mono, monospace',
            }}>
              {formatTime(seg.start)}
            </span>
          </div>
          <p style={{
            margin: '4px 0 0', color: '#ccc',
            borderLeft: `3px solid ${colorMap[seg.speaker]}`,
            paddingLeft: 12,
          }}>{seg.text}</p>
        </div>
      ))}
    </div>
  );
}

function formatTime(seconds) {
  const m = Math.floor(seconds / 60);
  const s = Math.floor(seconds % 60).toString().padStart(2, '0');
  return `${m}:${s}`;
}

V100's diarization uses voice embeddings to distinguish speakers. It works without any prior enrollment — the model clusters voices automatically during the conversation. For meetings with 2-6 speakers, accuracy is typically above 95%. Speaker labels are consistent within a session: Speaker 1 is always the same person throughout the transcript.

Step 5 — Transcript Search

Full-text search across a transcript lets users find specific moments in a recording instantly. The TranscriptSearch component performs client-side substring matching across all segments and highlights the matches. Clicking a search result seeks the video to that timestamp.

src/components/TranscriptSearch.jsx

import { useState, useMemo } from 'react';

export function TranscriptSearch({ transcript, videoRef }) {
  const [query, setQuery] = useState('');

  const results = useMemo(() => {
    if (!query.trim() || query.length < 2) return [];
    const q = query.toLowerCase();
    return transcript.segments
      .filter(seg => seg.text.toLowerCase().includes(q))
      .map(seg => {
        const idx = seg.text.toLowerCase().indexOf(q);
        return {
          ...seg,
          before: seg.text.slice(Math.max(0, idx - 40), idx),
          match: seg.text.slice(idx, idx + query.length),
          after: seg.text.slice(idx + query.length, idx + query.length + 40),
        };
      });
  }, [query, transcript]);

  const seekTo = (time) => {
    if (videoRef.current) videoRef.current.currentTime = time;
  };

  return (
    <div style={{ padding: 16 }}>
      <input
        value={query}
        onChange={e => setQuery(e.target.value)}
        placeholder="Search transcript..."
        style={{
          width: '100%', padding: '10px 16px',
          background: '#161616', border: '1px solid #252525',
          borderRadius: 8, color: '#e8e8e8', fontSize: 14,
        }}
      />
      <div style={{ marginTop: 12 }}>
        {results.map((r, i) => (
          <div
            key={i}
            onClick={() => seekTo(r.start)}
            style={{
              padding: '10px 14px', cursor: 'pointer',
              borderBottom: '1px solid #252525',
              fontSize: 13,
            }}
          >
            <span style={{ color: '#555', fontSize: 11 }}>
              {formatTime(r.start)} — {r.speaker}
            </span>
            <div style={{ color: '#888' }}>
              ...{r.before}<mark style={{
                background: 'rgba(99,102,241,0.3)',
                color: '#fff', borderRadius: 3,
              }}>{r.match}</mark>{r.after}...
            </div>
          </div>
        ))}
      </div>
    </div>
  );
}

function formatTime(s) {
  return `${Math.floor(s / 60)}:${Math.floor(s % 60).toString().padStart(2, '0')}`;
}

This is a pure client-side search that runs in the browser. For recordings under an hour, substring matching on the flat transcript text is instant. For longer recordings or multi-meeting search, V100's GET /api/transcription/search endpoint provides server-side full-text search with relevance ranking across all your transcripts.

Step 6 — Export to SRT, VTT, and JSON

Users expect to download transcripts in standard formats for use in video editors, LMS platforms, and compliance archives. The exportTranscript function converts V100's verbose JSON into SRT, VTT, or plain JSON downloads.

src/utils/exportTranscript.js

export function exportTranscript(transcript, format) {
  let content, mime, ext;

  if (format === 'srt') {
    content = transcript.segments.map((seg, i) =>
      [
        i + 1,
        `${srtTime(seg.start)} --> ${srtTime(seg.end)}`,
        seg.text,
        '',
      ].join('\n')
    ).join('\n');
    mime = 'text/srt';
    ext = 'srt';
  }

  if (format === 'vtt') {
    const cues = transcript.segments.map(seg =>
      `${vttTime(seg.start)} --> ${vttTime(seg.end)}\n${seg.speaker}: ${seg.text}`
    ).join('\n\n');
    content = `WEBVTT\n\n${cues}`;
    mime = 'text/vtt';
    ext = 'vtt';
  }

  if (format === 'json') {
    content = JSON.stringify(transcript, null, 2);
    mime = 'application/json';
    ext = 'json';
  }

  const blob = new Blob([content], { type: mime });
  const url = URL.createObjectURL(blob);
  const a = document.createElement('a');
  a.href = url;
  a.download = `transcript.${ext}`;
  a.click();
  URL.revokeObjectURL(url);
}

function srtTime(s) {
  const h = Math.floor(s / 3600).toString().padStart(2, '0');
  const m = Math.floor((s % 3600) / 60).toString().padStart(2, '0');
  const sec = Math.floor(s % 60).toString().padStart(2, '0');
  const ms = Math.floor((s % 1) * 1000).toString().padStart(3, '0');
  return `${h}:${m}:${sec},${ms}`;
}

function vttTime(s) {
  return srtTime(s).replace(',', '.');
}

SRT and VTT are the standard subtitle formats accepted by YouTube, Vimeo, video editors (Premiere, DaVinci Resolve), and LMS platforms. The JSON export preserves all metadata including word-level timestamps, speaker labels, and confidence scores — useful for building custom integrations or feeding into downstream AI pipelines.

Tip: You can also request SRT or VTT directly from V100's API without client-side conversion. Pass format: 'srt' or format: 'vtt' to the POST /api/transcription endpoint and the response body will be the formatted subtitle file.

Step 7 — Edit by Transcript

Edit-by-transcript is the most powerful feature in this tutorial. Users select a range of words in the transcript and delete them, which creates a video edit that removes the corresponding time segment. This is how Descript and similar tools work. V100's Video Editing API handles the actual video cutting server-side.

src/components/EditableTranscript.jsx

import { useState, useCallback } from 'react';

export function EditableTranscript({ transcript, videoRef, onCut }) {
  const [selection, setSelection] = useState({ start: -1, end: -1 });
  const [deleted, setDeleted] = useState(new Set());

  const words = transcript.segments.flatMap(seg =>
    seg.words.map(w => ({ ...w, speaker: seg.speaker }))
  );

  const handleMouseDown = (i) => setSelection({ start: i, end: i });
  const handleMouseOver = (i) => {
    if (selection.start >= 0) {
      setSelection(prev => ({ ...prev, end: i }));
    }
  };

  const handleDelete = useCallback(() => {
    const lo = Math.min(selection.start, selection.end);
    const hi = Math.max(selection.start, selection.end);

    // Mark words as deleted (strikethrough in UI)
    const next = new Set(deleted);
    for (let i = lo; i <= hi; i++) next.add(i);
    setDeleted(next);

    // Create video cut via V100 API
    const cutStart = words[lo].start;
    const cutEnd = words[hi].end;
    onCut?.({ start: cutStart, end: cutEnd });

    setSelection({ start: -1, end: -1 });
  }, [selection, deleted, words, onCut]);

  const isSelected = (i) => {
    const lo = Math.min(selection.start, selection.end);
    const hi = Math.max(selection.start, selection.end);
    return i >= lo && i <= hi;
  };

  return (
    <div style={{ padding: 20, lineHeight: 2.2, userSelect: 'none' }}>
      {words.map((w, i) => (
        <span
          key={i}
          onMouseDown={() => handleMouseDown(i)}
          onMouseOver={() => handleMouseOver(i)}
          style={{
            cursor: 'pointer',
            padding: '2px 3px',
            borderRadius: 3,
            background: isSelected(i) ? 'rgba(248,113,113,0.3)' : 'transparent',
            textDecoration: deleted.has(i) ? 'line-through' : 'none',
            color: deleted.has(i) ? '#555' : '#ccc',
          }}>
          {w.text}{' '}
        </span>
      ))}
      {selection.start >= 0 && (
        <button
          onClick={handleDelete}
          onMouseUp={() => {}}
          style={{
            position: 'fixed', bottom: 30, right: 30,
            background: '#f87171', color: '#fff',
            border: 'none', padding: '10px 20px',
            borderRadius: 8, cursor: 'pointer',
            fontWeight: 700,
          }}>
          Delete Selected
        </button>
      )}
    </div>
  );
}

When the user selects a word range and clicks Delete Selected, the onCut callback fires with the start and end timestamps. You can then call V100's video editing API to produce a new version of the recording with that segment removed:

Create a video cut via V100 API

const cutVideoSegment = async (recordingId, cuts) => {
  const result = await fetch(`https://api.v100.ai/api/recordings/${recordingId}/edit`, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      operations: cuts.map(c => ({
        type: 'cut',
        start: c.start,
        end: c.end,
      })),
    }),
  }).then(r => r.json());
  // result = { editedUrl: "https://storage.v100.ai/edited/...", duration: 1205 }
  return result;
};

The edited video is produced server-side and returned as a new URL. The original recording is never modified, so you can always undo by referencing the original. Multiple cuts can be batched in a single request for complex edits.

V100 vs Deepgram vs AssemblyAI vs Whisper

Choosing a transcription provider for a React app means evaluating latency, accuracy, language support, and how much integration work you need to do. Here is how V100 compares to the main alternatives:

	V100	Deepgram	AssemblyAI	Whisper (self-hosted)
Real-time latency	<300ms	~300ms	~400ms	1-5s (depends on GPU)
Word-level timestamps	Yes	Yes	Yes	Yes (verbose mode)
Speaker diarization	Built-in	Built-in	Built-in	Requires pyannote (extra setup)
Video call integration	Native (same WebSocket)	Separate integration	Separate integration	Build everything yourself
Video editing from transcript	API endpoint	Not available	Not available	Not available
Languages	40+	36+	30+	99+
React components	This tutorial	None provided	None provided	None provided
GPU infrastructure	Managed	Managed	Managed	You provision and maintain
Pricing	Free tier, then $0.004/min	$0.0043/min	$0.0065/min	GPU cost ($0.50-2/hr)

V100's key advantage for React developers is the unified platform. If you are already using V100 for video calls, transcription is a config flag — not a separate API integration. The captions arrive on the same WebSocket you use for signaling, and the transcript is linked to the recording. Deepgram and AssemblyAI are strong standalone transcription services, but integrating them with a video call means piping audio to a second API and correlating the results. Whisper is free and supports the most languages, but self-hosting requires GPU infrastructure and real-time performance is difficult to achieve without careful optimization.

Pricing

V100's free tier includes 100 API calls per month, which covers transcription for short meetings and testing. No credit card required.

Free — 100 API calls/month, real-time + post-call transcription, 40+ languages. No credit card.
Pro — $0.004/minute of audio. Word-level timestamps, diarization, SRT/VTT export, video editing API.
Enterprise — Volume pricing, custom language models, on-prem deployment option, SLA.

See the pricing page for full details.

Add Transcription to Your App Today

Get your free API key and start building. Live captions, word-level timestamps, speaker diarization, and edit-by-transcript — all from one API.

Get Your Free API Key

What You Will Build

Step 1 — V100 Transcription Setup

Step 2 — TranscriptionOverlay Component

Step 3 — Word-Level Timestamps

Step 4 — Speaker Diarization

Step 5 — Transcript Search

Step 6 — Export to SRT, VTT, and JSON

Step 7 — Edit by Transcript

V100 vs Deepgram vs AssemblyAI vs Whisper

Pricing

Add Transcription to Your App Today

Related