What is speaker diarization?

Speaker diarization is the process of identifying and labeling which speaker is talking at each point in an audio or video recording. It answers the question 'who spoke when?' by segmenting the audio into speaker-homogeneous regions and assigning each region a speaker label. In V100's API, diarization is performed at the word level, so every word in the transcript is tagged with the speaker who said it.

How accurate is V100's speaker diarization?

V100's speaker diarization achieves 90-95% accuracy on typical meeting recordings with 2-6 speakers and clear audio. Accuracy decreases in challenging conditions: overlapping speech, very short speaker turns, or more than 10 speakers in a single session. Noise suppression significantly improves diarization accuracy by reducing background interference that confuses speaker embeddings.

Can I name speakers in the V100 API?

Yes. V100 assigns generic labels (Speaker A, Speaker B) during transcription. You can map these labels to real names via the API after the transcript is generated. If participants are authenticated in your application, V100 can automatically match speaker labels to participant identities using the session's participant list.

SPEAKER DIARIZATION

Video API with
Speaker Diarization

A transcript without speaker labels is a wall of text. You know what was said, but not who said it. V100's speaker diarization API identifies each speaker in a video session and tags every word in the transcript with the speaker who said it. The result is a color-coded, searchable, per-speaker transcript that powers meeting analytics, accountability tracking, and intelligent editing. Powered by Deepgram and Whisper, with word-level precision across up to 20 speakers.

Get API Key View Code Sample

Up to 20 speakers per session

Word-level speaker labels

Per-speaker analytics

Why Speaker Diarization Matters

Without diarization, a meeting transcript reads like a monologue. You can search for what was said, but you cannot attribute it. For meeting analytics, legal transcripts, and clinical documentation, knowing who said what is not optional -- it is the entire point.

Meeting Analytics

Who talked the most? Who barely spoke? Was the customer given enough time to ask questions? Speaker diarization turns meeting recordings into quantified data: talk-time percentages, interruption counts, question-to-statement ratios, and silence analysis per speaker. Meeting coaches and sales leaders use these metrics to improve team communication patterns.

Accountability and Attribution

In legal depositions, board meetings, and compliance discussions, who said what matters legally. Diarized transcripts provide per-speaker attribution that holds up to scrutiny. When someone says "I never agreed to that," the transcript shows exactly who said what, with timestamps, linked to the video for verification.

Filtered Search and Navigation

Search transcripts by speaker. "Show me everything the customer said about pricing" is a single API call with diarization. Without it, you get every mention of pricing from every speaker. With it, you get the customer's exact words, with timestamps, linked to the video moments where they expressed their concerns.

How V100's Speaker Diarization Works

V100's diarization pipeline uses speaker embedding models to create a voice signature for each speaker, then clusters words by speaker similarity. The process runs as part of the transcription pipeline -- enable it with a single parameter.

Speaker Embeddings

The diarization model extracts a speaker embedding -- a numerical fingerprint of vocal characteristics -- for each segment of speech. Embeddings capture pitch, timbre, speaking rate, and accent patterns. These embeddings are compared across the recording to determine which segments belong to the same speaker, even when speakers have similar voices.

Speaker Clustering

Embeddings are grouped into clusters using agglomerative clustering. Each cluster represents a unique speaker. The system automatically determines the number of speakers (up to 20) without requiring the count in advance. You can optionally provide a speaker count hint to improve accuracy when you know how many participants are in the session.

Word-Level Labeling

Speaker labels are assigned at the word level, not just the utterance level. Every word in the transcript includes a speaker field. This enables precise per-word speaker attribution even during rapid back-and-forth exchanges where speakers alternate within the same sentence boundary.

Speaker Naming

V100 initially assigns generic labels (Speaker A, Speaker B). After transcription, you can map these labels to participant names via the API. If participants are authenticated in your application, V100 can auto-match speaker labels to participant identities using the session's authenticated participant list -- making the transcript immediately usable without manual speaker assignment.

Per-Speaker Meeting Analytics

Speaker diarization enables analytics that are impossible without speaker attribution. V100 computes these metrics automatically from the diarized transcript and makes them available through the API and webhook events.

Talk Time per Speaker

Total speaking duration for each participant, as both absolute time and percentage of the session. Identify meetings dominated by one speaker, or sessions where the customer barely spoke.

Turn Count

How many times each speaker took a turn. High turn counts with short durations indicate dialogue. Low turn counts with long durations indicate monologue. Both patterns are useful signals for coaching.

Interruption Count

Detects when a speaker begins talking before the previous speaker finishes. Frequent interruptions can signal engagement or dysfunction depending on context. The API provides interruption pairs (who interrupted whom).

Questions per Speaker

Counts interrogative sentences per speaker. In sales calls, the ratio of questions asked by the rep versus the prospect is a strong indicator of call quality. In interviews, it measures how much the interviewer lets the candidate speak.

Silence Analysis

Identifies gaps in conversation longer than a configurable threshold. Extended silence after a question might indicate confusion, thinking time, or disengagement. Silence patterns are attributed to the context of the preceding speaker.

Speaker Timeline

A visual timeline showing when each speaker was active throughout the recording. Color-coded blocks represent each speaker's contributions. Click any block to jump to that moment in the video. Available as raw data via API or as a pre-built UI component.

Integration Guide

Enable speaker diarization with a single parameter. The diarized transcript includes speaker labels on every word, plus pre-computed per-speaker analytics.

speaker-diarization.js

import { V100 } from 'v100-sdk';
const v100 = new V100('YOUR_API_KEY');

// Create a meeting with diarization enabled
const session = await v100.meetings.create({
  name: 'Product Strategy Review',
  transcription: {
    enabled: true,
    language: 'en',
    diarize: true,              // Enable speaker diarization
    speakerCountHint: 4          // Optional: improve accuracy
  }
});

// After meeting ends, get diarized transcript
const transcript = await v100.transcripts.get(session.transcriptId);

// transcript.words = [
//   { word: "Let's", start: 1200, end: 1450, speaker: "A" },
//   { word: "review", start: 1450, end: 1700, speaker: "A" },
//   { word: "I", start: 2100, end: 2200, speaker: "B" },
//   { word: "disagree", start: 2200, end: 2600, speaker: "B" },
//   ...
// ]

// Name the speakers
await v100.transcripts.nameSpeakers(session.transcriptId, {
  'A': 'Sarah Chen',
  'B': 'Marcus Johnson',
  'C': 'Priya Patel',
  'D': 'James Williams'
});

// Get per-speaker analytics
const analytics = await v100.transcripts.speakerAnalytics(session.transcriptId);
// analytics = {
//   speakers: [
//     { name: "Sarah Chen",     talkTimeMs: 420000, talkPercent: 35, turns: 42, questions: 8 },
//     { name: "Marcus Johnson", talkTimeMs: 360000, talkPercent: 30, turns: 38, questions: 3 },
//     { name: "Priya Patel",   talkTimeMs: 240000, talkPercent: 20, turns: 28, questions: 12 },
//     { name: "James Williams", talkTimeMs: 180000, talkPercent: 15, turns: 22, questions: 5 }
//   ],
//   interruptions: 7,
//   longestMonologue: { speaker: "Sarah Chen", durationMs: 45000 }
// }

Pricing

Speaker diarization is included with V100's transcription at no additional per-word or per-speaker cost. You pay for transcription minutes -- diarization and analytics are computed at no extra charge.

Free Tier

$0/mo

Diarization included with transcription.

Up to 20 speakers

Word-level labels

Speaker analytics

60 minutes/month

RECOMMENDED

Pro

Usage-based

Production diarization at scale.

$0.006/min transcription + diarization

Auto speaker-participant matching

Speaker timeline component

Webhook events for analytics

Enterprise

Custom

Custom analytics and voice enrollment.

Volume discounts

Custom analytics dashboards

Voice enrollment for known speakers

Dedicated support

Video API with
Speaker Diarization

Why Speaker Diarization Matters

Meeting Analytics

Accountability and Attribution

Filtered Search and Navigation

How V100's Speaker Diarization Works

Speaker Embeddings

Speaker Clustering

Word-Level Labeling

Speaker Naming

Per-Speaker Meeting Analytics

Talk Time per Speaker

Turn Count

Interruption Count

Questions per Speaker

Silence Analysis

Speaker Timeline

Integration Guide

Pricing

Know Who Said What in Every Meeting

Related Resources

Transcription API

AI Meeting Notes

Noise Suppression API

Video API withSpeaker Diarization

Why Speaker Diarization Matters

Meeting Analytics

Accountability and Attribution

Filtered Search and Navigation

How V100's Speaker Diarization Works

Speaker Embeddings

Speaker Clustering

Word-Level Labeling

Speaker Naming

Per-Speaker Meeting Analytics

Talk Time per Speaker

Turn Count

Interruption Count

Questions per Speaker

Silence Analysis

Speaker Timeline

Integration Guide

Pricing

Know Who Said What in Every Meeting

Related Resources

Transcription API

AI Meeting Notes

Noise Suppression API

Video API with
Speaker Diarization