How does transcript-based video editing work?

Transcript-based video editing maps every word in a transcript to its exact start and end time in the video using word-level timestamps from Deepgram or Whisper. Clicking a word in the transcript jumps the video playhead to that word's timestamp. Deleting words from the transcript marks those time segments for removal from the video. The result is a non-destructive edit list that preserves the original video while producing a new timeline of kept segments.

What is the difference between V100's editing API and Descript?

Descript is a consumer/prosumer product priced at $24-33/user/month. V100 provides the same transcript-based editing capability as a developer API, allowing you to embed this experience in your own application. V100 handles transcription, word-level alignment, timeline management, EDL export, and video rendering -- you build the UI. This means you can create custom editing workflows for specific industries (legal, education, media) without the limitations of a generic product.

Is transcript-based editing non-destructive?

Yes. V100's transcript-based editing is fully non-destructive. The original video file is never modified. Edits are stored as a list of kept segments (start and end timestamps). The getKeptSegments() function returns the final timeline, which can be rendered into a new video file or exported as an EDL for use in Premiere Pro, DaVinci Resolve, or Final Cut Pro. You can undo any edit with Cmd+Z and the original video remains intact.

What video editing export formats does V100 support?

V100 exports edit decisions as EDL (Edit Decision List) files compatible with Adobe Premiere Pro, DaVinci Resolve, and Final Cut Pro. It can also render the final edited video directly as MP4 (H.264/H.265), WebM (VP9/AV1), or MOV. The getKeptSegments() API returns raw timestamp data as JSON for custom integrations.

Edit Video by Editing the Transcript: V100's Text-Based V...

Video editing is hard. Not because the creative decisions are hard — most people know what they want to keep and what they want to cut. It is hard because the tools are hard. Timeline editors require learning concepts like tracks, razor tools, ripple edits, and keyframes. Simple tasks like "remove the section where I said um for ten seconds" require scrubbing through video, finding the exact start and end points, splitting the clip, deleting the segment, and closing the gap. That workflow takes 30 seconds for an experienced editor and 5 minutes for everyone else.

Transcript-based editing eliminates the timeline entirely. Instead of scrubbing through video to find the right moment, you read the transcript. Instead of splitting clips at precise timestamps, you select and delete words. Instead of a multi-track timeline with clips, transitions, and keyframes, you have a document that you edit like text. The video follows the text.

V100 provides this capability as an API. You can embed transcript-based editing into any application — a video editing SaaS, a meeting recording platform, a podcast production tool, an e-learning authoring system. Your users get the most intuitive editing experience possible, and you do not have to build the transcription engine, the word-level alignment, the timeline management, or the video rendering pipeline.

How It Works: Word-Level Timestamps

The foundation of transcript-based editing is word-level timestamps. Every word in the transcript has a precise start time and end time, measured in milliseconds. When V100 transcribes a video (using Deepgram for real-time transcription or Whisper for batch processing), the output is not just text — it is a structured document where each word is mapped to its exact position in the video timeline.

word-level transcript data

// Each word has precise start and end timestamps
{
  "words": [
    { "word": "We",       "start": 0.240, "end": 0.480, "speaker": "Sarah" },
    { "word": "need",     "start": 0.480, "end": 0.720, "speaker": "Sarah" },
    { "word": "to",       "start": 0.720, "end": 0.840, "speaker": "Sarah" },
    { "word": "ship",     "start": 0.840, "end": 1.120, "speaker": "Sarah" },
    { "word": "the",      "start": 1.120, "end": 1.240, "speaker": "Sarah" },
    { "word": "update",   "start": 1.240, "end": 1.680, "speaker": "Sarah" },
    { "word": "by",       "start": 1.680, "end": 1.800, "speaker": "Sarah" },
    { "word": "Friday",   "start": 1.800, "end": 2.400, "speaker": "Sarah" },
    { "word": "um",       "start": 2.800, "end": 3.200, "speaker": "Sarah" },
    { "word": "actually", "start": 3.200, "end": 3.800, "speaker": "Sarah" },
    // ... delete "um actually" → cut 2.800-3.800s from video
  ]
}

When a user selects the word "um" in the transcript and deletes it, V100 marks the time segment from 2.800s to 3.200s as "removed." The video player skips that segment during playback. The original video file is never modified. The edit is stored as a list of kept segments — the inverse of the removed segments. This is fully non-destructive editing.

The Feature Set

Transcript-based editing capabilities

Click-to-navigate

Click any word in the transcript and the video playhead jumps to that word's timestamp. No scrubbing, no guessing, no timeline. If you can read, you can navigate a video.

Select-and-delete to cut

Select a range of words in the transcript and press Delete. The corresponding video segments are marked for removal. The playback skips cut segments seamlessly. Selecting an entire sentence or paragraph works exactly as expected.

Undo/redo with Cmd+Z

Full undo/redo stack. Every edit operation (delete, restore, cut) is reversible. Cmd+Z to undo, Cmd+Shift+Z to redo. The edit history persists across sessions.

Speaker labels with color coding

Each speaker in the transcript is identified by name (from meeting participant data or speaker diarization) and assigned a unique color. Visual boundaries between speakers make it easy to find who said what and delete entire speaking turns.

Floating cut toolbar

When text is selected, a floating toolbar appears with Cut, Keep Only, and Restore options. Cut removes the selection. Keep Only removes everything except the selection. Restore un-deletes previously cut words.

EDL export for professional NLEs

Export the edit decision list as an EDL file compatible with Adobe Premiere Pro, DaVinci Resolve, and Final Cut Pro. The professional editor receives a timeline with all cuts already made — they can fine-tune transitions and add effects without re-editing the content.

Non-destructive by default

The original video file is never modified. All edits are stored as metadata. The getKeptSegments() function returns the final timeline as an array of {start, end} timestamps. The original video can always be fully restored.

The API: Building a Transcript Editor

V100's transcript editing API gives you everything you need to build a transcript-based video editor into your application. The API handles transcription with word-level timestamps, maintains the edit state, computes the kept segments, and renders the final video. You build the UI.

transcript-based editing API

// 1. Upload video and get word-level transcript
const video = await v100.videos.upload({
  file: videoFile,
  transcribe: true,
  transcription: {
    provider: "deepgram",       // or "whisper"
    word_timestamps: true,       // Required for editing
    speaker_diarization: true,  // Speaker labels
  },
});

// 2. Get the transcript with word-level timestamps
const transcript = await v100.videos.transcript(video.id);
// Returns: { words: [{ word, start, end, speaker }], speakers: [...] }

// 3. Delete words by index (user selects words 45-52 in UI)
await v100.videos.edit(video.id, {
  action: "delete",
  word_indices: [45, 46, 47, 48, 49, 50, 51, 52],
});

// 4. Get the final timeline (kept segments)
const segments = await v100.videos.getKeptSegments(video.id);
// Returns: [
//   { start: 0.000, end: 14.320 },      // Before the cut
//   { start: 18.750, end: 62.400 },      // After the cut
// ]

// 5. Undo the last edit
await v100.videos.edit(video.id, { action: "undo" });

// 6. Export as EDL for Premiere/DaVinci
const edl = await v100.videos.exportEdl(video.id, {
  format: "cmx3600",  // CMX 3600 EDL (universal)
  framerate: 30,        // Project frame rate
});

// 7. Render the final edited video
const rendered = await v100.videos.render(video.id, {
  format: "mp4",
  codec: "h264",
  quality: "high",
});

Why This Is the Killer Feature for Video Editing SaaS

If you are building a video editing SaaS product, transcript-based editing is the single feature that differentiates you from FFmpeg wrappers and timeline clones. It is the feature that makes video editing accessible to non-editors. And it is the feature that users describe to their colleagues when recommending a product — "you just edit the text and the video changes."

Building this feature from scratch requires solving several hard problems simultaneously: real-time transcription with word-level accuracy (not just sentence-level), speaker diarization, edit state management with undo/redo, gap-free video rendering from non-contiguous segments, and export to professional editing formats. Each of these is a significant engineering effort. Together, they represent months of development time.

V100's API handles all of them. You make API calls. V100 returns transcript data, manages edit state, computes kept segments, and renders the final video. You build the UI — the part that differentiates your product — and V100 handles the infrastructure that powers it.

V100 API vs Descript: Build vs Buy

Descript is an excellent product that popularized transcript-based video editing. It is priced at $24/month (Pro) to $33/month (Business) per user and provides a complete desktop application for video editing, podcasting, and screen recording. For individuals and small teams who want a finished product, Descript is a strong choice.

V100 is not competing with Descript. V100 is the API that lets you build your own Descript — or a transcript-based editor tailored to a specific vertical that Descript does not serve. If you are building a legal video deposition platform, a corporate training authoring tool, a podcast production SaaS, or an e-learning content management system, you need transcript-based editing embedded in your application with your branding, your workflow, and your integrations. You cannot embed Descript.

V100 API vs Descript product

Capability	V100 API	Descript
Transcript-based editing	Yes (API)	Yes (desktop app)
Embed in your app	Yes	No
Custom branding	Full control	Descript branding
Pricing model	Pay-per-use	$24-33/user/mo
EDL export	Yes (API)	Yes
Multi-platform publishing	7 platforms, one call	Limited
PQ encryption	ML-KEM-768 + ML-DSA-65	No
Real-time AI highlights	Claude Haiku	Post-recording only

Use Cases: Who Builds With Transcript Editing

Legal video deposition platforms: Attorneys review depositions by reading the transcript, not watching hours of video. Click a disputed statement to jump to the video. Select and mark segments for court exhibits. Export an EDL for the legal videographer. The transcript is the natural interface for legal professionals who are already trained in transcript review.

Podcast production tools: Podcast editors spend most of their time removing filler words, long pauses, and tangential segments. With transcript editing, they read the conversation, delete the "ums" and "you knows," cut the 5-minute tangent about the host's weekend, and export a clean episode in minutes instead of hours.

Corporate training platforms: Subject matter experts record training videos and need to remove mistakes, false starts, and off-topic digressions. They are not video editors. They do not know what a razor tool is. But they can read a transcript and delete the parts that should not be in the final training module.

Meeting recording platforms: After a meeting, participants want to share the key segments — the decision about Q2 strategy, the product demo, the customer feedback. Transcript editing lets them select the relevant portions and share a clip, without scrubbing through a 45-minute recording to find the right timestamps.

getKeptSegments(): The Core Primitive

The fundamental output of transcript-based editing is a list of kept segments. The getKeptSegments() function returns an array of time ranges that represent the final edited video. Each segment has a start time and an end time. The segments are in chronological order and do not overlap.

getKeptSegments() output

// After user deletes words at 14.32-18.75s and 42.10-45.30s
const segments = await v100.videos.getKeptSegments(videoId);

// Result: three kept segments
[
  { "start": 0.000,  "end": 14.320 },  // Opening segment
  { "start": 18.750, "end": 42.100 },  // Middle segment
  { "start": 45.300, "end": 62.400 },  // Closing segment
]

// Total kept: 42.57s out of 62.40s original
// Removed: 19.83s (two cuts)

// Feed to render endpoint for final video
const rendered = await v100.videos.render(videoId, {
  segments: segments,  // Optional: override with custom segments
  format: "mp4",
  include_captions: true,
});

This primitive is deliberately simple. An array of time ranges is the most portable and composable representation of a video edit. You can pass it to V100's render endpoint to produce a final video. You can convert it to an EDL for professional editing software. You can use it client-side to build a custom player that skips cut segments. You can serialize it to JSON and store it alongside the video as non-destructive edit metadata.

Limitations

Transcript accuracy is the ceiling. If the transcription engine mis-transcribes a word, the transcript will not match the audio, and the user experience degrades. Deepgram and Whisper achieve 95-98% word accuracy on clear audio, but accuracy drops with heavy accents, overlapping speakers, background noise, or domain-specific jargon. V100 allows users to manually correct transcript text without affecting the underlying timestamps.

Cuts are at word boundaries, not frame boundaries. Because edits are defined by word timestamps, the smallest unit of editing is a word. You cannot cut at an arbitrary frame within a word. For most use cases (removing sentences, paragraphs, or filler words), word-level precision is more than sufficient. For frame-precise editing (cutting on a specific visual transition), users should export the EDL and fine-tune in a professional NLE.

No rearrangement in the current API. The current transcript editing API supports deletion and restoration but does not support reordering segments. You cannot drag paragraph 3 above paragraph 1 in the transcript and have the video rearrange accordingly. This is on the product roadmap. Workaround: use getKeptSegments() to get the timeline, reorder the segments manually, and pass the reordered array to the render endpoint.

Build transcript-based editing into your application

V100's API gives you word-level transcription, edit state management, undo/redo, EDL export, and video rendering. You build the UI. Your users get the most intuitive video editing experience available. Start a free trial and build your first transcript editor today.

Start Free Trial Headless Video Editor

Edit Video by Editing the Transcript

How It Works: Word-Level Timestamps

The Feature Set

Transcript-based editing capabilities

The API: Building a Transcript Editor

Why This Is the Killer Feature for Video Editing SaaS

V100 API vs Descript: Build vs Buy

V100 API vs Descript product

Use Cases: Who Builds With Transcript Editing

getKeptSegments(): The Core Primitive

Limitations

Build transcript-based editing into your application

Related Reading

Headless Video Editor

Publish to 7 Platforms in One API Call

Real-Time Video Intelligence

How to Build a Video Editing SaaS