Video editing is hard. Not because the creative decisions are hard — most people know what they want to keep and what they want to cut. It is hard because the tools are hard. Timeline editors require learning concepts like tracks, razor tools, ripple edits, and keyframes. Simple tasks like "remove the section where I said um for ten seconds" require scrubbing through video, finding the exact start and end points, splitting the clip, deleting the segment, and closing the gap. That workflow takes 30 seconds for an experienced editor and 5 minutes for everyone else.
Transcript-based editing eliminates the timeline entirely. Instead of scrubbing through video to find the right moment, you read the transcript. Instead of splitting clips at precise timestamps, you select and delete words. Instead of a multi-track timeline with clips, transitions, and keyframes, you have a document that you edit like text. The video follows the text.
V100 provides this capability as an API. You can embed transcript-based editing into any application — a video editing SaaS, a meeting recording platform, a podcast production tool, an e-learning authoring system. Your users get the most intuitive editing experience possible, and you do not have to build the transcription engine, the word-level alignment, the timeline management, or the video rendering pipeline.
How It Works: Word-Level Timestamps
The foundation of transcript-based editing is word-level timestamps. Every word in the transcript has a precise start time and end time, measured in milliseconds. When V100 transcribes a video (using Deepgram for real-time transcription or Whisper for batch processing), the output is not just text — it is a structured document where each word is mapped to its exact position in the video timeline.
// Each word has precise start and end timestamps
{
"words": [
{ "word": "We", "start": 0.240, "end": 0.480, "speaker": "Sarah" },
{ "word": "need", "start": 0.480, "end": 0.720, "speaker": "Sarah" },
{ "word": "to", "start": 0.720, "end": 0.840, "speaker": "Sarah" },
{ "word": "ship", "start": 0.840, "end": 1.120, "speaker": "Sarah" },
{ "word": "the", "start": 1.120, "end": 1.240, "speaker": "Sarah" },
{ "word": "update", "start": 1.240, "end": 1.680, "speaker": "Sarah" },
{ "word": "by", "start": 1.680, "end": 1.800, "speaker": "Sarah" },
{ "word": "Friday", "start": 1.800, "end": 2.400, "speaker": "Sarah" },
{ "word": "um", "start": 2.800, "end": 3.200, "speaker": "Sarah" },
{ "word": "actually", "start": 3.200, "end": 3.800, "speaker": "Sarah" },
// ... delete "um actually" → cut 2.800-3.800s from video
]
}
When a user selects the word "um" in the transcript and deletes it, V100 marks the time segment from 2.800s to 3.200s as "removed." The video player skips that segment during playback. The original video file is never modified. The edit is stored as a list of kept segments — the inverse of the removed segments. This is fully non-destructive editing.
The Feature Set
Transcript-based editing capabilities
The API: Building a Transcript Editor
V100's transcript editing API gives you everything you need to build a transcript-based video editor into your application. The API handles transcription with word-level timestamps, maintains the edit state, computes the kept segments, and renders the final video. You build the UI.
// 1. Upload video and get word-level transcript
const video = await v100.videos.upload({
file: videoFile,
transcribe: true,
transcription: {
provider: "deepgram", // or "whisper"
word_timestamps: true, // Required for editing
speaker_diarization: true, // Speaker labels
},
});
// 2. Get the transcript with word-level timestamps
const transcript = await v100.videos.transcript(video.id);
// Returns: { words: [{ word, start, end, speaker }], speakers: [...] }
// 3. Delete words by index (user selects words 45-52 in UI)
await v100.videos.edit(video.id, {
action: "delete",
word_indices: [45, 46, 47, 48, 49, 50, 51, 52],
});
// 4. Get the final timeline (kept segments)
const segments = await v100.videos.getKeptSegments(video.id);
// Returns: [
// { start: 0.000, end: 14.320 }, // Before the cut
// { start: 18.750, end: 62.400 }, // After the cut
// ]
// 5. Undo the last edit
await v100.videos.edit(video.id, { action: "undo" });
// 6. Export as EDL for Premiere/DaVinci
const edl = await v100.videos.exportEdl(video.id, {
format: "cmx3600", // CMX 3600 EDL (universal)
framerate: 30, // Project frame rate
});
// 7. Render the final edited video
const rendered = await v100.videos.render(video.id, {
format: "mp4",
codec: "h264",
quality: "high",
});
Why This Is the Killer Feature for Video Editing SaaS
If you are building a video editing SaaS product, transcript-based editing is the single feature that differentiates you from FFmpeg wrappers and timeline clones. It is the feature that makes video editing accessible to non-editors. And it is the feature that users describe to their colleagues when recommending a product — "you just edit the text and the video changes."
Building this feature from scratch requires solving several hard problems simultaneously: real-time transcription with word-level accuracy (not just sentence-level), speaker diarization, edit state management with undo/redo, gap-free video rendering from non-contiguous segments, and export to professional editing formats. Each of these is a significant engineering effort. Together, they represent months of development time.
V100's API handles all of them. You make API calls. V100 returns transcript data, manages edit state, computes kept segments, and renders the final video. You build the UI — the part that differentiates your product — and V100 handles the infrastructure that powers it.
V100 API vs Descript: Build vs Buy
Descript is an excellent product that popularized transcript-based video editing. It is priced at $24/month (Pro) to $33/month (Business) per user and provides a complete desktop application for video editing, podcasting, and screen recording. For individuals and small teams who want a finished product, Descript is a strong choice.
V100 is not competing with Descript. V100 is the API that lets you build your own Descript — or a transcript-based editor tailored to a specific vertical that Descript does not serve. If you are building a legal video deposition platform, a corporate training authoring tool, a podcast production SaaS, or an e-learning content management system, you need transcript-based editing embedded in your application with your branding, your workflow, and your integrations. You cannot embed Descript.
V100 API vs Descript product
| Capability | V100 API | Descript |
|---|---|---|
| Transcript-based editing | Yes (API) | Yes (desktop app) |
| Embed in your app | Yes | No |
| Custom branding | Full control | Descript branding |
| Pricing model | Pay-per-use | $24-33/user/mo |
| EDL export | Yes (API) | Yes |
| Multi-platform publishing | 7 platforms, one call | Limited |
| PQ encryption | ML-KEM-768 + ML-DSA-65 | No |
| Real-time AI highlights | Claude Haiku | Post-recording only |
Use Cases: Who Builds With Transcript Editing
Legal video deposition platforms: Attorneys review depositions by reading the transcript, not watching hours of video. Click a disputed statement to jump to the video. Select and mark segments for court exhibits. Export an EDL for the legal videographer. The transcript is the natural interface for legal professionals who are already trained in transcript review.
Podcast production tools: Podcast editors spend most of their time removing filler words, long pauses, and tangential segments. With transcript editing, they read the conversation, delete the "ums" and "you knows," cut the 5-minute tangent about the host's weekend, and export a clean episode in minutes instead of hours.
Corporate training platforms: Subject matter experts record training videos and need to remove mistakes, false starts, and off-topic digressions. They are not video editors. They do not know what a razor tool is. But they can read a transcript and delete the parts that should not be in the final training module.
Meeting recording platforms: After a meeting, participants want to share the key segments — the decision about Q2 strategy, the product demo, the customer feedback. Transcript editing lets them select the relevant portions and share a clip, without scrubbing through a 45-minute recording to find the right timestamps.
getKeptSegments(): The Core Primitive
The fundamental output of transcript-based editing is a list of kept segments. The getKeptSegments() function returns an array of time ranges that represent the final edited video. Each segment has a start time and an end time. The segments are in chronological order and do not overlap.
// After user deletes words at 14.32-18.75s and 42.10-45.30s
const segments = await v100.videos.getKeptSegments(videoId);
// Result: three kept segments
[
{ "start": 0.000, "end": 14.320 }, // Opening segment
{ "start": 18.750, "end": 42.100 }, // Middle segment
{ "start": 45.300, "end": 62.400 }, // Closing segment
]
// Total kept: 42.57s out of 62.40s original
// Removed: 19.83s (two cuts)
// Feed to render endpoint for final video
const rendered = await v100.videos.render(videoId, {
segments: segments, // Optional: override with custom segments
format: "mp4",
include_captions: true,
});
This primitive is deliberately simple. An array of time ranges is the most portable and composable representation of a video edit. You can pass it to V100's render endpoint to produce a final video. You can convert it to an EDL for professional editing software. You can use it client-side to build a custom player that skips cut segments. You can serialize it to JSON and store it alongside the video as non-destructive edit metadata.
Limitations
Transcript accuracy is the ceiling. If the transcription engine mis-transcribes a word, the transcript will not match the audio, and the user experience degrades. Deepgram and Whisper achieve 95-98% word accuracy on clear audio, but accuracy drops with heavy accents, overlapping speakers, background noise, or domain-specific jargon. V100 allows users to manually correct transcript text without affecting the underlying timestamps.
Cuts are at word boundaries, not frame boundaries. Because edits are defined by word timestamps, the smallest unit of editing is a word. You cannot cut at an arbitrary frame within a word. For most use cases (removing sentences, paragraphs, or filler words), word-level precision is more than sufficient. For frame-precise editing (cutting on a specific visual transition), users should export the EDL and fine-tune in a professional NLE.
No rearrangement in the current API. The current transcript editing API supports deletion and restoration but does not support reordering segments. You cannot drag paragraph 3 above paragraph 1 in the transcript and have the video rearrange accordingly. This is on the product roadmap. Workaround: use getKeptSegments() to get the timeline, reorder the segments manually, and pass the reordered array to the render endpoint.
Build transcript-based editing into your application
V100's API gives you word-level transcription, edit state management, undo/redo, EDL export, and video rendering. You build the UI. Your users get the most intuitive video editing experience available. Start a free trial and build your first transcript editor today.