~400
Lines of Code
9
Features Built
SFU
Multi-Party Architecture
Free
100 API Calls/mo

What You Will Build

By the end of this tutorial, you will have a fully functional Zoom-like application built with React and Vite. The app supports everything you would expect from a modern video conferencing tool: multi-participant video with a dynamic tile grid, screen sharing, real-time in-meeting chat, cloud recording, live transcription with caption overlays, and AI-generated post-meeting summaries with action items. Every feature is powered by V100's API — you write the frontend, V100 handles the infrastructure.

Multi-participant video (up to 50)
Screen sharing with replaceTrack
Real-time chat over WebSocket
Cloud recording to S3
Live transcription (40+ languages)
Active speaker detection
AI meeting summaries + action items
Post-quantum encrypted signaling
Architecture — SFU Topology
Participant A        V100 SFU Server        Participant B
    |                       |                       |
    |--- media track ------->|                       |
    |                       |--- media track ------->|
    |                       |                       |
    |<--- media track -------|                       |
    |                       |<--- media track -------|
    |                       |                       |
Participant C          (routes all streams)   Participant D
    |                       |                       |
    |--- media track ------->|--- media track ------->|
    |<--- media tracks ------|<--- media tracks ------|
    |                       |                       |
    Each participant sends 1 stream, receives N-1 streams
            

Prerequisites

Step 1 — Project Setup

Scaffold a new React project with Vite. We will keep the dependency footprint minimal — just React and the browser's native WebRTC APIs. No SDK installation required. V100 uses standard REST and WebSocket endpoints.

Terminal
npm create vite@latest zoom-clone -- --template react cd zoom-clone npm install npm run dev

Create a .env file in the project root with your API key. Vite exposes environment variables prefixed with VITE_ to the client.

.env
VITE_V100_API_KEY=v100_sk_your_api_key_here

Production note: Never expose API keys in client-side code in production. Create meetings from your backend server and pass only the short-lived token to the browser. This tutorial uses VITE_ variables for simplicity during development. See the server-side guide.

Step 2 — Create a Meeting Room

Every video call starts with creating a meeting. The POST /api/meetings endpoint returns a meetingId, a short-lived token for WebSocket authentication, and ICE server credentials. We will wrap this in a custom hook that the rest of the app consumes.

src/hooks/useMeeting.js
import { useState, useCallback } from 'react'; const API = 'https://api.v100.ai'; const KEY = import.meta.env.VITE_V100_API_KEY; export function useMeeting() { const [meeting, setMeeting] = useState(null); const [loading, setLoading] = useState(false); const createMeeting = useCallback(async (title = 'Zoom Clone Meeting') => { setLoading(true); const res = await fetch(`${API}/api/meetings`, { method: 'POST', headers: { 'Authorization': `Bearer ${KEY}`, 'Content-Type': 'application/json', }, body: JSON.stringify({ title, settings: { maxParticipants: 50, topology: 'sfu', // automatic SFU for 3+ participants recording: { autoStart: false }, transcription: { enabled: true, language: 'en', showCaptions: true, }, }, }), }).then(r => r.json()); // res = { meetingId, token, iceServers, joinUrl } setMeeting(res); setLoading(false); return res; }, []); const joinMeeting = useCallback(async (meetingId) => { setLoading(true); const res = await fetch(`${API}/api/meetings/${meetingId}/join`, { method: 'POST', headers: { 'Authorization': `Bearer ${KEY}` }, }).then(r => r.json()); setMeeting(res); setLoading(false); return res; }, []); return { meeting, loading, createMeeting, joinMeeting }; }

The response from POST /api/meetings looks like this:

Response — 201 Created
{ "meetingId": "mtg_a1b2c3d4e5f6", "token": "eyJhbGciOiJFZDI1NTE5...", "joinUrl": "https://api.v100.ai/join/mtg_a1b2c3d4e5f6", "iceServers": [ { "urls": "stun:stun.v100.ai:3478" }, { "urls": "turn:turn.v100.ai:443?transport=tcp", "username": "auto_generated", "credential": "short_lived_credential" } ] }

Step 3 — Join with WebRTC

Once you have a meeting, the next step is connecting the user's camera and microphone, opening a WebSocket to the signaling server, and establishing peer connections. V100 handles the SFU routing automatically — each participant sends one upstream track and receives downstream tracks from every other participant.

src/hooks/useWebRTC.js
import { useRef, useState, useCallback, useEffect } from 'react'; const WS_URL = 'wss://api.v100.ai/ws/signaling'; export function useWebRTC(meeting) { const [participants, setParticipants] = useState(new Map()); const [localStream, setLocalStream] = useState(null); const [activeSpeaker, setActiveSpeaker] = useState(null); const wsRef = useRef(null); const pcsRef = useRef(new Map()); const connect = useCallback(async () => { if (!meeting) return; // Capture local camera + mic const stream = await navigator.mediaDevices.getUserMedia({ video: { width: 1280, height: 720, frameRate: 30 }, audio: { echoCancellation: true, noiseSuppression: true }, }); setLocalStream(stream); // Open signaling WebSocket const ws = new WebSocket( `${WS_URL}?token=${meeting.token}&meetingId=${meeting.meetingId}` ); wsRef.current = ws; ws.onmessage = async (event) => { const msg = JSON.parse(event.data); if (msg.type === 'peer-joined') { const pc = createPeerConnection(msg.peerId, stream, ws); const offer = await pc.createOffer(); await pc.setLocalDescription(offer); ws.send(JSON.stringify({ type: 'offer', sdp: offer, to: msg.peerId, })); } if (msg.type === 'offer') { const pc = createPeerConnection(msg.from, stream, ws); await pc.setRemoteDescription(msg.sdp); const answer = await pc.createAnswer(); await pc.setLocalDescription(answer); ws.send(JSON.stringify({ type: 'answer', sdp: answer, to: msg.from, })); } if (msg.type === 'answer') { const pc = pcsRef.current.get(msg.from); await pc?.setRemoteDescription(msg.sdp); } if (msg.type === 'ice-candidate') { const pc = pcsRef.current.get(msg.from); await pc?.addIceCandidate(msg.candidate); } if (msg.type === 'active-speaker') { setActiveSpeaker(msg.peerId); } if (msg.type === 'peer-left') { pcsRef.current.get(msg.peerId)?.close(); pcsRef.current.delete(msg.peerId); setParticipants(prev => { const next = new Map(prev); next.delete(msg.peerId); return next; }); } }; }, [meeting]); function createPeerConnection(peerId, stream, ws) { const pc = new RTCPeerConnection({ iceServers: meeting.iceServers, }); pcsRef.current.set(peerId, pc); stream.getTracks().forEach(t => pc.addTrack(t, stream)); pc.ontrack = (e) => { setParticipants(prev => { const next = new Map(prev); next.set(peerId, e.streams[0]); return next; }); }; pc.onicecandidate = (e) => { if (e.candidate) { ws.send(JSON.stringify({ type: 'ice-candidate', candidate: e.candidate, to: peerId, })); } }; return pc; } return { localStream, participants, activeSpeaker, connect, wsRef }; }

When a new participant joins, V100's signaling server sends a peer-joined event to everyone in the room. Each existing peer creates a new RTCPeerConnection and initiates the SDP handshake. For rooms with three or more participants, V100 automatically routes traffic through its SFU — each person sends one upstream track and receives separate downstream tracks, keeping bandwidth usage linear instead of exponential.

Step 4 — Participant Grid with Active Speaker

Zoom's signature tile layout resizes dynamically based on participant count. The ParticipantGrid component below calculates columns based on the number of streams and highlights the active speaker with a colored border.

src/components/ParticipantGrid.jsx
import { useRef, useEffect } from 'react'; function VideoTile({ stream, label, isActive, muted }) { const ref = useRef(null); useEffect(() => { if (ref.current && stream) ref.current.srcObject = stream; }, [stream]); return ( <div style={{ position: 'relative', borderRadius: 12, overflow: 'hidden', border: isActive ? '2px solid #4ade80' : '2px solid #252525', background: '#111', }}> <video ref={ref} autoPlay playsInline muted={muted} style={{ width: '100%', display: 'block' }} /> <span style={{ position: 'absolute', bottom: 8, left: 12, background: 'rgba(0,0,0,0.7)', color: '#fff', padding: '4px 10px', borderRadius: 6, fontSize: 12, }}>{label}</span> </div> ); } export function ParticipantGrid({ localStream, participants, activeSpeaker }) { const count = participants.size + 1; const cols = count <= 1 ? 1 : count <= 4 ? 2 : count <= 9 ? 3 : 4; return ( <div style={{ display: 'grid', gridTemplateColumns: `repeat(${cols}, 1fr)`, gap: 8, padding: 16, }}> <VideoTile stream={localStream} label="You" isActive={activeSpeaker === 'local'} muted /> {[...participants.entries()].map(([id, stream]) => ( <VideoTile key={id} stream={stream} label={`Participant ${id.slice(0, 6)}`} isActive={activeSpeaker === id} muted={false} /> ))} </div> ); }

V100 sends active-speaker events over the WebSocket whenever the dominant audio source changes. The grid highlights that participant's tile with a green border, exactly like Zoom does. No client-side audio analysis needed — the SFU computes audio levels server-side.

Step 5 — Screen Sharing

Screen sharing uses the browser's getDisplayMedia API to capture the user's screen and replaceTrack to swap the camera feed for the screen feed on the existing peer connection. No new signaling round is needed.

src/hooks/useScreenShare.js
import { useState, useCallback } from 'react'; export function useScreenShare(localStream, pcsRef) { const [sharing, setSharing] = useState(false); const toggleScreenShare = useCallback(async () => { if (sharing) { // Switch back to camera const camTrack = localStream.getVideoTracks()[0]; for (const pc of pcsRef.current.values()) { const sender = pc.getSenders().find( s => s.track?.kind === 'video' ); await sender?.replaceTrack(camTrack); } setSharing(false); return; } // Capture screen const screenStream = await navigator.mediaDevices.getDisplayMedia({ video: { width: 1920, height: 1080, frameRate: 30 }, }); const screenTrack = screenStream.getVideoTracks()[0]; // Replace camera track on all peer connections for (const pc of pcsRef.current.values()) { const sender = pc.getSenders().find( s => s.track?.kind === 'video' ); await sender?.replaceTrack(screenTrack); } // Handle user clicking browser's "Stop sharing" button screenTrack.onended = () => toggleScreenShare(); setSharing(true); }, [sharing, localStream, pcsRef]); return { sharing, toggleScreenShare }; }

The key insight is replaceTrack — it swaps the video track on the existing RTCPeerConnection without renegotiating the SDP. All other participants see the screen share instantly with zero downtime. When the user clicks the browser's native "Stop sharing" button, the onended callback fires and the camera feed is restored automatically.

Step 6 — In-Meeting Chat

Chat messages travel over the same WebSocket connection used for signaling. V100 routes chat messages to all participants in the meeting. No separate chat server needed.

src/components/ChatPanel.jsx
import { useState, useEffect, useRef } from 'react'; export function ChatPanel({ wsRef }) { const [messages, setMessages] = useState([]); const [input, setInput] = useState(''); const bottomRef = useRef(null); useEffect(() => { const ws = wsRef.current; if (!ws) return; const handler = (event) => { const msg = JSON.parse(event.data); if (msg.type === 'chat') { setMessages(prev => [...prev, { from: msg.from, text: msg.text, time: new Date().toLocaleTimeString(), }]); bottomRef.current?.scrollIntoView({ behavior: 'smooth' }); } }; ws.addEventListener('message', handler); return () => ws.removeEventListener('message', handler); }, [wsRef]); const send = () => { if (!input.trim()) return; wsRef.current?.send(JSON.stringify({ type: 'chat', text: input, })); setInput(''); }; return ( <div style={{ width: 300, display: 'flex', flexDirection: 'column', height: '100%' }}> <div style={{ flex: 1, overflowY: 'auto', padding: 12 }}> {messages.map((m, i) => ( <div key={i} style={{ marginBottom: 10 }}> <strong>{m.from}</strong> <span style={{ color: '#666', fontSize: 11 }}> {m.time}</span> <p style={{ margin: '4px 0 0' }}>{m.text}</p> </div> ))} <div ref={bottomRef} /> </div> <div style={{ display: 'flex', gap: 8, padding: 12 }}> <input value={input} onChange={e => setInput(e.target.value)} onKeyDown={e => e.key === 'Enter' && send()} placeholder="Type a message..." style={{ flex: 1, padding: '8px 12px', borderRadius: 8 }} /> <button onClick={send}>Send</button> </div> </div> ); }

Chat messages persist server-side. V100 stores chat history for the duration of the meeting. When a new participant joins mid-call, they receive the full chat history in a chat-history event on WebSocket connect, so they never miss context.

Step 7 — Recording

Cloud recording is a single API call. V100's SFU captures all participant streams server-side, composites them into a grid layout, and encodes to MP4. No client-side recording means zero performance impact on participants.

src/hooks/useRecording.js
import { useState, useCallback } from 'react'; const API = 'https://api.v100.ai'; const KEY = import.meta.env.VITE_V100_API_KEY; export function useRecording(meetingId) { const [recording, setRecording] = useState(false); const startRecording = useCallback(async () => { await fetch(`${API}/api/meetings/${meetingId}/recording/start`, { method: 'POST', headers: { 'Authorization': `Bearer ${KEY}` }, }); setRecording(true); }, [meetingId]); const stopRecording = useCallback(async () => { const res = await fetch( `${API}/api/meetings/${meetingId}/recording/stop`, { method: 'POST', headers: { 'Authorization': `Bearer ${KEY}` }, } ).then(r => r.json()); setRecording(false); return res; // res = { recordingUrl, transcriptUrl, duration } }, [meetingId]); return { recording, startRecording, stopRecording }; }

The stopRecording response includes a recordingUrl (signed S3 URL for the MP4), a transcriptUrl (if transcription was enabled), and the total duration in seconds. For long meetings, V100 processes the recording asynchronously and sends a webhook when it is ready.

Step 8 — Live Transcription

Since we enabled transcription in the meeting settings in Step 2, live captions arrive automatically over the WebSocket. The CaptionOverlay component renders them as a floating bar at the bottom of the video grid, styled exactly like Zoom's closed captions.

src/components/CaptionOverlay.jsx
import { useState, useEffect } from 'react'; export function CaptionOverlay({ wsRef }) { const [caption, setCaption] = useState(''); const [speaker, setSpeaker] = useState(''); useEffect(() => { const ws = wsRef.current; if (!ws) return; const handler = (event) => { const msg = JSON.parse(event.data); if (msg.type === 'transcription') { setSpeaker(msg.speaker || 'Unknown'); setCaption(msg.text); setTimeout(() => setCaption(''), 5000); } }; ws.addEventListener('message', handler); return () => ws.removeEventListener('message', handler); }, [wsRef]); if (!caption) return null; return ( <div style={{ position: 'fixed', bottom: 100, left: '50%', transform: 'translateX(-50%)', background: 'rgba(0,0,0,0.85)', color: '#fff', padding: '10px 24px', borderRadius: 10, maxWidth: '70%', textAlign: 'center', fontSize: 15, lineHeight: 1.5, zIndex: 50, }}> <strong style={{ color: '#818cf8' }}>{speaker}:</strong> {caption} </div> ); }

Each transcription event includes the speaker identifier (mapped to the participant's display name), the transcribed text, a timestamp, and a confidence score. V100 uses server-side speech-to-text with speaker diarization, so captions are attributed to the correct participant automatically. Supported languages include English, Spanish, French, German, Japanese, Mandarin, Portuguese, and 30+ more.

Step 9 — Post-Meeting Summary

After a meeting ends, you can request an AI-generated summary from the transcript. The summary includes key discussion points, decisions made, and action items assigned to specific participants. This is a single API call.

Post-meeting AI summary
const getMeetingSummary = async (meetingId) => { const summary = await fetch( `https://api.v100.ai/api/meetings/${meetingId}/summary`, { method: 'POST', headers: { 'Authorization': `Bearer ${KEY}`, 'Content-Type': 'application/json', }, body: JSON.stringify({ format: 'structured', // 'structured' | 'prose' | 'bullets' includeActionItems: true, includeDecisions: true, }), } ).then(r => r.json()); return summary; // { // title: "Sprint Planning — March 28", // keyPoints: ["Discussed Q2 roadmap priorities...", ...], // decisions: ["Move launch to April 15", ...], // actionItems: [ // { assignee: "Alice", task: "Draft PRD by Friday", due: "2026-04-01" }, // { assignee: "Bob", task: "Set up staging env", due: "2026-03-31" }, // ], // transcript: "https://storage.v100.ai/transcripts/mtg_a1b2c3..." // } };

The structured format returns a JSON object with keyPoints, decisions, and actionItems arrays. Each action item includes the assignee (identified from the transcript's speaker diarization), the task description, and an optional due date extracted from conversational context. You can integrate this directly into your project management tool, send it via email, or display it in a post-call review screen.

Summaries work retroactively. You can request a summary for any past meeting that had transcription enabled, even if the meeting ended hours or days ago. The transcript is stored on V100's servers for 30 days on the free tier and 90 days on Pro.

V100 vs Building from Scratch

Every feature in this tutorial is something you could build yourself. WebRTC is an open standard. SFUs like Janus and mediasoup are open-source. Whisper handles transcription. Here is what that project actually looks like in practice:

Feature Build from Scratch V100 API
Time to first video call 3–6 months 1 afternoon
SFU server Deploy mediasoup/Janus, handle scaling, monitor Managed (auto-scales to 50 participants)
TURN relay Deploy coturn, configure TLS, geographic distribution Included (RustTURN, global PoPs)
Screen sharing Build track replacement logic, handle browser quirks replaceTrack + SFU handles routing
Recording FFmpeg composite pipeline, S3, transcoding queue One API call, server-side capture
Live transcription Whisper deployment, GPU infra, speaker diarization One config flag in meeting settings
AI meeting summaries LLM integration, prompt engineering, structured output One POST call after meeting ends
Active speaker detection Client-side audio analysis, heuristics Server-side, delivered via WebSocket
Chat Separate WebSocket server, persistence, history Same signaling WebSocket, auto-persisted
Infrastructure cost $2,000–$8,000+/mo for SFU + TURN + GPU Free tier, then $0.004/participant-minute
Engineering time 6 months, team of 3–5 2 days, solo developer

The reality of building a Zoom clone from scratch is not the WebRTC code — it is the infrastructure. TURN servers for NAT traversal. SFU servers that scale. Recording pipelines. Transcription models running on GPUs. Each of these is its own multi-month project. V100 collapses all of that into API calls so you can focus on your product.

Pricing

Everything in this tutorial works on V100's free tier, which includes 100 API calls per month. That is enough to build and test your entire Zoom clone without entering a credit card.

See the pricing page for full details.

Ship Your Zoom Clone This Weekend

Get your free API key and start building. Multi-participant video, screen sharing, chat, recording, and AI summaries — all from one API.

Get Your Free API Key