SFU
Multi-Party Architecture
What You Will Build
By the end of this tutorial, you will have a fully functional Zoom-like application built with React and Vite. The app supports everything you would expect from a modern video conferencing tool: multi-participant video with a dynamic tile grid, screen sharing, real-time in-meeting chat, cloud recording, live transcription with caption overlays, and AI-generated post-meeting summaries with action items. Every feature is powered by V100's API — you write the frontend, V100 handles the infrastructure.
✓ Multi-participant video (up to 50)
✓ Screen sharing with replaceTrack
✓ Real-time chat over WebSocket
✓ Cloud recording to S3
✓ Live transcription (40+ languages)
✓ Active speaker detection
✓ AI meeting summaries + action items
✓ Post-quantum encrypted signaling
Architecture — SFU Topology
Participant A V100 SFU Server Participant B
| | |
|--- media track ------->| |
| |--- media track ------->|
| | |
|<--- media track -------| |
| |<--- media track -------|
| | |
Participant C (routes all streams) Participant D
| | |
|--- media track ------->|--- media track ------->|
|<--- media tracks ------|<--- media tracks ------|
| | |
Each participant sends 1 stream, receives N-1 streams
Prerequisites
- Node.js 18+ and npm
- React 18+ with Vite (this tutorial uses Vite, but the code works with any bundler)
- A V100 API key — free tier gives you 100 API calls per month, no credit card required
Step 1 — Project Setup
Scaffold a new React project with Vite. We will keep the dependency footprint minimal — just React and the browser's native WebRTC APIs. No SDK installation required. V100 uses standard REST and WebSocket endpoints.
Terminal
npm create vite@latest zoom-clone -- --template react
cd zoom-clone
npm install
npm run dev
Create a .env file in the project root with your API key. Vite exposes environment variables prefixed with VITE_ to the client.
.env
VITE_V100_API_KEY=v100_sk_your_api_key_here
Production note: Never expose API keys in client-side code in production. Create meetings from your backend server and pass only the short-lived token to the browser. This tutorial uses VITE_ variables for simplicity during development. See the server-side guide.
Step 2 — Create a Meeting Room
Every video call starts with creating a meeting. The POST /api/meetings endpoint returns a meetingId, a short-lived token for WebSocket authentication, and ICE server credentials. We will wrap this in a custom hook that the rest of the app consumes.
src/hooks/useMeeting.js
import { useState, useCallback } from 'react';
const API = 'https://api.v100.ai';
const KEY = import.meta.env.VITE_V100_API_KEY;
export function useMeeting() {
const [meeting, setMeeting] = useState(null);
const [loading, setLoading] = useState(false);
const createMeeting = useCallback(async (title = 'Zoom Clone Meeting') => {
setLoading(true);
const res = await fetch(`${API}/api/meetings`, {
method: 'POST',
headers: {
'Authorization': `Bearer ${KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
title,
settings: {
maxParticipants: 50,
topology: 'sfu',
recording: { autoStart: false },
transcription: {
enabled: true,
language: 'en',
showCaptions: true,
},
},
}),
}).then(r => r.json());
setMeeting(res);
setLoading(false);
return res;
}, []);
const joinMeeting = useCallback(async (meetingId) => {
setLoading(true);
const res = await fetch(`${API}/api/meetings/${meetingId}/join`, {
method: 'POST',
headers: { 'Authorization': `Bearer ${KEY}` },
}).then(r => r.json());
setMeeting(res);
setLoading(false);
return res;
}, []);
return { meeting, loading, createMeeting, joinMeeting };
}
The response from POST /api/meetings looks like this:
Response — 201 Created
{
"meetingId": "mtg_a1b2c3d4e5f6",
"token": "eyJhbGciOiJFZDI1NTE5...",
"joinUrl": "https://api.v100.ai/join/mtg_a1b2c3d4e5f6",
"iceServers": [
{ "urls": "stun:stun.v100.ai:3478" },
{
"urls": "turn:turn.v100.ai:443?transport=tcp",
"username": "auto_generated",
"credential": "short_lived_credential"
}
]
}
Step 3 — Join with WebRTC
Once you have a meeting, the next step is connecting the user's camera and microphone, opening a WebSocket to the signaling server, and establishing peer connections. V100 handles the SFU routing automatically — each participant sends one upstream track and receives downstream tracks from every other participant.
src/hooks/useWebRTC.js
import { useRef, useState, useCallback, useEffect } from 'react';
const WS_URL = 'wss://api.v100.ai/ws/signaling';
export function useWebRTC(meeting) {
const [participants, setParticipants] = useState(new Map());
const [localStream, setLocalStream] = useState(null);
const [activeSpeaker, setActiveSpeaker] = useState(null);
const wsRef = useRef(null);
const pcsRef = useRef(new Map());
const connect = useCallback(async () => {
if (!meeting) return;
const stream = await navigator.mediaDevices.getUserMedia({
video: { width: 1280, height: 720, frameRate: 30 },
audio: { echoCancellation: true, noiseSuppression: true },
});
setLocalStream(stream);
const ws = new WebSocket(
`${WS_URL}?token=${meeting.token}&meetingId=${meeting.meetingId}`
);
wsRef.current = ws;
ws.onmessage = async (event) => {
const msg = JSON.parse(event.data);
if (msg.type === 'peer-joined') {
const pc = createPeerConnection(msg.peerId, stream, ws);
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
ws.send(JSON.stringify({
type: 'offer', sdp: offer, to: msg.peerId,
}));
}
if (msg.type === 'offer') {
const pc = createPeerConnection(msg.from, stream, ws);
await pc.setRemoteDescription(msg.sdp);
const answer = await pc.createAnswer();
await pc.setLocalDescription(answer);
ws.send(JSON.stringify({
type: 'answer', sdp: answer, to: msg.from,
}));
}
if (msg.type === 'answer') {
const pc = pcsRef.current.get(msg.from);
await pc?.setRemoteDescription(msg.sdp);
}
if (msg.type === 'ice-candidate') {
const pc = pcsRef.current.get(msg.from);
await pc?.addIceCandidate(msg.candidate);
}
if (msg.type === 'active-speaker') {
setActiveSpeaker(msg.peerId);
}
if (msg.type === 'peer-left') {
pcsRef.current.get(msg.peerId)?.close();
pcsRef.current.delete(msg.peerId);
setParticipants(prev => {
const next = new Map(prev);
next.delete(msg.peerId);
return next;
});
}
};
}, [meeting]);
function createPeerConnection(peerId, stream, ws) {
const pc = new RTCPeerConnection({
iceServers: meeting.iceServers,
});
pcsRef.current.set(peerId, pc);
stream.getTracks().forEach(t => pc.addTrack(t, stream));
pc.ontrack = (e) => {
setParticipants(prev => {
const next = new Map(prev);
next.set(peerId, e.streams[0]);
return next;
});
};
pc.onicecandidate = (e) => {
if (e.candidate) {
ws.send(JSON.stringify({
type: 'ice-candidate',
candidate: e.candidate,
to: peerId,
}));
}
};
return pc;
}
return { localStream, participants, activeSpeaker, connect, wsRef };
}
When a new participant joins, V100's signaling server sends a peer-joined event to everyone in the room. Each existing peer creates a new RTCPeerConnection and initiates the SDP handshake. For rooms with three or more participants, V100 automatically routes traffic through its SFU — each person sends one upstream track and receives separate downstream tracks, keeping bandwidth usage linear instead of exponential.
Step 4 — Participant Grid with Active Speaker
Zoom's signature tile layout resizes dynamically based on participant count. The ParticipantGrid component below calculates columns based on the number of streams and highlights the active speaker with a colored border.
src/components/ParticipantGrid.jsx
import { useRef, useEffect } from 'react';
function VideoTile({ stream, label, isActive, muted }) {
const ref = useRef(null);
useEffect(() => {
if (ref.current && stream) ref.current.srcObject = stream;
}, [stream]);
return (
<div style={{
position: 'relative',
borderRadius: 12,
overflow: 'hidden',
border: isActive ? '2px solid #4ade80' : '2px solid #252525',
background: '#111',
}}>
<video ref={ref} autoPlay playsInline muted={muted}
style={{ width: '100%', display: 'block' }}
/>
<span style={{
position: 'absolute', bottom: 8, left: 12,
background: 'rgba(0,0,0,0.7)', color: '#fff',
padding: '4px 10px', borderRadius: 6, fontSize: 12,
}}>{label}</span>
</div>
);
}
export function ParticipantGrid({ localStream, participants, activeSpeaker }) {
const count = participants.size + 1;
const cols = count <= 1 ? 1 : count <= 4 ? 2 : count <= 9 ? 3 : 4;
return (
<div style={{
display: 'grid',
gridTemplateColumns: `repeat(${cols}, 1fr)`,
gap: 8, padding: 16,
}}>
<VideoTile
stream={localStream}
label="You"
isActive={activeSpeaker === 'local'}
muted
/>
{[...participants.entries()].map(([id, stream]) => (
<VideoTile
key={id}
stream={stream}
label={`Participant ${id.slice(0, 6)}`}
isActive={activeSpeaker === id}
muted={false}
/>
))}
</div>
);
}
V100 sends active-speaker events over the WebSocket whenever the dominant audio source changes. The grid highlights that participant's tile with a green border, exactly like Zoom does. No client-side audio analysis needed — the SFU computes audio levels server-side.
Step 5 — Screen Sharing
Screen sharing uses the browser's getDisplayMedia API to capture the user's screen and replaceTrack to swap the camera feed for the screen feed on the existing peer connection. No new signaling round is needed.
src/hooks/useScreenShare.js
import { useState, useCallback } from 'react';
export function useScreenShare(localStream, pcsRef) {
const [sharing, setSharing] = useState(false);
const toggleScreenShare = useCallback(async () => {
if (sharing) {
const camTrack = localStream.getVideoTracks()[0];
for (const pc of pcsRef.current.values()) {
const sender = pc.getSenders().find(
s => s.track?.kind === 'video'
);
await sender?.replaceTrack(camTrack);
}
setSharing(false);
return;
}
const screenStream = await navigator.mediaDevices.getDisplayMedia({
video: { width: 1920, height: 1080, frameRate: 30 },
});
const screenTrack = screenStream.getVideoTracks()[0];
for (const pc of pcsRef.current.values()) {
const sender = pc.getSenders().find(
s => s.track?.kind === 'video'
);
await sender?.replaceTrack(screenTrack);
}
screenTrack.onended = () => toggleScreenShare();
setSharing(true);
}, [sharing, localStream, pcsRef]);
return { sharing, toggleScreenShare };
}
The key insight is replaceTrack — it swaps the video track on the existing RTCPeerConnection without renegotiating the SDP. All other participants see the screen share instantly with zero downtime. When the user clicks the browser's native "Stop sharing" button, the onended callback fires and the camera feed is restored automatically.
Step 6 — In-Meeting Chat
Chat messages travel over the same WebSocket connection used for signaling. V100 routes chat messages to all participants in the meeting. No separate chat server needed.
src/components/ChatPanel.jsx
import { useState, useEffect, useRef } from 'react';
export function ChatPanel({ wsRef }) {
const [messages, setMessages] = useState([]);
const [input, setInput] = useState('');
const bottomRef = useRef(null);
useEffect(() => {
const ws = wsRef.current;
if (!ws) return;
const handler = (event) => {
const msg = JSON.parse(event.data);
if (msg.type === 'chat') {
setMessages(prev => [...prev, {
from: msg.from,
text: msg.text,
time: new Date().toLocaleTimeString(),
}]);
bottomRef.current?.scrollIntoView({ behavior: 'smooth' });
}
};
ws.addEventListener('message', handler);
return () => ws.removeEventListener('message', handler);
}, [wsRef]);
const send = () => {
if (!input.trim()) return;
wsRef.current?.send(JSON.stringify({
type: 'chat',
text: input,
}));
setInput('');
};
return (
<div style={{ width: 300, display: 'flex', flexDirection: 'column', height: '100%' }}>
<div style={{ flex: 1, overflowY: 'auto', padding: 12 }}>
{messages.map((m, i) => (
<div key={i} style={{ marginBottom: 10 }}>
<strong>{m.from}</strong>
<span style={{ color: '#666', fontSize: 11 }}> {m.time}</span>
<p style={{ margin: '4px 0 0' }}>{m.text}</p>
</div>
))}
<div ref={bottomRef} />
</div>
<div style={{ display: 'flex', gap: 8, padding: 12 }}>
<input
value={input}
onChange={e => setInput(e.target.value)}
onKeyDown={e => e.key === 'Enter' && send()}
placeholder="Type a message..."
style={{ flex: 1, padding: '8px 12px', borderRadius: 8 }}
/>
<button onClick={send}>Send</button>
</div>
</div>
);
}
Chat messages persist server-side. V100 stores chat history for the duration of the meeting. When a new participant joins mid-call, they receive the full chat history in a chat-history event on WebSocket connect, so they never miss context.
Step 7 — Recording
Cloud recording is a single API call. V100's SFU captures all participant streams server-side, composites them into a grid layout, and encodes to MP4. No client-side recording means zero performance impact on participants.
src/hooks/useRecording.js
import { useState, useCallback } from 'react';
const API = 'https://api.v100.ai';
const KEY = import.meta.env.VITE_V100_API_KEY;
export function useRecording(meetingId) {
const [recording, setRecording] = useState(false);
const startRecording = useCallback(async () => {
await fetch(`${API}/api/meetings/${meetingId}/recording/start`, {
method: 'POST',
headers: { 'Authorization': `Bearer ${KEY}` },
});
setRecording(true);
}, [meetingId]);
const stopRecording = useCallback(async () => {
const res = await fetch(
`${API}/api/meetings/${meetingId}/recording/stop`,
{
method: 'POST',
headers: { 'Authorization': `Bearer ${KEY}` },
}
).then(r => r.json());
setRecording(false);
return res;
}, [meetingId]);
return { recording, startRecording, stopRecording };
}
The stopRecording response includes a recordingUrl (signed S3 URL for the MP4), a transcriptUrl (if transcription was enabled), and the total duration in seconds. For long meetings, V100 processes the recording asynchronously and sends a webhook when it is ready.
Step 8 — Live Transcription
Since we enabled transcription in the meeting settings in Step 2, live captions arrive automatically over the WebSocket. The CaptionOverlay component renders them as a floating bar at the bottom of the video grid, styled exactly like Zoom's closed captions.
src/components/CaptionOverlay.jsx
import { useState, useEffect } from 'react';
export function CaptionOverlay({ wsRef }) {
const [caption, setCaption] = useState('');
const [speaker, setSpeaker] = useState('');
useEffect(() => {
const ws = wsRef.current;
if (!ws) return;
const handler = (event) => {
const msg = JSON.parse(event.data);
if (msg.type === 'transcription') {
setSpeaker(msg.speaker || 'Unknown');
setCaption(msg.text);
setTimeout(() => setCaption(''), 5000);
}
};
ws.addEventListener('message', handler);
return () => ws.removeEventListener('message', handler);
}, [wsRef]);
if (!caption) return null;
return (
<div style={{
position: 'fixed', bottom: 100, left: '50%',
transform: 'translateX(-50%)',
background: 'rgba(0,0,0,0.85)', color: '#fff',
padding: '10px 24px', borderRadius: 10,
maxWidth: '70%', textAlign: 'center',
fontSize: 15, lineHeight: 1.5,
zIndex: 50,
}}>
<strong style={{ color: '#818cf8' }}>{speaker}:</strong> {caption}
</div>
);
}
Each transcription event includes the speaker identifier (mapped to the participant's display name), the transcribed text, a timestamp, and a confidence score. V100 uses server-side speech-to-text with speaker diarization, so captions are attributed to the correct participant automatically. Supported languages include English, Spanish, French, German, Japanese, Mandarin, Portuguese, and 30+ more.
Step 9 — Post-Meeting Summary
After a meeting ends, you can request an AI-generated summary from the transcript. The summary includes key discussion points, decisions made, and action items assigned to specific participants. This is a single API call.
Post-meeting AI summary
const getMeetingSummary = async (meetingId) => {
const summary = await fetch(
`https://api.v100.ai/api/meetings/${meetingId}/summary`,
{
method: 'POST',
headers: {
'Authorization': `Bearer ${KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
format: 'structured',
includeActionItems: true,
includeDecisions: true,
}),
}
).then(r => r.json());
return summary;
};
The structured format returns a JSON object with keyPoints, decisions, and actionItems arrays. Each action item includes the assignee (identified from the transcript's speaker diarization), the task description, and an optional due date extracted from conversational context. You can integrate this directly into your project management tool, send it via email, or display it in a post-call review screen.
Summaries work retroactively. You can request a summary for any past meeting that had transcription enabled, even if the meeting ended hours or days ago. The transcript is stored on V100's servers for 30 days on the free tier and 90 days on Pro.
V100 vs Building from Scratch
Every feature in this tutorial is something you could build yourself. WebRTC is an open standard. SFUs like Janus and mediasoup are open-source. Whisper handles transcription. Here is what that project actually looks like in practice:
| Feature |
Build from Scratch |
V100 API |
| Time to first video call |
3–6 months |
1 afternoon |
| SFU server |
Deploy mediasoup/Janus, handle scaling, monitor |
Managed (auto-scales to 50 participants) |
| TURN relay |
Deploy coturn, configure TLS, geographic distribution |
Included (RustTURN, global PoPs) |
| Screen sharing |
Build track replacement logic, handle browser quirks |
replaceTrack + SFU handles routing |
| Recording |
FFmpeg composite pipeline, S3, transcoding queue |
One API call, server-side capture |
| Live transcription |
Whisper deployment, GPU infra, speaker diarization |
One config flag in meeting settings |
| AI meeting summaries |
LLM integration, prompt engineering, structured output |
One POST call after meeting ends |
| Active speaker detection |
Client-side audio analysis, heuristics |
Server-side, delivered via WebSocket |
| Chat |
Separate WebSocket server, persistence, history |
Same signaling WebSocket, auto-persisted |
| Infrastructure cost |
$2,000–$8,000+/mo for SFU + TURN + GPU |
Free tier, then $0.004/participant-minute |
| Engineering time |
6 months, team of 3–5 |
2 days, solo developer |
The reality of building a Zoom clone from scratch is not the WebRTC code — it is the infrastructure. TURN servers for NAT traversal. SFU servers that scale. Recording pipelines. Transcription models running on GPUs. Each of these is its own multi-month project. V100 collapses all of that into API calls so you can focus on your product.
Pricing
Everything in this tutorial works on V100's free tier, which includes 100 API calls per month. That is enough to build and test your entire Zoom clone without entering a credit card.
- Free — 100 API calls/month, up to 10 participants, 720p video, transcription, chat. No credit card.
- Pro — Usage-based at $0.004/participant-minute. 1080p, recording, AI summaries, 50 participants, priority TURN.
- Enterprise — Volume discounts, dedicated SFU clusters, custom TURN deployment, SLA, SSO.
See the pricing page for full details.
Ship Your Zoom Clone This Weekend
Get your free API key and start building. Multi-participant video, screen sharing, chat, recording, and AI summaries — all from one API.
Get Your Free API Key