Build a Zoom Clone with V100 — Complete Tutorial

~400

Lines of Code

Features Built

SFU

Multi-Party Architecture

Free

100 API Calls/mo

What You Will Build

By the end of this tutorial, you will have a fully functional Zoom-like application built with React and Vite. The app supports everything you would expect from a modern video conferencing tool: multi-participant video with a dynamic tile grid, screen sharing, real-time in-meeting chat, cloud recording, live transcription with caption overlays, and AI-generated post-meeting summaries with action items. Every feature is powered by V100's API — you write the frontend, V100 handles the infrastructure.

✓ Multi-participant video (up to 50)

✓ Screen sharing with replaceTrack

✓ Real-time chat over WebSocket

✓ Cloud recording to S3

✓ Live transcription (40+ languages)

✓ Active speaker detection

✓ AI meeting summaries + action items

✓ Post-quantum encrypted signaling

Architecture — SFU Topology

Participant A        V100 SFU Server        Participant B
    |                       |                       |
    |--- media track ------->|                       |
    |                       |--- media track ------->|
    |                       |                       |
    |<--- media track -------|                       |
    |                       |<--- media track -------|
    |                       |                       |
Participant C          (routes all streams)   Participant D
    |                       |                       |
    |--- media track ------->|--- media track ------->|
    |<--- media tracks ------|<--- media tracks ------|
    |                       |                       |
    Each participant sends 1 stream, receives N-1 streams

Prerequisites

Node.js 18+ and npm
React 18+ with Vite (this tutorial uses Vite, but the code works with any bundler)
A V100 API key — free tier gives you 100 API calls per month, no credit card required

Step 1 — Project Setup

Scaffold a new React project with Vite. We will keep the dependency footprint minimal — just React and the browser's native WebRTC APIs. No SDK installation required. V100 uses standard REST and WebSocket endpoints.

Terminal

npm create vite@latest zoom-clone -- --template react
cd zoom-clone
npm install
npm run dev

Create a .env file in the project root with your API key. Vite exposes environment variables prefixed with VITE_ to the client.

.env

VITE_V100_API_KEY=v100_sk_your_api_key_here

Production note: Never expose API keys in client-side code in production. Create meetings from your backend server and pass only the short-lived token to the browser. This tutorial uses VITE_ variables for simplicity during development. See the server-side guide.

Step 2 — Create a Meeting Room

Every video call starts with creating a meeting. The POST /api/meetings endpoint returns a meetingId, a short-lived token for WebSocket authentication, and ICE server credentials. We will wrap this in a custom hook that the rest of the app consumes.

src/hooks/useMeeting.js

import { useState, useCallback } from 'react';

const API = 'https://api.v100.ai';
const KEY = import.meta.env.VITE_V100_API_KEY;

export function useMeeting() {
  const [meeting, setMeeting] = useState(null);
  const [loading, setLoading] = useState(false);

  const createMeeting = useCallback(async (title = 'Zoom Clone Meeting') => {
    setLoading(true);
    const res = await fetch(`${API}/api/meetings`, {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${KEY}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        title,
        settings: {
          maxParticipants: 50,
          topology: 'sfu',       // automatic SFU for 3+ participants
          recording: { autoStart: false },
          transcription: {
            enabled: true,
            language: 'en',
            showCaptions: true,
          },
        },
      }),
    }).then(r => r.json());

    // res = { meetingId, token, iceServers, joinUrl }
    setMeeting(res);
    setLoading(false);
    return res;
  }, []);

  const joinMeeting = useCallback(async (meetingId) => {
    setLoading(true);
    const res = await fetch(`${API}/api/meetings/${meetingId}/join`, {
      method: 'POST',
      headers: { 'Authorization': `Bearer ${KEY}` },
    }).then(r => r.json());

    setMeeting(res);
    setLoading(false);
    return res;
  }, []);

  return { meeting, loading, createMeeting, joinMeeting };
}

The response from POST /api/meetings looks like this:

Response — 201 Created

{
  "meetingId": "mtg_a1b2c3d4e5f6",
  "token": "eyJhbGciOiJFZDI1NTE5...",
  "joinUrl": "https://api.v100.ai/join/mtg_a1b2c3d4e5f6",
  "iceServers": [
    { "urls": "stun:stun.v100.ai:3478" },
    {
      "urls": "turn:turn.v100.ai:443?transport=tcp",
      "username": "auto_generated",
      "credential": "short_lived_credential"
    }
  ]
}

Step 3 — Join with WebRTC

Once you have a meeting, the next step is connecting the user's camera and microphone, opening a WebSocket to the signaling server, and establishing peer connections. V100 handles the SFU routing automatically — each participant sends one upstream track and receives downstream tracks from every other participant.

src/hooks/useWebRTC.js

import { useRef, useState, useCallback, useEffect } from 'react';

const WS_URL = 'wss://api.v100.ai/ws/signaling';

export function useWebRTC(meeting) {
  const [participants, setParticipants] = useState(new Map());
  const [localStream, setLocalStream] = useState(null);
  const [activeSpeaker, setActiveSpeaker] = useState(null);
  const wsRef = useRef(null);
  const pcsRef = useRef(new Map());

  const connect = useCallback(async () => {
    if (!meeting) return;

    // Capture local camera + mic
    const stream = await navigator.mediaDevices.getUserMedia({
      video: { width: 1280, height: 720, frameRate: 30 },
      audio: { echoCancellation: true, noiseSuppression: true },
    });
    setLocalStream(stream);

    // Open signaling WebSocket
    const ws = new WebSocket(
      `${WS_URL}?token=${meeting.token}&meetingId=${meeting.meetingId}`
    );
    wsRef.current = ws;

    ws.onmessage = async (event) => {
      const msg = JSON.parse(event.data);

      if (msg.type === 'peer-joined') {
        const pc = createPeerConnection(msg.peerId, stream, ws);
        const offer = await pc.createOffer();
        await pc.setLocalDescription(offer);
        ws.send(JSON.stringify({
          type: 'offer', sdp: offer, to: msg.peerId,
        }));
      }

      if (msg.type === 'offer') {
        const pc = createPeerConnection(msg.from, stream, ws);
        await pc.setRemoteDescription(msg.sdp);
        const answer = await pc.createAnswer();
        await pc.setLocalDescription(answer);
        ws.send(JSON.stringify({
          type: 'answer', sdp: answer, to: msg.from,
        }));
      }

      if (msg.type === 'answer') {
        const pc = pcsRef.current.get(msg.from);
        await pc?.setRemoteDescription(msg.sdp);
      }

      if (msg.type === 'ice-candidate') {
        const pc = pcsRef.current.get(msg.from);
        await pc?.addIceCandidate(msg.candidate);
      }

      if (msg.type === 'active-speaker') {
        setActiveSpeaker(msg.peerId);
      }

      if (msg.type === 'peer-left') {
        pcsRef.current.get(msg.peerId)?.close();
        pcsRef.current.delete(msg.peerId);
        setParticipants(prev => {
          const next = new Map(prev);
          next.delete(msg.peerId);
          return next;
        });
      }
    };
  }, [meeting]);

  function createPeerConnection(peerId, stream, ws) {
    const pc = new RTCPeerConnection({
      iceServers: meeting.iceServers,
    });
    pcsRef.current.set(peerId, pc);

    stream.getTracks().forEach(t => pc.addTrack(t, stream));

    pc.ontrack = (e) => {
      setParticipants(prev => {
        const next = new Map(prev);
        next.set(peerId, e.streams[0]);
        return next;
      });
    };

    pc.onicecandidate = (e) => {
      if (e.candidate) {
        ws.send(JSON.stringify({
          type: 'ice-candidate',
          candidate: e.candidate,
          to: peerId,
        }));
      }
    };
    return pc;
  }

  return { localStream, participants, activeSpeaker, connect, wsRef };
}

When a new participant joins, V100's signaling server sends a peer-joined event to everyone in the room. Each existing peer creates a new RTCPeerConnection and initiates the SDP handshake. For rooms with three or more participants, V100 automatically routes traffic through its SFU — each person sends one upstream track and receives separate downstream tracks, keeping bandwidth usage linear instead of exponential.

Step 4 — Participant Grid with Active Speaker

Zoom's signature tile layout resizes dynamically based on participant count. The ParticipantGrid component below calculates columns based on the number of streams and highlights the active speaker with a colored border.

src/components/ParticipantGrid.jsx

import { useRef, useEffect } from 'react';

function VideoTile({ stream, label, isActive, muted }) {
  const ref = useRef(null);
  useEffect(() => {
    if (ref.current && stream) ref.current.srcObject = stream;
  }, [stream]);

  return (
    <div style={{
      position: 'relative',
      borderRadius: 12,
      overflow: 'hidden',
      border: isActive ? '2px solid #4ade80' : '2px solid #252525',
      background: '#111',
    }}>
      <video ref={ref} autoPlay playsInline muted={muted}
        style={{ width: '100%', display: 'block' }}
      />
      <span style={{
        position: 'absolute', bottom: 8, left: 12,
        background: 'rgba(0,0,0,0.7)', color: '#fff',
        padding: '4px 10px', borderRadius: 6, fontSize: 12,
      }}>{label}</span>
    </div>
  );
}

export function ParticipantGrid({ localStream, participants, activeSpeaker }) {
  const count = participants.size + 1;
  const cols = count <= 1 ? 1 : count <= 4 ? 2 : count <= 9 ? 3 : 4;

  return (
    <div style={{
      display: 'grid',
      gridTemplateColumns: `repeat(${cols}, 1fr)`,
      gap: 8, padding: 16,
    }}>
      <VideoTile
        stream={localStream}
        label="You"
        isActive={activeSpeaker === 'local'}
        muted
      />
      {[...participants.entries()].map(([id, stream]) => (
        <VideoTile
          key={id}
          stream={stream}
          label={`Participant ${id.slice(0, 6)}`}
          isActive={activeSpeaker === id}
          muted={false}
        />
      ))}
    </div>
  );
}

V100 sends active-speaker events over the WebSocket whenever the dominant audio source changes. The grid highlights that participant's tile with a green border, exactly like Zoom does. No client-side audio analysis needed — the SFU computes audio levels server-side.

Step 5 — Screen Sharing

Screen sharing uses the browser's getDisplayMedia API to capture the user's screen and replaceTrack to swap the camera feed for the screen feed on the existing peer connection. No new signaling round is needed.

src/hooks/useScreenShare.js

import { useState, useCallback } from 'react';

export function useScreenShare(localStream, pcsRef) {
  const [sharing, setSharing] = useState(false);

  const toggleScreenShare = useCallback(async () => {
    if (sharing) {
      // Switch back to camera
      const camTrack = localStream.getVideoTracks()[0];
      for (const pc of pcsRef.current.values()) {
        const sender = pc.getSenders().find(
          s => s.track?.kind === 'video'
        );
        await sender?.replaceTrack(camTrack);
      }
      setSharing(false);
      return;
    }

    // Capture screen
    const screenStream = await navigator.mediaDevices.getDisplayMedia({
      video: { width: 1920, height: 1080, frameRate: 30 },
    });
    const screenTrack = screenStream.getVideoTracks()[0];

    // Replace camera track on all peer connections
    for (const pc of pcsRef.current.values()) {
      const sender = pc.getSenders().find(
        s => s.track?.kind === 'video'
      );
      await sender?.replaceTrack(screenTrack);
    }

    // Handle user clicking browser's "Stop sharing" button
    screenTrack.onended = () => toggleScreenShare();
    setSharing(true);
  }, [sharing, localStream, pcsRef]);

  return { sharing, toggleScreenShare };
}

The key insight is replaceTrack — it swaps the video track on the existing RTCPeerConnection without renegotiating the SDP. All other participants see the screen share instantly with zero downtime. When the user clicks the browser's native "Stop sharing" button, the onended callback fires and the camera feed is restored automatically.

Step 6 — In-Meeting Chat

Chat messages travel over the same WebSocket connection used for signaling. V100 routes chat messages to all participants in the meeting. No separate chat server needed.

src/components/ChatPanel.jsx

import { useState, useEffect, useRef } from 'react';

export function ChatPanel({ wsRef }) {
  const [messages, setMessages] = useState([]);
  const [input, setInput] = useState('');
  const bottomRef = useRef(null);

  useEffect(() => {
    const ws = wsRef.current;
    if (!ws) return;

    const handler = (event) => {
      const msg = JSON.parse(event.data);
      if (msg.type === 'chat') {
        setMessages(prev => [...prev, {
          from: msg.from,
          text: msg.text,
          time: new Date().toLocaleTimeString(),
        }]);
        bottomRef.current?.scrollIntoView({ behavior: 'smooth' });
      }
    };
    ws.addEventListener('message', handler);
    return () => ws.removeEventListener('message', handler);
  }, [wsRef]);

  const send = () => {
    if (!input.trim()) return;
    wsRef.current?.send(JSON.stringify({
      type: 'chat',
      text: input,
    }));
    setInput('');
  };

  return (
    <div style={{ width: 300, display: 'flex', flexDirection: 'column', height: '100%' }}>
      <div style={{ flex: 1, overflowY: 'auto', padding: 12 }}>
        {messages.map((m, i) => (
          <div key={i} style={{ marginBottom: 10 }}>
            <strong>{m.from}</strong>
            <span style={{ color: '#666', fontSize: 11 }}> {m.time}</span>
            <p style={{ margin: '4px 0 0' }}>{m.text}</p>
          </div>
        ))}
        <div ref={bottomRef} />
      </div>
      <div style={{ display: 'flex', gap: 8, padding: 12 }}>
        <input
          value={input}
          onChange={e => setInput(e.target.value)}
          onKeyDown={e => e.key === 'Enter' && send()}
          placeholder="Type a message..."
          style={{ flex: 1, padding: '8px 12px', borderRadius: 8 }}
        />
        <button onClick={send}>Send</button>
      </div>
    </div>
  );
}

Chat messages persist server-side. V100 stores chat history for the duration of the meeting. When a new participant joins mid-call, they receive the full chat history in a chat-history event on WebSocket connect, so they never miss context.

Step 7 — Recording

Cloud recording is a single API call. V100's SFU captures all participant streams server-side, composites them into a grid layout, and encodes to MP4. No client-side recording means zero performance impact on participants.

src/hooks/useRecording.js

import { useState, useCallback } from 'react';

const API = 'https://api.v100.ai';
const KEY = import.meta.env.VITE_V100_API_KEY;

export function useRecording(meetingId) {
  const [recording, setRecording] = useState(false);

  const startRecording = useCallback(async () => {
    await fetch(`${API}/api/meetings/${meetingId}/recording/start`, {
      method: 'POST',
      headers: { 'Authorization': `Bearer ${KEY}` },
    });
    setRecording(true);
  }, [meetingId]);

  const stopRecording = useCallback(async () => {
    const res = await fetch(
      `${API}/api/meetings/${meetingId}/recording/stop`,
      {
        method: 'POST',
        headers: { 'Authorization': `Bearer ${KEY}` },
      }
    ).then(r => r.json());
    setRecording(false);
    return res;
    // res = { recordingUrl, transcriptUrl, duration }
  }, [meetingId]);

  return { recording, startRecording, stopRecording };
}

The stopRecording response includes a recordingUrl (signed S3 URL for the MP4), a transcriptUrl (if transcription was enabled), and the total duration in seconds. For long meetings, V100 processes the recording asynchronously and sends a webhook when it is ready.

Step 8 — Live Transcription

Since we enabled transcription in the meeting settings in Step 2, live captions arrive automatically over the WebSocket. The CaptionOverlay component renders them as a floating bar at the bottom of the video grid, styled exactly like Zoom's closed captions.

src/components/CaptionOverlay.jsx

import { useState, useEffect } from 'react';

export function CaptionOverlay({ wsRef }) {
  const [caption, setCaption] = useState('');
  const [speaker, setSpeaker] = useState('');

  useEffect(() => {
    const ws = wsRef.current;
    if (!ws) return;

    const handler = (event) => {
      const msg = JSON.parse(event.data);
      if (msg.type === 'transcription') {
        setSpeaker(msg.speaker || 'Unknown');
        setCaption(msg.text);
        setTimeout(() => setCaption(''), 5000);
      }
    };
    ws.addEventListener('message', handler);
    return () => ws.removeEventListener('message', handler);
  }, [wsRef]);

  if (!caption) return null;

  return (
    <div style={{
      position: 'fixed', bottom: 100, left: '50%',
      transform: 'translateX(-50%)',
      background: 'rgba(0,0,0,0.85)', color: '#fff',
      padding: '10px 24px', borderRadius: 10,
      maxWidth: '70%', textAlign: 'center',
      fontSize: 15, lineHeight: 1.5,
      zIndex: 50,
    }}>
      <strong style={{ color: '#818cf8' }}>{speaker}:</strong> {caption}
    </div>
  );
}

Each transcription event includes the speaker identifier (mapped to the participant's display name), the transcribed text, a timestamp, and a confidence score. V100 uses server-side speech-to-text with speaker diarization, so captions are attributed to the correct participant automatically. Supported languages include English, Spanish, French, German, Japanese, Mandarin, Portuguese, and 30+ more.

Step 9 — Post-Meeting Summary

After a meeting ends, you can request an AI-generated summary from the transcript. The summary includes key discussion points, decisions made, and action items assigned to specific participants. This is a single API call.

Post-meeting AI summary

const getMeetingSummary = async (meetingId) => {
  const summary = await fetch(
    `https://api.v100.ai/api/meetings/${meetingId}/summary`,
    {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${KEY}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        format: 'structured',   // 'structured' | 'prose' | 'bullets'
        includeActionItems: true,
        includeDecisions: true,
      }),
    }
  ).then(r => r.json());

  return summary;
  // {
  //   title: "Sprint Planning — March 28",
  //   keyPoints: ["Discussed Q2 roadmap priorities...", ...],
  //   decisions: ["Move launch to April 15", ...],
  //   actionItems: [
  //     { assignee: "Alice", task: "Draft PRD by Friday", due: "2026-04-01" },
  //     { assignee: "Bob", task: "Set up staging env", due: "2026-03-31" },
  //   ],
  //   transcript: "https://storage.v100.ai/transcripts/mtg_a1b2c3..."
  // }
};

The structured format returns a JSON object with keyPoints, decisions, and actionItems arrays. Each action item includes the assignee (identified from the transcript's speaker diarization), the task description, and an optional due date extracted from conversational context. You can integrate this directly into your project management tool, send it via email, or display it in a post-call review screen.

Summaries work retroactively. You can request a summary for any past meeting that had transcription enabled, even if the meeting ended hours or days ago. The transcript is stored on V100's servers for 30 days on the free tier and 90 days on Pro.

V100 vs Building from Scratch

Every feature in this tutorial is something you could build yourself. WebRTC is an open standard. SFUs like Janus and mediasoup are open-source. Whisper handles transcription. Here is what that project actually looks like in practice:

Feature	Build from Scratch	V100 API
Time to first video call	3–6 months	1 afternoon
SFU server	Deploy mediasoup/Janus, handle scaling, monitor	Managed (auto-scales to 50 participants)
TURN relay	Deploy coturn, configure TLS, geographic distribution	Included (RustTURN, global PoPs)
Screen sharing	Build track replacement logic, handle browser quirks	replaceTrack + SFU handles routing
Recording	FFmpeg composite pipeline, S3, transcoding queue	One API call, server-side capture
Live transcription	Whisper deployment, GPU infra, speaker diarization	One config flag in meeting settings
AI meeting summaries	LLM integration, prompt engineering, structured output	One POST call after meeting ends
Active speaker detection	Client-side audio analysis, heuristics	Server-side, delivered via WebSocket
Chat	Separate WebSocket server, persistence, history	Same signaling WebSocket, auto-persisted
Infrastructure cost	$2,000–$8,000+/mo for SFU + TURN + GPU	Free tier, then $0.004/participant-minute
Engineering time	6 months, team of 3–5	2 days, solo developer

The reality of building a Zoom clone from scratch is not the WebRTC code — it is the infrastructure. TURN servers for NAT traversal. SFU servers that scale. Recording pipelines. Transcription models running on GPUs. Each of these is its own multi-month project. V100 collapses all of that into API calls so you can focus on your product.

Pricing

Everything in this tutorial works on V100's free tier, which includes 100 API calls per month. That is enough to build and test your entire Zoom clone without entering a credit card.

Free — 100 API calls/month, up to 10 participants, 720p video, transcription, chat. No credit card.
Pro — Usage-based at $0.004/participant-minute. 1080p, recording, AI summaries, 50 participants, priority TURN.
Enterprise — Volume discounts, dedicated SFU clusters, custom TURN deployment, SLA, SSO.

See the pricing page for full details.

Ship Your Zoom Clone This Weekend

Get your free API key and start building. Multi-participant video, screen sharing, chat, recording, and AI summaries — all from one API.

Get Your Free API Key

What You Will Build

Prerequisites

Step 1 — Project Setup

Step 2 — Create a Meeting Room

Step 3 — Join with WebRTC

Step 4 — Participant Grid with Active Speaker

Step 5 — Screen Sharing

Step 6 — In-Meeting Chat

Step 7 — Recording

Step 8 — Live Transcription

Step 9 — Post-Meeting Summary

V100 vs Building from Scratch

Pricing

Ship Your Zoom Clone This Weekend

Related