Core Architecture and Video Generation Workflow

A deep dive into Vidgen's high-level architecture, detailing the multi-stage process from prompt to final video, including AI model selection strategies, audio generation, and video composition.

High-Level Architecture Overview

Vidgen is a sophisticated web application designed to automate the creation of short-form social media videos, such as TikToks, Instagram Reels, and YouTube Shorts, directly from a single text prompt. Its architecture is built around a multi-stage workflow, leveraging various AI services and a robust video rendering engine to transform a textual idea into a complete visual and auditory experience.

The system's core functionality is compartmentalized into distinct, interconnected flows:

Script Generation (Flow 1): Harnesses advanced AI models to convert a user prompt into a structured video script, complete with content and metadata.
Audio Generation (Flow 2): Transforms the AI-generated script into natural-sounding speech using state-of-the-art text-to-speech services.
Video Compilation and Rendering (Flow 3): Integrates all generated assets—script, audio, captions, and visual overlays—into a final video using Remotion, a powerful video editing library.

This modular approach ensures resilience, scalability, and maintainability, allowing for independent optimization and upgrades of each stage. The system relies heavily on technologies like Next.js for the application framework, Remotion for server-side video processing, AI-SDK for seamless AI model integration, ElevenLabs for high-fidelity audio, and Whisper-CPP for local speech-to-text transcription.

Installation Guide

Environment Variables

Remotion CLI Reference

Script Generation Module (Flow 1)

The Script Generation Module is the initial and foundational step in Vidgen's workflow. It is responsible for interpreting the user's text prompt and translating it into a detailed, structured video script suitable for further processing. This module is implemented as a server-side action within the Next.js application, typically found in app/actions/generate-script.ts.

Workflow Steps:

Prompt Reception: The module receives a user-provided text prompt via the web interface.
Input Validation: The incoming prompt undergoes rigorous validation and sanitization to prevent injection attacks and ensure it meets expected format and length requirements.
AI Model Selection: Based on a predefined strategy (detailed below), the system selects the most appropriate AI model for script generation.
LLM Invocation: The selected Large Language Model (LLM) is called via the AI-SDK, passing the validated user prompt and a specific system prompt (e.g., from lib/prompts/reddit-story.ts) to guide the script generation towards the desired format (e.g., a Reddit-style story).
Schema Validation: The AI's output is then validated against a predefined schema to ensure it adheres to the expected structure, containing necessary fields like title, narration, backgroundMusic, etc. This step is crucial for maintaining data integrity and ensuring downstream modules can reliably process the script.
Metadata Calculation: Important metadata, such as the estimated duration of the video based on narration length, is calculated and attached to the script object.
Caching: To optimize performance and reduce API costs, generated scripts are cached for a specified duration (e.g., 1 hour). Subsequent identical prompts within this timeframe will retrieve the cached script instead of re-invoking the LLM.
Output: A complete, structured script object is returned, ready for the Audio Generation Module.

Advanced AI Model Selection and Fallback Strategy

Vidgen implements a sophisticated AI model selection and fallback strategy to ensure high reliability, optimal cost-effectiveness, and consistent performance for script generation. This multi-tiered approach minimizes service interruptions and provides resilience against individual API failures or rate limits.

Prioritized Model Tiers:

Primary: Google Gemini 2.5 Flash
- Rationale: Chosen for its exceptional speed and cost-efficiency. Gemini 2.5 Flash is ideal for rapid content generation, making it the first choice for general script creation tasks.
First Fallback: Grok Beta
- Rationale: Serves as a robust alternative. Grok Beta offers a different generative model that can provide varied stylistic outputs and acts as a strong backup if Gemini encounters issues.
Second Fallback: OpenAI GPT-4o Mini
- Rationale: A highly capable and widely adopted model known for its versatility and strong performance across various tasks. GPT-4o Mini ensures a reliable fallback if both Gemini and Grok are unavailable or produce unsatisfactory results.
Ultimate Fallback: Hardcoded Template
- Rationale: In the rare event that all external AI APIs fail or become inaccessible, the system defaults to a pre-defined, hardcoded script template. This guarantees that a basic, functional video can still be generated, providing maximum system resilience and a graceful degradation of service rather than a complete failure.

Important Note on Fallback

The fallback strategy is implemented with logging to provide observability. Each time a fallback model is used, it's recorded, allowing developers to monitor API stability and identify potential issues with primary services.

This layered approach significantly enhances the application's robustness, allowing it to adapt to the dynamic availability and performance of various AI services while maintaining a consistent user experience.

Audio Generation Module (Flow 2)

The Audio Generation Module is responsible for transforming the textual narration from the AI-generated script into high-quality, natural-sounding speech. This crucial step bridges the gap between the script and the auditory component of the final video.

Integration with ElevenLabs API:

Vidgen integrates with the ElevenLabs API for text-to-speech (TTS) conversion. ElevenLabs is renowned for its highly expressive and realistic AI voices, which significantly enhance the professional quality of the generated videos.

Workflow Steps:

Script Input: The module receives the narration text extracted from the structured script object generated by Flow 1.
Server-Side Action: The audio generation process is managed by a server-side action, typically app/actions/generate-audio.ts.
ElevenLabs API Call: The narration text, along with specified voice parameters (e.g., voice ID, stability, clarity), is sent to the ElevenLabs API.
Audio File Reception: Upon successful processing, ElevenLabs returns an audio stream or a direct link to an audio file (e.g., MP3 format).
Temporary Storage: The generated audio file is temporarily stored on the server, making it accessible for the subsequent video compilation stage.

API Key Configuration

Ensure your ElevenLabs API key is securely configured as an environment variable (ELEVENLABS_API_KEY) to prevent unauthorized access and manage your API usage effectively.

Video Compilation and Rendering (Flow 3)

This is the final and most complex stage, where all previously generated assets—the script, audio, captions, and visual elements—are synthesized into a complete, high-quality video file. This module relies heavily on Remotion for programmatic video editing and rendering.

The Remotion Rendering Challenge with Next.js:

Initially, integrating Remotion's server-side rendering (SSR) directly within a Next.js environment presents a significant challenge. Specifically, conflicts arise when using @remotion/tailwind-v4 with Next.js's internal bundling mechanisms, leading to build errors or unexpected behavior.

Workaround: To circumvent these conflicts and ensure reliable video generation, Vidgen currently employs a strategy of rendering videos via the Remotion Command Line Interface (CLI). This means rendering is triggered as a separate, server-side process, distinct from Next.js's request-response cycle.

Workflow Steps (Orchestrated by `app/actions/render-video.ts`):

Asset Collection: Gathers the generated script object (from Flow 1) and the audio file path (from Flow 2).
Caption Generation Trigger: Initiates the local transcription process using Whisper-CPP to generate word-level timed captions from the audio.
Remotion Props Preparation: Prepares a props object containing all necessary data (script, audio path, captions, etc.) that Remotion's composition will use.
CLI Invocation: Executes the Remotion CLI command (npx remotion render) on the server. This command targets a specific composition defined in remotion/index.ts (e.g., MyVideo) and passes the prepared props.
Rendering Process: The Remotion CLI spins up a headless browser environment, renders the video frame by frame according to the composition logic, and outputs a video file (e.g., output.mp4).
Video File Path Return: Once rendering is complete, the path to the final video file is returned, making it available for streaming or download.

Production Considerations

For scalable production rendering, especially in serverless environments, @remotion/lambda is the recommended solution. It allows offloading rendering tasks to AWS Lambda, providing a highly scalable and cost-effective approach compared to self-hosting the Remotion CLI.

Transcription with Whisper

Accurate and precisely timed subtitles are crucial for engaging short-form videos. Vidgen achieves this through local transcription using OpenAI Whisper, powered by whisper-cpp.

How it Works:

Local Execution: Instead of relying on external API calls for transcription, Vidgen runs Whisper locally using whisper-cpp, a highly optimized C++ port of OpenAI's Whisper model. This significantly reduces latency and removes dependency on external transcription services.
Installation: The whisper-cpp binaries are installed as part of the project setup via @remotion/install-whisper-cpp or by running the custom script remotion/scripts/install-whisper.mjs. This ensures the necessary executable is available in the local environment.
Caption Generation Script: A dedicated script, remotion/scripts/generate-captions.ts, handles the execution of whisper-cpp. It takes the generated audio file as input.
Word-Level Timestamps: Whisper processes the audio and outputs transcription data, critically including precise start and end timestamps for each word spoken. This granular timing is essential for creating dynamic, karaoke-style subtitles.

Model Selection for Whisper

You can choose different Whisper models (e.g., tiny, base, small, medium) based on accuracy requirements and available system resources. Larger models offer higher accuracy but require more memory and processing power.

Reddit-Style Overlay Generation

To give the generated videos a familiar and engaging context, Vidgen incorporates a Reddit-style overlay. This visual component, implemented as a Remotion component (remotion/RedditOverlay.tsx), dynamically displays elements commonly found on Reddit posts.

Key Features of the Overlay:

Dynamic Content: The overlay's content, such as the post title, hypothetical author, subreddit, upvote count, and comment count, is populated from the metadata within the AI-generated script.
Aesthetic Fidelity: Designed with Shadcn UI and Tailwind CSS, the overlay closely mimics the look and feel of a Reddit post, ensuring immediate recognition and relatability for viewers.
Contextual Storytelling: By framing the narration within a Reddit-like interface, the video effectively tells a story as if it were a viral post, enhancing engagement and storytelling capabilities.

This overlay serves as a powerful visual anchor, providing narrative context and mimicking popular content formats found on social media platforms.

Remotion Integration for Final Composition

Remotion is the heart of Vidgen's video production, acting as a programmable video editor that orchestrates all media assets into a coherent final video. It provides a robust framework for defining video compositions using React components.

Composition Structure (`remotion/Composition.tsx`):

At its core, Remotion works by defining compositions. These are React components that declare how different visual and audio elements should be laid out and animated over time. The main entry point for Remotion compositions is typically remotion/index.ts, which registers all available compositions.

// remotion/Composition.tsx (Simplified Structure)
import { AbsoluteFill, Series, Audio, Video } from 'remotion';
import { RedditOverlay } from './RedditOverlay';
import { TiktokCaptions } from './TiktokCaptions';

interface MyVideoProps {
  script: { title: string; narration: string; /* ... */ };
  audioUrl: string;
  captions: { text: string; start: number; end: number; }[];
  // ... other props
}

export const MyVideo: React.FC<MyVideoProps> = ({ script, audioUrl, captions }) => {
  return (
    <AbsoluteFill className="bg-gray-900">
      {/* Background element, e.g., a static image or subtle animation */}

      <Series>
        {/* Series allows sequential playback of scenes */}
        <Series.Sequence durationInFrames={Infinity}> {/* Adjust duration based on audio */}
          <Audio src={audioUrl} />
          <RedditOverlay script={script} />
          <TiktokCaptions captions={captions} />
        </Series.Sequence>
      </Series>
    </AbsoluteFill>
  );
};

// remotion/index.ts
import { registerRoot } from 'remotion';
import { Composition } from './Composition';

registerRoot(() => (
  <Composition
    id="MyVideo"
    component={MyVideo}
    durationInFrames={30 * 60} // Max 1 minute, adjust dynamically
    fps={30}
    width={1080}
    height={1920}
    defaultProps={{
      script: { title: 'Default Title', narration: 'Default narration.' },
      audioUrl: '',
      captions: [],
    }}
  />
));

Dynamic TikTok-Style Subtitles:

A key visual feature for social media videos is dynamic, engaging subtitles. Vidgen achieves this with custom Remotion components (remotion/TiktokCaptions.tsx and remotion/CaptionText.tsx) that leverage the word-level timestamps from Whisper transcription.

TiktokCaptions.tsx: This component iterates through the array of timed captions data.
CaptionText.tsx: For each word or phrase, it renders a text element, applying animations and styles (e.g., scaling, color changes, highlighting) based on its start and end timestamps relative to the current video frame. This creates the signature "karaoke-style" subtitle effect common in short-form videos.

Remotion Configuration and Rendering:

remotion.config.ts: This file configures Remotion's build process, including settings for Webpack, Babel, and potentially Tailwind CSS integration (though manual configuration might be needed due to the aforementioned conflicts).
Rendering Process: As detailed in Flow 3, the npx remotion render command is used. This command takes the composition id (e.g., MyVideo), the desired output file path, and a JSON string of props to inject dynamic data into the Remotion composition. This command executes independently, producing the final .mp4 video file.

This comprehensive integration with Remotion allows Vidgen to create highly customized, visually rich, and dynamically generated videos tailored for modern social media platforms.

Core Architecture and Video Generation Workflow

Installation Guide

Environment Variables

Remotion CLI Reference

On this page