skills/ad-creative/references/generative-tools.md
Reference for using AI image generators, video generators, and code-based video tools to produce ad visuals at scale.
| Need | Tool Category | Best Fit |
|---|---|---|
| Static ad images (banners, social) | Image generation | Nano Banana Pro, Flux, Ideogram |
| Ad images with text overlays | Image generation (text-capable) | Ideogram, Nano Banana Pro |
| Short video ads (6-30 sec) | Video generation | Veo, Kling, Runway, Sora, Seedance |
| Video ads with voiceover | Video gen + voice | Veo/Sora (native), or Runway + ElevenLabs |
| Voiceover tracks for ads | Voice generation | ElevenLabs, OpenAI TTS, Cartesia |
| Multi-language ad versions | Voice generation | ElevenLabs, PlayHT |
| Brand voice cloning | Voice generation | ElevenLabs, Resemble AI |
| Product mockups and variations | Image generation + references | Flux (multi-image reference) |
| Templated video ads at scale | Code-based video | Remotion |
| Personalized video (name, data) | Code-based video | Remotion |
| Brand-consistent variations | Image gen + style refs | Flux, Ideogram, Nano Banana Pro |
Google DeepMind's image generation model, available through the Gemini API.
Best for: High-quality ad images, product visuals, text rendering API: Gemini API (Google AI Studio, Vertex AI) Pricing: ~$0.04/image (Gemini 2.5 Flash Image), ~$0.24/4K image (Nano Banana Pro)
Strengths:
Ad creative use cases:
API example:
# Using the Gemini API for image generation
curl -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" \
-H "Content-Type: application/json" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-d '{
"contents": [{"parts": [{"text": "Create a clean, modern social media ad image for a project management tool. Show a laptop with a kanban board interface. Bright, professional, 16:9 ratio."}]}],
"generationConfig": {"responseModalities": ["TEXT", "IMAGE"]}
}'
Docs: Gemini Image Generation
Open-weight image generation models with API access through Replicate and BFL's native API.
Best for: Photorealistic images, brand-consistent variations, multi-reference generation API: Replicate, BFL API, fal.ai Pricing: ~$0.01-0.06/image depending on model and resolution
Model variants:
| Model | Speed | Quality | Cost | Best For |
|---|---|---|---|---|
| Flux 2 Pro | ~6 sec | Highest | $0.015/MP | Final production assets |
| Flux 2 Flex | ~22 sec | High + editing | $0.06/MP | Iterative editing |
| Flux 2 Dev | ~2.5 sec | Good | $0.012/MP | Rapid prototyping |
| Flux 2 Klein | Fastest | Good | Lowest | High-volume batch generation |
Strengths:
Ad creative use cases:
Docs: Replicate Flux, BFL API
Specialized in typography and text rendering within images.
Best for: Ad banners with text, branded graphics, social ad images with headlines API: Ideogram API, Runware Pricing: ~$0.06/image (API), ~$0.009/image (subscription)
Strengths:
Ad creative use cases:
Docs: Ideogram API, Ideogram
| Tool | Best For | API Status | Notes |
|---|---|---|---|
| DALL-E 3 (OpenAI) | General image generation | Official API | Integrated with ChatGPT, good text rendering |
| Midjourney | Artistic, high-aesthetic images | No official public API | Discord-based; unofficial APIs exist but risk bans |
| Stable Diffusion | Self-hosted, customizable | Open source | Best for teams with GPU infrastructure |
Google DeepMind's video generation model, available through the Gemini API and Vertex AI.
Best for: High-quality video ads with native audio, vertical video for social API: Gemini API, Vertex AI Pricing: ~$0.15/sec (Veo 3.1 Fast), ~$0.40/sec (Veo 3.1 Standard)
Capabilities:
Ad creative use cases:
Docs: Veo on Vertex AI
Video generation with simultaneous audio-visual generation and camera controls.
Best for: Cinematic video ads, longer-form content, audio-synced video API: Kling API, PiAPI, fal.ai Pricing: ~$0.09/sec (via fal.ai third-party)
Capabilities:
Ad creative use cases:
Docs: Kling AI Developer
Video generation and editing platform with strong controllability.
Best for: Controlled video generation, style-consistent content, editing existing footage API: Runway Developer Portal
Capabilities:
Ad creative use cases:
Docs: Runway API
OpenAI's video generation model with synchronized audio.
Best for: High-fidelity video with dialogue and sound API: OpenAI API Pricing: Free tier available; Pro from $0.10-0.50/sec depending on resolution
Capabilities:
Ad creative use cases:
Docs: OpenAI Video Generation
ByteDance's video generation model with simultaneous audio-visual generation and multimodal inputs.
Best for: Fast, affordable video ads with native audio, multimodal reference inputs API: BytePlus (official), Replicate, WaveSpeedAI, fal.ai (third-party); OpenAI-compatible API format Pricing: ~$0.10-0.80/min depending on resolution (estimated 10-100x cheaper than Sora 2 per clip)
Capabilities:
Ad creative use cases:
Docs: Seedance
Full-stack video creation platform with cinematic camera controls.
Best for: Social video ads, cinematic style, mobile-first content Platform: higgsfield.ai
Capabilities:
Ad creative use cases:
| Tool | Max Length | Audio | Resolution | API | Best For |
|---|---|---|---|---|---|
| Veo 3.1 | 60 sec | Native | 1080p/4K | Gemini | Vertical social video |
| Kling 2.6 | 3 min | Native | 1080p | Third-party | Longer cinematic |
| Runway Gen-4 | 10 sec | No | 1080p | Official | Controlled, consistent |
| Sora 2 | 60 sec | Native | 1080p | Official | Dialogue-heavy |
| Seedance 2.0 | 20 sec | Native | 2K | Official + third-party | Affordable high-volume |
| Higgsfield | Varies | Yes | 1080p | Web-based | Social, mobile-first |
For layering realistic voiceovers onto video ads, adding narration to product demos, or generating audio for Remotion-rendered videos. These tools turn ad scripts into natural-sounding voice tracks.
Many video generators (Veo, Kling, Sora, Seedance) now include native audio. Use standalone voice tools when you need:
The market leader in realistic voice generation and voice cloning.
Best for: Most natural-sounding voiceovers, brand voice cloning, multilingual API: REST API with streaming support Pricing: ~$0.12-0.30 per 1,000 characters depending on plan; starts at $5/month
Capabilities:
Ad creative use cases:
API example:
curl -X POST "https://api.elevenlabs.io/v1/text-to-speech/{voice_id}" \
-H "xi-api-key: $ELEVENLABS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"text": "Stop wasting hours on manual reporting. Try DataFlow free for 14 days.",
"model_id": "eleven_multilingual_v2",
"voice_settings": {"stability": 0.5, "similarity_boost": 0.75}
}' --output voiceover.mp3
Docs: ElevenLabs API
Simple, affordable text-to-speech built into the OpenAI API.
Best for: Quick voiceovers, cost-effective at scale, simple integration API: OpenAI API (same SDK as GPT/DALL-E) Pricing: $15/million chars (standard), $30/million chars (HD); ~$0.015/min with gpt-4o-mini-tts
Capabilities:
Ad creative use cases:
Docs: OpenAI TTS
Ultra-low latency voice generation built for real-time applications.
Best for: Real-time voice, lowest latency, emotional expressiveness API: REST + WebSocket streaming Pricing: Starts at $5/month; pay-as-you-go from $0.03/min
Capabilities:
Ad creative use cases:
Docs: Cartesia Sonic
Free, local-first voice synthesis studio powered by Qwen3-TTS. The open-source alternative to ElevenLabs.
Best for: Free voice cloning, local/private generation, zero-cost batch production
API: Local REST API at http://localhost:8000
Pricing: Free (MIT license). Runs entirely on your machine.
Stack: Tauri (Rust) + React + FastAPI (Python)
Capabilities:
Ad creative use cases:
API example:
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"text": "Stop wasting hours on manual reporting.", "profile_id": "abc123", "language": "en"}'
Install: Desktop apps for macOS and Windows at voicebox.sh, or build from source:
git clone https://github.com/jamiepine/voicebox.git
cd voicebox && make setup && make dev
Docs: GitHub
| Tool | Best For | Differentiator | API |
|---|---|---|---|
| PlayHT | Large voice library, low latency | 900+ voices, <300ms latency, ultra-realistic | play.ht |
| Resemble AI | Enterprise voice cloning | On-premise deployment, real-time speech-to-speech | resemble.ai |
| WellSaid Labs | Ethical, commercial-safe voices | Voices from compensated actors, safe for commercial use | wellsaid.io |
| Fish Audio | Budget-friendly, emotion control | ~50-70% cheaper than ElevenLabs, emotion tags | fish.audio |
| Murf AI | Non-technical teams | Browser-based studio, 200+ voices | murf.ai |
| Google Cloud TTS | Google ecosystem, scale | 220+ voices, 40+ languages, enterprise SLAs | Google TTS |
| Amazon Polly | AWS ecosystem, cost | Neural voices, SSML control, cheap at volume | Amazon Polly |
| Tool | Quality | Cloning | Languages | Latency | Price/1K chars |
|---|---|---|---|---|---|
| ElevenLabs | Best | Yes (instant + pro) | 29+ | ~200ms | $0.12-0.30 |
| OpenAI TTS | Good | No | 13+ | ~300ms | $0.015-0.030 |
| Cartesia Sonic | Very good | No | 15+ | ~40ms | ~$0.03/min |
| PlayHT | Very good | Yes | 140+ | <300ms | ~$0.10-0.20 |
| Fish Audio | Good | Yes | 13+ | ~200ms | ~$0.05-0.10 |
| WellSaid | Very good | No (actor voices) | English | ~300ms | Custom pricing |
| Voicebox | Good | Yes (local) | 2+ | Local | Free (open source) |
Need voiceover for ads?
├── Need to clone a specific brand voice?
│ ├── Best quality → ElevenLabs
│ ├── Enterprise/on-premise → Resemble AI
│ └── Budget-friendly → Fish Audio, PlayHT
├── Need multilingual (same ad, many languages)?
│ ├── Most languages → PlayHT (140+)
│ └── Best quality → ElevenLabs (29+)
├── Need free / open source / local?
│ └── Voicebox (MIT, runs on your machine)
├── Need cheap, fast, good-enough?
│ └── OpenAI TTS ($0.015/min)
├── Need commercially-safe licensing?
│ └── WellSaid Labs (actor-compensated voices)
└── Need real-time/interactive?
└── Cartesia Sonic (40ms TTFA)
1. Write ad script (use ad-creative skill for copy)
2. Generate voiceover with ElevenLabs/OpenAI TTS
3. Generate or render video:
a. Silent video from Runway/Remotion → layer voice track
b. Or use Veo/Sora/Seedance with native audio (skip separate VO)
4. Combine with ffmpeg if layering separately:
ffmpeg -i video.mp4 -i voiceover.mp3 -c:v copy -c:a aac output.mp4
5. Generate variations (different scripts, voices, or languages)
For templated, data-driven video ads at scale, Remotion is the best option. Unlike AI video generators that produce unique video from prompts, Remotion uses React code to render deterministic, brand-perfect video from templates and data.
Best for: Templated ad variations, personalized video, brand-consistent production Stack: React + TypeScript Pricing: Free for individuals/small teams; commercial license required for 4+ employees Docs: remotion.dev
| AI Video Generators | Remotion |
|---|---|
| Unique output each time | Deterministic, pixel-perfect |
| Prompt-based, less control | Full code control over every frame |
| Hard to match brand exactly | Exact brand colors, fonts, spacing |
| One-at-a-time generation | Batch render hundreds from data |
| No dynamic data insertion | Personalize with names, prices, stats |
1. Dynamic product ads Feed a JSON array of products and render a unique video ad for each:
// Simplified Remotion component for product ads
export const ProductAd: React.FC<{
productName: string;
price: string;
imageUrl: string;
tagline: string;
}> = ({productName, price, imageUrl, tagline}) => {
return (
<AbsoluteFill style={{backgroundColor: '#fff'}}>
<h1>{productName}</h1>
<p>{tagline}</p>
<div className="price">{price}</div>
<div className="cta">Shop Now</div>
</AbsoluteFill>
);
};
2. A/B test video variations Render the same template with different headlines, CTAs, or color schemes:
const variations = [
{headline: "Save 50% Today", cta: "Get the Deal", theme: "urgent"},
{headline: "Join 10K+ Teams", cta: "Start Free", theme: "social-proof"},
{headline: "Built for Speed", cta: "Try It Now", theme: "benefit"},
];
// Render all variations programmatically
3. Personalized outreach videos Generate videos addressing prospects by name for cold outreach or sales.
4. Social ad batch production Render the same content across different aspect ratios:
1. Design template in React (or use AI to generate the component)
2. Define data schema (products, headlines, CTAs, images)
3. Feed data array into template
4. Batch render all variations
5. Upload to ad platform
# Create a new Remotion project
npx create-video@latest
# Render a single video
npx remotion render src/index.ts MyComposition out/video.mp4
# Batch render from data
npx remotion render src/index.ts MyComposition --props='{"data": [...]}'
Need video ads?
├── Templated, data-driven (same structure, different data)
│ └── Use Remotion
├── Unique creative from prompts (exploratory)
│ ├── Need dialogue/voiceover? → Sora 2, Veo 3.1, Kling 2.6, Seedance 2.0
│ ├── Need consistency across scenes? → Runway Gen-4
│ ├── Need vertical social video? → Veo 3.1 (native 9:16)
│ ├── Need high volume at low cost? → Seedance 2.0
│ └── Need cinematic camera work? → Higgsfield, Kling
└── Both → Use AI gen for hero creative, Remotion for variations
Need image ads?
├── Need text/headlines in image? → Ideogram
├── Need product consistency across variations? → Flux (multi-ref)
├── Need quick iterations on existing images? → Nano Banana Pro
├── Need highest visual quality? → Flux Pro, Midjourney
└── Need high volume at low cost? → Flux Klein, Nano Banana
| Approach | Tool | Approximate Cost |
|---|---|---|
| 100 static images | Nano Banana Pro | ~$4-24 |
| 100 static images | Flux Dev | ~$1-2 |
| 100 static images | Ideogram API | ~$6 |
| 100 × 15-sec videos | Veo 3.1 Fast | ~$225 |
| 100 × 15-sec videos | Remotion (templated) | ~$0 (self-hosted render) |
| 10 hero videos + 90 templated | Veo + Remotion | ~$22 + render time |
This hybrid approach gives you the creative exploration of AI generators and the consistency and scale of code-based rendering.
When generating images for ads, request the correct dimensions:
| Platform | Placement | Aspect Ratio | Recommended Size |
|---|---|---|---|
| Meta Feed | Single image | 1:1 | 1080x1080 |
| Meta Stories/Reels | Vertical | 9:16 | 1080x1920 |
| Meta Carousel | Square | 1:1 | 1080x1080 |
| Google Display | Landscape | 1.91:1 | 1200x628 |
| Google Display | Square | 1:1 | 1200x1200 |
| LinkedIn Feed | Landscape | 1.91:1 | 1200x627 |
| LinkedIn Feed | Square | 1:1 | 1200x1200 |
| TikTok Feed | Vertical | 9:16 | 1080x1920 |
| Twitter/X Feed | Landscape | 16:9 | 1200x675 |
| Twitter/X Card | Landscape | 1.91:1 | 800x418 |
Include these dimensions in your generation prompts to avoid needing to crop or resize.