Documentation Embeddings Generation System

Overview

The documentation embeddings generation system processes various documentation sources and uploads their metadata to a database for semantic search functionality. The system is located in apps/docs/scripts/search/ and works by:

Discovering content sources from multiple types of documentation
Processing content into structured sections with checksums
Generating embeddings using OpenAI's text-embedding-ada-002 model
Storing in database with vector embeddings for semantic search

Architecture

Main Entry Point

apps/docs/scripts/search/generate-embeddings.ts - Main script that orchestrates the entire process
Supports --refresh flag to force regeneration of all content

Content Sources (`sources/` directory)

Base Classes

BaseLoader - Abstract class for loading content from different sources
BaseSource - Abstract class for processing and formatting content

Source Types

Markdown Sources (apps/docs/scripts/search/sources/markdown.ts)
- Processes .mdx files from guides and documentation
- Extracts frontmatter metadata and content sections
Reference Documentation (apps/docs/scripts/search/sources/reference-doc.ts)
- OpenAPI References - Management API documentation from OpenAPI specs
- Client Library References - JavaScript, Dart, Python, C#, Swift, Kotlin SDKs
- CLI References - Command-line interface documentation
- Processes YAML/JSON specs and matches with common sections
GitHub Discussions (apps/docs/scripts/search/sources/github-discussion.ts)
- Fetches troubleshooting discussions from GitHub using GraphQL API
- Uses GitHub App authentication for access
Partner Integrations (apps/docs/scripts/search/sources/partner-integrations.ts)
- Fetches approved partner integration documentation from Supabase database
- Technology integrations only (excludes agencies)

Processing Flow

Content Discovery: Each source loader discovers and loads content files/data
Content Processing: Each source processes content into:
- Checksum for change detection
- Metadata (title, subtitle, etc.)
- Sections with headings and content
Change Detection: Compares checksums against existing database records
Embedding Generation: Uses OpenAI to generate embeddings for new/changed content
Database Storage: Stores in page and page_section tables with embeddings
Cleanup: Removes outdated pages using version tracking

Database Schema

page table: Stores page metadata, content, checksum, version
page_section table: Stores individual sections with embeddings, token counts

Documentation Embeddings Generation System

Documentation Embeddings Generation System

Overview

Architecture

Main Entry Point

Content Sources (sources/ directory)

Base Classes

Source Types

Processing Flow

Database Schema

Content Sources (`sources/` directory)