metadata-ingestion/docs/sources/notion/notion_post.md
Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.
:::caution Not Supported with Remote Executor This source is not supported with the Remote Executor in DataHub Cloud. It must be run using a self-hosted ingestion setup. :::
Ingest entire workspace documentation with semantic search:
source:
type: notion
config:
api_key: "${NOTION_API_KEY}"
# Start from workspace root page
page_ids:
- "workspace_root_page_id"
recursive: true
# Enable semantic embeddings
embedding:
provider: "cohere"
model: "embed-english-v3.0"
api_key: "${COHERE_API_KEY}"
Ingest a specific Notion database (e.g., "Product Requirements"):
source:
type: notion
config:
api_key: "${NOTION_API_KEY}"
# Only this database
database_ids:
- "product_requirements_db_id"
recursive: false # Only database entries, not child pages
Ingest from multiple workspaces (requires multiple integrations):
source:
type: notion
config:
api_key: "${NOTION_API_KEY}"
# Multiple root pages from different workspaces
page_ids:
- "workspace_1_page_id"
- "workspace_2_page_id"
recursive: true
Enterprise setup using AWS Bedrock for embeddings:
source:
type: notion
config:
api_key: "${NOTION_API_KEY}"
page_ids:
- "company_wiki_root"
recursive: true
# Use AWS Bedrock (no API key needed, uses IAM roles)
embedding:
provider: "bedrock"
aws_region: "us-west-2"
model: "cohere.embed-english-v3"
# Enable stateful ingestion for incremental updates
stateful_ingestion:
enabled: true
The source uses content-based change detection:
This means:
processing:
parallelism:
num_processes: 4 # Increase for faster processing (default: 2)
max_connections: 20 # Concurrent API connections (default: 10)
Guidelines:
num_processes: 2num_processes: 4num_processes: 8filtering:
min_text_length: 100 # Skip short pages (default: 50)
skip_empty_documents: true # Skip empty pages (default: true)
chunking:
strategy: "by_title" # Preserves document structure (recommended)
max_characters: 500 # Chunk size (default: 500)
combine_text_under_n_chars: 100 # Merge small chunks (default: 100)
Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.
"Integration not found" or "Unauthorized" errors:
api_key is correct (should start with secret_)Empty or missing content:
skip_empty_documents: true)min_text_length filter setting (default: 50 characters)recursive: true if expecting child pagesSlow ingestion:
processing.parallelism.num_processes (default: 2)partition_by_api: false for local processing (requires more memory)page_idsEmbedding generation failures:
Stateful ingestion not working:
stateful_ingestion.enabled: true in configMissing hierarchy/parent relationships:
hierarchy.enabled: true (default)recursive: true to discover parent-child relationshipsIf ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.