metadata-ingestion/docs/sources/confluence/confluence_post.md
Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.
By default, the connector discovers and ingests all accessible spaces:
source:
type: confluence
config:
# Confluence Cloud
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "[email protected]"
api_token: "${CONFLUENCE_API_TOKEN}"
# No filtering - discovers all accessible spaces
# Optional: limit number of spaces for large instances
max_spaces: 100
Ingest only specific Confluence spaces:
source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "[email protected]"
api_token: "${CONFLUENCE_API_TOKEN}"
# Include only these spaces
spaces:
allow:
- "ENGINEERING"
- "PRODUCT"
- "DESIGN"
Ingest all spaces except specific ones:
source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "[email protected]"
api_token: "${CONFLUENCE_API_TOKEN}"
# Exclude personal spaces and archived content
spaces:
deny:
- "~john.doe"
- "~jane.smith"
- "ARCHIVE"
- "OLD_DOCS"
Ingest specific pages and their descendants:
source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "[email protected]"
api_token: "${CONFLUENCE_API_TOKEN}"
# Start from specific pages
pages:
allow:
- "123456789" # API Documentation page tree
- "987654321" # User Guides page tree
recursive: true # Include all child pages
Combine space and page filters for fine-grained control:
source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "[email protected]"
api_token: "${CONFLUENCE_API_TOKEN}"
# Include specific spaces
spaces:
allow:
- "ENGINEERING"
- "PRODUCT"
# Exclude personal spaces even if in allow list
deny:
- "~admin"
# Exclude specific pages (e.g., drafts, archived content)
pages:
deny:
- "999999" # Draft page
- "888888" # Archived page
Connect to Confluence Data Center or Server:
source:
type: confluence
config:
# Data Center / Server
cloud: false
url: "https://confluence.company.com"
personal_access_token: "${CONFLUENCE_PAT}"
spaces:
allow:
- "WIKI"
- "DOCS"
Enterprise setup with incremental updates:
source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "[email protected]"
api_token: "${CONFLUENCE_API_TOKEN}"
spaces:
allow:
- "COMPANY"
- "PUBLIC"
# Enable stateful ingestion for incremental updates
stateful_ingestion:
enabled: true
Note: Embedding configuration is managed by your DataHub instance. See Semantic Search Configuration for setup.
You can specify spaces and pages using full URLs for both allow and deny lists:
source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "[email protected]"
api_token: "${CONFLUENCE_API_TOKEN}"
# Use full URLs - connector extracts keys/IDs automatically
spaces:
allow:
- "https://your-domain.atlassian.net/wiki/spaces/ENG"
- "https://your-domain.atlassian.net/wiki/spaces/PRODUCT"
deny:
- "https://your-domain.atlassian.net/wiki/spaces/ARCHIVE"
- "~john.doe" # Can mix URLs and keys
pages:
allow:
- "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/123456/Getting+Started"
deny:
- "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/999999/Draft"
The connector provides flexible filtering options through allow and deny lists for both spaces and pages.
Control which Confluence spaces are ingested:
spaces.allow: Include only specific spaces (by default, all accessible spaces are discovered)
spaces:
allow:
- "ENGINEERING" # Space key
- "PRODUCT"
- "https://your-domain.atlassian.net/wiki/spaces/DESIGN" # Or full URL
spaces.deny: Exclude specific spaces (applied after spaces.allow)
spaces:
deny:
- "~john.doe" # Personal space
- "ARCHIVE" # Archived content
- "TEST" # Test space
Control which pages are ingested:
pages.allow: Include only specific pages (triggers page-based mode, bypasses space discovery)
pages:
allow:
- "123456789" # Page ID
- "987654321"
- "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/111111/API+Docs" # Or full URL
recursive: true # Include child pages
pages.deny: Exclude specific pages (works in both space-based and page-based modes)
pages:
deny:
- "999999" # Draft page
- "888888" # Archived page
Precedence:
Modes:
page_allow is specified, bypasses space discovery and fetches specific page treesFormat Support:
"ENGINEERING", "~username" (for personal spaces)"123456789" (numeric string)Exclude all personal spaces:
spaces:
deny:
- "~*" # Note: Use explicit user IDs, wildcard not supported
# Instead, list specific personal spaces:
- "~john.doe"
- "~jane.smith"
Ingest only documentation spaces:
spaces:
allow:
- "DOCS"
- "API_DOCS"
- "USER_GUIDES"
Focus on specific documentation trees:
pages:
allow:
- "123456" # API Documentation root page
- "789012" # User Guides root page
recursive: true
Exclude drafts and WIP pages:
pages:
deny:
- "999999" # Draft page ID
- "888888" # WIP page ID
The connector supports multiple input formats for spaces and pages in allow/deny lists:
Space Identifiers:
"ENGINEERING", "~username" (for personal spaces)"https://your-domain.atlassian.net/wiki/spaces/ENGINEERING"Page Identifiers:
"123456789" (numeric string)"https://your-domain.atlassian.net/wiki/spaces/ENG/pages/123456/Page+Title""https://confluence.company.com/pages/viewpage.action?pageId=123456"The connector automatically extracts space keys and page IDs from URLs, so you can use either format interchangeably in space_allow, space_deny, page_allow, and page_deny lists.
The source uses content-based change detection:
This means:
processing:
parallelism:
num_processes: 4 # Increase for faster processing (default: 2)
max_connections: 20 # Concurrent API connections (default: 10)
Guidelines:
num_processes: 2num_processes: 4num_processes: 8filtering:
min_text_length: 100 # Skip short pages (default: 50)
skip_empty_documents: true # Skip empty pages (default: true)
Instead of ingesting all spaces, select specific ones:
spaces:
allow:
- "ENGINEERING" # High-value documentation space
- "PRODUCT" # Product requirements space
deny:
- "~*" # Exclude personal spaces (list specific users)
- "ARCHIVE" # Exclude archived content
- "TEST" # Exclude test spaces
Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.
:::caution Not Supported with Remote Executor This source is not supported with the Remote Executor in DataHub Cloud. It must be run using a self-hosted ingestion setup. :::
"401 Unauthorized" or "Authentication failed" errors:
username (email) and api_token are correctpersonal_access_token is valid and not expiredcloud: true/false matches your Confluence type/wiki suffix for Cloud (e.g., https://domain.atlassian.net/wiki)"403 Forbidden" or "Space not found" errors:
Empty or missing content:
skip_empty_documents: true)min_text_length filter setting (default: 50 characters)recursive: true if expecting child pagesSlow ingestion:
processing.parallelism.num_processes (default: 2)Embedding generation failures:
Stateful ingestion not working:
stateful_ingestion.enabled: true in configMissing hierarchy/parent relationships:
hierarchy.enabled: true (default)recursive: true to discover parent-child relationshipsPage IDs not working:
/pages/)?pageId=page_allow or page_denyHow to find space keys and page IDs:
https://domain.atlassian.net/wiki/spaces/ENGINEERING → key is ENGINEERING/pages/: https://domain.atlassian.net/wiki/spaces/ENG/pages/123456/Title → ID is 123456https://confluence.company.com/pages/viewpage.action?pageId=123456 → ID is 123456~username (e.g., ~john.doe for user john.doe)If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.