metadata-ingestion/docs/sources/datahub-documents/datahub-documents_pre.md
The DataHub Documents source processes Document entities already stored in DataHub and enriches them with semantic embeddings for semantic search. This source is designed to work with DataHub's native Document entities that have been created via GraphQL, Python SDK, or other ingestion sources (like Notion, Confluence, etc.).
The source automatically fetches embedding configuration from your DataHub server, ensuring perfect alignment:
DATAHUB_GMS_URL, DATAHUB_GMS_TOKEN)config: {} in your recipeEvent-Driven Mode (Recommended):
Batch Mode:
["notion", "confluence"])You MUST configure semantic search on your DataHub server before using this source.
See the Semantic Search Configuration Guide for complete setup instructions.
Required server configuration:
application.ymlThis source processes existing Document entities. Documents can be created through:
datahub.sdk.document.DocumentIf using AWS Bedrock for embeddings:
bedrock:InvokeModel# Required for DataHub connection
export DATAHUB_GMS_URL="http://localhost:8080"
export DATAHUB_GMS_TOKEN="your-token-here"
# Optional: AWS credentials (if not using instance/task roles)
export AWS_PROFILE="datahub-dev"
# OR
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
export AWS_REGION="us-west-2"