docs/dev-guides/agent-context/agent-context.md
📚 Navigation: LangChain Integration → | Google ADK Integration → | Google Vertex AI Integration → | Snowflake Integration → | Copilot Studio Integration →
When building AI agents that answer questions about data, agents often face these challenges:
Agent Context Kit solves this by giving AI agents real-time access to your DataHub metadata catalog, enabling them to provide accurate, contextual answers about your data ecosystem.
DataHub Agent Context provides a collection of tools and utilities for building AI agents that interact with DataHub metadata. This package contains MCP (Model Context Protocol) tools that enable AI agents to search, retrieve, and manipulate metadata in DataHub. These can be used directly to create an agent, or be included in an MCP server such as DataHub's open source MCP server.
pip install datahub-agent-context
These tools are designed to be used with an AI agent and have the responses passed directly to an LLM, so the return schema is a simple dict, but they can be used independently if desired.
from datahub.sdk.main_client import DataHubClient
from datahub_agent_context.context import DataHubContext
from datahub_agent_context.mcp_tools.search import search
from datahub_agent_context.mcp_tools.entities import get_entities
# Initialize DataHub client from environment (or specify server/token)
client = DataHubClient.from_env()
# Or: client = DataHubClient(server="http://localhost:8080", token="YOUR_TOKEN")
# Use DataHubContext to set up the client for tool calls
with DataHubContext(client):
# Search for datasets
results = search(
query="user_data",
filters={"entity_type": ["dataset"]},
num_results=10
)
print(f"Found {len(results['searchResults'])} datasets")
for result in results["searchResults"]:
print(f"- {result['entity']['name']} ({result['entity']['urn']})")
# Get detailed entity information
entity_urns = [result["entity"]["urn"] for result in results["searchResults"]]
entities = get_entities(urns=entity_urns)
print(f"\nDetailed info for {len(entities['entities'])} entities:")
for entity in entities["entities"]:
print(f"- {entity['urn']}: {entity.get('properties', {}).get('description', 'No description')}")
Before using Agent Context Kit, familiarize yourself with these DataHub concepts:
urn:li:dataset:(urn:li:dataPlatform:mysql,mydb.users,PROD). This is like a primary key for metadata.| Platform | Status | Guide |
|---|---|---|
| Custom | Launched | See below |
| Langchain | Launched | LangChain Guide |
| Snowflake | Launched | Snowflake Guide |
| Google ADK | Launched | Google ADK Guide |
| Google Vertex AI | Launched | Google Vertex AI Guide |
| Microsoft Copilot Studio | Launched | Copilot Studio Guide |
| Crew AI | Coming Soon | - |
| OpenAI | Coming Soon | - |
search(client, query, filters, num_results)
search(client, "customer", {"entity_type": ["dataset"]}, 10) to find datasets about customerssearch_documents(client, query, semantic_query, num_results)
search_documents(client, "*", "data retention policy", 5) to find policy documentsgrep_documents(client, pattern, num_results)
grep_documents(client, "PII.*encrypted", 10) to find docs mentioning PII encryptionget_entities(client, urns)
list_schema_fields(client, urn, filters)
list_schema_fields(client, dataset_urn, {"field_path": "customer_"}) to find customer-related columnsget_lineage(client, urn, direction, max_depth)
get_lineage(client, dashboard_urn, "UPSTREAM", 3) to trace data sources for a dashboardget_lineage_paths_between(client, source_urn, destination_urn)
get_dataset_queries(client, urn, column_name)
get_dataset_queries(client, dataset_urn, "email") to see how the email column is usedNote: These tools modify metadata. Use with caution in production environments.
add_tags(client, urn, tags) / remove_tags(client, urn, tags)
add_tags(client, dataset_urn, ["PII", "Finance"]) to mark sensitive dataupdate_description(client, urn, description)
set_domains(client, urn, domain_urns) / remove_domains(client, urn, domain_urns)
add_owners(client, urn, owners) / remove_owners(client, urn, owners)
add_owners(client, dataset_urn, [{"owner": user_urn, "type": "TECHNICAL_OWNER"}])add_glossary_terms(client, urn, term_urns) / remove_glossary_terms(client, urn, term_urns)
add_structured_properties(client, urn, properties) / remove_structured_properties(client, urn, properties)
save_document(document_type, title, content, urn, topics, related_documents, related_assets)
document_type: Type of document (required)title: Document title (required)content: Full document content in markdown format (required)urn: URN of existing document to update (optional, creates new if not provided)topics: List of topic tags for categorization (optional)related_documents: URNs of related documents (optional)related_assets: URNs of related data assets like datasets or dashboards (optional)save_document("Insight", "High Null Rate in Customer Emails", "## Finding\\n\\n23% of customer records have null email...", topics=["data-quality", "customer-data"], related_assets=["urn:li:dataset:(urn:li:dataPlatform:snowflake,customers,PROD)"])update_description insteadget_me(client)
It is also possible to connect your agent or tool directly to the DataHub MCP Server: DataHub MCP Server with your chosen framework.
Problem: Unauthorized or 401 errors when calling tools
Solutions:
datahub check metadata-serviceProblem: Connection refused or timeout errors
Solutions:
curl -I https://your-datahub-instance.com/api/gms/healthProblem: Search returns no results or missing expected entities
Solutions:
dataset, not DatasetProblem: ModuleNotFoundError: No module named 'datahub_agent_context'
Solutions:
pip install datahub-agent-contextpip install datahub-agent-context[langchain]pip install datahub-agent-context[snowflake]Problem: 429 Too Many Requests errors
Solutions:
Enable debug logging to see detailed API calls:
import logging
logging.basicConfig(level=logging.DEBUG)
# Your agent code here
Check the DataHub server logs for more details on server-side errors.