metadata-models/docs/entities/notebook.md
A Notebook is a metadata entity that represents interactive computational documents combining code execution, text documentation, data visualizations, and query results. Notebooks are collaborative environments for data analysis, exploration, and documentation, commonly used in data science, analytics, and business intelligence workflows.
The Notebook entity captures both the structural components (cells containing text, queries, or charts) and the metadata about notebooks from platforms like Jupyter, Databricks, QueryBook, Hex, Mode, Deepnote, and other notebook-based tools.
⚠️ Notice: The Notebook entity is currently in BETA. While the core functionality is stable, the entity model and UI features may evolve based on community feedback. Notebook support is actively being developed and improved.
A Notebook is uniquely identified by two components:
The URN structure for a Notebook is:
urn:li:notebook:(<notebookTool>,<notebookId>)
urn:li:notebook:(querybook,773)
urn:li:notebook:(jupyter,analysis_2024_q1)
urn:li:notebook:(databricks,/Users/analyst/customer_segmentation)
urn:li:notebook:(hex,a8b3c5d7-1234-5678-90ab-cdef12345678)
The notebookId should be globally unique for a notebook tool, even when there are multiple deployments. Best practices include:
querybook.com/notebook/773)The key requirement is that the same notebook should always produce the same URN across different ingestion runs.
The notebookInfo aspect contains the core metadata about a notebook:
The following code snippet shows you how to create a Notebook with basic information.
<details> <summary>Python SDK: Create a Notebook</summary>{{ inline /metadata-ingestion/examples/library/notebook_create.py show_path_as_comment }}
The notebookContent aspect captures the actual structure and content of a notebook through a list of cells. Each cell represents a distinct block of content within the notebook.
Notebooks support three types of cells:
TEXT_CELL: Markdown or rich text content for documentation, explanations, and narrative
QUERY_CELL: SQL or other query language statements for data retrieval and transformation
CHART_CELL: Data visualizations and charts built from query results
Each cell in the notebookContent aspect includes:
The cell list represents the sequential structure of the notebook as it appears to users.
<details> <summary>Python SDK: Add content to a Notebook</summary>{{ inline /metadata-ingestion/examples/library/notebook_add_content.py show_path_as_comment }}
The editableNotebookProperties aspect allows users to add or modify certain notebook properties through the DataHub UI without affecting the source system:
This separation allows DataHub users to enrich notebook metadata while preserving the original information from the source platform.
Notebooks support ownership through the ownership aspect, allowing you to track who is responsible for maintaining and governing each notebook. Ownership types include:
{{ inline /metadata-ingestion/examples/library/notebook_add_owner.py show_path_as_comment }}
Notebooks can be tagged and associated with glossary terms for organization and discovery:
globalTags aspect): Informal categorization labels like "exploratory", "production", "deprecated", "customer-analysis"glossaryTerms aspect): Formal business vocabulary linking notebooks to business concepts{{ inline /metadata-ingestion/examples/library/notebook_add_tags.py show_path_as_comment }}
Notebooks can be assigned to one or more domains through the domains aspect, organizing them by business unit, team, or functional area. This helps with discovery and governance at scale.
The browsePaths and browsePathsV2 aspects enable hierarchical navigation of notebooks within DataHub, allowing users to browse notebooks by platform, workspace, folder, or other organizational structures.
The applications aspect allows linking notebooks to specific applications or use cases, helping track which business applications or workflows depend on particular notebooks.
The subTypes aspect enables classification of notebooks into categories like:
This helps users find notebooks relevant to their specific needs.
Through the institutionalMemory aspect, notebooks can have links to external documentation, wikis, runbooks, or other resources that provide additional context about their purpose and usage.
The testResults aspect can capture the results of data quality tests or validation checks performed within the notebook, integrating notebook-based testing into DataHub's data quality framework.
Notebooks have relationships with datasets through query cells:
When a notebook contains chart cells, those cells can reference chart entities, creating a relationship between the notebook and the visualizations it produces. This is particularly relevant for BI notebook tools like Mode or Hex where notebooks generate reusable charts.
Query cells in notebooks can be linked to query entities, enabling:
The dataPlatformInstance aspect associates a notebook with a specific instance of a notebook platform (e.g., a particular Databricks workspace or Hex account), which is essential when multiple instances of the same platform exist.
Several DataHub connectors extract notebook metadata:
These connectors typically:
Notebooks are accessible through DataHub's GraphQL API, supporting queries for:
As a BETA feature, notebooks have some limitations:
Users should expect ongoing improvements and potential schema changes as the feature matures.
Notebook cells store structural information and metadata but may not capture:
The focus is on capturing the notebook's code, structure, and metadata rather than execution artifacts.
Different notebook platforms have unique features that may not map perfectly to DataHub's model:
Ingestion connectors capture common features while platform-specific capabilities may be stored in customProperties.
The notebookContent cells array preserves the order of cells as they appear in the source notebook. However, notebooks with complex branching logic or non-linear execution flows may not be fully represented by a simple ordered list.
The current notebook model doesn't natively track notebook versions or revision history. The changeAuditStamps captures last modified information, but full version control requires integration with the source platform's versioning system (e.g., Git for Jupyter, platform version history for Databricks).
Very large notebooks with hundreds of cells may face performance considerations:
Notebooks in DataHub enable several important use cases:
By bringing notebooks into DataHub's metadata graph, organizations can treat analysis code with the same rigor as production data assets.