metadata-models/docs/entities/dataContract.md
A Data Contract is an agreement between a data asset's producer and consumer that defines expectations and guarantees about the quality, structure, and operational characteristics of data. Data Contracts serve as formal commitments that help establish trust and reliability in data pipelines by making explicit what data consumers can expect from data producers.
Data Contracts in DataHub are built on top of assertions and represent a curated set of verifiable guarantees about a physical data asset. They are producer-oriented, meaning each physical data asset has one contract owned by its producer, which declares the standards and SLAs that consumers can rely on.
Data Contracts are identified by a single unique string identifier:
The URN structure for a Data Contract is: urn:li:dataContract:<contract-id>
Example URNs:
urn:li:dataContract:my-critical-dataset-contracturn:li:dataContract:a1b2c3d4-e5f6-7890-abcd-ef1234567890When creating Data Contracts programmatically, the contract ID can be explicitly specified, or it can be auto-generated based on the entity being contracted. The auto-generation creates a stable, deterministic ID using a GUID derived from the entity URN, ensuring that contracts are reproducible across multiple runs.
Data Contracts provide three main types of guarantees through assertions:
The dataContractProperties aspect defines the core characteristics of a contract, including:
Each contract type (schema, freshness, data quality) contains references to assertion entities, which define the actual validation logic and evaluation criteria.
The following code snippet shows how to create a basic Data Contract with schema, freshness, and data quality assertions.
<details> <summary>Python SDK: Create a Data Contract</summary>{{ inline /metadata-ingestion/examples/library/datacontract_create_basic.py show_path_as_comment }}
Schema contracts ensure that the structure of your data asset matches expectations. They are particularly important for:
A schema contract references a schema assertion entity, which contains the actual schema specification and validation logic. The assertion can be created using DataHub's built-in assertion framework or by integrating with external tools like dbt or Great Expectations.
<details> <summary>Python SDK: Add a schema contract to an existing Data Contract</summary>{{ inline /metadata-ingestion/examples/library/datacontract_add_schema_contract.py show_path_as_comment }}
Freshness contracts define SLAs for how recent data should be. They help answer questions like:
Freshness contracts are critical for time-sensitive applications where stale data can lead to incorrect decisions or missed opportunities. They typically specify thresholds like "data should be no more than 2 hours old" or "data should be updated at least once per day."
<details> <summary>Python SDK: Add a freshness contract to an existing Data Contract</summary>{{ inline /metadata-ingestion/examples/library/datacontract_add_freshness_contract.py show_path_as_comment }}
Data quality contracts define expectations about the quality characteristics of your data. These can include:
Unlike schema and freshness contracts (which are typically singular per dataset), a Data Contract can contain multiple data quality assertions, each targeting different aspects of data quality.
<details> <summary>Python SDK: Add data quality contracts to an existing Data Contract</summary>{{ inline /metadata-ingestion/examples/library/datacontract_add_quality_contract.py show_path_as_comment }}
The dataContractStatus aspect tracks the current state of the contract. A contract can be in one of two states:
The contract status can also include custom properties for additional metadata about the contract's state.
<details> <summary>Python SDK: Update the status of a Data Contract</summary>{{ inline /metadata-ingestion/examples/library/datacontract_update_status.py show_path_as_comment }}
Like other DataHub entities, Data Contracts can have tags and glossary terms attached to them. These help with:
Tags and terms on Data Contracts follow the same patterns as other entities, using the globalTags and glossaryTerms aspects.
Data Contracts support structured properties, which allow you to attach custom, strongly-typed metadata to contracts. This is useful for:
Structured properties are defined at the platform level and can be applied to any Data Contract entity.
Data Contracts are built on top of the assertion entity. Each contract contains references to assertion URNs that define the actual validation logic. This separation allows:
The relationship between contracts and assertions is established through the ContractFor relationship (contract to entity) and IncludesSchemaAssertion, IncludesFreshnessAssertion, and IncludesDataQualityAssertion relationships (contract to assertions).
Currently, Data Contracts are primarily associated with dataset entities. The dataContractProperties aspect includes an entity field that references the dataset URN. This relationship is captured using the ContractFor relationship type.
A dataset can have one active Data Contract at a time, though the contract can be updated or replaced. Consumers can query a dataset to retrieve its associated contract and understand the guarantees they can expect.
Data Contracts integrate with external data quality and testing tools:
Data Contracts are accessible through DataHub's GraphQL API, which provides:
upsertDataContract mutationThe GraphQL API is particularly useful for integrating Data Contracts into CI/CD pipelines, custom UIs, or workflow orchestration systems.
Data Contracts can be created, read, updated, and deleted using DataHub's REST API. The standard entity CRUD operations apply:
POST /entities - Create or update a Data Contract entityGET /entities/urn:li:dataContract:<id> - Retrieve a Data Contract by URNDELETE /entities/urn:li:dataContract:<id> - Remove a Data ContractAspects can be individually updated using the aspect-specific endpoints, allowing fine-grained control over contract properties and status.
DataHub Data Contracts are producer-oriented, meaning each physical data asset has one contract owned by the producer. This design choice keeps contracts manageable and ensures clear ownership.
However, this may not fit all use cases. Some organizations prefer consumer-oriented contracts where each consumer defines their own expectations for a shared data asset. While DataHub doesn't directly support consumer-oriented contracts, you can achieve similar functionality by:
Data Contracts in DataHub define expectations but do not automatically enforce them. Enforcement depends on:
DataHub provides the framework for defining and tracking contracts, but actual enforcement requires additional integration work specific to your data infrastructure.
When you delete a Data Contract, the associated assertions are not automatically deleted. This is by design - assertions can exist independently and may be used by other contracts or monitored separately.
If you want to remove both a contract and its assertions, you must delete them separately. This ensures that assertion definitions and their historical results are preserved even when contracts change.
DataHub supports defining Data Contracts in YAML files using the DataContract Python model. This provides a simpler, declarative way to define contracts that can be version-controlled and reviewed like code.
The YAML format is particularly useful for:
However, YAML-based contracts are converted to the underlying MCP (Metadata Change Proposal) format when ingested, so all operations ultimately use the same underlying entity and aspect structure.