metadata-models/docs/entities/schemaField.md
The schemaField entity represents an individual column or field within a dataset's schema. While schema information is typically ingested as part of a dataset's schemaMetadata aspect, schemaField entities exist as first-class entities to enable direct attachment of metadata like tags, glossary terms, documentation, and structured properties at the field level.
SchemaField entities are automatically created by DataHub when datasets with schemas are ingested. They serve as the link between dataset-level metadata and column-level metadata, enabling fine-grained data governance and lineage tracking at the field level.
SchemaField entities are uniquely identified by two components:
user_id, address.zipcode for nested fields)The URN structure for a schemaField follows this pattern:
urn:li:schemaField:(<parent_dataset_urn>,<encoded_field_path>)
Simple field:
urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,public.users,PROD),user_id)
Nested field:
urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:bigquery,project.dataset.table,PROD),address.zipcode)
Field with special characters (URL encoded):
urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table,PROD),first%20name)
Note that the field path component may be URL-encoded if it contains special characters. The v1 field path uses . notation for nested structures, while v2 field paths include type information (e.g., [version=2.0].[type=struct].address.[type=string].zipcode).
The schemafieldInfo aspect contains basic identifying information about the schema field:
This aspect is primarily used internally by DataHub to support field path variations and search functionality.
The documentation aspect stores field-level documentation from multiple sources. Unlike the dataset-level description pattern which uses separate aspects (datasetProperties and editableDatasetProperties), field-level documentation uses a single unified aspect that can contain multiple documentation entries from different sources.
Each documentation entry includes:
{{ inline /metadata-ingestion/examples/library/schemafield_add_documentation.py show_path_as_comment }}
Tags can be added directly to schema fields using the globalTags aspect. This is separate from tags added at the dataset level, allowing for fine-grained classification of individual columns.
Tags on fields are commonly used to:
{{ inline /metadata-ingestion/examples/library/schemafield_add_tag.py show_path_as_comment }}
Glossary terms can be attached to schema fields via the glossaryTerms aspect, enabling semantic annotation at the column level. This helps users understand the business meaning of individual fields.
{{ inline /metadata-ingestion/examples/library/schemafield_add_term.py show_path_as_comment }}
The businessAttributes aspect allows association of business attribute definitions with schema fields. Business attributes provide a way to attach enterprise-specific metadata dimensions (like data classification, retention policies, or business rules) directly to fields.
This is particularly useful for organizations that need to track custom governance metadata at the field level that isn't covered by standard aspects.
Schema fields support structured properties via the structuredProperties aspect, allowing organizations to extend the metadata model with custom typed properties. This is useful for tracking field-level metadata like:
{{ inline /metadata-ingestion/examples/library/schemafield_add_structured_properties.py show_path_as_comment }}
The schemaFieldAliases aspect stores alternative URNs for a schema field. This is useful when:
Fields can be marked as deprecated using the deprecation aspect, indicating they should not be used in new applications or analyses. The deprecation aspect includes:
The logicalParent aspect can associate a schema field with a logical parent entity (like a container or domain), enabling organizational hierarchies that differ from the physical dataset structure.
Forms can be attached to schema fields via the forms aspect, enabling structured data collection and validation at the field level. This is useful for capturing field-level certifications, approvals, or additional metadata.
The status aspect indicates whether a schema field is active or has been soft-deleted.
The testResults aspect can store results of data quality tests run on specific fields, linking test outcomes directly to the columns they validate.
The subTypes aspect allows categorization of schema fields beyond their data type, enabling custom classification schemes.
The standard GET API can be used to retrieve schema field entities and their aspects:
<details> <summary>Fetch a schemaField entity</summary>{{ inline /metadata-ingestion/examples/library/schemafield_query_entity.py show_path_as_comment }}
Example API call:
curl 'http://localhost:8080/entities/urn%3Ali%3AschemaField%3A(urn%3Ali%3Adataset%3A(urn%3Ali%3AdataPlatform%3Apostgres%2Cpublic.users%2CPROD)%2Cuser_id)'
This returns all aspects associated with the schema field, including tags, terms, documentation, and structured properties.
</details>Schema fields are central to fine-grained (column-level) lineage. When defining lineage between datasets, you can specify which fields flow from upstream to downstream:
<details> <summary>Example lineage query showing field-level relationships</summary># Find upstream fields of a specific schema field
curl 'http://localhost:8080/relationships?direction=OUTGOING&urn=urn%3Ali%3AschemaField%3A(urn%3Ali%3Adataset%3A(urn%3Ali%3AdataPlatform%3Apostgres%2Cpublic.orders%2CPROD)%2Cuser_id)&types=DownstreamOf'
This shows which upstream fields contribute to this field's values, enabling impact analysis at the column level.
</details>Schema fields have a parent-child relationship with datasets. The dataset's schemaMetadata aspect defines the structure and metadata of fields, while individual schemaField entities allow direct metadata attachment at the field level.
Key integration points:
schemaMetadata and editableSchemaMetadata aspects of datasetsschemaMetadata) or directly on schemaField entitieseditableSchemaMetadata on the dataset, while ingestion connectors set schemaMetadataSchema fields are essential for column-level lineage:
dataJobInputOutput aspect can specify inputDatasetFields and outputDatasetFieldsupstreamLineage aspect on datasets can include fineGrainedLineages that map specific fieldsThe GraphQL API exposes schema field entities as first-class entities with the SchemaFieldEntity type. Key resolvers include:
Note: Field fetching via GraphQL is controlled by the schemaFieldEntityFetchEnabled feature flag. When disabled, schema field metadata is accessed only through the parent dataset's schema aspects.
Schema fields are indexed for search, enabling users to:
Schema field metadata can be accessed and modified in two ways:
schemaMetadata or editableSchemaMetadata aspects on the datasetglobalTags, glossaryTerms, documentation on the schemaField URNBest practices:
schemaMetadata)editableSchemaMetadata)The ability to fetch schemaField entities via GraphQL depends on the schemaFieldEntityFetchEnabled feature flag. When disabled:
This flag exists for performance reasons, as materializing individual field entities can be expensive for datasets with hundreds of columns.
Field paths in schemaField URNs must be URL-encoded if they contain special characters (spaces, special symbols, etc.). Always use the make_schema_field_urn utility function from datahub.emitter.mce_builder to construct URNs correctly:
from datahub.emitter.mce_builder import make_schema_field_urn
# Automatically handles encoding
field_urn = make_schema_field_urn(
parent_urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table,PROD)",
field_path="first name" # Will be encoded as "first%20name"
)
DataHub supports two field path formats:
address.zipcode)[version=2.0].[type=struct].address.[type=string].zipcode)V2 field paths are required for:
Most simple schemas can use v1 field paths. Use v2 when dealing with complex types or when ingestion connectors generate them.