metadata-ingestion/docs/sources/business-glossary/datahub-business-glossary_post.md
Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.
The business glossary source file should be a .yml file with the following top-level keys:
Glossary: the top level keys of the business glossary file
Example Glossary:
version: "1" # the version of business glossary file config the config conforms to. Currently the only version released is `1`.
source: DataHub # the source format of the terms. Currently only supports `DataHub`
owners: # owners contains two nested fields
users: # (optional) a list of user IDs
- njones
groups: # (optional) a list of group IDs
- logistics
url: "https://github.com/datahub-project/datahub/" # (optional) external url pointing to where the glossary is defined externally, if applicable
nodes: # list of child **GlossaryNode** objects. See **GlossaryNode** section below
...
GlossaryNode: a container of GlossaryNode and GlossaryTerm objects
Example GlossaryNode:
- name: "Shipping" # name of the node
id: "Shipping-Logistics" # (optional) custom identifier for the node
description: Provides terms related to the shipping domain # description of the node
owners: # (optional) owners contains 2 nested fields
users: # (optional) a list of user IDs
- njones
groups: # (optional) a list of group IDs
- logistics
nodes: # list of child **GlossaryNode** objects
...
knowledge_links: # (optional) list of **KnowledgeCard** objects
- label: Wiki link for shipping
url: "https://en.wikipedia.org/wiki/Freight_transport"
GlossaryTerm: a term in your business glossary
Example GlossaryTerm:
- name: "Full Address" # name of the term
id: "Full-Address-Details" # (optional) custom identifier for the term
description: A collection of information to give the location of a building or plot of land. # description of the term
owners: # (optional) owners contains 2 nested fields
users: # (optional) a list of user IDs
- njones
groups: # (optional) a list of group IDs
- logistics
term_source: "EXTERNAL" # one of `EXTERNAL` or `INTERNAL`. Whether the term is coming from an external glossary or one defined in your organization.
source_ref: FIBO # (optional) if external, what is the name of the source the glossary term is coming from?
source_url: "https://www.google.com" # (optional) if external, what is the url of the source definition?
inherits: # (optional) list of **GlossaryTerm** that this term inherits from
- Privacy.PII
contains: # (optional) a list of **GlossaryTerm** that this term contains
- Shipping.ZipCode
- Shipping.CountryCode
- Shipping.StreetAddress
custom_properties: # (optional) a map of key/value pairs of arbitrary custom properties
- is_used_for_compliance_tracking: "true"
knowledge_links: # (optional) a list of **KnowledgeCard** related to this term. These appear as links on the glossary node's page
- url: "https://en.wikipedia.org/wiki/Address"
label: Wiki link
domain: "urn:li:domain:Logistics" # (optional) domain name or domain urn
The business glossary provides two primary ways to manage term and node identifiers:
Custom IDs: You can explicitly specify an ID for any term or node using the id field. This is recommended for terms that need stable, predictable identifiers:
terms:
- name: "Response Time"
id: "support-response-time" # Explicit ID
description: "Target time to respond to customer inquiries"
Automatic ID Generation: When no ID is specified, the system will generate one based on the enable_auto_id setting:
With enable_auto_id: false (default):
With enable_auto_id: true:
Here's how path-based ID generation works:
nodes:
- name: "Customer Support" # Node ID: Customer-Support
terms:
- name: "Response Time" # Term ID: Customer-Support.Response-Time
description: "Response SLA"
- name: "First Reply" # Term ID: Customer-Support.First-Reply
description: "Initial response"
- name: "Product Feedback" # Node ID: Product-Feedback
terms:
- name: "Response Time" # Term ID: Product-Feedback.Response-Time
description: "Feedback response"
Important Notes:
inherits, contains, etc.) must use its correct IDenable_auto_id: true if you plan to reorganize your glossaryExample of how different names are handled:
nodes:
- name: "Data Services" # Node ID: Data-Services
terms:
# Basic term name
- name: "Response Time" # Term ID: Data-Services.Response-Time
description: "SLA metrics"
# Term name with special characters
- name: "API @ Response" # Term ID: Data-Services.API-Response
description: "API metrics"
# Term with non-ASCII (triggers GUID)
- name: "パフォーマンス" # Term ID will be a 32-character GUID
description: "Performance"
To see how these all work together, check out this comprehensive example business glossary file below:
version: "1"
source: DataHub
owners:
users:
- mjames
url: "https://github.com/datahub-project/datahub/"
nodes:
- name: "Data Classification"
id: "Data-Classification" # Custom ID for stable references
description: A set of terms related to Data Classification
knowledge_links:
- label: Wiki link for classification
url: "https://en.wikipedia.org/wiki/Classification"
terms:
- name: "Sensitive Data" # Will generate: Data-Classification.Sensitive-Data
description: Sensitive Data
custom_properties:
is_confidential: "false"
- name: "Confidential Information" # Will generate: Data-Classification.Confidential-Information
description: Confidential Data
custom_properties:
is_confidential: "true"
- name: "Highly Confidential" # Will generate: Data-Classification.Highly-Confidential
description: Highly Confidential Data
custom_properties:
is_confidential: "true"
domain: Marketing
- name: "Personal Information"
description: All terms related to personal information
owners:
users:
- mjames
terms:
- name: "Email" # Will generate: Personal-Information.Email
description: An individual's email address
inherits:
- Data-Classification.Confidential # References parent node path
owners:
groups:
- Trust and Safety
- name: "Address" # Will generate: Personal-Information.Address
description: A physical address
- name: "Gender" # Will generate: Personal-Information.Gender
description: The gender identity of the individual
inherits:
- Data-Classification.Sensitive # References parent node path
- name: "Clients And Accounts"
description: Provides basic concepts such as account, account holder, account provider, relationship manager that are commonly used by financial services providers to describe customers and to determine counterparty identities
owners:
groups:
- finance
type: DATAOWNER
terms:
- name: "Account" # Will generate: Clients-And-Accounts.Account
description: Container for records associated with a business arrangement for regular transactions and services
term_source: "EXTERNAL"
source_ref: FIBO
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
inherits:
- Data-Classification.Highly-Confidential # References parent node path
contains:
- Clients-And-Accounts.Balance # References term in same node
- name: "Balance" # Will generate: Clients-And-Accounts.Balance
description: Amount of money available or owed
term_source: "EXTERNAL"
source_ref: FIBO
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Balance"
- name: "KPIs"
description: Common Business KPIs
terms:
- name: "CSAT %" # Will generate: KPIs.CSAT
description: Customer Satisfaction Score
Custom IDs can be specified in two ways, both of which are fully supported and acceptable:
terms:
- name: "Email"
id: "company-email" # Will become urn:li:glossaryTerm:company-email
description: "Company email address"
terms:
- name: "Email"
id: "urn:li:glossaryTerm:company-email"
description: "Company email address"
Both methods are valid and will work correctly. The system will automatically handle the URN prefix if you specify just the ID portion.
The same applies for nodes:
nodes:
- name: "Communications"
id: "internal-comms" # Will become urn:li:glossaryNode:internal-comms
description: "Internal communication methods"
Note: Once you select a custom ID, it cannot be easily changed.
Compatible with version 1 of business glossary format. The source will be evolved as newer versions of this format are published.
Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.
If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.