metadata-ingestion/src/datahub/ingestion/source/rdf/README.md
A lightweight RDF ontology ingestion system for DataHub focused on business glossaries. This source enables ingestion of SKOS-based glossaries with term definitions, hierarchical organization, and relationships.
The RDF ingestion source provides:
skos:broader and skos:narrower term relationshipspip install acryl-datahub[rdf]
Create a recipe file (rdf_glossary.yml):
source:
type: rdf
config:
source: path/to/glossary.ttl
environment: PROD
sink:
type: datahub-rest
config:
server: "http://localhost:8080"
token: "${DATAHUB_TOKEN}"
Run ingestion:
# Ingest glossary
datahub ingest -c rdf_glossary.yml
# Dry run (preview without ingesting)
datahub ingest -c rdf_glossary.yml --dry-run
Filter large ontologies by namespace/module:
source:
type: rdf
config:
source: https://spec.edmcouncil.org/fibo/ontology/master/latest/fibo-all.ttl
sparql_filter: |
CONSTRUCT { ?s ?p ?o }
WHERE {
?s ?p ?o .
FILTER(STRSTARTS(STR(?s), "https://spec.edmcouncil.org/fibo/ontology/FBC/"))
}
environment: PROD
sink:
type: datahub-rest
config:
server: "http://localhost:8080"
token: "${DATAHUB_TOKEN}"
RDF concepts are mapped to DataHub glossary terms:
skos:Concept → GlossaryTermskos:prefLabel OR rdfs:label → term nameskos:definition OR rdfs:comment → term definitionIRI path hierarchies are automatically converted to glossary node hierarchies:
https://example.com/finance/credit-risk
→ Glossary Node: finance
└─ Glossary Node: credit-risk
└─ Glossary Term: (final segment)
Note: Domains are used internally as a data structure to organize glossary terms. They are not ingested as DataHub domain entities (which are for datasets/products).
skos:broader → creates isRelatedTerms relationships in DataHubskos:narrower → creates isRelatedTerms relationships (inverse direction)http://example.com/finance/credit-risk
→ urn:li:glossaryTerm:finance/credit-risk
fibo:FinancialInstrument
→ urn:li:glossaryTerm:fibo:FinancialInstrument
| Parameter | Description | Default |
|---|---|---|
source | RDF source (file, folder, URL) | required |
environment | DataHub environment | PROD |
format | RDF format (turtle, xml, n3, etc.) | auto-detect |
dialect | RDF dialect (default, fibo, generic) | auto-detect |
export_only | Export only specified types | all |
skip_export | Skip specified types | none |
sparql_filter | SPARQL CONSTRUCT query to filter graph | null |
recursive | Recursive folder processing | true |
extensions | File extensions to process | .ttl, .rdf, .owl, .n3, .nt |
glossary or glossary_terms - Glossary terms onlyrelationship or relationships - Term relationships onlyNote: The domain option is not available in MVP. Domains are used internally as a data structure for organizing glossary terms into hierarchies.
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
<https://example.com/finance/credit-risk>
a skos:Concept ;
skos:prefLabel "Credit Risk" ;
skos:definition "The risk of loss due to a borrower's failure to repay a loan" ;
skos:broader <https://example.com/finance/risk> .
<https://example.com/finance/risk>
a skos:Concept ;
skos:prefLabel "Risk" ;
skos:definition "General category of financial risk" .
This will create:
financeRisk (under finance node)Credit Risk (under finance node, with relationship to Risk)RDF uses a modular, pluggable entity architecture:
skos:broader/narrower supportCurrent MVP includes:
Not included in MVP:
rdflib, acryl-datahubdatahub ingest --help for command options