metadata-ingestion/src/datahub/ingestion/source/rdf/docs/user-stories-and-acceptance-criteria.md
This document provides detailed user stories with precise acceptance criteria for implementing RDF. Each story includes specific technical requirements, mapping rules, and validation criteria to ensure consistent implementation.
Status: This document has been updated to reflect current implementation status. Checked items [x] indicate completed features. Unchecked items [ ] indicate features not yet implemented or requiring verification.
Last Updated: December 2024
✅ Core Glossary Management (Stories 1-8): ~95% complete
✅ Advanced Dataset and Lineage (Stories 9-11): ~100% complete
✅ Experimental Features (Story 12): ~100% complete
✅ Technical Implementation (Stories 13-15): ~95% complete
As a data steward
I want to ingest RDF glossaries from various sources and formats
So that I can import my existing ontology into DataHub without manual configuration
AC1.1: Format Support
AC1.2: Source Support
--source file.ttl)--source /path/to/glossary/)--source http://sparql.endpoint.com)AC1.4: Error Handling
As a data steward
I want to automatically detect glossary terms from RDF
So that I don't need to manually specify which resources are terms
AC2.1: Term Detection Criteria
skos:Concept resources as glossary termsowl:Class resources as glossary termsowl:NamedIndividual resources as glossary termsowl:Ontology declarations from term detectionrdfs:label OR skos:prefLabel ≥3 characters)AC2.2: Property Extraction
skos:prefLabel as primary name (preferred)rdfs:label if skos:prefLabel not availableskos:definition as primary description (preferred)rdfs:comment if skos:definition not availableAC2.3: Validation Rules
As a data steward
I want to map SKOS relationships to DataHub glossary relationships
So that my glossary hierarchy is preserved in DataHub
AC3.1: Hierarchical Relationships
skos:broader to DataHub parent relationshipsskos:narrower to DataHub child relationshipsskos:broadMatch and skos:narrowMatch to hierarchy relationshipsAC3.2: Associative Relationships
skos:related to DataHub related termsskos:closeMatch to DataHub related termsAC3.3: External References
skos:exactMatch to DataHub external referencesowl:sameAs to DataHub external referencesAC3.4: Relationship Validation
As a data steward
I want to convert RDF IRIs to DataHub URNs
So that my glossary terms have proper DataHub identifiers
AC4.1: IRI Processing
: characterAC4.2: URN Generation
AC4.3: Validation and Error Handling
As a data steward
I want to automatically create DataHub domains from IRI hierarchy
So that my glossary terms are organized in DataHub
AC5.1: Domain Hierarchy Creation
urn:li:domain:example_com for https://example.com/finance/accountsurn:li:domain:finance for https://example.com/finance/accountsaccounts to urn:li:domain:financeAC5.2: Domain Naming Convention
example.com → urn:li:domain:example_comfinance → urn:li:domain:financeloan-trading → urn:li:domain:loan_tradingAC5.3: Domain Assignment
As a data steward
I want to process SKOS concept schemes and collections
So that I can organize my glossary terms in DataHub
AC6.1: Concept Scheme Processing
skos:ConceptScheme resources as glossary nodesskos:prefLabel → DataHub glossary node nameskos:definition → DataHub glossary node descriptionGlossaryNode entitiesAC6.2: Collection Processing
skos:Collection resources as glossary nodesAC6.3: Node Relationships
skos:broader relationships for nodesAs a data steward
I want to attach structured properties to glossary terms
So that I can add domain-specific metadata
AC7.1: Property Detection
rdf:Property declarations with rdfs:domainrdfs:domain to appropriate DataHub entity typesrdfs:label as property namerdfs:comment as property descriptionrdfs:range class instancesAC7.2: Entity Type Mapping
dcat:Dataset domain → dataset entity typeskos:Concept domain → glossaryTerm entity typeschema:Person domain → user entity typeschema:Organization domain → corpGroup entity typeAC7.3: Property Application
As a developer
I want to use CLI commands and Python API
So that I can integrate RDF into my workflows
AC8.1: CLI Commands
ingest command with --source, --export, --server, --token optionslist command to show existing glossary itemsdelete command to remove glossary terms/domains--dry-run flag for safe testingAC8.2: Python API
DataHubClient class for API interactionsOntologyToDataHub class for processingAC8.3: Export Targets
entities target (datasets, glossary terms, properties)links target (relationships, associations)lineage target (lineage activities and relationships)all target (comprehensive export)As a data steward
I want to process RDF datasets with platform integration
So that I can manage my data assets in DataHub
AC9.1: Dataset Detection
void:Dataset resources as datasetsdcterms:Dataset resources as datasetsschema:Dataset resources as datasetsdh:Dataset resources as datasetsAC9.2: Dataset Properties
dcterms:title → dataset name (preferred)schema:name → dataset namerdfs:label → dataset namedcterms:description → dataset descriptiondcterms:creator → dataset ownershipdcterms:created → creation timestampdcterms:modified → modification timestampAC9.3: Platform Integration
dcat:accessService → platform identifier (preferred)schema:provider → platform identifiervoid:sparqlEndpoint → SPARQL platformvoid:dataDump → file platformAs a data steward
I want to process PROV-O lineage relationships
So that I can track data flow and dependencies
AC10.1: Activity Processing
prov:Activity resources as DataHub DataJobsrdfs:label → activity namedcterms:description → activity descriptionprov:startedAtTime → activity start timeprov:endedAtTime → activity end timeprov:wasAssociatedWith → user attributionAC10.2: Lineage Relationships
prov:used → upstream data dependenciesprov:generated → downstream data productsprov:wasDerivedFrom → direct derivation relationshipsprov:wasGeneratedBy → activity-to-entity relationshipsprov:wasInfluencedBy → downstream influencesAC10.3: Field-Level Lineage
As a data steward
I want to extract and map dataset schema fields
So that I can document my data structure
AC11.1: Field Detection
dh:hasSchemaFielddh:hasName, rdfs:label, or custom hasNameAC11.2: Field Properties
dh:hasName → field pathrdfs:label → field display namedh:hasDataType → field data typedh:isNullable → nullable constraintdh:hasGlossaryTerm → associated glossary termsrdfs:comment → field descriptionAC11.3: Data Type Mapping
varchar, string → StringTypeClassdate, datetime → DateTypeClassint, number, decimal → NumberTypeClassbool, boolean → BooleanTypeClassStringTypeClass for unknown typesAs a developer
I want to use SPARQL queries for dynamic entity detection
So that I can process any RDF pattern without hardcoded logic
AC12.1: Query-Based Detection
entity_type field in resultsAC12.2: Query Registry
As a developer
I want to implement clean separation of concerns with minimal abstraction
So that the system is maintainable, testable, and easy to understand
AC13.1: RDF to DataHub AST (Simplified)
DataHubGraph representationAC13.2: DataHub AST to MCPs
AC13.3: Output Strategy
As a developer
I want to use dependency injection for modular architecture
So that components can be easily swapped and tested
AC14.1: RDF Loading (Simplified)
load_rdf_graph() function for RDF source loadingAC14.2: Query Factory
QueryFactory for query processingSPARQLQuery, PassThroughQuery, FilterQueryQueryInterface for consistent APIAC14.3: Target Factory
TargetFactory for output targetsDataHubTarget, PrettyPrintTarget, FileTargetTargetInterface for consistent APIAs a developer
I want to implement comprehensive validation
So that the system provides clear error messages and graceful recovery
AC15.1: RDF Validation
AC15.2: Entity Validation
AC15.3: DataHub Validation
AC15.4: Error Recovery
For detailed technical specifications including:
See: RDF Specification