Back to Datahub

RDF User Stories and Acceptance Criteria

metadata-ingestion/src/datahub/ingestion/source/rdf/docs/user-stories-and-acceptance-criteria.md

1.5.0.420.6 KB
Original Source

RDF User Stories and Acceptance Criteria

Overview

This document provides detailed user stories with precise acceptance criteria for implementing RDF. Each story includes specific technical requirements, mapping rules, and validation criteria to ensure consistent implementation.

Status: This document has been updated to reflect current implementation status. Checked items [x] indicate completed features. Unchecked items [ ] indicate features not yet implemented or requiring verification.

Last Updated: December 2024

Implementation Status Summary

  • Core Glossary Management (Stories 1-8): ~95% complete

    • Format support: TTL, RDF/XML, JSON-LD (N-Triples pending)
    • Source support: File, folder (server sources pending)
    • Term detection, relationships, IRI-to-URN conversion: Complete
    • Domain management, glossary nodes, structured properties: Complete
    • CLI/API: Ingest command complete (list/delete commands pending)
  • Advanced Dataset and Lineage (Stories 9-11): ~100% complete

    • Dataset processing, platform integration: Complete
    • Comprehensive lineage processing: Complete
    • Schema field processing: Complete
  • Experimental Features (Story 12): ~100% complete

    • Dynamic routing with SPARQL queries: Complete
  • Technical Implementation (Stories 13-15): ~95% complete

    • Streamlined architecture: Complete (simplified from three-phase)
    • Dependency injection framework: Complete
    • Validation and error handling: Complete (rollback/retry pending)

Table of Contents

  1. Core Glossary Management Stories
  2. Advanced Dataset and Lineage Stories
  3. Experimental Features Stories
  4. Technical Implementation Stories

Core Glossary Management Stories

Story 1: RDF Glossary Ingestion

As a data steward
I want to ingest RDF glossaries from various sources and formats
So that I can import my existing ontology into DataHub without manual configuration

Acceptance Criteria

AC1.1: Format Support

  • System supports TTL (Turtle) format with proper namespace handling
  • System supports RDF/XML format with namespace preservation
  • System supports JSON-LD format with context handling
  • System supports N-Triples format with proper parsing
  • System validates RDF syntax and reports specific parsing errors

AC1.2: Source Support

  • System handles single file sources (--source file.ttl)
  • System handles directory sources (--source /path/to/glossary/)
  • System handles server sources (--source http://sparql.endpoint.com)
  • System processes multiple files in directory recursively
  • System handles mixed format directories (TTL + RDF/XML)

AC1.4: Error Handling

  • System provides detailed error messages for malformed RDF
  • System continues processing after encountering non-fatal errors
  • System logs all processing steps for debugging
  • System validates file permissions and accessibility

Story 2: Glossary Term Detection and Processing

As a data steward
I want to automatically detect glossary terms from RDF
So that I don't need to manually specify which resources are terms

Acceptance Criteria

AC2.1: Term Detection Criteria

  • System detects skos:Concept resources as glossary terms
  • System detects owl:Class resources as glossary terms
  • System detects owl:NamedIndividual resources as glossary terms
  • System detects custom class instances (any resource typed as instance of custom class)
  • System excludes owl:Ontology declarations from term detection
  • System requires terms to have labels (rdfs:label OR skos:prefLabel ≥3 characters)

AC2.2: Property Extraction

  • System extracts skos:prefLabel as primary name (preferred)
  • System falls back to rdfs:label if skos:prefLabel not available
  • System extracts skos:definition as primary description (preferred)
  • System falls back to rdfs:comment if skos:definition not available
  • System preserves language tags for multilingual support
  • System extracts custom properties and stores as metadata

AC2.3: Validation Rules

  • System validates that terms have valid URI references (not blank nodes)
  • System validates that labels are non-empty strings (≥3 characters)
  • System validates that definitions are non-empty strings
  • System reports validation errors with specific term URIs

Story 3: SKOS Relationship Mapping

As a data steward
I want to map SKOS relationships to DataHub glossary relationships
So that my glossary hierarchy is preserved in DataHub

Acceptance Criteria

AC3.1: Hierarchical Relationships

  • System maps skos:broader to DataHub parent relationships
  • System maps skos:narrower to DataHub child relationships
  • System maps skos:broadMatch and skos:narrowMatch to hierarchy relationships
  • System creates bidirectional relationships automatically
  • System validates no circular references in hierarchy

AC3.2: Associative Relationships

  • System maps skos:related to DataHub related terms
  • System maps skos:closeMatch to DataHub related terms
  • System preserves relationship directionality
  • System handles multiple related terms per term

AC3.3: External References

  • System maps skos:exactMatch to DataHub external references
  • System maps owl:sameAs to DataHub external references
  • System preserves external reference URIs
  • System validates external reference format

AC3.4: Relationship Validation

  • System validates that referenced terms exist in the glossary
  • System reports broken relationship references
  • System handles missing referenced terms gracefully

Story 4: IRI-to-URN Conversion

As a data steward
I want to convert RDF IRIs to DataHub URNs
So that my glossary terms have proper DataHub identifiers

Acceptance Criteria

AC4.1: IRI Processing

  • System processes HTTP/HTTPS IRIs by removing scheme and preserving path structure
  • System processes custom scheme IRIs by splitting on first : character
  • System handles various scheme formats (http://, https://, ftp://, custom:)
  • System preserves fragments as part of path structure
  • System handles empty path segments gracefully

AC4.2: URN Generation

  • System generates DataHub-compliant URNs for all entity types
  • System preserves original case and structure from IRI
  • System validates URN format compliance
  • System handles edge cases and error conditions
  • System follows consistent URN generation algorithm

AC4.3: Validation and Error Handling

  • System validates IRI format and scheme requirements
  • System provides detailed error messages for invalid IRIs
  • System handles malformed IRIs gracefully
  • System reports specific validation failures

Story 5: Domain Management

As a data steward
I want to automatically create DataHub domains from IRI hierarchy
So that my glossary terms are organized in DataHub

Acceptance Criteria

AC5.1: Domain Hierarchy Creation

  • System creates domains for parent segments only (excludes term name)
  • System creates urn:li:domain:example_com for https://example.com/finance/accounts
  • System creates urn:li:domain:finance for https://example.com/finance/accounts
  • System assigns dataset accounts to urn:li:domain:finance
  • System handles deep hierarchies correctly

AC5.2: Domain Naming Convention

  • System converts example.comurn:li:domain:example_com
  • System converts financeurn:li:domain:finance
  • System converts loan-tradingurn:li:domain:loan_trading
  • System preserves original segment names for display
  • System validates domain URN format

AC5.3: Domain Assignment

  • System assigns glossary terms to leaf domain (most specific parent)
  • System creates parent-child relationships between domains
  • System handles shared domains correctly
  • System validates domain assignment logic

Story 6: Glossary Node Support

As a data steward
I want to process SKOS concept schemes and collections
So that I can organize my glossary terms in DataHub

Acceptance Criteria

AC6.1: Concept Scheme Processing

  • System detects skos:ConceptScheme resources as glossary nodes
  • System maps skos:prefLabel → DataHub glossary node name
  • System maps skos:definition → DataHub glossary node description
  • System creates proper DataHub GlossaryNode entities
  • System generates URNs for concept schemes

AC6.2: Collection Processing

  • System detects skos:Collection resources as glossary nodes
  • System processes collection metadata (labels, descriptions)
  • System handles collection membership relationships
  • System creates DataHub glossary nodes for collections

AC6.3: Node Relationships

  • System maps skos:broader relationships for nodes
  • System creates parent-child relationships between nodes
  • System links terms to their containing nodes
  • System validates node hierarchy consistency

Story 7: Structured Properties Support

As a data steward
I want to attach structured properties to glossary terms
So that I can add domain-specific metadata

Acceptance Criteria

AC7.1: Property Detection

  • System detects rdf:Property declarations with rdfs:domain
  • System maps rdfs:domain to appropriate DataHub entity types
  • System extracts rdfs:label as property name
  • System extracts rdfs:comment as property description
  • System identifies enum values from rdfs:range class instances

AC7.2: Entity Type Mapping

  • System maps dcat:Dataset domain → dataset entity type
  • System maps skos:Concept domain → glossaryTerm entity type
  • System maps schema:Person domain → user entity type
  • System maps schema:Organization domain → corpGroup entity type
  • System handles multiple domains per property

AC7.3: Property Application

  • System applies structured properties to appropriate entities
  • System validates property values against allowed values
  • System creates DataHub structured property definitions
  • System generates proper URNs for structured properties

Story 8: CLI and API Interface

As a developer
I want to use CLI commands and Python API
So that I can integrate RDF into my workflows

Acceptance Criteria

AC8.1: CLI Commands

  • System provides ingest command with --source, --export, --server, --token options
  • System provides list command to show existing glossary items
  • System provides delete command to remove glossary terms/domains
  • System supports --dry-run flag for safe testing
  • System provides comprehensive help and usage examples

AC8.2: Python API

  • System provides DataHubClient class for API interactions
  • System provides OntologyToDataHub class for processing
  • System supports both dry run and live execution modes
  • System provides clear error handling and logging
  • System includes comprehensive API documentation

AC8.3: Export Targets

  • System supports entities target (datasets, glossary terms, properties)
  • System supports links target (relationships, associations)
  • System supports lineage target (lineage activities and relationships)
  • System supports all target (comprehensive export)

Advanced Dataset and Lineage Stories

Story 9: Dataset Processing

As a data steward
I want to process RDF datasets with platform integration
So that I can manage my data assets in DataHub

Acceptance Criteria

AC9.1: Dataset Detection

  • System detects void:Dataset resources as datasets
  • System detects dcterms:Dataset resources as datasets
  • System detects schema:Dataset resources as datasets
  • System detects dh:Dataset resources as datasets
  • System validates dataset metadata requirements

AC9.2: Dataset Properties

  • System maps dcterms:title → dataset name (preferred)
  • System falls back to schema:name → dataset name
  • System falls back to rdfs:label → dataset name
  • System maps dcterms:description → dataset description
  • System maps dcterms:creator → dataset ownership
  • System maps dcterms:created → creation timestamp
  • System maps dcterms:modified → modification timestamp

AC9.3: Platform Integration

  • System maps dcat:accessService → platform identifier (preferred)
  • System maps schema:provider → platform identifier
  • System maps void:sparqlEndpoint → SPARQL platform
  • System maps void:dataDump → file platform
  • System extracts platform information from service URIs
  • System validates platform connection configurations

Story 10: Comprehensive Lineage Processing

As a data steward
I want to process PROV-O lineage relationships
So that I can track data flow and dependencies

Acceptance Criteria

AC10.1: Activity Processing

  • System detects prov:Activity resources as DataHub DataJobs
  • System maps rdfs:label → activity name
  • System maps dcterms:description → activity description
  • System maps prov:startedAtTime → activity start time
  • System maps prov:endedAtTime → activity end time
  • System maps prov:wasAssociatedWith → user attribution

AC10.2: Lineage Relationships

  • System maps prov:used → upstream data dependencies
  • System maps prov:generated → downstream data products
  • System maps prov:wasDerivedFrom → direct derivation relationships
  • System maps prov:wasGeneratedBy → activity-to-entity relationships
  • System maps prov:wasInfluencedBy → downstream influences
  • System preserves activity mediation in lineage edges

AC10.3: Field-Level Lineage

  • System processes field-to-field mappings between datasets
  • System tracks data transformations at column level
  • System identifies unauthorized data flows
  • System supports complex ETL process documentation
  • System generates proper DataHub lineage URNs

Story 11: Schema Field Processing

As a data steward
I want to extract and map dataset schema fields
So that I can document my data structure

Acceptance Criteria

AC11.1: Field Detection

  • System detects fields referenced via dh:hasSchemaField
  • System detects custom field properties
  • System requires field name via dh:hasName, rdfs:label, or custom hasName
  • System validates field identification criteria

AC11.2: Field Properties

  • System maps dh:hasName → field path
  • System maps rdfs:label → field display name
  • System maps dh:hasDataType → field data type
  • System maps dh:isNullable → nullable constraint
  • System maps dh:hasGlossaryTerm → associated glossary terms
  • System maps rdfs:comment → field description

AC11.3: Data Type Mapping

  • System maps varchar, stringStringTypeClass
  • System maps date, datetimeDateTypeClass
  • System maps int, number, decimalNumberTypeClass
  • System maps bool, booleanBooleanTypeClass
  • System defaults to StringTypeClass for unknown types
  • System validates data type constraints

Experimental Features Stories

Story 12: Dynamic Routing

As a developer
I want to use SPARQL queries for dynamic entity detection
So that I can process any RDF pattern without hardcoded logic

Acceptance Criteria

AC12.1: Query-Based Detection

  • System executes SPARQL queries to extract entities with types
  • System routes processing based on entity_type field in results
  • System processes generically using appropriate handlers
  • System eliminates need for separate processing methods per entity type

AC12.2: Query Registry

  • System maintains centralized SPARQL queries for each export target
  • System supports query customization for specialized use cases
  • System validates query syntax and execution
  • System provides query performance optimization

Technical Implementation Stories

Story 13: Streamlined Architecture (Simplified from Three-Phase)

As a developer
I want to implement clean separation of concerns with minimal abstraction
So that the system is maintainable, testable, and easy to understand

Acceptance Criteria

AC13.1: RDF to DataHub AST (Simplified)

  • System extracts entities directly from RDF graphs
  • System creates internal DataHubGraph representation
  • System extracts datasets, glossary terms, activities, properties
  • System handles various RDF patterns (SKOS, OWL, DCAT, PROV-O)
  • Extractors can return DataHub AST directly (no RDF AST layer required)
  • RDF AST layer is optional - only used when needed

AC13.2: DataHub AST to MCPs

  • System implements MCP builders for DataHub ingestion
  • System generates DataHub URNs with proper format
  • System converts RDF types to DataHub types
  • System prepares DataHub-specific metadata
  • System handles DataHub naming conventions

AC13.3: Output Strategy

  • System supports DataHub ingestion target
  • System supports pretty print output for debugging
  • System supports file export
  • System enables easy addition of new output formats

Story 14: Dependency Injection Framework

As a developer
I want to use dependency injection for modular architecture
So that components can be easily swapped and tested

Acceptance Criteria

AC14.1: RDF Loading (Simplified)

  • System implements load_rdf_graph() function for RDF source loading
  • System supports file, folder, and URL sources
  • System provides consistent API for loading RDF graphs
  • System enables easy addition of new source types

AC14.2: Query Factory

  • System implements QueryFactory for query processing
  • System supports SPARQLQuery, PassThroughQuery, FilterQuery
  • System provides QueryInterface for consistent API
  • System enables query customization and optimization

AC14.3: Target Factory

  • System implements TargetFactory for output targets
  • System supports DataHubTarget, PrettyPrintTarget, FileTarget
  • System provides TargetInterface for consistent API
  • System enables easy addition of new output formats

Story 15: Validation and Error Handling

As a developer
I want to implement comprehensive validation
So that the system provides clear error messages and graceful recovery

Acceptance Criteria

AC15.1: RDF Validation

  • System validates RDF syntax and structure
  • System reports specific parsing errors with line numbers
  • System validates namespace declarations
  • System handles malformed RDF gracefully

AC15.2: Entity Validation

  • System validates entity identification criteria
  • System validates property mappings and constraints
  • System validates relationship references
  • System reports validation errors with specific entity URIs

AC15.3: DataHub Validation

  • System validates DataHub URN format
  • System validates DataHub entity properties
  • System validates DataHub relationship constraints
  • System provides detailed error messages for DataHub API failures

AC15.4: Error Recovery

  • System continues processing after non-fatal errors
  • System logs all errors with appropriate severity levels
  • System provides rollback capabilities for failed operations
  • System supports retry mechanisms for transient failures

Implementation Notes

Technical Specifications

For detailed technical specifications including:

  • IRI-to-URN Conversion Algorithm: Complete algorithm with pseudocode
  • Relationship Mapping Tables: SKOS and PROV-O to DataHub mappings
  • Property Mapping Rules: Priority chains and fallback rules
  • Validation Rules: Comprehensive validation criteria
  • DataHub Integration: Complete entity type mappings

See: RDF Specification

Development Guidelines

  • User Stories: Focus on functional requirements and user value
  • Technical Specs: Reference the technical specifications document for implementation details
  • Testing: Each acceptance criteria should have corresponding test cases
  • Documentation: Keep user stories focused on "what" and "why", not "how"