RDF User Stories and Acceptance Criteria

Overview

This document provides detailed user stories with precise acceptance criteria for implementing RDF. Each story includes specific technical requirements, mapping rules, and validation criteria to ensure consistent implementation.

Status: This document has been updated to reflect current implementation status. Checked items [x] indicate completed features. Unchecked items [ ] indicate features not yet implemented or requiring verification.

Last Updated: December 2024

Implementation Status Summary

✅ Core Glossary Management (Stories 1-8): ~95% complete
- Format support: TTL, RDF/XML, JSON-LD (N-Triples pending)
- Source support: File, folder (server sources pending)
- Term detection, relationships, IRI-to-URN conversion: Complete
- Domain management, glossary nodes, structured properties: Complete
- CLI/API: Ingest command complete (list/delete commands pending)
✅ Advanced Dataset and Lineage (Stories 9-11): ~100% complete
- Dataset processing, platform integration: Complete
- Comprehensive lineage processing: Complete
- Schema field processing: Complete
✅ Experimental Features (Story 12): ~100% complete
- Dynamic routing with SPARQL queries: Complete
✅ Technical Implementation (Stories 13-15): ~95% complete
- Streamlined architecture: Complete (simplified from three-phase)
- Dependency injection framework: Complete
- Validation and error handling: Complete (rollback/retry pending)

Core Glossary Management Stories
Advanced Dataset and Lineage Stories
Experimental Features Stories
Technical Implementation Stories

Core Glossary Management Stories

Story 1: RDF Glossary Ingestion

As a data steward
I want to ingest RDF glossaries from various sources and formats
So that I can import my existing ontology into DataHub without manual configuration

Acceptance Criteria

AC1.1: Format Support

System supports TTL (Turtle) format with proper namespace handling
System supports RDF/XML format with namespace preservation
System supports JSON-LD format with context handling
System supports N-Triples format with proper parsing
System validates RDF syntax and reports specific parsing errors

AC1.2: Source Support

System handles single file sources (--source file.ttl)
System handles directory sources (--source /path/to/glossary/)
System handles server sources (--source http://sparql.endpoint.com)
System processes multiple files in directory recursively
System handles mixed format directories (TTL + RDF/XML)

AC1.4: Error Handling

System provides detailed error messages for malformed RDF
System continues processing after encountering non-fatal errors
System logs all processing steps for debugging
System validates file permissions and accessibility

Story 2: Glossary Term Detection and Processing

As a data steward
I want to automatically detect glossary terms from RDF
So that I don't need to manually specify which resources are terms

Acceptance Criteria

AC2.1: Term Detection Criteria

System detects skos:Concept resources as glossary terms
System detects owl:Class resources as glossary terms
System detects owl:NamedIndividual resources as glossary terms
System detects custom class instances (any resource typed as instance of custom class)
System excludes owl:Ontology declarations from term detection
System requires terms to have labels (rdfs:label OR skos:prefLabel ≥3 characters)

AC2.2: Property Extraction

System extracts skos:prefLabel as primary name (preferred)
System falls back to rdfs:label if skos:prefLabel not available
System extracts skos:definition as primary description (preferred)
System falls back to rdfs:comment if skos:definition not available
System preserves language tags for multilingual support
System extracts custom properties and stores as metadata

AC2.3: Validation Rules

System validates that terms have valid URI references (not blank nodes)
System validates that labels are non-empty strings (≥3 characters)
System validates that definitions are non-empty strings
System reports validation errors with specific term URIs

Story 3: SKOS Relationship Mapping

As a data steward
I want to map SKOS relationships to DataHub glossary relationships
So that my glossary hierarchy is preserved in DataHub

Acceptance Criteria

AC3.1: Hierarchical Relationships

System maps skos:broader to DataHub parent relationships
System maps skos:narrower to DataHub child relationships
System maps skos:broadMatch and skos:narrowMatch to hierarchy relationships
System creates bidirectional relationships automatically
System validates no circular references in hierarchy

AC3.2: Associative Relationships

System maps skos:related to DataHub related terms
System maps skos:closeMatch to DataHub related terms
System preserves relationship directionality
System handles multiple related terms per term

AC3.3: External References

System maps skos:exactMatch to DataHub external references
System maps owl:sameAs to DataHub external references
System preserves external reference URIs
System validates external reference format

AC3.4: Relationship Validation

System validates that referenced terms exist in the glossary
System reports broken relationship references
System handles missing referenced terms gracefully

Story 4: IRI-to-URN Conversion

As a data steward
I want to convert RDF IRIs to DataHub URNs
So that my glossary terms have proper DataHub identifiers

Acceptance Criteria

AC4.1: IRI Processing

System processes HTTP/HTTPS IRIs by removing scheme and preserving path structure
System processes custom scheme IRIs by splitting on first : character
System handles various scheme formats (http://, https://, ftp://, custom:)
System preserves fragments as part of path structure
System handles empty path segments gracefully

AC4.2: URN Generation

System generates DataHub-compliant URNs for all entity types
System preserves original case and structure from IRI
System validates URN format compliance
System handles edge cases and error conditions
System follows consistent URN generation algorithm

AC4.3: Validation and Error Handling

System validates IRI format and scheme requirements
System provides detailed error messages for invalid IRIs
System handles malformed IRIs gracefully
System reports specific validation failures

Story 5: Domain Management

As a data steward
I want to automatically create DataHub domains from IRI hierarchy
So that my glossary terms are organized in DataHub

Acceptance Criteria

AC5.1: Domain Hierarchy Creation

System creates domains for parent segments only (excludes term name)
System creates urn:li:domain:example_com for https://example.com/finance/accounts
System creates urn:li:domain:finance for https://example.com/finance/accounts
System assigns dataset accounts to urn:li:domain:finance
System handles deep hierarchies correctly

AC5.2: Domain Naming Convention

System converts example.com → urn:li:domain:example_com
System converts finance → urn:li:domain:finance
System converts loan-trading → urn:li:domain:loan_trading
System preserves original segment names for display
System validates domain URN format

AC5.3: Domain Assignment

System assigns glossary terms to leaf domain (most specific parent)
System creates parent-child relationships between domains
System handles shared domains correctly
System validates domain assignment logic

Story 6: Glossary Node Support

As a data steward
I want to process SKOS concept schemes and collections
So that I can organize my glossary terms in DataHub

Acceptance Criteria

AC6.1: Concept Scheme Processing

System detects skos:ConceptScheme resources as glossary nodes
System maps skos:prefLabel → DataHub glossary node name
System maps skos:definition → DataHub glossary node description
System creates proper DataHub GlossaryNode entities
System generates URNs for concept schemes

AC6.2: Collection Processing

System detects skos:Collection resources as glossary nodes
System processes collection metadata (labels, descriptions)
System handles collection membership relationships
System creates DataHub glossary nodes for collections

AC6.3: Node Relationships

System maps skos:broader relationships for nodes
System creates parent-child relationships between nodes
System links terms to their containing nodes
System validates node hierarchy consistency

Story 7: Structured Properties Support

As a data steward
I want to attach structured properties to glossary terms
So that I can add domain-specific metadata

Acceptance Criteria

AC7.1: Property Detection

System detects rdf:Property declarations with rdfs:domain
System maps rdfs:domain to appropriate DataHub entity types
System extracts rdfs:label as property name
System extracts rdfs:comment as property description
System identifies enum values from rdfs:range class instances

AC7.2: Entity Type Mapping

System maps dcat:Dataset domain → dataset entity type
System maps skos:Concept domain → glossaryTerm entity type
System maps schema:Person domain → user entity type
System maps schema:Organization domain → corpGroup entity type
System handles multiple domains per property

AC7.3: Property Application

System applies structured properties to appropriate entities
System validates property values against allowed values
System creates DataHub structured property definitions
System generates proper URNs for structured properties

Story 8: CLI and API Interface

As a developer
I want to use CLI commands and Python API
So that I can integrate RDF into my workflows

Acceptance Criteria

AC8.1: CLI Commands

System provides ingest command with --source, --export, --server, --token options
System provides list command to show existing glossary items
System provides delete command to remove glossary terms/domains
System supports --dry-run flag for safe testing
System provides comprehensive help and usage examples

AC8.2: Python API

System provides DataHubClient class for API interactions
System provides OntologyToDataHub class for processing
System supports both dry run and live execution modes
System provides clear error handling and logging
System includes comprehensive API documentation

AC8.3: Export Targets

System supports entities target (datasets, glossary terms, properties)
System supports links target (relationships, associations)
System supports lineage target (lineage activities and relationships)
System supports all target (comprehensive export)

Advanced Dataset and Lineage Stories

Story 9: Dataset Processing

As a data steward
I want to process RDF datasets with platform integration
So that I can manage my data assets in DataHub

Acceptance Criteria

AC9.1: Dataset Detection

System detects void:Dataset resources as datasets
System detects dcterms:Dataset resources as datasets
System detects schema:Dataset resources as datasets
System detects dh:Dataset resources as datasets
System validates dataset metadata requirements

AC9.2: Dataset Properties

System maps dcterms:title → dataset name (preferred)
System falls back to schema:name → dataset name
System falls back to rdfs:label → dataset name
System maps dcterms:description → dataset description
System maps dcterms:creator → dataset ownership
System maps dcterms:created → creation timestamp
System maps dcterms:modified → modification timestamp

AC9.3: Platform Integration

System maps dcat:accessService → platform identifier (preferred)
System maps schema:provider → platform identifier
System maps void:sparqlEndpoint → SPARQL platform
System maps void:dataDump → file platform
System extracts platform information from service URIs
System validates platform connection configurations

Story 10: Comprehensive Lineage Processing

As a data steward
I want to process PROV-O lineage relationships
So that I can track data flow and dependencies

Acceptance Criteria

AC10.1: Activity Processing

System detects prov:Activity resources as DataHub DataJobs
System maps rdfs:label → activity name
System maps dcterms:description → activity description
System maps prov:startedAtTime → activity start time
System maps prov:endedAtTime → activity end time
System maps prov:wasAssociatedWith → user attribution

AC10.2: Lineage Relationships

System maps prov:used → upstream data dependencies
System maps prov:generated → downstream data products
System maps prov:wasDerivedFrom → direct derivation relationships
System maps prov:wasGeneratedBy → activity-to-entity relationships
System maps prov:wasInfluencedBy → downstream influences
System preserves activity mediation in lineage edges

AC10.3: Field-Level Lineage

System processes field-to-field mappings between datasets
System tracks data transformations at column level
System identifies unauthorized data flows
System supports complex ETL process documentation
System generates proper DataHub lineage URNs

Story 11: Schema Field Processing

As a data steward
I want to extract and map dataset schema fields
So that I can document my data structure

Acceptance Criteria

AC11.1: Field Detection

System detects fields referenced via dh:hasSchemaField
System detects custom field properties
System requires field name via dh:hasName, rdfs:label, or custom hasName
System validates field identification criteria

AC11.2: Field Properties

System maps dh:hasName → field path
System maps rdfs:label → field display name
System maps dh:hasDataType → field data type
System maps dh:isNullable → nullable constraint
System maps dh:hasGlossaryTerm → associated glossary terms
System maps rdfs:comment → field description

AC11.3: Data Type Mapping

System maps varchar, string → StringTypeClass
System maps date, datetime → DateTypeClass
System maps int, number, decimal → NumberTypeClass
System maps bool, boolean → BooleanTypeClass
System defaults to StringTypeClass for unknown types
System validates data type constraints

Experimental Features Stories

Story 12: Dynamic Routing

As a developer
I want to use SPARQL queries for dynamic entity detection
So that I can process any RDF pattern without hardcoded logic

Acceptance Criteria

AC12.1: Query-Based Detection

System executes SPARQL queries to extract entities with types
System routes processing based on entity_type field in results
System processes generically using appropriate handlers
System eliminates need for separate processing methods per entity type

AC12.2: Query Registry

System maintains centralized SPARQL queries for each export target
System supports query customization for specialized use cases
System validates query syntax and execution
System provides query performance optimization

Technical Implementation Stories

Story 13: Streamlined Architecture (Simplified from Three-Phase)

As a developer
I want to implement clean separation of concerns with minimal abstraction
So that the system is maintainable, testable, and easy to understand

Acceptance Criteria

AC13.1: RDF to DataHub AST (Simplified)

System extracts entities directly from RDF graphs
System creates internal DataHubGraph representation
System extracts datasets, glossary terms, activities, properties
System handles various RDF patterns (SKOS, OWL, DCAT, PROV-O)
Extractors can return DataHub AST directly (no RDF AST layer required)
RDF AST layer is optional - only used when needed

AC13.2: DataHub AST to MCPs

System implements MCP builders for DataHub ingestion
System generates DataHub URNs with proper format
System converts RDF types to DataHub types
System prepares DataHub-specific metadata
System handles DataHub naming conventions

AC13.3: Output Strategy

System supports DataHub ingestion target
System supports pretty print output for debugging
System supports file export
System enables easy addition of new output formats

Story 14: Dependency Injection Framework

As a developer
I want to use dependency injection for modular architecture
So that components can be easily swapped and tested

Acceptance Criteria

AC14.1: RDF Loading (Simplified)

System implements load_rdf_graph() function for RDF source loading
System supports file, folder, and URL sources
System provides consistent API for loading RDF graphs
System enables easy addition of new source types

AC14.2: Query Factory

System implements QueryFactory for query processing
System supports SPARQLQuery, PassThroughQuery, FilterQuery
System provides QueryInterface for consistent API
System enables query customization and optimization

AC14.3: Target Factory

System implements TargetFactory for output targets
System supports DataHubTarget, PrettyPrintTarget, FileTarget
System provides TargetInterface for consistent API
System enables easy addition of new output formats

Story 15: Validation and Error Handling

As a developer
I want to implement comprehensive validation
So that the system provides clear error messages and graceful recovery

Acceptance Criteria

AC15.1: RDF Validation

System validates RDF syntax and structure
System reports specific parsing errors with line numbers
System validates namespace declarations
System handles malformed RDF gracefully

AC15.2: Entity Validation

System validates entity identification criteria
System validates property mappings and constraints
System validates relationship references
System reports validation errors with specific entity URIs

AC15.3: DataHub Validation

System validates DataHub URN format
System validates DataHub entity properties
System validates DataHub relationship constraints
System provides detailed error messages for DataHub API failures

AC15.4: Error Recovery

System continues processing after non-fatal errors
System logs all errors with appropriate severity levels
System provides rollback capabilities for failed operations
System supports retry mechanisms for transient failures

Implementation Notes

Technical Specifications

For detailed technical specifications including:

IRI-to-URN Conversion Algorithm: Complete algorithm with pseudocode
Relationship Mapping Tables: SKOS and PROV-O to DataHub mappings
Property Mapping Rules: Priority chains and fallback rules
Validation Rules: Comprehensive validation criteria
DataHub Integration: Complete entity type mappings

See: RDF Specification

Development Guidelines

User Stories: Focus on functional requirements and user value
Technical Specs: Reference the technical specifications document for implementation details
Testing: Each acceptance criteria should have corresponding test cases
Documentation: Keep user stories focused on "what" and "why", not "how"

RDF User Stories and Acceptance Criteria

RDF User Stories and Acceptance Criteria

Overview

Implementation Status Summary

Table of Contents

Core Glossary Management Stories

Story 1: RDF Glossary Ingestion

Acceptance Criteria

Story 2: Glossary Term Detection and Processing

Acceptance Criteria

Story 3: SKOS Relationship Mapping

Acceptance Criteria

Story 4: IRI-to-URN Conversion

Acceptance Criteria

Story 5: Domain Management

Acceptance Criteria

Story 6: Glossary Node Support

Acceptance Criteria

Story 7: Structured Properties Support

Acceptance Criteria

Story 8: CLI and API Interface

Acceptance Criteria

Advanced Dataset and Lineage Stories

Story 9: Dataset Processing

Acceptance Criteria

Story 10: Comprehensive Lineage Processing

Acceptance Criteria

Story 11: Schema Field Processing

Acceptance Criteria

Experimental Features Stories

Story 12: Dynamic Routing

Acceptance Criteria

Technical Implementation Stories

Story 13: Streamlined Architecture (Simplified from Three-Phase)

Acceptance Criteria

Story 14: Dependency Injection Framework

Acceptance Criteria

Story 15: Validation and Error Handling

Acceptance Criteria

Implementation Notes

Technical Specifications

Development Guidelines