docs/developers/java-sdk-v2-design.md
This document describes the design of DataHub Java SDK V2, a modern, user-friendly Java client library that provides feature parity with the Python SDK V2. The new SDK addresses feedback from enterprise Java customers who require a first-class SDK experience comparable to Python developers.
This document is organized into two main sections:
Why Hand-Crafted? For a deep dive into why we chose to hand-craft this SDK instead of using OpenAPI code generation, see Java SDK V2 Philosophy.
Currently, DataHub's Java SDK (datahub-client) provides only low-level emission capabilities:
This gap has created issues with enterprise customers, particularly Java shops who feel like "second-class citizens" when compared to Python developers.
datahub.client.v2.* namespace for new APIsThis section describes the public API that SDK users interact with - the patterns, behaviors, and interfaces that define the developer experience.
Intuitive entity construction through method chaining:
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.env("PROD")
.description("My dataset")
.build();
// Fluent metadata operations with type-safe method chaining
dataset.addTag("pii")
.addOwner("urn:li:corpuser:jdoe", OwnershipType.TECHNICAL_OWNER)
.setDomain("urn:li:domain:Analytics")
.setStructuredProperty("io.acryl.dataQuality.qualityScore", 95.5);
client.entities().upsert(dataset);
Leverage Java's strong typing:
DatasetUrn, ChartUrn, etc.)SDK Mode vs INGESTION Mode for proper separation of concerns:
editableDatasetPropertiesdatasetProperties// SDK mode - user edits go to editable aspects
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.mode(OperationMode.SDK) // Default
.build();
// INGESTION mode - pipeline writes go to system aspects
DataHubClientV2 ingestionClient = DataHubClientV2.builder()
.server("http://localhost:8080")
.mode(OperationMode.INGESTION)
.build();
Design Decision: Prioritize patches over full aspect replacement
The SDK V2 is designed around patch-based operations because they represent the most common and intuitive way to make metadata changes:
Dataset dataset = client.entities().get(datasetUrn);
Dataset mutable = dataset.mutable(); // Get mutable copy
// These create patches internally - no server calls yet
mutable.addTag("pii")
.addTag("sensitive")
.addOwner(ownerUrn, OwnershipType.TECHNICAL_OWNER);
// Single call emits all accumulated patches atomically
client.entities().update(mutable);
Why patches?
When to use low-level SDK:
If you need to completely replace an aspect (full PUT/upsert semantics), use the V1 SDK's RestEmitter directly with MetadataChangeProposalWrapper. The V2 SDK focuses on making common operations simple, not exposing every low-level primitive.
Shared metadata operations via type-safe mixin interfaces:
HasTags<T> - Add, remove, set tagsHasOwners<T> - Manage ownershipHasGlossaryTerms<T> - Associate glossary termsDomainOperations<T> - Domain assignmentHasContainer<T> - Parent-child hierarchiesAll mixins use CRTP pattern for type-safe method chaining that returns the concrete entity type.
datahub-client/
├── src/main/java/
│ ├── datahub/client/ # Existing v1 (unchanged)
│ │ ├── Emitter.java
│ │ ├── rest/RestEmitter.java
│ │ └── ...
│ │
│ └── datahub/client/v2/ # New v2 namespace
│ ├── DataHubClientV2.java # Main client entry point
│ │
│ ├── entity/ # Entity classes
│ │ ├── Entity.java # Base entity class (490 lines)
│ │ ├── AspectCache.java # Unified cache with dirty tracking (184 lines)
│ │ ├── CachedAspect.java # Aspect wrapper with metadata (68 lines)
│ │ ├── AspectSource.java # SERVER vs LOCAL enum (23 lines)
│ │ ├── ReadMode.java # ALLOW_DIRTY vs SERVER_ONLY (28 lines)
│ │ ├── Dataset.java # Dataset entity (564 lines)
│ │ ├── Chart.java # Chart entity (587 lines)
│ │ ├── Dashboard.java # Dashboard entity (671 lines)
│ │ ├── DataJob.java # DataJob entity (597 lines)
│ │ ├── DataFlow.java # DataFlow entity (467 lines)
│ │ ├── Container.java # Container entity (500 lines)
│ │ ├── MLModel.java # ML Model entity NEW
│ │ ├── MLModelGroup.java # ML Model Group entity NEW
│ │ ├── HasTags.java # Tag operations mixin
│ │ ├── HasOwners.java # Ownership operations mixin
│ │ ├── HasGlossaryTerms.java # Terms operations mixin
│ │ ├── HasDomains.java # Domain operations mixin
│ │ ├── HasContainer.java # Container hierarchy mixin
│ │ └── HasStructuredProperties.java # Structured properties mixin
│ │
│ ├── operations/ # CRUD operation clients
│ │ └── EntityClient.java # Entity CRUD operations (570 lines)
│ │
│ └── config/ # Configuration
│ └── DataHubClientConfigV2.java # Config with mode support
│
└── src/test/java/ # Tests mirror structure
└── datahub/client/v2/
├── DataHubClientV2Test.java # Client tests
├── entity/ # 378 unit tests
│ ├── AspectCacheTest.java # 30 tests (cache infrastructure)
│ ├── CachedAspectTest.java # 13 tests (cache infrastructure)
│ ├── DatasetTest.java # 37 tests
│ ├── ChartTest.java # 43 tests
│ ├── DashboardTest.java # 52 tests
│ ├── DataJobTest.java # 45 tests
│ ├── DataFlowTest.java # 40 tests
│ ├── ContainerTest.java # 40 tests
│ ├── MLModelTest.java # 44 tests
│ └── MLModelGroupTest.java # 38 tests
└── integration/ # 79 integration tests
├── DatasetIntegrationTest.java
├── ChartIntegrationTest.java
├── DashboardIntegrationTest.java
├── DataJobIntegrationTest.java
├── DataFlowIntegrationTest.java
├── ContainerIntegrationTest.java
├── MLModelIntegrationTest.java
└── MLModelGroupIntegrationTest.java
Key Design Decisions:
patch/ package - patches accumulate internally within entitiesentity/ package using CRTP pattern for type safetyFile: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/DataHubClientV2.java (266 lines)
package datahub.client.v2;
/**
* Main entry point for DataHub Java SDK V2.
* Provides high-level operations for entity management with mode-aware behavior.
*
* <p>Example usage:
* <pre>
* DataHubClientV2 client = DataHubClientV2.builder()
* .server("http://localhost:8080")
* .token("my-token")
* .mode(OperationMode.SDK) // SDK or INGESTION mode
* .build();
*
* Dataset dataset = Dataset.builder()
* .platform("snowflake")
* .name("my_table")
* .env("PROD")
* .description("My dataset")
* .build();
*
* client.entities().upsert(dataset);
* </pre>
*/
public class DataHubClientV2 implements AutoCloseable {
private final RestEmitter emitter;
private final DataHubClientConfigV2 config;
private final EntityClient entityClient;
// Builder for client configuration
public static Builder builder() { ... }
// Entity operations
public EntityClient entities() { return entityClient; }
// Low-level emitter access (for advanced users)
public RestEmitter emitter() { return emitter; }
// Configuration access
public DataHubClientConfigV2 config() { return config; }
@Override
public void close() throws IOException { ... }
public static class Builder {
public Builder server(String serverUrl) { ... }
public Builder token(String token) { ... }
public Builder timeout(int timeoutMs) { ... }
public Builder mode(OperationMode mode) { ... } // NEW
public Builder config(DataHubClientConfigV2 config) { ... }
public DataHubClientV2 build() { ... }
}
}
Design Features:
File: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Entity.java (490 lines)
The Entity base class provides a unified interface for all DataHub entities. From a user perspective, all entities support:
Public API Methods:
// URN access
public Urn getUrn()
public abstract String getEntityType()
// Convert to MCPs for emission (primarily internal)
public List<MetadataChangeProposalWrapper> toMCPs()
Entity Construction:
Entities are constructed via fluent builders:
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.env("PROD")
.description("My dataset")
.build();
Fluent Metadata Operations:
All entities support method chaining for metadata operations (via mixin interfaces):
dataset.addTag("pii")
.addOwner(ownerUrn, OwnershipType.TECHNICAL_OWNER)
.setDomain(domainUrn)
.addTerm(termUrn);
Lazy Loading:
Entities loaded from the server fetch aspects on-demand:
Dataset dataset = client.entities().get(datasetUrn); // Only URN loaded
String description = dataset.getDescription(); // Aspect fetched now
List<String> tags = dataset.getTags(); // Another aspect fetch
Patch Accumulation:
Metadata operations create patches that accumulate until save:
Dataset dataset = client.entities().get(datasetUrn);
Dataset mutable = dataset.mutable(); // Get mutable copy
mutable.addTag("pii"); // Creates patch (not sent yet)
mutable.addTag("sensitive"); // Another patch (not sent yet)
client.entities().update(mutable); // Emits all patches atomically
Immutability-by-Default:
Entities fetched from the server are read-only to prevent accidental mutations:
Dataset dataset = client.entities().get(datasetUrn);
dataset.isReadOnly(); // true
dataset.isMutable(); // false
// Attempting mutation throws ReadOnlyEntityException
// dataset.addTag("pii"); // ERROR!
// Get mutable copy for updates
Dataset mutable = dataset.mutable();
mutable.isMutable(); // true
mutable.addTag("pii"); // Works
client.entities().upsert(mutable);
Entity Lifecycle:
Builder-created entities - Mutable from creation
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
dataset.isMutable(); // true - can mutate immediately
Server-fetched entities - Immutable by default
Dataset dataset = client.entities().get(urn);
dataset.isReadOnly(); // true - must call .mutable()
Mutable copies - Created via .mutable()
Dataset mutable = dataset.mutable();
mutable.isMutable(); // true - can mutate
The .mutable() method:
Why immutability-by-default?
See "Developer-Facing Implementation Design" section below for internal architecture details.
The SDK V2 implements 8 entity types with full metadata support:
Data Entities:
Pipeline Entities:
Visualization Entities:
ML Entities:
Common Entity Operations:
All entities support these fluent operations (via mixin interfaces):
// Tags
entity.addTag("pii")
.removeTag("deprecated")
.setTags(Arrays.asList("tag1", "tag2"))
.clearTags()
// Owners
entity.addOwner(ownerUrn, OwnershipType.TECHNICAL_OWNER)
.removeOwner(ownerUrn)
.setOwners(ownerList)
.clearOwners()
// Glossary Terms
entity.addTerm(termUrn)
.removeTerm(termUrn)
.setTerms(termList)
.clearTerms()
// Domains
entity.setDomain(domainUrn)
.removeDomain(domainUrn)
.clearDomains()
// Container (for hierarchical entities)
entity.setContainer(containerUrn)
.clearContainer()
// Structured Properties (custom typed metadata)
entity.setStructuredProperty("io.acryl.dataManagement.replicationSLA", "24h")
.setStructuredProperty("io.acryl.dataQuality.qualityScore", 95.5)
.setStructuredProperty("io.acryl.dataManagement.certifications",
Arrays.asList("SOC2", "HIPAA", "GDPR"))
.setStructuredProperty("io.acryl.privacy.retentionDays", 90, 180, 365)
.removeStructuredProperty("io.acryl.dataManagement.deprecated")
Entity-Specific Documentation:
See comprehensive guides in metadata-integration/java/docs/sdk-v2/:
dataset-entity.md - Dataset with schema supportchart-entity.md - Chart with lineagedashboard-entity.md - Dashboard with chart relationshipscontainer-entity.md - Container hierarchiesdataflow-entity.md - DataFlow pipelinesdatajob-entity.md - DataJob with inlet/outlet lineagemlmodel-entity.md - MLModel with metricsmlmodelgroup-entity.md - MLModelGroup with versionsFile: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/operations/EntityClient.java (570 lines)
package datahub.client.v2.operations;
/**
* Client for entity CRUD operations.
* Provides create, read, update, and upsert operations.
*/
public class EntityClient {
private final RestEmitter emitter;
private final DataHubClientConfigV2 config;
/**
* Create a new entity (convenience method - same as upsert).
*/
public <T extends Entity> void create(T entity) throws IOException, ExecutionException, InterruptedException {
upsert(entity);
}
/**
* Upsert an entity (create or update).
* Emits all aspects and accumulated patches.
*/
public <T extends Entity> void upsert(T entity) throws IOException, ExecutionException, InterruptedException {
List<MetadataChangeProposalWrapper> mcps = entity.toMCPs();
// Emit all MCPs asynchronously and wait for completion
// ...
}
/**
* Update an existing entity.
* Emits only accumulated patches (not full aspects).
*/
public <T extends Entity> void update(T entity) throws IOException, ExecutionException, InterruptedException {
// Emit only pending patches
// ...
}
/**
* Get an entity by URN.
* Returns entity with lazy-loaded aspects.
*/
public <T extends Entity> T get(Urn urn, Class<T> entityClass) throws IOException {
// Fetch entity aspects from server
// Construct entity with lazy loading support
// ...
}
// Note: delete(Urn) and exists(Urn) operations deferred to future releases
}
Supported Operations:
create() - Create new entities (wrapper for upsert)upsert() - Create or update entities (emits all aspects + patches)update() - Update existing entities (emits only patches)get() - Retrieve entities with lazy loadingdelete() and exists() - Deferred to future releasesPatch Behavior:
Patches are accumulated inside entities during metadata operations and emitted automatically during upsert()/update():
Dataset dataset = client.entities().get(datasetUrn);
Dataset mutable = dataset.mutable(); // Get mutable copy
mutable.addTag("pii"); // Creates internal patch
mutable.addTag("sensitive"); // Creates another internal patch
client.entities().update(mutable); // Emits both patches atomically
There is no separate patch() method - patches are managed internally by entities.
Files: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Has*.java
Mixin interfaces provide reusable metadata operations across entities using the Curiously Recurring Template Pattern (CRTP) for type-safe method chaining:
/**
* Interface for entities that support tags.
* Uses CRTP for type-safe method chaining.
*/
public interface HasTags<T extends Entity & HasTags<T>> {
/**
* Add a tag to this entity.
* Creates a patch that will be emitted on save.
*/
default T addTag(@Nonnull String tagUrn) {
// Implementation creates patch internally
return (T) this;
}
default T removeTag(@Nonnull String tagUrn) { ... }
default T setTags(@Nonnull List<String> tagUrns) { ... }
default T clearTags() { ... }
// Getter methods
default List<String> getTags() { ... }
}
Available Mixin Interfaces:
HasTags<T> - Tag operations (addTag, removeTag, setTags, clearTags)HasOwners<T> - Ownership operations (addOwner, removeOwner, setOwners, clearOwners)HasGlossaryTerms<T> - Glossary term operations (addTerm, removeTerm, setTerms, clearTerms)DomainOperations<T> - Domain operations (setDomain, removeDomain, clearDomains)HasContainer<T> - Container hierarchy (setContainer, clearContainer)HasStructuredProperties<T> - Structured properties operations (setStructuredProperty, removeStructuredProperty)Why CRTP?
The CRTP pattern enables type-safe method chaining that returns the concrete entity type:
// Without CRTP: returns Entity
Entity entity = dataset.addTag("pii"); // Loses Dataset type!
// With CRTP: returns Dataset
Dataset dataset = dataset.addTag("pii")
.addOwner(ownerUrn, type) // Still Dataset type!
.setDomain(domainUrn); // Still Dataset type!
Entity Implementations:
Entities implement mixin interfaces by declaring them in the class signature:
public class Dataset extends Entity
implements HasTags<Dataset>,
HasOwners<Dataset>,
HasGlossaryTerms<Dataset>,
DomainOperations<Dataset>,
HasContainer<Dataset>,
HasStructuredProperties<Dataset> {
// Mixin methods provided by default implementations
}
This section describes the internal architecture and implementation details for developers contributing to the SDK.
File: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Entity.java (490 lines)
The Entity base class implements three core subsystems:
Unified Cache Architecture: The SDK uses a unified AspectCache that provides read-your-own-writes semantics with proper dirty tracking. This architecture fixes bugs where fetched aspects would override patches.
Core Implementation Files:
AspectCache.java (184 lines) - Main cache with dirty trackingCachedAspect.java (68 lines) - Aspect wrapper with metadataAspectSource.java (23 lines) - Enum for SERVER vs LOCAL aspectsReadMode.java (28 lines) - Enum for ALLOW_DIRTY vs SERVER_ONLY readsKey Architectural Features:
AspectSource Tracking: Distinguishes between SERVER-fetched aspects (subject to TTL) and LOCAL-created aspects (no expiration)
Dirty Tracking: Explicit marking of aspects that need write-back to server via markDirty() method
Read-Your-Own-Writes: Default ReadMode.ALLOW_DIRTY returns local modifications immediately, SERVER_ONLY mode skips dirty aspects
TTL Management: 60-second TTL enforced only for SERVER-sourced aspects, LOCAL aspects never expire
Thread Safety: Uses ConcurrentHashMap for safe concurrent access
Internal State (Entity.java):
protected final AspectCache cache; // Unified cache with dirty tracking
protected final Map<String, List<MetadataChangeProposal>> pendingPatches;
private DataHubClientV2 boundClient = null;
Cache Operations:
getAspectLazy() - Lazy loads from server, stores as clean SERVER-sourced aspectgetOrCreateAspect() - Gets from cache or creates new LOCAL-sourced aspect (marked dirty)markAspectDirty() - Marks aspect dirty after in-place modification (used by domain operations)toMCPs() - Returns only dirty aspects for emission (excludes clean fetched aspects)Why This Architecture?
The unified cache solves a critical bug: when entities are fetched from the server and then patch operations are applied (e.g., removeTerm()), the cached aspect would be included in toMCPs() and override the patches. With dirty tracking, toMCPs() only returns modified aspects, allowing patches to work correctly.
Metadata operations create patches that accumulate until emission. The system supports two types of operations:
Patch-Based Operations (incremental updates):
PatchBuilder classespendingPatches map (aspect name → list of patches)Cache-Based Operations (full aspect replacement):
markAspectDirty() after modificationtoMCPs() outputMCP Generation:
The toMCPs() method returns only dirty aspects and accumulated patches:
public List<MetadataChangeProposalWrapper> toMCPs() {
// 1. Add dirty aspects from cache (excludes clean fetched aspects)
for (Map.Entry<String, RecordTemplate> entry : cache.getDirtyAspects().entrySet()) {
mcps.add(createMCP(entry.getKey(), entry.getValue()));
}
// 2. Add accumulated patches
for (PatchBuilder builder : patchBuilders.values()) {
mcps.add(builder.build());
}
// 3. Add pending MCPs
mcps.addAll(pendingMCPs);
return mcps;
}
Critical Design Point: toMCPs() uses cache.getDirtyAspects() instead of all cached aspects. This ensures that fetched aspects don't override patches - only locally modified aspects are emitted.
SDK mode vs INGESTION mode for proper aspect selection:
/**
* Get aspect name based on operation mode.
* SDK mode: prefer editable aspects
* INGESTION mode: use system aspects
*/
protected String getAspectName(Class<? extends RecordTemplate> aspectClass, OperationMode mode) {
if (mode == OperationMode.SDK) {
// Check if editable variant exists
String editableAspectName = getEditableAspectName(aspectClass);
if (editableAspectName != null) {
return editableAspectName;
}
}
return aspectClass.getSimpleName();
}
/**
* Get getter preference order: editable aspects first, then system aspects.
*/
protected <T extends RecordTemplate> T getAspectWithPreference(
Class<T> editableClass,
Class<T> systemClass
) {
// Try editable aspect first
T editable = getAspectLazy(editableClass);
if (editable != null) {
return editable;
}
// Fall back to system aspect
return getAspectLazy(systemClass);
}
## Implementation Phases
### Phase 1: Core Framework
Base functionality for all entities:
- Base `Entity` class with aspect management, lazy loading, and patch accumulation
- `DataHubClientV2` main client class with mode-aware behavior
- `EntityClient` with create, read, update, upsert operations
- Configuration classes with environment variable support
- Mixin interfaces using CRTP pattern for type safety
### Phase 2: Dataset Entity
Reference implementation demonstrating all patterns:
- `Dataset` entity with fluent builder
- Dataset-specific aspects (properties, schema, lineage)
- Mixin interface implementations
- Comprehensive unit tests
### Phase 3: Additional Entities
Seven additional entity types:
- `Chart` - Visualizations with lineage
- `Dashboard` - Dashboards with chart relationships
- `Container` - Hierarchical data structures
- `DataJob` - Pipeline tasks with inlet/outlet lineage
- `DataFlow` - Pipeline workflows
- `MLModel` - Machine learning models
- `MLModelGroup` - ML model families
### Phase 4: Patch Capabilities
Patch-based updates for efficient metadata changes:
- Internal patch accumulation within entities (not separate patch builders)
- Automatic patch emission on `update()` and `upsert()`
- Leverages existing `PatchBuilder` classes from entity-registry module
- Patches tested via entity unit tests
### Phase 5: Testing & Documentation
Comprehensive validation and user guides:
- Integration tests with live DataHub server
- API documentation (Javadoc) and 13 comprehensive Markdown guides
- 19 working example files demonstrating real-world usage
- Migration guide from V1
- Design principles document
- Patch operations deep-dive
- Entity-specific guides for all 8 entities
## Testing Strategy
### Unit Tests
Each entity and component has comprehensive unit tests:
- Builder validation (required fields, optional fields, validation logic)
- Aspect management (getters, setters, mode-aware routing)
- MCP generation (full aspects + patches)
- Patch operations (accumulation, emission)
- Fluent API chaining (type safety via CRTP)
- Mixin operations (tags, owners, terms, domains)
**Test Coverage by Entity:**
- Dataset: 37 tests
- Chart: 43 tests
- Dashboard: 52 tests
- DataJob: 45 tests
- DataFlow: 40 tests
- Container: 40 tests
- MLModel: 44 tests
- MLModelGroup: 38 tests
### Integration Tests
Full end-to-end tests against a real DataHub instance:
```java
@Test
public void testDatasetCreateAndRead() throws Exception {
// Create client
DataHubClientV2 client = DataHubClientV2.builder()
.server(TEST_SERVER)
.token(TEST_TOKEN)
.build();
// Create dataset
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("db.schema.test_table_" + System.currentTimeMillis())
.env("PROD")
.description("Test dataset created by Java SDK V2")
.build();
dataset.addTag("test-tag")
.addOwner("urn:li:corpuser:datahub", OwnershipType.TECHNICAL_OWNER);
// Upsert
client.entities().upsert(dataset);
// Read back
Dataset retrieved = client.entities().get(dataset.getUrn(), Dataset.class);
assertNotNull(retrieved);
assertEquals("Test dataset created by Java SDK V2", retrieved.getDescription());
}
@Test
public void testDatasetPatchOperations() throws Exception {
DataHubClientV2 client = DataHubClientV2.builder()
.server(TEST_SERVER)
.token(TEST_TOKEN)
.build();
// Create dataset first
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("db.schema.test_table_patch_" + System.currentTimeMillis())
.env("PROD")
.build();
client.entities().upsert(dataset);
// Retrieve and apply patches
Dataset retrieved = client.entities().get(dataset.getUrn(), Dataset.class);
Dataset mutable = retrieved.mutable(); // Get mutable copy
mutable.addTag("pii") // Creates patch
.addTag("sensitive") // Another patch
.addTerm("urn:li:glossaryTerm:CustomerData"); // Another patch
// All patches emitted atomically
client.entities().update(mutable);
// Verify patches were applied
Dataset verified = client.entities().get(dataset.getUrn(), Dataset.class);
assertTrue(verified.getTags().contains("urn:li:tag:pii"));
}
Integration Test Coverage:
Running Integration Tests:
export DATAHUB_SERVER=http://localhost:8080
export DATAHUB_TOKEN=your_token
./gradlew :metadata-integration:java:datahub-client:test --tests "*Integration*"
All public classes and methods have comprehensive Javadoc plus extensive Markdown documentation:
Javadoc Coverage:
Markdown Documentation (13 files):
Located in metadata-integration/java/docs/sdk-v2/:
Working Examples (19 files):
Located in metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/:
For users of the existing Java SDK:
RestEmitter emitter = RestEmitter.create(b -> b.server("http://localhost:8080"));
DatasetUrn urn = new DatasetUrn(
new DataPlatformUrn("postgres"),
"my_table",
FabricType.PROD
);
DatasetProperties props = new DatasetProperties();
props.setDescription("My dataset");
MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.builder()
.entityType("dataset")
.entityUrn(urn)
.upsert()
.aspect(props)
.build();
emitter.emit(mcpw).get();
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.build();
Dataset dataset = Dataset.builder()
.platform("postgres")
.name("my_table")
.description("My dataset")
.build();
client.entities().upsert(dataset);
Decision: Use Pegasus models (com.linkedin.*) for aspect classes.
Rationale:
Result: Proven correct - seamless integration with existing infrastructure.
Decision: Use datahub.client.v2.* namespace.
Rationale:
Result: 100% backward compatibility achieved - v1 code unchanged.
Decision: Use nested static Builder classes.
Rationale:
Result: Excellent developer experience with fluent API.
Decision: Provide synchronous API that wraps async operations.
Rationale:
Result: Simplified API widely adopted in examples and tests.
Decision: Throw checked exceptions for I/O operations.
Rationale:
Result: Clear error handling patterns in all code.
Exception Hierarchy:
The SDK introduces custom exceptions for common error conditions:
ReadOnlyEntityException - Thrown when attempting to mutate a read-only entity:
try {
Dataset dataset = client.entities().get(urn);
dataset.addTag("pii"); // Throws ReadOnlyEntityException
} catch (ReadOnlyEntityException e) {
// Exception message explains the issue and provides fix
System.err.println(e.getMessage());
// Fix: Get mutable copy first
Dataset mutable = dataset.mutable();
mutable.addTag("pii");
client.entities().upsert(mutable);
}
PendingMutationsException - Thrown when reading from entity with pending mutations:
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
dataset.setDescription("New description");
// dataset.getDescription(); // Throws PendingMutationsException!
// Fix: Save first, then read
client.entities().upsert(dataset); // Clears dirty flag
String desc = dataset.getDescription(); // Now works
Why these restrictions?
Both restrictions enforce clear separation between read and write workflows. These may be relaxed in future versions as the API matures and usage patterns emerge.
Decision: Prioritize patch-based operations as the primary API, defer full aspect replacement to V1 SDK.
Rationale:
RestEmitter directlyWhy not both? V2 SDK focuses on making common operations simple, not exposing every low-level primitive. This keeps the API focused and prevents confusion about when to use patches vs full replacement.
Result: Clean, intuitive API for 95% of use cases. Power users can drop to V1 SDK for remaining 5%.
Decision: Accumulate patches inside entities rather than separate patch builder classes.
Rationale:
Original Design: Separate DatasetPatch, ChartPatch builder classes
Actual Implementation: Patches accumulate in Entity.pendingPatches and emit via toMCPs()
Result: Superior developer experience - no need to learn separate patch API.
Decision: Use Curiously Recurring Template Pattern for type-safe mixin interfaces.
Rationale:
Original Design: Simple interfaces returning Entity
Actual Implementation:
public interface HasTags<T extends Entity & HasTags<T>> {
default T addTag(String tagUrn) { return (T) this; }
}
Result: Excellent type safety and developer experience.
Decision: Support SDK mode and INGESTION mode for aspect routing.
Rationale:
Original Design: Not specified
Actual Implementation: OperationMode enum with aspect routing logic
Result: Clear separation of concerns, aligns with DataHub's aspect model.
Decision: Implement lazy loading for aspects when entities are retrieved.
Rationale:
Original Design: Not specified (GET deferred)
Actual Implementation: Full lazy loading with getAspectLazy() and client binding
Result: Efficient entity retrieval with on-demand aspect fetching.
GET operation implementation: Should we implement REST client for reading entities, or defer to future?
Search client: Should we include search functionality in V2?
Lineage client: Should we include lineage management?
Schema field builders: Should we provide fluent builders for schema fields?
Start Here:
metadata-integration/java/docs/sdk-v2/getting-started.md - Quick start guidemetadata-integration/java/docs/sdk-v2/design-principles.md - Architecture overviewCore Implementation: 3. metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Entity.java (490 lines) - Base entity class 4. metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/operations/EntityClient.java (570 lines) - CRUD operations 5. metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/DataHubClientV2.java (266 lines) - Main client
Sample Entities: 6. metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Dataset.java (564 lines) - Reference implementation 7. metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/HasTags.java (145 lines) - CRTP mixin example
Examples: 8. metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetFullExample.java - Complete workflow 9. metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/ChartLineageExample.java - Lineage relationships
Tests: 10. metadata-integration/java/datahub-client/src/test/java/datahub/client/v2/entity/DatasetTest.java (37 unit tests) 11. metadata-integration/java/datahub-client/src/test/java/datahub/client/v2/integration/DatasetIntegrationTest.java - End-to-end validation
Document Status: Design document reflecting implemented architecture (includes AspectCache refactoring) Author: DataHub OSS Team Last Updated: 2025-01-06