metadata-integration/java/docs/sdk-v2/design-principles.md
This document provides an architectural overview of DataHub Java SDK V2, exploring the engineering principles and design patterns that enable its type-safe, efficient metadata management capabilities.
SDK V2 is built on a foundation of pragmatic reuse, intelligent caching, and layered abstractions. Rather than reinventing infrastructure, it composes proven components into a coherent, intuitive API while introducing new patterns for efficient metadata operations.
SDK V2 employs a three-layer architecture with clear separation of responsibilities:
┌─────────────────────────────────────────────────────────────┐
│ Entity Layer │
│ (Dataset, Chart, Dashboard - Business Logic) │
│ - Fluent builders for entity construction │
│ - Patch accumulation and aspect management │
│ - Mode-aware behavior (SDK vs INGESTION) │
└──────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────┴──────────────────────────────────────┐
│ Operations Layer │
│ (EntityClient - CRUD Operations) │
│ - Entity lifecycle management │
│ - Patch vs full aspect emission logic │
│ - Lazy loading coordination │
└──────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────┴──────────────────────────────────────┐
│ Transport Layer │
│ (RestEmitter, Patch Builders) │
│ - HTTP communication with DataHub │
│ - MCP serialization and emission │
│ - Patch builder integration │
└─────────────────────────────────────────────────────────────┘
Entity construction follows a fluent builder pattern that guides developers through required fields and provides IDE autocomplete support:
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("analytics.public.events")
.env("PROD")
.description("User events")
.build();
Engineering Benefits:
build() creates immutable entityRather than modifying aspects directly, mutations create patch MCPs that accumulate in a pending list:
dataset.addTag("pii") // Creates patch MCP
.addOwner("user", TECHNICAL_OWNER) // Creates patch MCP
.addCustomProperty("retention", "90"); // Creates patch MCP
client.entities().upsert(dataset); // Emits all patches atomically
Engineering Benefits:
datahub.client.patch buildersImplementation Detail: Entity base class maintains multiple change tracking mechanisms:
// From Entity.java
protected final Map<String, RecordTemplate> aspectCache; // Cached aspects from builder
protected final List<MetadataChangeProposalWrapper> pendingMCPs; // Full aspect replacements
protected final List<MetadataChangeProposal> pendingPatches; // Incremental patches
Each mutation (addTag, addOwner) creates a patch using existing builders:
// From Dataset.java
public Dataset addTag(@Nonnull String tagUrn) {
GlobalTagsPatchBuilder patch = new GlobalTagsPatchBuilder()
.urn(getUrn())
.addTag(tag, null);
addPatchMcp(patch.build()); // Adds to pendingPatches list
return this;
}
When EntityClient.upsert() is called, it emits everything accumulated on the entity in order:
// From EntityClient.upsert()
// Step 1: Emit cached aspects (from builder)
if (!entity.toMCPs().isEmpty()) {
for (MetadataChangeProposalWrapper mcp : entity.toMCPs()) {
emitter.emit(mcp);
}
}
// Step 2: Emit pending full aspect MCPs (from set*() methods)
if (entity.hasPendingMCPs()) {
for (MetadataChangeProposalWrapper mcp : entity.getPendingMCPs()) {
emitter.emit(mcp);
}
entity.clearPendingMCPs();
}
// Step 3: Emit all pending patches (from add*/remove* methods)
if (entity.hasPendingPatches()) {
for (MetadataChangeProposal patchMcp : entity.getPendingPatches()) {
emitter.emit(patchMcp, null);
}
entity.clearPendingPatches();
}
Key insight: upsert() is not an either/or operation - it emits all accumulated changes. What gets sent depends on what you've accumulated on the entity, not which method you call.
Entities support lazy aspect loading to minimize network calls while ensuring data freshness:
// Entity maintains aspect cache with timestamps
protected final Map<String, RecordTemplate> aspectCache;
protected final Map<String, Long> aspectTimestamps;
protected long cacheTtlMs = 60000; // 60-second default TTL
Loading Strategy:
getAspectCached) - Returns cached aspect or nullgetAspectLazy) - Checks cache freshness, fetches from server if stalegetOrCreateAspect) - Returns cached or creates new empty aspect locallyImplementation:
protected <T extends RecordTemplate> T getAspectLazy(@Nonnull Class<T> aspectClass) {
String aspectName = getAspectName(aspectClass);
// Check cache freshness
if (aspectCache.containsKey(aspectName)) {
Long timestamp = aspectTimestamps.get(aspectName);
if (timestamp != null && System.currentTimeMillis() - timestamp < cacheTtlMs) {
return aspectClass.cast(aspectCache.get(aspectName));
}
}
// Fetch from server if client is bound
if (client != null) {
T aspect = client.getAspect(urn, aspectClass);
if (aspect != null) {
aspectCache.put(aspectName, aspect);
aspectTimestamps.put(aspectName, System.currentTimeMillis());
}
return aspect;
}
return null;
}
Engineering Benefits:
SDK V2 distinguishes between user-initiated edits (SDK mode) and system/pipeline writes (INGESTION mode):
public enum OperationMode {
SDK, // Interactive use - writes to editable aspects
INGESTION // ETL pipelines - writes to system aspects
}
Aspect Routing:
editableDatasetProperties, editableSchemaMetadatadatasetProperties, schemaMetadataImplementation:
public Dataset setDescription(@Nonnull String description) {
if (isIngestionMode()) {
return setSystemDescription(description); // datasetProperties
} else {
return setEditableDescription(description); // editableDatasetProperties
}
}
Engineering Benefits:
Entities can be instantiated in two ways, each with distinct semantics:
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
// aspectCache populated with builder-provided aspects
// aspectTimestamps empty - indicates new entity
Use case: Creating new entities from scratch
Dataset dataset = client.entities().get(urn);
// aspectCache populated with server aspects
// aspectTimestamps records fetch time for each aspect
// Entity automatically bound to client for lazy loading
Use case: Modifying existing entities with current server state. When you access aspects not already cached, the entity will automatically fetch them from the server (lazy loading).
Entities are automatically bound to an EntityClient when loaded from the server or during upsert() to enable lazy aspect fetching:
public void bindToClient(@Nonnull EntityClient client,
@Nonnull OperationMode mode) {
if (this.client == null) {
this.client = client;
}
if (this.mode == null) {
this.mode = mode;
}
}
Binding occurs automatically during upsert():
// From EntityClient.upsert()
entity.bindToClient(this, config.getMode());
Engineering Benefits:
get() or upsert() operationsSDK V2 leverages Java generics to provide compile-time type safety for aspects:
// Type-safe aspect retrieval
protected <T extends RecordTemplate> T getAspectLazy(@Nonnull Class<T> aspectClass) {
String aspectName = getAspectName(aspectClass);
RecordTemplate aspect = aspectCache.get(aspectName);
return aspectClass.cast(aspect);
}
// Usage - compiler enforces type correctness
DatasetProperties props = dataset.getAspectLazy(DatasetProperties.class);
Engineering Benefits:
ClassCastException impossible with correct usageEntity-specific URN types prevent incorrect URN usage:
public class Dataset extends Entity {
public DatasetUrn getDatasetUrn() {
return (DatasetUrn) urn;
}
}
// Compile-time enforcement
DatasetUrn urn = dataset.getDatasetUrn(); // Type-safe
Urn genericUrn = dataset.getUrn(); // Also available
SDK V2 reuses existing patch builders from datahub.client.patch rather than creating new implementations:
OwnershipPatchBuilder - Owner additions/removalsGlobalTagsPatchBuilder - Tag managementGlossaryTermsPatchBuilder - Term associationsDomainsPatchBuilder - Domain assignmentDatasetPropertiesPatchBuilder - Property updatesEditableDatasetPropertiesPatchBuilder - Editable property updatesEngineering Benefits:
Example Integration:
public Dataset addOwner(@Nonnull String ownerUrn, @Nonnull OwnershipType type) {
Urn owner = Urn.createFromString(ownerUrn);
OwnershipPatchBuilder patch = new OwnershipPatchBuilder()
.urn(getUrn())
.addOwner(owner, type);
addPatchMcp(patch.build()); // Stores patch MCP
return this;
}
Transport layer reuses RestEmitter for HTTP communication:
No changes to RestEmitter - SDK V2 is purely additive.
Multiple patches accumulated and emitted atomically:
dataset.addTag("tag1").addTag("tag2").addOwner("user1", OWNER);
client.entities().upsert(dataset); // Single network call, 3 patches
RestEmitter uses CloseableHttpAsyncClient with connection pooling for efficient HTTP reuse.
Lazy loading failures logged but don't crash:
catch (Exception e) {
log.warn("Failed to lazy-load aspect {}: {}", aspectName, e.getMessage());
return null; // Graceful degradation
}
| Aspect | V1 (RestEmitter) | V2 (DataHubClientV2) |
|---|---|---|
| Abstraction Level | Low - MCPs | High - Entities |
| URN Construction | Manual strings | Automatic from builder |
| Aspect Wiring | Manual MCP building | Hidden in entity methods |
| Updates | Full aspect replacement | Patch-based incremental |
| Type Safety | Minimal - generic MCPs | Strong - typed entities |
| Lazy Loading | Not supported | TTL-based caching |
| Mode Awareness | Not supported | SDK vs INGESTION modes |
| Learning Curve | Steep - requires MCP knowledge | Gentle - intuitive builders |
SDK V2 designed for extensibility:
Entity base classgetAspectLazy / getOrCreateAspectcacheTtlMsJava SDK V2 achieves its goals through principled design:
The result is an SDK that feels natural to Java developers while providing the efficiency and correctness required for production metadata management at scale.