Updating DataHub

This file documents any backwards-incompatible changes in DataHub and assists people when migrating to a new version.

Draft release notes for the upcoming version. Add entries here as breaking changes, deprecations, and notable fixes land on master; promote this section to a versioned heading (for example ## v1.6.1 or ## v1.7.0) at release cut.

Requirements:

CLI / Python SDK: (TBD)
Helm Chart: (TBD)

Breaking Changes

Known Issues

Potential Downtime

Deprecations

Other Notable Changes

v1.6.0

Requirements:

CLI / Python SDK: 1.6.0
Helm Chart: TBD (set at release from acryldata/datahub-helm)

Breaking Changes

#16816 / #17351: (GMS / Java services) Spring Boot upgraded from 3.5.6 to 4.0.5 (later patch to 4.0.6 for CVE fixes), including Spring Framework 7.0 and Spring Kafka 4.0. Official Docker images and Helm deployments are updated together. Action: If you ship custom GMS plugins, Spring @Configuration extensions, or Kafka listener customizations against DataHub's classpath, recompile and retest against Spring Boot 4 before upgrading production.
#16625 / #16680: (Frontend) The V1 UI has been fully removed from the codebase (not merely deprecated). THEME_V2_ENABLED, THEME_V2_DEFAULT, and related env vars are required; there is no fallback to V1. Helm users should set datahub-gms.theme_v2.enabled and datahub-gms.theme_v2.default to true and datahub-gms.theme_v2.toggleable to false.
#17214: (Frontend / Play 3 + Apache Pekko) datahub-frontend is upgraded from Play 2 / Akka to Play 3 and Apache Pekko. Session and flash cookies use Play 3's JWT-based signing; error responses from the frontend proxy use a custom JSON HttpErrorHandler instead of Play's default HTML pages. See the DATAHUB_SECRET requirement below.
#16982: (Auth) The deprecated corpUserInfo.active field is no longer considered for session eligibility. Users gated on active for login must migrate to supported CorpUser status mechanisms.
#16887: (Operations / Elasticsearch) Optional Elasticsearch side zero-downtime upgrade (ZDU) for OpenSearch/Elasticsearch version bumps. When enabled via Helm (global.datahub.systemUpdate.zdu), system-update runs staged index upgrades with reduced downtime; scale-down during system-update is disabled automatically when ZDU is enabled. See Helm chart values for preEnable / enable and index upgrade tuning (global.elasticsearch.index.upgrade, global.elasticsearch.buildIndices).
#16930: (Operations / Upgrades) Background aspect schema version migration sweep runs during upgrades on large deployments. The sweep migrates stale aspect schema versions and avoids overwriting aspects that were updated concurrently during the sweep window.
#17493: (Operations / Retention) Retention policies from boot/retention.yaml are re-applied during system-update when retention is configured, so policy changes shipped with an upgrade take effect without a separate manual ingest step.
Structured properties write validation (GMS): Upserts, patches, and MCP writes to the structuredProperties aspect are now handled in GMS by StructuredPropertiesValidator (at validateProposed) and StructuredPropertiesAssignmentMutator (before commit), for all APIs (GraphQL, OpenAPI, ingestion). With the default STRUCTURED_PROPERTIES_DROP_MISSING_PROPERTY_VALUES_WITH_WARNING=true, assignments whose property definition entity is missing or has no propertyDefinition aspect are removed from the proposal (server WARN per dropped URN) instead of failing the entire write at definition lookup. If the proposal had assignments but none remain valid after filtering, the write is rejected. Clearing structured properties (empty aspect) is unchanged. Action: Pipelines that must fail when any orphaned assignment is present should set STRUCTURED_PROPERTIES_DROP_MISSING_PROPERTY_VALUES_WITH_WARNING=false on GMS and restart; otherwise expect stale property URNs to be stripped on the next write. See Structured Properties — orphaned assignments and STRUCTURED_PROPERTIES_DROP_MISSING_PROPERTY_VALUES_WITH_WARNING.
#17465 The default ingestion profiler for SQL connectors is now sqlalchemy instead of ge (Great Expectations). The SQLAlchemy profiler provides equivalent functionality without requiring the Great Expectations library. To keep using the Great Expectations profiler, install the new optional extra and set the method explicitly in your recipe:
bash
```
pip install 'acryl-datahub[profiling-ge]'
```
yaml
```
source:
  config:
    profiling:
      enabled: true
      method: ge
```
Note: acryl-datahub[snowflake] (and other SQL connector extras) no longer install acryl-great-expectations by default. Recipes that previously relied on the GE default profiler will now use the SQLAlchemy profiler automatically (no action needed in most cases). Recipes that explicitly set profiling.method: ge without installing [profiling-ge] will fail with a ConfigurationError pointing at the fix.

Deprecation notice: The Great Expectations profiler is now considered legacy and is planned for removal in a future release. New deployments should rely on the default SQLAlchemy profiler. Existing users that still depend on method: ge should plan to migrate.
Search filters (REST, GraphQL, SDK): The Pegasus Criterion record and GraphQL FacetFilterInput no longer expose a singular value field. Send match values only via values (a string array; use a one-element array where you previously passed a single string). Clients that still serialize value must migrate; server-side auto-conversion and related logging have been removed.
#16842: (Ingestion / Athena) Upstream lineage URNs changed for Glue-backed catalogs. Athena tables backed by the AWS Glue Data Catalog (i.e. using awsdatacatalog or any catalog whose type resolves to GLUE) now emit upstream lineage to Glue dataset URNs (urn:li:dataPlatform:glue) instead of S3 URNs (urn:li:dataPlatform:s3). Iceberg tables in non-Glue catalogs (e.g. Hive Metastore, Lambda, Federated) now emit upstream lineage to Iceberg dataset URNs (urn:li:dataPlatform:iceberg). If your upstream Glue or Iceberg connector is configured with a non-default platform_instance, set the new glue_platform_instance / iceberg_platform_instance options on the Athena source so the upstream URNs match. If you have downstream dashboards, saved searches, or lineage queries that reference the old S3 upstream URNs for Athena tables, update them after re-running ingestion.
(Docker / Gradle) Java service images (datahub-gms, datahub-mce-consumer, datahub-mae-consumer, datahub-upgrade, datahub-frontend-react) use a default base image and apk-style packages at build time. Override the base with Docker BASE_IMAGE / Gradle -PdockerBaseImage=..., and the apk repository line with APK_REPOSITORY_URL / -PapkRepositoryUrl=.... If you previously passed a legacy Gradle property for a custom package mirror, migrate to apkRepositoryUrl (and ensure the mirror layout matches your chosen base image). datahub-actions uses the same BASE_IMAGE build arg.
#17407: BigQuery extract_policy_tags_from_catalog: The policy tag extraction feature has been rewritten to use INFORMATION_SCHEMA.COLUMN_FIELD_PATHS combined with batch Data Catalog API calls instead of per-column get_table() and get_policy_tag() calls. This eliminates millions of redundant API calls (e.g. 434k → ~10 for a 29k-table warehouse), reducing extraction time from hours to seconds. The old code path has been removed with no fallback. Action required if you use extract_policy_tags_from_catalog: true: (1) Verify BigQuery API v2 is available in your environment (available since ~2020). (2) If the Data Catalog API is blocked by VPC Service Controls, policy tag resource names will be stored instead of display names — this is graceful degradation, not a failure. (3) Review ingestion logs for any new warnings about missing policy_tags column in INFORMATION_SCHEMA results.
Bootstrap Steps Moved to System-Update Job. The following metadata ingestion steps now run during system-update instead of GMS startup:
- Entity Types ingestion
On the first deploy after this change, these entities will be re-ingested across the cluster. Per-entity existence checks prevent duplicate data writes.
#16473: Vertex AI Source - ML Model version set URNs now scoped per project: Previously, all Vertex AI model versions shared a version set based solely on the numeric model ID, which caused conflicts when multiple GCP projects contained models with the same numeric ID (GCP model IDs are project-local, not globally unique). Version sets are now scoped per project, preventing 422 ValidationException errors during ingestion. After upgrading, existing model version entities will be orphaned in their old version sets; new ingestion runs will create correctly scoped version sets. Old orphaned entities can be cleaned up by enabling stateful ingestion with stale entity removal.
#17196: (Ingestion / Sigma) Chart InputFields now emit resolved upstream fields only. Previously, Sigma charts emitted self-referential InputFields entries for every chart column, which the DataHub UI ignored as dead edges. Sigma now derives chart field lineage from column formulas and emits only fields that resolve to an upstream workbook element or warehouse table. Downstream consumers that read raw InputFields aspects may see self-referential entries disappear, especially when extract_lineage is disabled, formulas are unavailable, or chart columns are formula-less direct passthroughs; those charts can now emit fields: [] instead of one self-reference per chart column.
#17296: (Ingestion / Sigma) New connection_to_platform_map config for per-connection warehouse overrides. Sigma recipes now support a connection_to_platform_map field keyed by Sigma connectionId (UUID from /v2/connections). Use it to set env, platform_instance, convert_urns_to_lowercase, default_database, and default_schema per connection so emitted lineage URNs match your warehouse connector. This is required for Redshift (and other warehouses where Sigma's /v2/connections API returns null for database/schema): without default_database/default_schema overrides, the SQL parser produces under-qualified URNs that will not match your warehouse connector's datasets. These overrides apply to all lineage paths that have a connectionId on hand — DM element → warehouse table, DM customSQL, and workbook customSQL. No action required if you use Snowflake or BigQuery with default env/platform_instance settings.
#17347: (Ingestion / Sigma) Workbook customSQL chart lineage is now emitted automatically. When extract_lineage: true (default), charts whose data source is a customSQL definition will emit warehouse UpstreamLineage and column-level FineGrainedLineage on the next ingestion run. No configuration change is required for Snowflake and BigQuery. For Redshift, set default_database/default_schema in connection_to_platform_map (see #17296 above).
#17370: (Ingestion / Sigma) Two bug fixes with operator-visible side effects for warehouse-backed chart column lineage and DM element schema. When extract_lineage: true (default): (1) Charts whose column IDs carry warehouse-native names were incorrectly emitting Sigma display names as InputFields.fields[].schemaField.fieldPath (e.g. Customer Id instead of customer_id). This is now fixed. Downstream consumers that matched on the display-name spelling will need to update those references after the next ingestion run. (2) A bug in DM element schema extraction was silently suppressing valid columns when sibling DM elements shared the same columnId. This is now fixed. Affected DM element Datasets may gain previously suppressed schema columns on the next ingestion run; column-count alerts and soft-delete policies scoped to those Datasets may fire. No configuration change is required.
#17369: (Ingestion / Sigma) Workbook charts that pull directly from warehouse tables now emit entity-level ChartInfo.inputs upstreams. When extract_lineage: true (default), Sigma's BFS node type=table upstreams (direct warehouse tables, not Sigma Datasets) are now resolved to warehouse Dataset URNs and added to chartInfo.inputs. This is a visible behavior change: on the next ingestion run, charts that previously had empty inputs will now show warehouse table lineage edges in the DataHub UI if the workbook contains direct table references. For Snowflake and BigQuery, no configuration change is required. For Redshift, set default_database and default_schema in connection_to_platform_map (see #17296) to ensure URNs match your warehouse connector. Additionally, the 2-segment /files path (e.g. Redshift Connection Root/<SCHEMA>) is now accepted in the workbook warehouse table index; previously only 3-segment paths (e.g. Snowflake Connection Root/<DB>/<SCHEMA>) were accepted. The counter chart_warehouse_unknown_connection has been renamed to chart_warehouse_table_name_unmatched to reflect its actual semantics (workbook table short name not found in the index, not a connection error).
#17276: (Ingestion / Sigma) Sigma Data Model ingestion is now on by default. ingest_data_models defaults to true. On the next ingestion run, new Data Model Containers and element Datasets will appear in DataHub for any Sigma tenant that has Data Models. Workbook charts that reference a DM element will also gain new ChartInfo.inputs entries. Action required to opt out: set ingest_data_models: false in your Sigma recipe. If you keep it enabled, update any soft-delete policy to account for the new Container and Dataset entities, and silence or raise fan-in cardinality alerts before the first run.
(Ingestion / Sigma — Action required) Null DatasetUpstream.name now tolerated. Previously raised a Pydantic ValidationError warning; now counted under chart_dataset_upstream_name_missing with a SourceReport.warning titled Sigma workbook dataset upstream dropped (name missing). Lineage semantics unchanged. Action: migrate any alert keyed on the old Pydantic warning text to the new counter or warning title.
(Actions / Kafka) The default Kafka offset commit strategy for Actions has changed from synchronous (async_commit_enabled: false) to asynchronous (async_commit_enabled: true). With asynchronous commits, offsets are stored locally after each event and committed periodically by a background thread (default interval: 10 seconds via async_commit_interval). This provides up to 25x throughput improvement for high-volume action pipelines. The tradeoff: on consumer crash, up to async_commit_interval milliseconds of events may be redelivered. All built-in actions are idempotent or tolerate redelivery, so no behavior change is expected for standard deployments. To restore the previous synchronous default for any action, add async_commit_enabled: false to the source.config section of your action YAML.
Docker / Prometheus (Micrometer): When using the stock DataHub Docker start.sh entrypoints, GMS, MAE consumer, MCE consumer, and the frontend expose Micrometer Actuator (including GET /actuator/prometheus and /actuator/health) on container port 4319 by default (MANAGEMENT_SERVER_PORT), separate from the main API/UI port. Profile compose (docker/profiles/docker-compose.gms.yml) exposes 4319 for on-network scraping and maps ${DATAHUB_MAPPED_GMS_MANAGEMENT_PORT:-4319}:4319 on the host for GMS (same pattern as ${DATAHUB_MAPPED_GMS_PORT:-8080}:8080 for HTTP). Override DATAHUB_GMS_MANAGEMENT_URL in smoke tests if needed, or change the host port via DATAHUB_MAPPED_GMS_MANAGEMENT_PORT. Scrape from another container on the same compose network (e.g. http://datahub-gms:4319/actuator/prometheus as in docker/monitoring/prometheus.yaml). MAE and MCE image HEALTHCHECK probes the management port first, then falls back to the main consumer port for older layouts. The JMX Prometheus Java agent remains on 4318. If you previously scraped Micrometer from the GMS HTTP port (e.g. 8080), update scrapes to the management listener. To keep Actuator on the main port, unset MANAGEMENT_SERVER_PORT in the container or set it equal to the main server port.
(Operations / monitoring) Docker images that bundle the JMX Prometheus Java agent (datahub-gms, datahub-frontend-react, datahub-mae-consumer, datahub-mce-consumer, datahub-upgrade) now ship jmx_prometheus_javaagent 1.0.1 (previously 0.20.0). The agent uses Prometheus client_java 1.x; upgrade scrapers and dashboards accordingly.
- HTTP scrape path: Metrics are exposed at /metrics, not at /. If you scrape the JMX port directly (often 4318 when Prometheus export is enabled), set your Prometheus metrics_path (or Kubernetes ServiceMonitor / PodMonitor path) to /metrics. Anything still requesting / will receive the default HTML page, not the metrics exposition.
- JVM metrics: Some built-in JVM metric names changed to align with OpenMetrics (for example, memory-related series). Update Grafana panels, recording rules, and alerts that referenced the old names. See the client_java JVM migration notes.
- Labels: When multiple MBeans map to the same metric name, series may include an _objectname label; adjust queries or aggregations if you previously assumed a single series per name.
#16723 (Ingestion) Dataplex source configuration cleanup: filter_config.entries.dataset_pattern was removed, use filter_config.entries.pattern instead; entries_location was removed, use entries_locations (list) instead.
(Ingestion / dbt) Assertion result mapping for dbt tests is more faithful to dbt's status semantics:
- dbt test results with status error or runtime error (the test itself could not run due to a compilation, SQL, or infrastructure problem) are now emitted with AssertionResult.type = ERROR instead of FAILURE. Previously these were conflated with real data-quality failures.
- dbt source freshness results with status runtime error are similarly emitted with AssertionResult.type = ERROR. Freshness status error (meaning the error_after threshold was exceeded) continues to map to FAILURE since it represents a legitimate verdict.
- A new optional AssertionResult.severity field is populated on failure results: LOW for dbt warn (only emitted as FAILURE when test_warnings_are_errors: true), HIGH for fail / freshness error.
- Downstream consumers (alerting, dashboards, saved searches) that filter assertion runs by result.type == FAILURE will no longer see infrastructure-level test failures in that set; include type == ERROR if you want to continue capturing them.
#16912: (Build / Java 21) DataHub now compiles with Java 21 JDK and produces Java 8 bytecode for broad runtime compatibility. Self-hosted builds require Java 21+; JVM runtimes consuming the published artifacts remain Java 8+ compatible.
- Spark integration upgraded from 3.3.4 to 3.5.0 (enables Spark 4.x forward compatibility and improves lineage handling for complex query plans)
- Hadoop upgraded from 2.7.2 to 3.3.6 (addresses Hadoop CVEs, bundled with Spark 3.5+)
- For self-hosted deployments: Build/compile requires Java 21 JDK. Runtime environments can use Java 8+ (DataHub produces Java 8 bytecode for compatibility). Spark lineage users must upgrade to Spark 3.5.0+ if using DataHub's Spark integration.
- Java 21 runtime configuration: When running DataHub services with Java 21, add --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.lang.reflect=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED to JAVA_OPTS or your JVM startup command to allow Spark lineage listener reflection-based operations. DataHub Docker images include this automatically.
#17340: (Docker Runtime / Java 25 LTS) Docker images (datahub-gms, datahub-mce-consumer, datahub-mae-consumer, datahub-upgrade, datahub-frontend-react, datahub-actions, and ingestion images) now use Java 25 LTS runtime while maintaining Java 21 bytecode compilation for maximum compatibility. This enables JEP 491 (virtual thread pinning fix in synchronized blocks) and other Java 25 runtime benefits without breaking build toolchain compatibility. Self-hosted users: no action required if using official Docker images. Build system remains on Java 21 (Gradle 8.14.3 compatibility); custom deployments compiling from source should follow AGENTS.md for Java version strategy.
(Frontend / Play 3 — DATAHUB_SECRET) With the Play 3 upgrade (#17214), DATAHUB_SECRET is wired to play.http.secret.key. Play 3 rejects application secrets that are unset, left as the default changeme, or shorter than the minimum required for session/flash JWT signing (default algorithm HS256 needs at least 32 bytes of key material; see Play Application secret). If DATAHUB_SECRET is too short, datahub-frontend will fail on startup. Before upgrading, set DATAHUB_SECRET to a random value of at least 32 characters (for example openssl rand -base64 32 or head -c 32 /dev/urandom | base64). Helm and other production layouts typically already generate a long secret; this mainly affects ad hoc Docker Compose, copied env files, or hand-crafted short values.
(Ingestion / Fivetran — Lineage URN shift for multi-destination accounts). Per-destination URN routing now reflects each destination's actual platform instead of defaulting every destination to fivetran_log_config.destination_platform. Previous behavior: in hybrid mode (when both fivetran_log_config and api_config are configured), every connector's destination URN was emitted with the log warehouse's platform — e.g. if the Fivetran log lives in BigQuery, all destinations were urn:li:dataset:(bigquery, ...) regardless of where the data actually went. New behavior: the connector calls GET /v1/destinations/{id} per destination and emits a URN matching the actual service. Action required if you are running Fivetran in hybrid mode (fivetran_log_config + api_config) AND your account has connectors writing to multiple destination types (e.g. some to Snowflake, some to BigQuery, some to a Managed Data Lake): existing lineage URNs will shift to the correct platform/database on the next ingest, which may break dashboards, queries, or saved searches keyed on the old (incorrectly-conflated) URNs. Audit affected URNs against your Fivetran destinations console before upgrading. To opt out and preserve the old behavior, pin each non-default destination explicitly via destination_to_platform_instance.<destination_id>.platform (and database for relational platforms) — declarative overrides always win over discovery. Single-destination accounts and recipes that already pin every destination explicitly are unaffected.
#16873 (Ingestion) Unity Catalog (Databricks) connector now emits ownership and datasetProperties as UPSERT (full replace) by default, instead of incremental PATCH. This means each ingestion run will re-state the complete ownership and properties for each dataset. If you rely on the previous additive/merge behavior (e.g., manually added owners surviving ingestion runs), set incremental_ownership: true and/or incremental_properties: true in your recipe config.

Known Issues

Potential Downtime

System-update / Elasticsearch: Upgrades that trigger reindexing, incremental index migration, or optional Elasticsearch ZDU (#16887) can run for an extended period depending on catalog size. Reindex duration was improved (#16949) and a perpetual ES8 reindex loop fix was included (#17402). Helm upgrades without --atomic may report failure while background reindex jobs continue — allow system-update to finish before retrying. See Helm Notes under the 0.10.0 section below for timeout guidance patterns.
First deploy after bootstrap moves: Entity Types ingestion moved to system-update (see Breaking Changes). Expect additional system-update work on the first upgrade after this release.
Aspect schema version sweep (#16930): Large deployments may observe background migration activity during upgrade; plan maintenance windows accordingly.

Deprecations

(Build / Gradle) Legacy Gradle properties for overriding artifact download URLs are deprecated in favour of the datahub.dependencies.* namespace. Old properties are still honoured but will be removed in a future release. See Custom Repositories.

Legacy property	Replacement
`apacheMavenRepositoryUrl`, `mavenCentralRepositoryUrl`	`datahub.dependencies.maven.central`
`confluentMavenRepositoryUrl`	`datahub.dependencies.maven.confluent`
`linkedinOpenSourceRepositoryUrl`	`datahub.dependencies.maven.linkedinOpenSource`
`nodeDistBaseUrl`	`datahub.dependencies.node.distBaseUrl`
`githubMirrorUrl`	`datahub.dependencies.github.baseURL`
`pipMirrorUrl`	`datahub.dependencies.python.pipMirrorUrl`
`pipExtraIndexUrls`	`datahub.dependencies.python.pipExtraIndexUrl`

(Operations / Helm) Per-workload monitoring configuration is deprecated in favor of cluster-wide settings (global.datahub.monitoring).
(Operations / Helm) Enable global.datahub.systemUpdate.consolidatedUpgrade so upgrades run through the consolidated system-update path and the chart no longer relies on separate one-off setup jobs (e.g. mysql/ES setup jobs) for that flow.
(Ingestion / dbt) The test_warnings_are_errors default will change from false to true in a future release. Once assertion result consumers can filter by severity, warn test statuses will always emit as FAILURE with severity: LOW, and the flag itself will be deprecated. To adopt the forthcoming behavior today, set test_warnings_are_errors: true in your dbt recipe.
(Automations / Glossary Term AI) The Glossary Term AI automation has been deprecated. This capability will be replaced in an upcoming release.

Other Notable Changes

(Security / CWE-200) Error responses from GMS (GraphQL, OpenAPI) and the Play frontend proxy no longer leak internal Java class names, Jackson source locations, or stack traces. Previously, a malformed JSON request to /api/v2/graphql could return detailed internal information such as com.fasterxml.jackson.core.JsonParseException and StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION in the HTTP response body. All error responses now return sanitized messages (e.g. "Invalid JSON format", "Malformed request body", "Bad Request"). The frontend now uses a custom HttpErrorHandler that returns JSON instead of Play's default HTML error pages. Configuration: The frontend error handler is enabled automatically via play.http.errorHandler in application.conf. GMS error sanitization is applied in GlobalControllerExceptionHandler and the global ObjectMapper explicitly disables StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION for defense in depth. No operator action is required; internal exception details remain in server-side logs for debugging.
#17086 (Ingestion / Sigma) Intra-workbook element-to-element lineage is now emitted via ChartInfo.inputEdges. On the next ingestion run, new chart→chart edges will appear in the lineage graph for any Sigma workbook where one element feeds another. This is additive — no existing dataset or warehouse lineage edges are removed.
#16358 (Ingestion / dbt) dbt URN lowercasing is now configurable via convert_urns_to_lowercase. Previously, dbt unconditionally lowercased all target-platform URNs with no way to disable it. This caused lineage breaks with case-sensitive platforms like BigQuery when table names contained uppercase characters (e.g. AM100 lowercased to am100). The default is true (preserving existing behavior). Users on case-sensitive platforms like BigQuery should set convert_urns_to_lowercase: false in their dbt recipe to preserve original identifier casing.
(Frontend) The OIDC post-login redirect cookie (REDIRECT_URL) format has changed. It now stores a plain Base64-encoded URL string instead of the previous serialized format. Any in-flight REDIRECT_URL cookies set before the upgrade will fail to decode; affected users will be redirected to the DataHub home page (/) after their next SSO login instead of the page they originally requested. No action is required — users simply log in again and the new cookie format takes effect automatically.
(Operations / Helm) Added global.datahub.monitoring.metricsMode with three modes: legacy (default), jmx_and_actuator, and actuator_only, so JMX vs Spring Boot Actuator scraping can be chosen cluster-wide. See the Micrometer transition plan.
#16619 (Operations / Helm) Added a Cleanup upgrade that tears down all DataHub-owned infrastructure resources (Elasticsearch indices, Kafka topics, SQL database and users). It is designed to run as a Helm pre-delete hook so that helm uninstall leaves no DataHub-specific state on shared infrastructure. Each component (ES, Kafka, SQL) can be disabled independently via environment variables (CLEANUP_ELASTICSEARCH_ENABLED, CLEANUP_KAFKA_ENABLED, CLEANUP_SQL_ENABLED; all default to true). Elasticsearch cleanup uses scoped IndexConvention patterns to enumerate only DataHub-owned indices — it does not issue a wildcard DELETE /* that would be dangerous on shared clusters without an index prefix configured.
#16879 (Ingestion) PowerBI: When convert_lineage_urns_to_lowercase is enabled, column-level lineage and upstream dataset URNs are now consistently lowercased. Previously, only parts of the URN were lowercased, which could cause lineage mismatches. After upgrading, re-running ingestion will emit corrected URNs.
#17033 (Ingestion / MongoDB) system.* collections (e.g. system.profile, system.views) are now excluded from ingestion by default via the new excludeSystemCollections option (default: true). If you were relying on ingesting system collections, set excludeSystemCollections: false in your recipe. The read role is now sufficient for all standard ingestion use cases; dbAdmin is only required if excludeSystemCollections is disabled and MongoDB profiling is enabled.
#16871 (Ingestion) Postgres stored procedures: The dataTransformLogic.queryStatement.value for Postgres stored procedures now contains only the SQL body (from pg_proc.prosrc) instead of the full CREATE OR REPLACE PROCEDURE ... AS $$ ... $$ wrapper previously returned by pg_get_functiondef(). This change enables lineage extraction (the parser previously skipped the CREATE wrapper), but may affect custom tooling that parsed the full DDL from this field. Additionally, the language field is now normalized to uppercase (e.g. "SQL" instead of "sql").
#17271 (Ingestion / ClickHouse) New opt-in column profiling backend (profiling.method: sqlalchemy). The default Great Expectations profiler issues CREATE TEMPORARY TABLE ... AS SELECT (when custom_sql, limit, or offset is set), but ClickHouse temporary tables are session-scoped, so the follow-up read on a different pooled connection raises UNKNOWN_TABLE. Set profiling: { method: sqlalchemy } in your ClickHouse recipe to route through a SQLAlchemy-based profiler that never creates temp tables and uses ClickHouse-native aggregates (uniq(), quantile(), stddevSamp()). The default is unchanged (ge); existing recipes are not affected. Note: custom_sql, limit, and offset are not supported on the SQLAlchemy path — a warning is emitted and full-table profiling is used instead.
#17191 (Ingestion / Airflow plugin) OpenLineage Dataset names that contain a literal "None" segment — typically produced by Apache Airflow providers that build dataset names with f-strings like f"{database}.{self.schema}.{self.table}" without a None guard (notably S3ToRedshiftOperator when schema is unset) — are now sanitized before the DataHub URN is constructed. Lineage edges from these operators will therefore stitch to the same URN that DataHub's native ingestion sources (e.g. the Redshift source) emit for the same physical table, instead of producing orphan urn:li:dataset:(...,db.None.schema.tbl,PROD) entities. A WARNING is logged at most once per worker process from datahub_airflow_plugin._datahub_ol_adapter so the offending operator can be tracked down upstream without log spam. Orphan URNs created before upgrading are not removed automatically — see the new Orphan dataset URNs containing .None. section in the Airflow lineage docs for cleanup guidance.
(Ingestion / Fabric OneLake) The queryinsights schema is now excluded from table and view discovery alongside INFORMATION_SCHEMA and sys. Fabric Warehouse's Microsoft-managed Query Insights views (exec_requests_history, long_running_queries, etc.) are no longer ingested as user datasets. This is additive cleanup for most deployments.
#17380 (Ingestion / Databricks Unity Catalog) New opt-in flag include_metric_views (default false) for Unity Catalog Metric Views. When enabled, metric views are emitted with subtype Metric View (instead of Table), their YAML body is preserved as a ViewProperties aspect with viewLanguage: YAML, and upstream lineage is derived from the YAML source and joins[].source fields. Lineage resolution supports 3-part (catalog.schema.table) and 2-part (schema.table, resolved against the metric view's own catalog) identifiers as well as backtick-quoted parts; 1-part names and SQL subqueries fall back to the existing table-lineage REST API. Column-level lineage is emitted from each dimensions[].expr / measures[].expr using the Databricks SQL dialect (controlled by the existing include_column_lineage flag, which is on by default). Schema fields are tagged Dimension or Measure when their names match the YAML; per-column descriptions are taken from dimensions[].description / measures[].description when present and otherwise fall back to the Unity Catalog column comment; the top-level filter field surfaces as a metric_view_filter custom property; and materialization: materialized drives ViewProperties.materialized = true. On the opt-in path, the noisy Spark engine config snapshot Unity Catalog injects as view.sqlConfig.spark.* custom properties (~150 entries per view) is filtered out — other view.* properties and Databricks' own metric_view.* properties pass through unchanged. The flag remains false by default so the upgrade is byte-for-byte identical to the previous release. Requires a databricks-sdk recent enough to expose TableType.METRIC_VIEW (older SDKs short-circuit silently and the flag has no effect; a startup warning is emitted in that case).

New ingestion sources

#11838 Aerospike
#13217 Airbyte
#15913 StarRocks
#15966 Matillion
#16218 Apache Flink (metadata and lineage)
#16426 dlt (data load tool)
#16472 Pinecone Vector DB
#16564 Omni BI platform (INCUBATING)
#16646 Microsoft Fabric Data Factory
#17467 pg_queue metadata-ingestion sink and Actions event source

Connector and platform enhancements

#16505 (Ingestion / Glue) JDBC upstream lineage for Glue jobs
#16562 (Ingestion / Glue) Iceberg lineage
#16636 (Ingestion / Glue) Job subtype on Glue jobs
#16515 (Ingestion / Kafka Connect) Column-level lineage for sink connectors and field-level transform support
#16348 (Ingestion / BigQuery) External table metadata enrichment (format, URIs, compression, max bad records)
#16621 (Ingestion / Power BI) browsePathsV2 hierarchy for tables, tiles, and pages under parent dashboards
#16616 (Ingestion / Power BI) Sql.Databases M-Query data access function support
#16486 (SDK) Configurable connection pool settings on REST emitter
#16798 (Ingestion) Multi-entity domain and ownership transformers
#16044 (Ingestion / dbt) Extract and emit stats from catalog.json
#17284 (Ingestion / Fabric OneLake) Query usage extraction from queryinsights
#17475 (Ingestion / Confluence) Page body HTML converted to Markdown via markdownify

Frontend and security

#17277 (Frontend) Content-Security-Policy via Play CSPFilter
#17468 (Frontend) Collapsible nav redesign and home hero toggler
(Frontend) Semantic color tokens migration across entity, lineage, home, and settings surfaces (use theme.colors.* tokens per theming guidelines)
#17487 (Security) Home page templates/module scope validation
#17489 (Security) URL validation before rendering external links in the app

Operations

#17393 (Kubernetes) KEDA-aware scaling support in KubernetesController
#17431 (Search) Timeseries write throttle for Elasticsearch index updates
#17456 (Docker) Published DataHub Postgres extensions image

v1.5.0

Requirements:

Helm Chart: 0.9.0

Breaking Changes

#16341 (Ingestion) SQL parsing: View query IDs are now generated using a SHA-256 hash instead of URL-encoding the view URN. This affects all connectors that use view lineage tracking (Snowflake, Oracle, BigQuery, Postgres, MySQL, Hive, Trino, ClickHouse, DB2, Dremio, SQL Server, and others). Previously, query entities had URNs like urn:li:query:view_urn%3Ali%3Adataset%3A%28...%29; they now use urn:li:query:view_<sha256hash>. After upgrading, the old URL-encoded query entities will become stale and orphaned. To clean them up, enable stateful ingestion with stale entity removal in your recipe and re-run ingestion.
#16685:(Ingestion) PowerBI M-Query lineage extraction has been rewritten using Microsoft's official @microsoft/powerquery-parser. As part of this change, the native_query_parsing: false configuration flag now suppresses only expressions containing Value.NativeQuery. Previously it suppressed all M-Query lineage extraction. Users who set this flag to block all lineage extraction will now see lineage produced for non-NativeQuery sources (Snowflake, PostgreSQL, MSSQL, etc.). To restore the suppress-all behaviour, add a table_pattern.deny rule in your recipe.
#16396: Oracle connector: When connecting via service_name to a multitenant Oracle database, the database name used in URNs will now reflect the Pluggable Database (PDB) name instead of the Container Database (CDB) name. In Oracle Multitenant architecture, a CDB is the top-level container (e.g. cdb) and a PDB is an individual tenant database within it (e.g. mypdb); service_name typically routes to the PDB, so the PDB name is the correct identifier for your datasets. This affects both dataset URNs (when add_database_name_to_urn: true) and database/schema container URNs (always, since containers always include the database name). If your existing metadata was ingested with the old CDB-based URNs, re-ingesting will create new entities under the corrected URNs. To preserve the old URN shape and avoid re-creating entities, set urn_db_name explicitly in your recipe to match your previous CDB name.
#16628 (Ingestion) Fabric OneLake source: Workspace containers now use the fabric platform instead of fabric-onelake. This changes workspace container URNs and the dataPlatformInstance.platform emitted for workspace entities. Lakehouse, warehouse, schema, and dataset entities remain on fabric-onelake.
Retention service disabled: only current version retained. When the retention service is not enabled (not configured or unavailable), the write path now retains only the current version (version 0) and does not create version-history rows. Previously, version history was still written when retention was disabled. Impact: Deployments that run without retention enabled will no longer accumulate aspect version history; only the latest aspect value is stored. Migration: Enable and configure the retention service (e.g. ingest retention policies from boot/retention.yaml) if you need version history for any entity/aspect.
#16134: Java 17 Runtime Required - DataHub now compiles to Java 17 bytecode (previously Java 11). All Docker images already ship with Java 17 runtime. Self-hosted users must ensure their runtime environment uses Java 17+. The Maven artifact name remains datahub-client-java8 for backward compatibility, but now requires Java 17+ at runtime. This change also includes:
- Spark integration upgraded from 3.0.3 to 3.3.4 (minimum version required for Java 17 support)
- Hadoop upgraded from 2.7.2 to 3.3.6 (addresses Hadoop CVEs, bundled with Spark 3.3+)
- Note: If you're a self-hosted user still running Java 11, you must upgrade to Java 17 before deploying this release. Spark lineage users must upgrade to Spark 3.3.0+.
(Frontend) CustomHttpClientFactory now restricts TLS to 1.2 and 1.3 only when using a custom truststore; TLS 1.0 and 1.1 are disabled for security. If the frontend uses a custom truststore to reach GMS or an IdP that only supports TLS 1.0/1.1, those connections will fail until the server is upgraded to support at least TLS 1.2.
#16176: Vertex AI Source - Two breaking changes:
1. Pipeline URN structure changed: Previously, each pipeline run created a new DataFlow entity. Now, Vertex AI pipelines are modeled as stable DataFlow entities (one per pipeline template), with DataJobs for tasks and DataProcessInstances for individual runs. This enables proper incremental lineage aggregation. Pipeline URNs now use the compiled pipeline spec name (pipelineInfo.name, e.g. the @pipeline(name="...") argument in Kubeflow Pipelines) as the stable identifier; non-Kubeflow pipelines fall back to display_name with any timestamp suffix stripped. After upgrading, existing pipeline entities will appear as separate entities from new ingestion runs. To clean up old entities, enable stateful ingestion with stale entity removal.
2. ML Metadata extraction now enabled by default: Enhanced ML Metadata extraction for CustomJob lineage, hyperparameters, metrics, and model evaluations is now enabled by default. This requires additional GCP permissions: aiplatform.metadataStores.get, aiplatform.metadataStores.list, aiplatform.executions.get, and aiplatform.executions.list. If these permissions are missing, the connector gracefully falls back with warnings. To disable these features, set use_ml_metadata_for_lineage: false, extract_execution_metrics: false, and include_evaluations: false.
#16149 (Ingestion) DataHub source: now uses URN pattern filtering to exclude environment-specific entities by default. This prevents copying credentials and creating invalid entities.
- Default behavior change: Previously exclude_aspects contained dataHubIngestionSourceInfo, dataHubSecretValue, dataHubExecutionRequestInput, globalSettingsInfo. Now urn_pattern.deny contains urn:li:dataHubIngestionSource:.*, urn:li:dataHubSecret:.*, urn:li:globalSettings:.*, urn:li:dataHubExecutionRequest:.* to exclude entire entity types instead of individual aspects.
- Note: If you override urn_pattern or exclude_aspects in recipe, carefully configure them based on your requirements to avoid syncing sensitive data or creating invalid entities. Recommended to keep new defaults.
#16333 (Ingestion) Sigma: Owner urns are now created from the email address of the user. Any existing references to the urns of firstName_lastName will break.
(Ingestion) Kafka Connect: The Debezium SQL Server connector now reports its platform as mssql instead of sqlserver, matching DataHub's canonical platform name. This fixes column-level lineage when using use_schema_resolver: true, as the SchemaResolver now correctly queries for platform=mssql. If you have platform_instance_map or connect_to_platform_map entries keyed on "sqlserver" for Debezium SQL Server connectors, update them to use "mssql" instead.
#16385 Default token signing key and salt have been removed from metadata-service/configuration/src/main/resources/application.yaml. It is recommended to set authentication.tokenService.signingKey or env var DATAHUB_TOKEN_SERVICE_SIGNING_KEY and authentication.tokenService.salt or env var DATAHUB_TOKEN_SERVICE_SALT before starting DataHub. Refer the linked pages to know this is handled for local development and CLI quickstart.
- If you are using helm to deploy DataHub you should be unaffected as the helm charts don't use the default values in application.yaml but rather generate a random secret to use.
- IMPACT: Due to the change in signing keys for local development and quickstart, PATs generated before this release will be invalidated and will need to be regenerated in those instances.
#16453 (Ingestion) dbt Cloud: Auto-discovery now ingests all production jobs by default, regardless of the "Generate docs on run" setting. Previously, jobs without generate_docs=True were silently skipped. If you rely on the old behavior (only ingesting jobs with doc generation enabled), add require_generate_docs: true under auto_discovery in your recipe.
#16342 (Timeline API) The backend ChangeCategory enum value OWNER has been renamed to OWNERSHIP to match the GraphQL ChangeCategoryType enum (which has always used OWNERSHIP). Clients calling the REST API (/timeline) with OWNER will still work (backward-compatible alias). The GraphQL schema is unchanged and only recognizes OWNERSHIP.
#15744: The emit_mcps() method on DataHubRestEmitter now returns List[TraceData] instead of int. Previously it returned the number of chunks/batches sent. Now it returns a list of TraceData objects (one per batch) containing trace IDs for debugging and status checking. To get the previous chunk count, use len(result) on the returned list. Additionally, emit_mcp() now returns Optional[TraceData] instead of None.
#16680 (Frontend) The V1 UI theme is now officially sunset. All new features and patches going forward will be on the V2 UI (THEME_V2_ENABLED=true). If you are a customer of DataHub Cloud, or started using DataHub after February 2025 (or kept your clone/fork's environment variables in-sync with upstream since then), then this is already your default experience and you do not need to change anything.
- To ensure you are experiencing the latest features of DataHub, please ensure the GMS environment variables THEME_V2_ENABLED and THEME_V2_DEFAULT are set to true and THEME_V2_TOGGLEABLE is set to false.
- If you are using helm to deploy Datahub, ensure datahub-gms.theme_v2.enabled and datahub-gms.theme_v2.default are both set to true in values.yaml.

Known Issues

Potential Downtime

Deprecations

#16176: Vertex AI Source - The region configuration field is deprecated in favor of regions (list type). The region field continues to work for backward compatibility.
#16176: Vertex AI Source - Partition handling behavior will change in a future major version. Currently, normalize_external_dataset_paths defaults to false, meaning partitioned paths like gs://bucket/data/year=2024/month=01/ create separate dataset entities. In the next major version, this will default to true, normalizing paths to gs://bucket/data/ for stable dataset URNs with lineage aggregation across partitions. To opt-in to the new behavior now, set normalize_external_dataset_paths: true in your configuration.
#16375: Vertex AI Source - The project_id (singular) configuration field is deprecated in favor of project_ids (list). The field is automatically migrated at startup — no immediate action required. Update your recipe to use project_ids: ["my-project"] to silence the deprecation warning. project_id will be removed in a future major release.

Other Notable Changes

#16317 (Ingestion) Looker/LookML: The tag_measures_and_dimensions config is now correctly honored. Previously, setting it to false had no effect — Dimension/Measure/Temporal tags were always added to schema fields. After this fix, setting tag_measures_and_dimensions: false will remove those tags and prepend the field type to the description instead (e.g., "Dimension. My field"). If you had this config set to false but were relying on the tags being present, be aware they will now be removed as intended.
#15270 (Ingestion) Browse paths: When platform_instance is configured, DataFlow and DataJob entities now receive a browsePathsV2 aspect with the platform instance as the root path. Previously, these entities had no browse path from ingestion and the backend would place them in a generic "Default" folder, causing entities from multiple platform instances to be mixed together. This affects sources like Fivetran, Glue, and Kafka-Connect that emit DataFlow/DataJob entities with platform_instance. Sources without platform_instance configured are unaffected.
#16339 (Ingestion) Migrating Python dependency declarations from setup.py to pyproject.toml (PEP 621). setup.py remains the source of truth for editing dependencies for now; pyproject.toml is auto-generated from it via ./gradlew :metadata-ingestion:updateLockFile. Sync verification (./gradlew :metadata-ingestion:verifyPyprojectSync) runs automatically during check and buildWheel. setup.py will be deprecated in a future release in favor of pyproject.toml as the sole dependency source.
#16358 (Ingestion) dbt: Added convert_urns_to_lowercase config option for dbt ingestion (both dbt-core and dbt-cloud). When enabled, dbt platform URNs are lowercased, preventing duplicate entities caused by schema name casing differences (e.g., app_sales vs APP_SALES). This is an opt-in flag (default: false for all platforms). Recommended for case-insensitive platforms like Snowflake or BigQuery where dbt manifests may contain mixed-case identifiers.
Kubernetes: optional scale-down during system-update. When the system-update job runs in a Kubernetes cluster, an optional Java-based step can scale down deployments (by label selector, e.g. MAE/MCE) and set environment variables on other deployments (by label selector, e.g. GMS) before blocking upgrades (e.g. reindex), then restore them after. Scale-down is conditional: it runs only when a blocking upgrade (such as BuildIndices when reindex is needed) requires it. Rollout and scale-down operations run in parallel. State is stored in a ConfigMap; if retries are exceeded, the step restores state, deletes the ConfigMap, and fails. The Java implementation is disabled by default; enable via Helm (datahubSystemUpdate.scaleDown.useJavaImplementation) and the job env vars described in Environment variables. Outside Kubernetes or when disabled, the step is a no-op.
#16176: Vertex AI Source Connector - Additional improvements beyond breaking changes:
- Cross-platform lineage to external data sources (GCS, BigQuery, S3, Azure Blob Storage, Snowflake) referenced in training jobs, with configurable platform instances via platform_instance_map to ensure URNs match native connectors
- UI organization improved with hierarchical folders: model versions appear under their model group folders, and pipeline tasks nest under their parent pipelines
- Enhanced lineage relationships: direct model→dataset relationships via TrainingData aspect; experiment run→model outputs and pipeline task run→model/dataset lineage via DataProcessInstance aspects
- For performance with large projects (1000+ training jobs), use stateful ingestion (recommended) which only processes new/changed resources on subsequent runs. Limits like max_training_jobs_per_type control which resources are processed and sent to DataHub (most recently updated N items)
- Optional platform_instance configuration field for environments running multiple Vertex AI instances
- Existing ingestion configurations continue to work without modification, and URNs remain unchanged unless platform_instance is explicitly configured
#16142: Oracle ingestion: Fixed database container naming when using service_name instead of database configuration. Previously, Oracle containers had no name (only an ID) when using service_name with data_dictionary_mode: ALL (the default). Container URNs will change for affected users as the database name is now properly populated in the container GUID. If stateful ingestion is enabled, running ingestion with the latest CLI version will automatically clean up the old containers and create properly named ones. This fix also ensures database names are normalized consistently with schema and table names.
#16300 (Ingestion) Mode connector: Performance improvements for large Mode workspaces including concurrent API fetching, response caching, and SQL parsing optimizations. Column-level lineage now emits table-level lineage as a fallback when column-level parsing times out.
#16300 (Ingestion) SQL parsing: Join resolution for queries involving CTEs and subqueries was optimized to avoid expensive deep-copy operations that caused significant slowdowns on complex queries. This affects all connectors that use SQL-based lineage (Snowflake, BigQuery, dbt, Mode, Redshift, etc.). The new approach directly walks SQL scope sources to resolve physical tables, replacing the previous indirect column-tracing fallback. This produces more accurate join metadata — for example, unused CTEs are now correctly excluded from join tables. In rare edge cases with unusual CTE patterns, join metadata may differ from previous results, but the new output better reflects the actual query structure.
#15741: New RDF ingestion source.
#15735: New Snowplow ingestion source.
#15249: New Apache Doris ingestion source.
#16236 (Ingestion) dbt: Support for ingesting dbt semantic models.
#16483 / #16445 (Ingestion) Kafka Connect: Added support for Debezium and Confluent JDBC sink connectors; the connector image bundles a JVM via jdk4py so a system Java installation is not required for Kafka Connect ingestion.
#16292 (Ingestion) Trino: Column-level lineage on upstream datasets via the connector.
#16443 (Ingestion) Iceberg: Optional ingestion-time domain assignment configuration.
#13232 (Ingestion) BigQuery: Added convert_column_urns_to_lowercase configuration option (opt-in).
#16568 (Ingestion) Power BI: extract_column_level_lineage is enabled by default for richer lineage; disable in the recipe if you need the previous behavior.
#16387 / #16388: Assets can be associated with multiple data products (backend and UI).
#16365: Policies can target resources by Glossary Terms and Groups.
#16207: Domain-scoped policies can include assets in child domains.
#16303 (Ingestion) Sigma: Workbook filtering in addition to workspace filtering.
#16100 (Ingestion) Snowflake: Metadata pattern pushdown and table type filtering to reduce warehouse metadata queries.
#16310 (Ingestion) Kafka source: Optional configuration to disable Avro schema name validation.
#16471 (CLI) datahub search with semantic search, query projection, and agent-context integration.
#15744 (SDK) Trace IDs are exposed for SYNC_PRIMARY and ASYNC emit modes on the REST emitter.
#16529 (Ingestion) Great Expectations and SQLAlchemy-based profilers are brought to feature parity.
#16355: Elasticsearch reindex and index-creation paths include additional retry behavior for resilience.
#16489 / #16498: Ingestion Docker images and builds use pinned transitive dependencies (uv.lock, constraints.txt) for more reproducible installs.
#16165 (Ingestion) Configurable report sample sizes and richer failure logging for ingestion reports.
#15855 (Ingestion) dbt: Support for ingesting dbt exposures.
#16058 (Ingestion) Snowflake: Support for ingesting external data metric function (DMF) assertions.
#16265 (Ingestion) Azure Data Factory: Column lineage extraction for Copy activity.
#16235 (Ingestion) Airflow plugin: Multi-statement SQL parsing support for lineage extraction.

v1.5.0.1

Patch release for the Actions image packaging.

#16781 (Actions): The datahub-actions Docker image bundles dedicated virtualenvs for datahub-gc and datahub-documents.

v1.5.0.2

Patch release focused on dependency and security updates (Java stack, Python ingestion, observability).

#16829 / #16828: Spring bumped to 6.2.17 and Netty to 4.1.132.Final.
#16950: Patched Rhino, Logback, and Commons Lang3 for CVEs.
OpenTelemetry API/SDK and instrumentation upgraded (Docker images use opentelemetry-javaagent 2.27.0; JMX Prometheus agent unchanged at 0.20.0); addresses CVE-2026-33701.
#16822 / #16868 / #16955: Python ingestion dependency fixes (CVE-2024-27459), raised pyOpenSSL floor with expanded security constraints, and LiteLLM 1.83.0 for CVE-2026-35030.

Bug Fixes

#15748 (Ingestion) PowerBI: Reverted create_corp_user default back to True (was changed to False in v1.5.0). Customers on v1.5.0 or v1.5.0.1 with stateful ingestion enabled and without explicitly setting ownership.create_corp_user: true in their recipe may have had PowerBI-discovered users soft-deleted.

Remediation — if users were already soft-deleted, restore them using one of:
- Re-ingest from your authoritative source (LDAP/Okta/SCIM) — recommended.
- CLI: datahub delete --urn "urn:li:corpuser:<email>" --soft --undelete for each affected user.
- Python SDK: emit StatusClass(removed=False) for each affected user URN.
Behavior with create_corp_user: true (new default): PowerBI creates CorpUser entities with display name and email. These are marked as non-primary (is_primary_source=False), so stateful ingestion will not soft-delete them if they disappear from PowerBI. Note: PowerBI user info may overwrite richer profiles from authoritative sources like LDAP/Okta. If this is a concern, follow the migration steps below to switch to false.

If you want create_corp_user: false (to avoid overwriting LDAP/Okta profiles): Do not switch directly to false — this will cause the same soft-deletion bug. Instead:
1. Upgrade to this version and run at least one ingestion with create_corp_user: true (the default). This updates the stateful ingestion checkpoint to mark user URNs as non-primary.
2. After that run completes, set ownership.create_corp_user: false in your recipe.
3. On subsequent runs, users will no longer be emitted, but stateful ingestion will safely skip them (they were already marked non-primary in the checkpoint).
Skipping step 1 and going directly to false means the checkpoint still has users marked as primary from the old code, and stateful ingestion will soft-delete them.

v1.5.0.7

Patch release focused on dependency and security updates.

OpenTelemetry bumped to 1.62.0 (CVE-2026-45292).
Jackson jackson-core 2.21.3; Parquet stack aligned to 1.17.1.
#17504: Exclude grpc-netty-shaded; align reactor-netty.
#17386: Vert.x 4.5.27 (CVE-2026-6860).
#17452: Sanitize GMS/frontend API error responses (CWE-200).
#17489: Validate URLs before rendering links in the app.
#17487: Fix home page templates/module scope hijacking.
#17453: Python ingestion ujson >= 5.12.1 (CVE-2026-44660).
#17277: Content-Security-Policy via Play CSPFilter on the frontend.

v1.5.0.6

Patch release focused on image maintenance.

Remove bundled kubectl from the datahub-upgrade Docker image.

v1.5.0.5

Patch release focused on Java dependency CVE fixes.

#17352: Netty 4.2.13.Final (CVE-2026-41417 and related).
#17330: Hadoop third-party, Netty, and gRPC CVE hygiene.
Spring Boot 3.5.14 (CVE-2026-40973).
Spring Framework 6.2.18 (CVE-2026-22740).

v1.5.0.4

Patch release focused on HTTP client CVE remediation.

#17323: Apache httpclient5 5.6.1 (CVE-2026-40542).

v1.5.0.3

Patch release focused on security and dependency updates (Play frontend and Java stack, Python ingestion), OIDC and native sign-up fixes, and Docker/CI maintenance.

#17030: Jetty upgrade with refreshed Gradle lockfiles.
#17114: Reactor Netty 1.2.17.
#17112: Apache Shiro 2.1.0 (CVE-2026-23903).
#17110: pac4j 6.4.1 (CVE-2026-40458).
#17113: Hibernate Validator upgrade (CVE-2025-35036).
#17132: jakarta.mail 1.6.8 (CVE-2025-7962).
#17140: OpenTelemetry Java agent 2.27.0.
#17152: maven-artifact 3.9.15.
#17092: Python cryptography raised to >= 46.0.7.
#17116: Ingestion lockfile updates for aiohttp, pytest, and tighter declared bounds.
#17131: Ingestion constraints for Authlib (>= 1.6.11) and pypdf (>= 6.10.2) addressing GHSA-jj8c-mmj3-mmgv and GHSA-4pxv-j86v-mhcw.
Docker-bundled Python virtualenvs: lxml version floor for CVE-2026-41066.
#17045: constraints.txt included in lock file materialization for more reproducible ingestion installs.
#17182: locked datahub-actions removes pip/uv

Notable Changes

datahub-actions default base image updated (override with BASE_IMAGE)

Bug Fixes

#17032: OIDC post-login redirect cookie handling.
Play Framework: add play-java-forms at runtime for native sign-up email validation; smoke tests use valid email addresses for those flows.
Shiro/pac4j: AuthModule refactored so the runtime no longer depends on master-only build artifacts for authentication wiring.

Build, Docker, and Tooling

#17156: Slimmer datahub-actions image build; test-only dependency moves; documentation updates for redacting certain Excel-sourced content.
#17133: datahub-upgrade image bundles kubectl v1.35.4.
macOS: pin JPype1 below 1.7.0 so dependency resolution keeps compatible wheels on Apple Silicon (reflected in ingestion constraints.txt).
CI: pin remaining GitHub Actions to full commit SHAs for supply-chain hardening.

v1.4.0

Breaking Changes

Python 3.9 support has been dropped. All Python modules now require Python 3.10 or later. This affects acryl-datahub, acryl-datahub-airflow-plugin, acryl-datahub-dagster-plugin, acryl-datahub-gx-plugin, prefect-datahub, and acryl-datahub-actions. Upgrade to Python 3.10+ before upgrading these packages.
#15930: The Airflow plugin's kill switch for Airflow 2.x now uses environment variables instead of Airflow Variables. If you were using airflow variables set datahub_airflow_plugin_disable_listener true to disable the plugin, you must now use export AIRFLOW_VAR_DATAHUB_AIRFLOW_PLUGIN_DISABLE_LISTENER=true instead.
#15877: The LDAP ingestion source now enforces TLS certificate verification by default (tls_verify: true). This prevents Man-in-the-Middle attacks (CWE-295) but may break existing configurations using self-signed certificates. To restore the previous behavior, explicitly set tls_verify: false in your recipe.
#15748: PowerBI ingestion source create_corp_user default changed from True to False. Previously, PowerBI would create user entities by default, potentially overwriting existing user profiles from LDAP/Okta. Now, PowerBI emits ownership URNs only (soft references) by default. To restore previous behavior, explicitly set ownership.create_corp_user: true in your recipe. Migration note: If upgrading and users appear with incomplete profiles, re-ingest from your authoritative source (LDAP/Okta/SCIM). Additionally, when create_corp_user: true, the connector now emits both CorpUserKeyClass and CorpUserInfoClass (was only CorpUserKeyClass), providing complete user metadata. Stateful ingestion note: User entities created by PowerBI are now marked as non-primary (is_primary_source=False), so they will NOT be soft-deleted by stateful ingestion when they disappear. This prevents accidental deletion of users who may also exist in LDAP/Okta/SCIM.
#15415: MLModel and MLModelGroup search field mapping has been updated to resolve duplicate field name conflicts. Existing entities will be automatically reindexed in the background after upgrade. New MLModel and MLModelGroup entities ingested after the upgrade will work immediately.
#15397: Grafana ingestion source dataset granularity changed from per-datasource to per-panel (per visual). This improves lineage accuracy by ensuring each panel's query results in a unique dataset entity with precise upstream/downstream connections. Dataset URN format changed from {ds_type}.{ds_uid} to {ds_type}.{ds_uid}.{dashboard_uid}.{panel_id}. This means all existing Grafana dataset entities will have different URNs. If stateful ingestion is enabled, running ingestion with the latest CLI version will automatically clean up old entities and create new ones. Otherwise, we recommend soft deleting all Grafana datasets via the DataHub CLI: datahub delete --platform grafana --soft and then re-ingesting with the latest CLI version.
#15005: SqlParsingBuilder is removed, use SqlParsingAggregator instead
#14710: LookML ingestion source migrated to SDKv2 resulting in:
- browsePaths aspect replaced with browsePathsV2
- Only emits MCPs
#14693: Looker ingestion source migrated to SDKv2, resulting in:
- browsePaths aspect replaced with browsePathsV2
- In Dashboard entity and dashboardInfo aspect, charts property (deprecated) is replaced with chartEdges
- Only emits MCPs
#16023: In PowerBI, Container URNs change for users with platform_instance configured, as this PR now passes platform_instance to dataset containers (affects GUID generation). The env parameter addition is harmless as it's excluded from GUID calculation. Stateful ingestion will soft-delete old containers and create new ones on the next run. Dataset entities and their lineage are unaffected.
#16067: Oracle stored procedure URN format has been corrected to match table URN format. For most users (using service_name or database without add_database_name_to_urn: true), stored procedure DataJob URNs will change from database.schema.stored_procedures to schema.stored_procedures. This fixes a URN mismatch that prevented stored procedure lineage from working. Stateful ingestion will soft-delete old stored procedure entities and create new ones with correct lineage on the next run. Users with database config parameter and add_database_name_to_urn: true are unaffected.

Deprecations

(CLI) The --use-password flag in datahub init command is deprecated. Token generation is now automatically detected when both --username and --password are provided together. The flag continues to work for backward compatibility but will be removed in a future release.

Other Notable Changes

(Ingestion) BigQuery source: Improved dataset_pattern filtering to apply earlier in the ingestion pipeline, reducing unnecessary API calls to BigQuery for datasets that will be filtered out.
#15714: Kafka topic partition counts can now automatically be increased during upgrades if configured values exceed existing partition counts. Set DATAHUB_AUTO_INCREASE_PARTITIONS=true to enable.
(CLI) Added --extra-env option to datahub ingest deploy command to pass environment variables as comma-separated KEY=VALUE pairs (e.g., --extra-env "VAR1=value1,VAR2=value2"). These are stored in the ingestion source configuration and made available to the executor at runtime.
#14968: Added an ingestion source for IBM Db2 databases.
#15824: The Databricks ingestion source now additionally supports authentication via M2M OAuth or Databricks unified authentication. The Azure authentication method now supports profiling as well as ingesting lineage from system tables.
#16067: Oracle ingestion source improvements: Added procedure-to-procedure lineage support, improved PL/SQL parsing for control flow keywords (WHILE, LOOP, EXCEPTION, etc.), fixed overloaded procedure handling, and added distinct subtypes for functions vs procedures. Stored procedures and functions now have proper container hierarchy (Database → Schema → Flow) matching tables and views. Both functions and procedures are organized in the same {schema}.stored_procedures container (consistent with PostgreSQL, MySQL, and Snowflake), with individual subtypes to distinguish them.

1.3.0

Breaking Changes

#14580: (Ingestion) The redshift lineage v1 implementation (RedshiftLineageExtractor) has been removed, as lineage v2 (RedshiftSqlLineageV2) implementation has been default for a while already. As an effect use_lineage_v2 config has also been removed along with all lineage v1 references and tests have been updated to v2 implementation. This should not impact most users as change is isolated in redshift ingestion source only.
#14014: The acryl-datahub now requires pydantic v2. Support for pydantic v1 has been dropped and users must upgrade to pydantic v2 when using DataHub python package.
- As a side effect, this upgrade in pydantic version has implicit consequences for iceberg ingestion source. If it is run from CLI and datahub CLI was installed with all extras (acryl-datahub[all]), then pyiceberg has been kept at 0.4.0 version in such environment, just to satisfy the pydantic v1 restriction. However now, pyiceberg will be installed in the newest available version. While this is a breaking change in the behaviour, versions >0.4.0 have been used for some time by Managed Ingestion.
- Additionally, there have been changes to the catalog connection configuration details - especially for AWS-based catalogs and warehouses, the properties profile_name, region_name, aws_access_key_id, aws_secret_access_key, and aws_session_token were deprecated and removed in version 0.8.0. To check whether your configuration will work, consult https://py.iceberg.apache.org/configuration/#catalogs. Because of that, pyiceberg dependency has been restricted to be 0.8.0 at least.
- Anyway, there are no changes needed for iceberg recipes orchestrated via the Managed Ingestion UI
The acryl-datahub-airflow-plugin now requires specifying the appropriate installation extra based on your Airflow version due to different OpenLineage dependencies:
- For Airflow 2.x (2.7+): Install with pip install 'acryl-datahub-airflow-plugin[airflow2]'
- For Airflow 3.x (3.0+): Install with pip install 'acryl-datahub-airflow-plugin[airflow3]'
- For Airflow 3.0.x specifically: You'll also need to bump pydantic: pip install 'acryl-datahub-airflow-plugin[airflow3]' 'pydantic>=2.11.8'
- Installing without the appropriate extra will result in missing OpenLineage dependencies and lineage extraction will not work properly. The plugin automatically detects your Airflow version and uses the appropriate integration once the correct extra is installed.
Auto-detection for SearchClientShim requires permission for the cluster info endpoint on ElasticSearch/OpenSearch. If you are using a restrictive account to access your cluster, you may need to directly configure the engine type rather than relying on auto-detection. These are the two environment variables needed to be configured for GMS as well as MCE & MAE consumers (available in helm charts as well):
- ELASTICSEARCH_SHIM_ENGINE_TYPE & ELASTICSEARCH_SHIM_AUTO_DETECT (.Values.global.elasticsearch.engineType and .Values.global.elasticsearch.autoDetect respectively for helm)
- Allowed values: ELASTICSEARCH_SHIM_ENGINE_TYPE["ELASTICSEARCH_7", "ELASTICSEARCH_8", "OPENSEARCH_2", "AUTO_DETECT"] , ELASTICSEARCH_SHIM_AUTO_DETECT["true","false"]

Known Issues

Potential Downtime

#15226: Requires reindexing the query entity index on upgrade. Depending on the size of this index, upgrade times may vary.

Deprecations

Other Notable Changes

Airflow 3.x Support: The acryl-datahub-airflow-plugin now supports Apache Airflow 3.x while maintaining backward compatibility with Airflow 2.5+. Key changes include:
- Kill Switch Migration: The plugin's kill switch behavior has changed between versions:
  - Airflow 2.x: Continue using Airflow Variables to disable the plugin (set the Airflow Variable datahub_airflow_plugin_disable_listener to true)
  - Airflow 3.x: Use environment variable AIRFLOW_VAR_DATAHUB_AIRFLOW_PLUGIN_DISABLE_LISTENER=true to disable the plugin. This change is required to comply with Airflow 3.x's strict database access restrictions during listener initialization.
- Configuration Changes: Airflow 3.x moved some configuration keys (e.g., WEBSERVER__BASE_URL → API__BASE_URL). The plugin automatically detects and uses the correct configuration for each version.
- SubDAG Removal: SubDAG support has been removed in Airflow 3.x. Users should migrate to TaskGroups for visual grouping of tasks.
- New SQLParser Integration: Airflow 3.x uses a unified SQLParser patch architecture for lineage extraction, replacing operator-specific extractors. This provides better consistency and column-level lineage across all SQL operators.
- For detailed migration instructions and compatibility information, see metadata-ingestion-modules/airflow-plugin/AIRFLOW_3_MIGRATION.md in the DataHub repository.
#15118: (Ingestion) The Oracle source now includes stored procedures, functions, packages, and materialized views with automatic lineage generation. Use procedure_pattern to filter procedures if needed. See the Oracle source documentation for permissions and configuration details.
#14717: The Tableau ingestion source now enables extract_lineage_from_unsupported_custom_sql_queries by default. This improves the quality of lineage extracted by using DataHub's SQL parser in cases where the Tableau Catalog API fails to return lineage for Custom SQL queries.
#14824: DataHub now supports CDC (Change Data Capture) mode for generating MetadataChangeLogs with guaranteed ordering based on database transaction commits. CDC mode is optional and disabled by default. When enabled via CDC_MCL_PROCESSING_ENABLED=true, MCLs are generated from Debezium-captured database changes rather than directly from GMS. This provides stronger ordering guarantees and decoupled processing. Requires MySQL 5.7+ or PostgreSQL 10+ with replication enabled. See CDC Configuration Guide for setup instructions.
Added multi-client search engine shim for Elasticsearch and OpenSearch support. This enables DataHub to work with ES 7.17 (with API compatibility mode for ES 8.x servers), ES 8.x, and OpenSearch 2.x through a unified interface. The shim includes auto-detection of search engine types and backward compatibility with existing RestHighLevelClient usage. See elasticsearch-search-client-shim.md for configuration details.
#14953: Unity Catalog ingestion source now supports extracting usage query history from the system.query.history table for improved performance with large query volumes. The new usage_data_source configuration (default: AUTO) automatically uses system tables when warehouse_id is configured, otherwise falls back to the REST API. This change is not breaking as the AUTO mode gracefully handles configurations without warehouse_id by using the existing REST API approach. Users can explicitly force system tables mode by setting usage_data_source: SYSTEM_TABLES (requires SELECT permission on system.query.history) or continue using the REST API with usage_data_source: API.

1.2.0

Breaking Changes

All DataHub Python packages now require Python 3.9+. This affects the following packages:
- acryl-datahub (DataHub CLI and SDK)
- acryl-datahub-actions
- acryl-datahub-airflow-plugin
- acryl-datahub-prefect-plugin
- acryl-datahub-gx-plugin
- acryl-datahub-dagster-plugin (already required Python 3.9+)
#13619: The acryl-datahub-airflow-plugin has dropped support for Airflow versions less than 2.7.
#14054: The v1 plugin in acryl-datahub-airflow-plugin has been removed. The v2 plugin has been the default for a while already, so this should not impact most users. Users who were explicitly setting DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN=true will need to either upgrade or pin to an older version to continue using the v1 plugin.
#14015: In the sql-queries source, the default_dialect configuration parameter has been renamed to override_dialect. This also affects the Python SDK methods:
- DataHubGraph.parse_sql_lineage(default_dialect=...) → DataHubGraph.parse_sql_lineage(override_dialect=...)
- LineageClient.add_lineage_via_sql(default_dialect=...) → LineageClient.add_lineage_via_sql(override_dialect=...)
#14059: The acryl-datahub-gx-plugin now requires pydantic v2, which means the effective minimum supported version of GX is 0.17.15 (from Sept 2023).
#13601: The use_queries_v2 flag is now enabled by default for Snowflake and BigQuery ingestion. This improves the quality of lineage and quantity of queries extracted.

Known Issues

Internal Schema Registry - The internal schema registry does not supply a compatible schema for older MCP messages. The short term recommendation is to process all MCPs before upgrading to this release.

Potential Downtime

Deprecations

#13858: For folks using bigquery and redshift connectors please update schema_pattern to match against fully qualified schema name <database_name>.<schema_name> and set config match_fully_qualified_names : True. Current default match_fully_qualified_names: False is only to maintain backward compatibility. The config option match_fully_qualified_names will be removed in future and the default behavior will be like match_fully_qualified_names: True.

Other Notable Changes

The acryl-datahub-actions package now requires Pydantic V2, while it previously was compatible with both Pydantic V1 and V2.
#14123: Adds a new environment variable DATAHUB_REST_EMITTER_BATCH_MAX_PAYLOAD_BYTES to control batch size limits when using the RestEmitter in ingestions. Default is 15MB but configurable.

1.1.0

Breaking Changes

#12795: slack source v2 - now ingests all user and channels. Make sure the ingestion source and GMS are on the same version of DataHub.
#13004: The acryl-datahub-airflow-plugin dropped support for Airflow 2.3 and 2.4.
#13186: NoCode Migration Removed - This code hasn't been required in many years. If needed, a user should upgrade to DataHub 1.0.x prior to upgrading to a later version.
#13397: async_flag removed from rest emitter, replaced with emit mode ASYNC
#13120: DataHub Actions Integration: Moved datahub-actions into OSS DataHub. Users building their own DataHub Actions docker images can now build from the OSS project.

Known Issues

#13397: Ingestion Rest Emitter
- ASYNC_WAIT/ASYNC - Async modes are impacted by kafka lag.
- SYNC_WAIT - Only available with OpenAPI ingestion
OpenAPI Reports OpenAPI Spec 3.1.0 when it only supports 3.0.1

Potential Downtime

Deprecations

Other Notable Changes

SDK Improvements: Added URN to URL helpers, lineage client, ML model support, and search enhancements.
Connector Improvements: Significant enhancements to PowerBI, Superset, MLflow, and Redshift connectors.
New Connectors: Added connectors for Hex notebooks and VertexAI.
Iceberg Connector Refactoring: The Iceberg connector has been completely refactored to use MCPWs instead of MCEs.
Entity Subtypes Model Change: The addition of subtypes to most entities.
#13151 #13255: UI Search Bar Redesign: search bar with new autocomplete functionality has been added.
#12976: Edit Lineage Feature: Added ability to edit lineage directly in the UI.
#12983 #13107: Manage Tags Interface: Added a dedicated "Manage Tags" navigation page.
#13361: Theme Support: Added foundation for basic theme support with primary color configuration.
#13165 - OpenAPI v3 Patching Improvements
#13397 - Ingestion Rest Emitter - Added EmitMode parameter for write guarantees.
- SYNC_WAIT: Synchronously updates the primary storage (SQL) but asynchronously updates search storage (Elasticsearch). Provides a balance between consistency and performance. Suitable for updates that need to be immediately reflected in direct entity retrievals but where search index consistency can be slightly delayed.
- SYNC_PRIMARY: Synchronously updates the primary storage (SQL) but asynchronously updates search storage (Elasticsearch). Provides a balance between consistency and performance. Suitable for updates that need to be immediately reflected in direct entity retrievals but where search index consistency can be slightly delayed.
- ASYNC: Queues the metadata change for asynchronous processing and returns immediately. The client continues execution without waiting for the change to be fully processed. Best for high-throughput scenarios where eventual consistency is acceptable.
- ASYNC_WAIT: Queues the metadata change asynchronously but blocks until confirmation that the write has been fully persisted. More efficient than fully synchronous operations due to backend parallelization and batching while still providing strong consistency guarantees. Useful when you need confirmation of successful persistence without sacrificing performance.
#13426 - Added support for extracting column transformation logic when using use_queries_v2 with warehouse ingestion.
#13499 - Added ELASTICSEARCH_MIN_SEARCH_FILTER_LENGTH configuration for ElasticSearch index config. If modified from the default, this configuration can have significant impact on search performance if changed and will trigger reindexing causing large delays in updates. Most users will not want to modify this.

1.0.0

Breaking Changes

#12673: Business Glossary ID generation has been modified to handle special characters and URL cleaning. When enable_auto_id is false (default), IDs are now generated by cleaning the name (converting spaces to hyphens, removing special characters except periods which are used as path separators) while preserving case. This may result in different IDs being generated for terms with special characters.
#12580: The OpenAPI source handled nesting incorrectly. 12580 fixes it to create proper nested field paths, however, this will re-write the incorrect schemas of existing OpenAPI runs.
#12408: The platform field in the DataPlatformInstance GraphQL type is removed. Clients need to retrieve the platform via the optional dataPlatformInstance field.
#12671: The priority field of the Incident entity is changed from an integer to an enum. This field was previously completely unused in UI and API, so this change should not affect existing deployments.
#12716: Fix the platform_instance being added twice to the URN. If you want to have the previous behavior back, you need to add your platform_instance twice (i.e. plat.plat).
#12797: Previously endpoints when used in ASYNC mode would not validate URNs, entity & aspect names immediately. Starting with this release, even in ASYNC mode, these requests will be returned with http code 4xx. This includes URN, Entity Type names, Entity & Aspect names.

Known Issues

#12601: Jetty 12 introduces a stricter handling of url encoding. We are currently applying a workaround to prevent a regression, while technically breaking the official specifications.
#12714: API Tracing requires at least one mutation of the aspect being updated using this version of DataHub.
#12797: See Breaking Change above. Entity Type names are case sensitive, this will result in 4xx exceptions when this rule is violated.
Python SDK v1.0.0.3 - direct accesses to the server_config property on DataHubRestEmitter can throw an unknown attribute error if test_connection is not called prior to directly accessing it as the default empty map initialization was removed. This is resolved in v1.1.0.

Potential Downtime

Deprecations

Other Notable Changes

#12641: Adds a new MCP validator that prevents deletes of any CorpUser entity that has a newly introduced CorpUserInfo#system flag set to true.
#12433: Fixes the searchable annotations in the model supporting Dashboard to Dashboard lineage within the DashboardInfo aspect. Mainly, users of Sigma and PowerBI Apps ingestion may be affected by this adjustment. Consequently, a reindex will be automatically triggered during the system upgrade.

0.15.0

OpenAPI Update: PIT Keep Alive parameter added to scroll endpoints. NOTE: This parameter requires the pointInTimeCreationEnabled feature flag to be enabled and the elasticSearch.implementation configuration to be elasticsearch. This feature is not supported for OpenSearch at this time and the parameter will not be respected without both of these set.
OpenAPI Update 2: Previously there was an incorrectly marked parameter named sort on the generic list entities endpoint for v3. This parameter is deprecated and only supports a single string value while the documentation indicates it supports a list of strings. This documentation error has been fixed and the correct field, sortCriteria, is now documented which supports a list of strings.

Known Issues

Persistence Exception: No Rows Updated may occur if a transaction does not change any aspect's data.

Breaking Changes

#12223: For dbt Cloud ingestion, the "View in dbt" link will point at the "Explore" page in the dbt Cloud UI. You can revert to the old behavior of linking to the dbt Cloud IDE by setting `external_url_mode: ide".
#12191 - Configs include_view_lineage and include_view_column_lineage are removed from snowflake ingestion source. View and External Table DDL lineage will always be ingested when definitions are available.
#12181 - Configs include_view_lineage, include_view_column_lineage and lineage_parse_view_ddl are removed from bigquery ingestion source. View and Snapshot lineage will always be ingested when definitions are available.
#12077: Kafka source no longer ingests schemas from schema registry as separate entities by default, set ingest_schemas_as_entities to true to ingest them
#11486 - Criterion's value parameter has been previously deprecated. Use of value instead of values is no longer supported and will be completely removed on the next major version.
#11484 - Metadata service authentication enabled by default
#11484 - Rest API authorization enabled by default
#10472 - SANDBOX added as a FabricType. No rollbacks allowed once metadata with this fabric type is added without manual cleanups in databases.
#11619 - schema field/column paths can no longer be empty strings
#11619 - schema field/column paths can no longer be duplicated within the schema
#11570 - The DatahubClientConfig's server field no longer defaults to http://localhost:8080. Be sure to explicitly set this.
#11570 - If a datahub_api is explicitly passed to a stateful ingestion config provider, it will be used. We previously ignored it if the pipeline context also had a graph object.
#11518 - DataHub Garbage Collection: Various entities that are soft-deleted (after 10d) or are timeseries entities (dataprocess, execution requests) will be removed automatically using logic in the datahub-gc ingestion source.
#12020 - Removed sql_parser configuration from the Redash source, as Redash now exclusively uses the sqlglot-based parser for lineage extraction.
#12020 - Removed datahub.utilities.sql_parser, datahub.utilities.sql_parser_base and datahub.utilities.sql_lineage_parser_impl module along with SqlLineageSQLParser and DefaultSQLParser. Use create_lineage_sql_parsed_result from datahub.sql_parsing.sqlglot_lineage module instead.
#11518 - DataHub Garbage Collection: Various entities that are soft-deleted (after 10d) or are timeseries entities (dataprocess, execution requests) will be removed automatically using logic in the datahub-gc ingestion source.
#12067 - Default behavior of DataJobPatchBuilder in Python sdk has been changed to NOT fill out created and lastModified auditstamps by default for input and output dataset edges. This should not have any user-observable impact (time-based lineage viz will still continue working based on observed time), but could break assumptions previously being made by clients.
#12158 - Users provisioned with user.props will need to be enabled before login in order to be granted access to DataHub.
#13273 - When using Opensearch, we will use 'zstd-no-dict' codec instead of 'default'. Elasticsearch still uses 'default'

Potential Downtime

Deprecations

#12056: The DataHub Airflow plugin no longer supports Airflow 2.1 and Airflow 2.2.
#11701: The Fivetran sources_to_database field is deprecated in favor of setting directly within sources_to_platform_instance.<key>.database.
#11560 - The PowerBI ingestion source configuration option include_workspace_name_in_dataset_urn determines whether the workspace name is included in the PowerBI dataset's URN. PowerBI allows to have identical name of semantic model and their tables across the workspace, It will overwrite the semantic model in-case of multi-workspace ingestion.

Entity urn with include_workspace_name_in_dataset_urn: false
```
 urn:li:dataset:(urn:li:dataPlatform:powerbi,[<PlatformInstance>.]<SemanticModelName>.<TableName>,<ENV>)
```
Entity urn with include_workspace_name_in_dataset_urn: true
```
 urn:li:dataset:(urn:li:dataPlatform:powerbi,[<PlatformInstance>.].<WorkspaceName>.<SemanticModelName>.<TableName>,<ENV>)
```
The config include_workspace_name_in_dataset_urn is default to false for backward compatibility, However, we recommend enabling this flag after performing the necessary cleanup. If stateful ingestion is enabled, running ingestion with the latest CLI version will handle the cleanup automatically. Otherwise, we recommend soft deleting all powerbi data via the DataHub CLI: datahub delete --platform powerbi --soft and then re-ingest with the latest CLI version, ensuring the include_workspace_name_in_dataset_urn configuration is set to true.

Other Notable Changes

#12236: Data flow and data job entities may additionally produce container aspect that will require a corresponding upgrade of server. Otherwise server can reject the aspect.
#12056: The DataHub Airflow plugin now defaults to the v2 plugin implementation.
#11742: For PowerBi ingestion, use_powerbi_email is now enabled by default when extracting ownership information.
#11549 - Manage Operations Privilege is extended from throttle control to all system management and operations APIs.

0.14.1

Breaking Changes

#9857 (#10773) lower method was removed from get_db_name of SQLAlchemySource class. This change will affect the urns of all related to SQLAlchemySource entities.

Old urn, where data_base_name is Some_Database:
```
- urn:li:dataJob:(urn:li:dataFlow:(mssql,demodata.Foo.stored_procedures,PROD),Proc.With.SpecialChar)
```
New urn, where data_base_name is Some_Database:
```
- urn:li:dataJob:(urn:li:dataFlow:(mssql,DemoData.Foo.stored_procedures,PROD),Proc.With.SpecialChar)
```
Re-running with stateful ingestion should automatically clear up the entities with old URNS and add entities with new URNs, therefore not duplicating the containers or jobs.
#11313 - datahub get will no longer return a key aspect for entities that don't exist.
#11369 - The default datahub-rest sink mode has been changed to ASYNC_BATCH. This requires a server with version 0.14.0+.
#11214 Container properties aspect will produce an additional field that will require a corresponding upgrade of server. Otherwise server can reject the aspects.
#10190 - extractor_config.set_system_metadata of datahub source has been moved to be a top level config in the recipe under flags.set_system_metadata

Potential Downtime

Deprecations

Other Notable Changes

Downgrade to previous version is not automatically supported.
Data Product Properties Unset side effect introduced
- Previously, Data Products could be set as linked to multiple Datasets if modified directly via the REST API rather than linked through the UI or GraphQL. This side effect aligns the REST API behavior with the GraphQL behavior by introducing a side effect that enforces the 1-to-1 constraint between Data Products and Datasets
- NOTE: There is a pathological pattern of writes for Data Products that can introduce issues with write processing that can occur with this side effect. If you are constantly changing all of the Datasets associated with a Data Product back and forth between multiple Data Products it will result in a high volume of writes due to the need to unset previous associations.

0.14.0.2

Breaking Changes

Protobuf CLI will no longer create binary encoded protoc custom properties. Flag added -protocProp in case this behavior is required.
#10814 Data flow info and data job info aspect will produce an additional field that will require a corresponding upgrade of server. Otherwise server can reject the aspects.
#10868 - OpenAPI V3 - Creation of aspects will need to be wrapped within a value key and the API is now symmetric with respect to input and outputs.

Example Global Tags Aspect:

json

{
  "tags": [
    {
      "tag": "string",
      "context": "string"
    }
  ]
}

New (optional fields systemMetadata and headers):

json

{
  "value": {
    "tags": [
      {
        "tag": "string",
        "context": "string"
      }
    ]
  },
  "systemMetadata": {},
  "headers": {}
}

#10858 Profiling configuration for Glue source has been updated.

Previously, the configuration was:
yaml
```
profiling: {}
```
Now, it needs to be:
yaml
```
profiling:
  enabled: true
```

Potential Downtime

Deprecations

OpenAPI v1: OpenAPI v1 is collectively defined as all endpoints which are not prefixed with /v2 or /v3. The v1 endpoints will be deprecated in no less than 6 months. Endpoints will be replaced with equivalents in the /v2 or /v3 APIs. No loss of functionality expected unless explicitly mentioned in Breaking Changes.

Other Notable Changes

#10498 - Tableau ingestion can now be configured to ingest multiple sites at once and add the sites as containers. The feature is currently only available for Tableau Server.
#10466 - Extends configuration in ~/.datahubenv to match DatahubClientConfig object definition. See full configuration in https://docs.datahub.com/docs/python-sdk/clients/. The CLI should now respect the updated configurations specified in ~/.datahubenv across its functions and utilities. This means that for systems where ssl certification is disabled, setting disable_ssl_verification: true in ~./datahubenv will apply to all CLI calls.
#11002 - We will not auto-generate a ~/.datahubenv file. You must either run datahub init to create that file, or set environment variables so that the config is loaded.
#11023 - Added a new parameter to datahub's put cli command: --run-id. This parameter is useful to associate a given write to an ingestion process. A use-case can be mimic transformers when a transformer for aspect being written does not exist.
#11051 - Ingestion reports will now trim the summary text to a maximum of 800k characters to avoid generating dataHubExecutionRequestResult that are too large for GMS to handle.

0.13.3

Breaking Changes

#10419 - aws_region is now a required configuration in the DynamoDB connector. The connector will no longer loop through all AWS regions; instead, it will only use the region passed into the recipe configuration.
#10389 - Custom validators, mutators, side-effects dropped a previously required constructor
#10472 - RVW added as a FabricType. No rollbacks allowed once metadata with this fabric type is added without manual cleanups in databases.

Potential Downtime

Deprecations

Other Notable Change

0.13.1

Breaking Changes

#9934 and #10075 - Stateful ingestion is now enabled by default if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified. It will still be disabled by default when any other sink type is used or if there is no pipeline name set.
#10002 - The DataHubGraph client no longer makes a request to the backend during initialization. If you want to preserve the old behavior, call graph.test_connection() after constructing the client.
#10026 - The dbt use_compiled_code option has been removed, because we now support capturing both source and compiled dbt SQL. This can be configured using include_compiled_code, which will be default enabled in 0.13.1.
#10055 - Assertion entities generated by dbt are now associated with the dbt dataset entity, and not the entity in the data warehouse.
#10090 - For Redshift ingestion, use_lineage_v2 is now enabled by default.
#10147 - For looker ingestion, the browse paths for looker Dashboard, Chart, View, Explore have been updated to align with Looker UI. This does not affect URNs or lineage but primarily affects (improves) browsing experience.
#10164 - For dbt ingestion, entities_enabled.model_performance and include_compiled_code are now both enabled by default. Upgrading dbt ingestion will also require upgrading the backend to 0.13.1.
#10066 - For view access controls, SEARCH_AUTHORIZATION_ENABLED replaced by VIEW_AUTHORIZATION_ENABLED to more accurately represent the feature.
#8231 - Google Analytics 3 has been fully sunsetted by Google as of July 2023, so we now support GA4 thanks to this PR and no longer support GA3 (which would have been broken since last year anyways).
#10278 - Renaming Presto-On-Hive Source to Hive Metastore source to reflect better its purpose

Potential Downtime

Deprecations

Other Notable Changes

0.13.0

Breaking Changes

Updating MySQL version for quickstarts to 8.2, may cause quickstart issues for existing instances.
Neo4j 5.x, may require migration from 4.x
Build requires JDK17 (Runtime Java 11)
Build requires Docker Compose > 2.20
#9731 - The acryl-datahub CLI now requires Python 3.8+
#9601 - The Unity Catalog(UC) ingestion source config include_metastore is now disabled by default. This change will affect the urns of all entities in the workspace.

Entity Hierarchy with include_metastore: true (Old)
```
- UC Metastore
  - Catalog
    - Schema
      - Table
```
Entity Hierarchy with include_metastore: false (New)
```
- Catalog
  - Schema
    - Table
```
We recommend using platform_instance for differentiating across metastores.

If stateful ingestion is enabled, running ingestion with latest cli version will perform all required cleanup. Otherwise, we recommend soft deleting all databricks data via the DataHub CLI: datahub delete --platform databricks --soft and then reingesting with latest cli version.
#9601 - The Unity Catalog(UC) ingestion source config include_hive_metastore is now enabled by default. This requires config warehouse_id to be set. You can disable include_hive_metastore by setting it to False to avoid ingesting legacy hive metastore catalog in Databricks.
#9904 - The default Redshift table_lineage_mode is now MIXED, instead of STL_SCAN_BASED. Improved lineage generation is also available by enabling use_lineaege_v2. This v2 implementation will become the default in a future release.

Potential Downtime

Deprecations

Spark 2.x (including previous JDK8 build requirements)

Other Notable Changes

0.12.1

Breaking Changes

#9244: The redshift-legacy and redshift-legacy-usage sources, which have been deprecated for >6 months, have been removed. The new redshift source is a superset of the functionality provided by those legacy sources.
database_alias config is no longer supported in SQL sources namely - Redshift, MySQL, Oracle, Postgres, Trino, Presto-on-hive. The config will automatically be ignored if it's present in your recipe. It has been deprecated since v0.9.6.
#9257: The Python SDK urn types are now autogenerated. The new classes are largely backwards compatible with the previous, manually written classes, but many older methods are now deprecated in favor of a more uniform interface. The only breaking change is that the signature for the director constructor e.g. TagUrn("tag", ["tag_name"]) is no longer supported, and the simpler TagUrn("tag_name") should be used instead. The canonical place to import the urn classes from is datahub.metadata.urns.*. Other import paths, like datahub.utilities.urns.corpuser_urn.CorpuserUrn are retained for backwards compatibility, but are considered deprecated.
#9286: The DataHubRestEmitter.emit method no longer returns anything. It previously returned a tuple of timestamps.
#8951: A great expectations based profiler has been added for the Unity Catalog source. To use the old profiler, set method: analyze under the profiling section in your recipe. To use the new profiler, set method: ge. Profiling is disabled by default, so to enable it, one of these methods must be specified.

Potential Downtime

Deprecations

Other Notable Changes

0.12.0

Breaking Changes

#8687 (datahub-helm #365 #353) - If Helm is used for installation and Neo4j is enabled, update the prerequisites Helm chart to version >=0.1.2 and adjust your value overrides in the neo4j: section according to the new structure.
#9044 - GraphQL APIs for adding ownership now expect either an ownershipTypeUrn referencing a customer ownership type or a (deprecated) type. Where before adding an ownership without a concrete type was allowed, this is no longer the case. For simplicity you can use the type parameter which will get translated to a custom ownership type internally if one exists for the type being added.
#9010 - In Redshift source's config incremental_lineage is set default to off.
#8810 - Removed support for SQLAlchemy 1.3.x. Only SQLAlchemy 1.4.x is supported now.
#8942 - Removed urn:li:corpuser:datahub owner for the Measure, Dimension and Temporal tags emitted by Looker and LookML source connectors.
#8853 - The Airflow plugin no longer supports Airflow 2.0.x or Python 3.7. See the docs for more details.
#8853 - Introduced the Airflow plugin v2. If you're using Airflow 2.3+, the v2 plugin will be enabled by default, and so you'll need to switch your requirements to include pip install 'acryl-datahub-airflow-plugin[plugin-v2]'. To continue using the v1 plugin, set the DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN environment variable to true.
#8943 - The Unity Catalog ingestion source has a new option include_metastore, which will cause all urns to be changed when disabled. This is currently enabled by default to preserve compatibility, but will be disabled by default and then removed in the future. If stateful ingestion is enabled, simply setting include_metastore: false will perform all required cleanup. Otherwise, we recommend soft deleting all databricks data via the DataHub CLI: datahub delete --platform databricks --soft and then reingesting with include_metastore: false.
#8846 - Changed enum values in resource filters used by policies. RESOURCE_TYPE became TYPE and RESOURCE_URN became URN. Any existing policies using these filters (i.e. defined for particular urns or types such as dataset) need to be upgraded manually, for example by retrieving their respective dataHubPolicyInfo aspect and changing part using filter i.e.

yaml

   "resources": {
     "filter": {
       "criteria": [
         {
           "field": "RESOURCE_TYPE",
           "condition": "EQUALS",
           "values": [
             "dataset"
           ]
         }
       ]
     }

into

yaml

   "resources": {
     "filter": {
       "criteria": [
         {
           "field": "TYPE",
           "condition": "EQUALS",
           "values": [
             "dataset"
           ]
         }
       ]
     }

for example, using datahub put command. Policies can be also removed and re-created via UI.

#9077 - The BigQuery ingestion source by default sets match_fully_qualified_names: true. This means that any dataset_pattern or schema_pattern specified will be matched on the fully qualified dataset name, i.e. <project_name>.<dataset_name>. We attempt to support the old pattern format by prepending .*\\. to dataset patterns lacking a period, so in most cases this should not cause any issues. However, if you have a complex dataset pattern, we recommend you manually convert it to the fully qualified format to avoid any potential issues.
#9110 - The Unity Catalog source will now generate urns based on env properly. If you have been setting env in your recipe to something besides PROD, we will now generate urns with that new env variable, invalidating your existing urns.

Potential Downtime

Deprecations

Other Notable Changes

Session token configuration has changed, all previously created session tokens will be invalid and users will be prompted to log in. Expiration time has also been shortened which may result in more login prompts with the default settings. There should be no other interruption due to this change.

0.11.0

Breaking Changes

Potential Downtime

#8611 Search improvements requires reindexing indices. A system-update job will run which will set indices to read-only and create a backup/clone of each index. During the reindexing new components will be prevented from start-up until the reindex completes. The logs of this job will indicate a % complete per index. Depending on index sizes and infrastructure this process can take 5 minutes to hours however as a rough estimate 1 hour for every 2.3 million entities.

Deprecations

#8525: In LDAP ingestor, the manager_pagination_enabled changed to general pagination_enabled
MAE Events are no longer produced. MAE events have been deprecated for over a year.

Other Notable Changes

In this release we now enable you to create and delete pinned announcements on your DataHub homepage! If you have the “Manage Home Page Posts” platform privilege you’ll see a new section in settings called “Home Page Posts” where you can create and delete text posts and link posts that your users see on the home page.
The new search and browse experience, which was first made available in the previous release behind a feature flag, is now on by default. Check out our release notes for v0.10.5 to get more information and documentation on this new Browse experience.
In addition to the ranking changes mentioned above, this release includes changes to the highlighting of search entities to understand why they match your query. You can also sort your results alphabetically or by last updated times, in addition to relevance. In this release, we suggest a correction if your query has a typo in it.
#8300: Clickhouse source now inherited from TwoTierSQLAlchemy. In old way we have platform_instance -> container -> co container db (None) -> container schema and now we have platform_instance -> container database.
#8300: Added uri_opts argument; now we can add any options for clickhouse client.
#8659: BigQuery ingestion no longer creates DataPlatformInstance aspects by default. This will only affect users that were depending on this aspect for custom functionality, and can be enabled via the include_data_platform_instance config option.
OpenAPI entity and aspect endpoints expanded to improve developer experience when using this API with additional aspects to be added in the near future.
The CLI now supports recursive deletes.
Batching of default aspects on initial ingestion (SQL)
Improvements to multi-threading. Ingestion recipes, if previously reduced to 1 thread, can be restored to the 15 thread default.
Gradle 7 upgrade moderately improves build speed
DataHub Ingestion slim images reduced in size by 2GB+
Glue Schema Registry fixed

0.10.5

Breaking Changes

#8201: Python SDK: In the DataFlow class, the cluster argument is deprecated in favor of env.
#8263: Okta source config option okta_profile_to_username_attr default changed from login to email. This determines which Okta profile attribute is used for the corresponding DataHub user and thus may change what DataHub users are generated by the Okta source. And in a follow up okta_profile_to_username_regex has been set to .* which taken together with previous change brings the defaults in line with OIDC.
#8331: For all sql-based sources that support profiling, you can no longer specify profile_table_level_only together with include_field_xyz config options to ingest certain column-level metrics. Instead, set profile_table_level_only to false and individually enable / disable desired field metrics.
#8451: The bigquery-beta and snowflake-beta source aliases have been dropped. Use bigquery and snowflake as the source type instead.
#8472: Ingestion runs created with Pipeline.create will show up in the DataHub ingestion tab as CLI-based runs. To revert to the previous behavior of not showing these runs in DataHub, pass no_default_report=True.
#8513: snowflake connector will use user's email attribute as is in urn. To revert to previous behavior disable email_as_user_identifier in recipe.

Potential Downtime

BrowsePathsV2 upgrade will now be handled by the system-update job in non-blocking mode. This process generates data needed for the new search and browse feature. This process must complete before enabling the new search and browse UI and while upgrading entities will be missing from the UI. If not using the new search and browse UI, there will be no impact and the update will complete in the background.

Deprecations

#8198: In the Python SDK, the PlatformKey class has been renamed to ContainerKey.

Other Notable Changes

0.10.5 introduces the new Unified Search & Browse experience and is disabled by default. You can control whether or not you want to see just the new search filtering experience, the new search and browse experience together, or keep the existing search and browse experiences by toggling the two environment variable feature flags SHOW_SEARCH_FILTERS_V2 and SHOW_BROWSE_V2 in your GMS container.

Upgrade Considerations:

With the release of Browse V2, we have created a job to run in GMS that will backfill your existing data with new browsePathsV2 aspects. This job loops over entity types that need a browsePathsV2 aspect (Dataset, Dashboard, Chart, DataJob, DataFlow, MLModel, MLModelGroup, MLFeatureTable, and MLFeature) and generates one for them. For entities that may have Container parents (Datasets and Dashboards) we will try to fetch their parent containers in order to generate this new aspect. For those deployments with large amounts of data, consider whether running this upgrade job makes sense as it may be a heavy operation and take some time to complete. If you wish to skip this job, simply set the BACKFILL_BROWSE_PATHS_V2 environment variable flag to false in your GMS container. Without this backfill job, though, you will need to rely on the newest CLI of ingestion to create these browsePathsV2 aspects when running ingestion otherwise your browse sidebar will be out-of-sync.
Since the new browse experience replaces the old, consider whether having the SHOW_BROWSE_V2 environment variable feature flag on is the right decision for your organization. If you’re creating custom browse paths with the browsePaths aspect, you can continue to do the same with the new experience, however you will have to generate browsePathsV2 aspects instead which are documented here.

0.10.4

Breaking Changes

Potential Downtime

Deprecations

#8045: With the introduction of custom ownership types, the Owner aspect has been updated where the type field is deprecated in favor of a new field typeUrn. This latter field is an urn reference to the new OwnershipType entity. GraphQL endpoints have been updated to use the new field. For pre-existing ownership aspect records, DataHub now has logic to map the old field to the new field.

Other notable Changes

#8191: Updates GMS's health check endpoint to account for its dependency on external components. Notably, at this time, elasticsearch. This means that DataHub operators can now use GMS health status more reliably.

0.10.3

Breaking Changes

#7900: The catalog_pattern and schema_pattern options of the Unity Catalog source now match against the fully qualified name of the catalog/schema instead of just the name. Unless you're using regex ^ in your patterns, this should not affect you.
#7942: Renaming the containerPath aspect to browsePathsV2. This means any data with the aspect name containerPath will be invalid. We had not exposed this in the UI or used it anywhere, but it was a model we recently merged to open up other work. This should not affect many people if anyone at all unless you were manually creating containerPath data through ingestion on your instance.
#8068: In the datahub delete CLI, if an --entity-type filter is not specified, we automatically delete across all entity types. The previous behavior was to use a default entity type of dataset.
#8068: In the datahub delete CLI, the --start-time and --end-time parameters are not required for timeseries aspect hard deletes. To recover the previous behavior of deleting all data, use --start-time min --end-time max.

Potential Downtime

Deprecations

The signature of Source.get_workunits() is changed from Iterable[WorkUnit] to the more restrictive Iterable[MetadataWorkUnit].
Legacy usage creation via the UsageAggregation aspect, /usageStats?action=batchIngest GMS endpoint, and UsageStatsWorkUnit metadata-ingestion class are all deprecated.

Other notable Changes

0.10.2

Breaking Changes

#7016 Add add_database_name_to_urn flag to Oracle source which ensure that Dataset urns have the DB name as a prefix to prevent collision (.e.g. {database}.{schema}.{table}). ONLY breaking if you set this flag to true, otherwise behavior remains the same.
The Airflow plugin no longer includes the DataHub Kafka emitter by default. Use pip install acryl-datahub-airflow-plugin[datahub-kafka] for Kafka support.
The Airflow lineage backend no longer includes the DataHub Kafka emitter by default. Use pip install acryl-datahub[airflow,datahub-kafka] for Kafka support.
Java SDK PatchBuilders have been modified in a backwards incompatible way to align more with the Python SDK and support more use cases. Any application utilizing the Java SDK for patch building may be affected on upgrading this dependency.

Deprecations

The docker image and script for updating from Elasticsearch 6 to 7 is no longer being maintained and will be removed from the /contrib section of the repository. Please refer to older releases if needed.

0.10.0

Breaking Changes

#7103 This should only impact users who have configured explicit non-default names for DataHub's Kafka topics. The environment variables used to configure Kafka topics for DataHub used in the kafka-setup docker image have been updated to be in-line with other DataHub components, for more info see our docs on Configuring Kafka in DataHub . They have been suffixed with _TOPIC where as now the correct suffix is _TOPIC_NAME. This change should not affect any user who is using default Kafka names.
#6906 The Redshift source has been reworked and now also includes usage capabilities. The old Redshift source was renamed to redshift-legacy. The redshift-usage source has also been renamed to redshift-usage-legacy will be removed in the future.

Potential Downtime

#6894 Search improvements requires reindexing indices. A system-update job will run which will set indices to read-only and create a backup/clone of each index. During the reindexing new components will be prevented from start-up until the reindex completes. The logs of this job will indicate a % complete per index. Depending on index sizes and infrastructure this process can take 5 minutes to hours however as a rough estimate 1 hour for every 2.3 million entities.

Helm Notes

Helm without --atomic: The default timeout for an upgrade command is 5 minutes. If the reindex takes longer (depending on data size) it will continue to run in the background even though helm will report a failure. Allow this job to finish and then re-run the helm upgrade command.

Helm with --atomic: In general, it is recommended to not use the --atomic setting for this particular upgrade since the system update job will be terminated before completion. If --atomic is preferred, then increase the timeout using the --timeout flag to account for the reindexing time (see note above for estimating this value).

Deprecations

0.9.6

Breaking Changes

#6742 The metadata file sink's output format no longer contains nested JSON strings for MCP aspects, but instead unpacks the stringified JSON into a real JSON object. The previous sink behavior can be recovered using the legacy_nested_json_string option. The file source is backwards compatible and supports both formats.
#6901 The env and database_alias fields have been marked deprecated across all sources. We recommend using platform_instance where possible instead.

Potential Downtime

Deprecations

#6851 - Sources bigquery-legacy and bigquery-usage-legacy have been removed

Other notable Changes

If anyone faces issues with login please clear your cookies. Some security updates are part of this release. That may cause login issues until cookies are cleared.

0.9.4 / 0.9.5

Breaking Changes

#6243 apache-ranger authorizer is no longer the core part of DataHub GMS, and it is shifted as plugin. Please refer updated documentation Configuring Authorization with Apache Ranger for configuring apache-ranger-plugin in DataHub GMS.
#6243 apache-ranger authorizer as plugin is not supported in DataHub Kubernetes deployment.
#6243 Authentication and Authorization plugins configuration are removed from application.yaml. Refer documentation Migration Of Plugins From application.yaml for migrating any existing custom plugins.
datahub check graph-consistency command has been removed. It was a beta API that we had considered but decided there are better solutions for this. So removing this.
graphql_url option of powerbi-report-server source deprecated as the options is not used.
#6789 BigQuery ingestion: If enable_legacy_sharded_table_support is set to False, sharded table names will be suffixed with _yyyymmdd to make sure they don't clash with non-sharded tables. This means if stateful ingestion is enabled then old sharded tables will be recreated with a new id and attached tags/glossary terms/etc will need to be added again. This behavior is not enabled by default yet, but will be enabled by default in a future release.

Potential Downtime

Deprecations

Other notable Changes

#6611 - Snowflake schema_pattern now accepts pattern for fully qualified schema name in format <catalog_name>.<schema_name> by setting config match_fully_qualified_names : True. Current default match_fully_qualified_names: False is only to maintain backward compatibility. The config option match_fully_qualified_names will be deprecated in future and the default behavior will assume match_fully_qualified_names: True."
#6636 - Sources snowflake-legacy and snowflake-usage-legacy have been removed.

0.9.3

Breaking Changes

The beta datahub check graph-consistency command has been removed.

Potential Downtime

Deprecations

PowerBI source: workspace_id_pattern is introduced in place of workspace_id. workspace_id is now deprecated and set for removal in a future version.

Other notable Changes

0.9.2

LookML source will only emit views that are reachable from explores while scanning your git repo. Previous behavior can be achieved by setting emit_reachable_views_only to False.
LookML source will always lowercase urns for lineage edges from views to upstream tables. There is no fallback provided to previous behavior because it was inconsistent in application of lower-casing earlier.
dbt config node_type_pattern which was previously deprecated has been removed. Use entities_enabled instead to control whether to emit metadata for sources, models, seeds, tests, etc.
The dbt source will always lowercase urns for lineage edges to the underlying data platform.
The DataHub Airflow lineage backend and plugin no longer support Airflow 1.x. You can still run DataHub ingestion in Airflow 1.x using the PythonVirtualenvOperator.

Breaking Changes

#6570 snowflake connector now populates created and last modified timestamps for snowflake datasets and containers. This version of snowflake connector will not work with datahub-gms version older than v0.9.3

Potential Downtime

Deprecations

Other notable Changes

0.9.1

Breaking Changes

We have promoted bigquery-beta to bigquery. If you are using bigquery-beta then change your recipes to use the type bigquery.

Potential Downtime

Deprecations

Other notable Changes

0.9.0

Breaking Changes

Java version 11 or greater is required.
For any of the GraphQL search queries, the input no longer supports value but instead now accepts a list of values. These values represent an OR relationship where the field value must match any of the values.

Potential Downtime

Deprecations

Other notable Changes

`v0.8.45`

Breaking Changes

The getNativeUserInviteToken and createNativeUserInviteToken GraphQL endpoints have been renamed to getInviteToken and createInviteToken respectively. Additionally, both now accept an optional roleUrn parameter. Both endpoints also now require the MANAGE_POLICIES privilege to execute, rather than MANAGE_USER_CREDENTIALS privilege.
One of the default policies shipped with DataHub (urn:li:dataHubPolicy:7, or All Users - All Platform Privileges) has been edited to no longer include MANAGE_POLICIES. Its name has consequently been changed to All Users - All Platform Privileges (EXCEPT MANAGE POLICIES). This change was made to prevent all users from effectively acting as superusers by default.

Potential Downtime

Deprecations

Other notable Changes

`v0.8.44`

Breaking Changes

Browse Paths have been upgraded to a new format to align more closely with the intention of the feature. Learn more about the changes, including steps on upgrading, here: https://docs.datahub.com/docs/advanced/browse-paths-upgrade
The dbt ingestion source's disable_dbt_node_creation and load_schema options have been removed. They were no longer necessary due to the recently added sibling entities functionality.
The snowflake source now uses newer faster implementation (earlier snowflake-beta). Config properties provision_role and check_role_grants are not supported. Older snowflake and snowflake-usage are available as snowflake-legacy and snowflake-usage-legacy sources respectively.

Potential Downtime

[Helm] If you're using Helm, please ensure that your version of the datahub-actions container is bumped to v0.0.7 or head. This version contains changes to support running ingestion in debug mode. Previous versions are not compatible with this release. Upgrading to helm chart version 0.2.103 will ensure that you have the compatible versions by default.

Deprecations

Other notable Changes

`v0.8.42`

Breaking Changes

Python 3.6 is no longer supported for metadata ingestion
#5451 GMS_HOST and GMS_PORT environment variables deprecated in v0.8.39 have been removed. Use DATAHUB_GMS_HOST and DATAHUB_GMS_PORT instead.
#5478 DataHub CLI delete command when used with --hard option will delete soft-deleted entities which match the other filters given.
#5471 Looker now populates userEmail in dashboard user usage stats. This version of looker connector will not work with older version of datahub-gms if you have extract_usage_history looker config enabled.
#5529 - ANALYTICS_ENABLED environment variable in datahub-gms is now deprecated. Use DATAHUB_ANALYTICS_ENABLED instead.
#5485 --include-removed option was removed from delete CLI

Potential Downtime

Deprecations

Other notable Changes

`v0.8.41`

Breaking Changes

The should_overwrite flag in csv-enricher has been replaced with write_semantics to match the format used for other sources. See the documentation for more details
Closing an authorization hole in creating tags adding a Platform Privilege called Create Tags for creating tags. This is assigned to datahub root user, along with default All Users policy. Notice: You may need to add this privilege (or Manage Tags) to existing users that need the ability to create tags on the platform.
#5329 Below profiling config parameters are now supported in BigQuery:
- profiling.profile_if_updated_since_days (default=1)
- profiling.profile_table_size_limit (default=1GB)
- profiling.profile_table_row_limit (default=50000)
Set above parameters to null if you want older behaviour.

Potential Downtime

Deprecations

Other notable Changes

`v0.8.40`

Breaking Changes

#5240 lineage_client_project_id in bigquery source is removed. Use storage_project_id instead.

Potential Downtime

Deprecations

Other notable Changes

`v0.8.39`

Breaking Changes

Refactored the health field of the Dataset GraphQL Type to be of type list of HealthStatus (was type HealthStatus). See this PR for more details.

Potential Downtime

Deprecations

#4875 Lookml view file contents will no longer be populated in custom_properties, instead view definitions will be always available in the View Definitions tab.
#5208 GMS_HOST and GMS_PORT environment variables being set in various containers are deprecated in favour of DATAHUB_GMS_HOST and DATAHUB_GMS_PORT.
KAFKA_TOPIC_NAME environment variable in datahub-mae-consumer and datahub-gms is now deprecated. Use METADATA_AUDIT_EVENT_NAME instead.
KAFKA_MCE_TOPIC_NAME environment variable in datahub-mce-consumer and datahub-gms is now deprecated. Use METADATA_CHANGE_EVENT_NAME instead.
KAFKA_FMCE_TOPIC_NAME environment variable in datahub-mce-consumer and datahub-gms is now deprecated. Use FAILED_METADATA_CHANGE_EVENT_NAME instead.

Other notable Changes

#5132 Profile tables in snowflake source only if they have been updated since configured (default: 1) number of day(s). Update the config profiling.profile_if_updated_since_days as per your profiling schedule or set it to None if you want older behaviour.

`v0.8.38`

Breaking Changes

Potential Downtime

Deprecations

Other notable Changes

Create & Revoke Access Tokens via the UI
Create and Manage new users via the UI
Improvements to Business Glossary UI
FIX - Do not require reindexing to migrate to using the UI business glossary

`v0.8.36`

Breaking Changes

In this release we introduce a brand new Business Glossary experience. With this new experience comes some new ways of indexing data in order to make viewing and traversing the different levels of your Glossary possible. Therefore, you will have to restore your indices in order for the new Glossary experience to work for users that already have existing Glossaries. If this is your first time using DataHub Glossaries, you're all set!

Potential Downtime

Deprecations

Other notable Changes

#4961 Dropped profiling is not reported by default as that caused a lot of spurious logging in some cases. Set profiling.report_dropped_profiles to True if you want older behaviour.

`v0.8.35`

Breaking Changes

Potential Downtime

Deprecations

#4875 Lookml view file contents will no longer be populated in custom_properties, instead view definitions will be always available in the View Definitions tab.

Other notable Changes

`v0.8.34`

Breaking Changes

#4644 Remove database option from snowflake source which was deprecated since v0.8.5
#4595 Rename confusing config report_upstream_lineage to upstream_lineage_in_report in snowflake connector which was added in 0.8.32

Potential Downtime

Deprecations

#4644 host_port option of snowflake and snowflake-usage sources deprecated as the name was confusing. Use account_id option instead.

Other notable Changes

#4760 check_role_grants option was added in snowflake to disable checking roles in snowflake as some people were reporting long run times when checking roles.

Updating DataHub

Updating DataHub

Next

Breaking Changes

Known Issues

Potential Downtime

Deprecations

Other Notable Changes

v1.6.0

Breaking Changes

Known Issues

Potential Downtime

Deprecations

Other Notable Changes

v1.5.0

Breaking Changes

Known Issues

Potential Downtime

Deprecations

Other Notable Changes

v1.5.0.1

v1.5.0.2

Bug Fixes

v1.5.0.7

v1.5.0.6

v1.5.0.5

v1.5.0.4

v1.5.0.3

Notable Changes

Bug Fixes

Build, Docker, and Tooling

v1.4.0

Breaking Changes

Deprecations

Other Notable Changes

1.3.0

Breaking Changes

Known Issues

Potential Downtime

Deprecations

Other Notable Changes

1.2.0

Breaking Changes

Known Issues

Potential Downtime

Deprecations

Other Notable Changes

1.1.0

Breaking Changes

Known Issues

Potential Downtime

Deprecations

Other Notable Changes

1.0.0

Breaking Changes

Known Issues

Potential Downtime

Deprecations

Other Notable Changes

0.15.0

Known Issues

Breaking Changes

Potential Downtime

Deprecations

Other Notable Changes

0.14.1

Breaking Changes

Potential Downtime

Deprecations

Other Notable Changes

0.14.0.2

Breaking Changes

Potential Downtime

Deprecations

Other Notable Changes

0.13.3

Breaking Changes

Potential Downtime

Deprecations

Other Notable Change