docs/how/updating-datahub.md
This file documents any backwards-incompatible changes in DataHub and assists people when migrating to a new version.
service_name to a multitenant Oracle database, the database name used in URNs will now reflect the Pluggable Database (PDB) name instead of the Container Database (CDB) name. In Oracle Multitenant architecture, a CDB is the top-level container (e.g. cdb) and a PDB is an individual tenant database within it (e.g. mypdb); service_name typically routes to the PDB, so the PDB name is the correct identifier for your datasets. This affects both dataset URNs (when add_database_name_to_urn: true) and database/schema container URNs (always, since containers always include the database name). If your existing metadata was ingested with the old CDB-based URNs, re-ingesting will create new entities under the corrected URNs. To preserve the old URN shape and avoid re-creating entities, set urn_db_name explicitly in your recipe to match your previous CDB name.boot/retention.yaml) if you need version history for any entity/aspect.datahub-client-java8 for backward compatibility, but now requires Java 17+ at runtime. This change also includes:
pipelineInfo.name, e.g. the @pipeline(name="...") argument in Kubeflow Pipelines) as the stable identifier; non-Kubeflow pipelines fall back to display_name with any timestamp suffix stripped. After upgrading, existing pipeline entities will appear as separate entities from new ingestion runs. To clean up old entities, enable stateful ingestion with stale entity removal.aiplatform.metadataStores.get, aiplatform.metadataStores.list, aiplatform.executions.get, and aiplatform.executions.list. If these permissions are missing, the connector gracefully falls back with warnings. To disable these features, set use_ml_metadata_for_lineage: false, extract_execution_metrics: false, and include_evaluations: false.exclude_aspects contained dataHubIngestionSourceInfo, dataHubSecretValue, dataHubExecutionRequestInput, globalSettingsInfo. Now urn_pattern.deny contains urn:li:dataHubIngestionSource:.*, urn:li:dataHubSecret:.*, urn:li:globalSettings:.*, urn:li:dataHubExecutionRequest:.* to exclude entire entity types instead of individual aspects.urn_pattern or exclude_aspects in recipe, carefully configure them based on your requirements to avoid syncing sensitive data or creating invalid entities. Recommended to keep new defaults.mssql instead of sqlserver, matching DataHub's canonical platform name. This fixes column-level lineage when using use_schema_resolver: true, as the SchemaResolver now correctly queries for platform=mssql. If you have platform_instance_map or connect_to_platform_map entries keyed on "sqlserver" for Debezium SQL Server connectors, update them to use "mssql" instead.metadata-service/configuration/src/main/resources/application.yaml. It is recommended to set authentication.tokenService.signingKey or env var DATAHUB_TOKEN_SERVICE_SIGNING_KEY and authentication.tokenService.salt or env var DATAHUB_TOKEN_SERVICE_SALT before starting DataHub. Refer the linked pages to know this is handled for local development and CLI quickstart.
emit_mcps() method on DataHubRestEmitter now returns List[TraceData] instead of int. Previously it returned the number of chunks/batches sent. Now it returns a list of TraceData objects (one per batch) containing trace IDs for debugging and status checking. To get the previous chunk count, use len(result) on the returned list. Additionally, emit_mcp() now returns Optional[TraceData] instead of None.region configuration field is deprecated in favor of regions (list type). The region field continues to work for backward compatibility.normalize_external_dataset_paths defaults to false, meaning partitioned paths like gs://bucket/data/year=2024/month=01/ create separate dataset entities. In the next major version, this will default to true, normalizing paths to gs://bucket/data/ for stable dataset URNs with lineage aggregation across partitions. To opt-in to the new behavior now, set normalize_external_dataset_paths: true in your configuration.REDIRECT_URL) format has changed. It now stores a plain Base64-encoded URL string instead of the previous serialized format. Any in-flight REDIRECT_URL cookies set before the upgrade will fail to decode; affected users will be redirected to the DataHub home page (/) after their next SSO login instead of the page they originally requested. No action is required — users simply log in again and the new cookie format takes effect automatically.platform_instance is configured, DataFlow and DataJob entities now receive a browsePathsV2 aspect with the platform instance as the root path. Previously, these entities had no browse path from ingestion and the backend would place them in a generic "Default" folder, causing entities from multiple platform instances to be mixed together. This affects sources like Fivetran, Glue, and Kafka-Connect that emit DataFlow/DataJob entities with platform_instance. Sources without platform_instance configured are unaffected.setup.py to pyproject.toml (PEP 621). setup.py remains the source of truth for editing dependencies for now; pyproject.toml is auto-generated from it via ./gradlew :metadata-ingestion:updateLockFile. Sync verification (./gradlew :metadata-ingestion:verifyPyprojectSync) runs automatically during check and buildWheel. setup.py will be deprecated in a future release in favor of pyproject.toml as the sole dependency source.convert_urns_to_lowercase config option for dbt ingestion (both dbt-core and dbt-cloud). When enabled, dbt platform URNs are lowercased, preventing duplicate entities caused by schema name casing differences (e.g., app_sales vs APP_SALES). This is an opt-in flag (default: false for all platforms). Recommended for case-insensitive platforms like Snowflake or BigQuery where dbt manifests may contain mixed-case identifiers.datahubSystemUpdate.scaleDown.useJavaImplementation) and the job env vars described in Environment variables. Outside Kubernetes or when disabled, the step is a no-op.platform_instance_map to ensure URNs match native connectorsmax_training_jobs_per_type control which resources are processed and sent to DataHub (most recently updated N items)platform_instance configuration field for environments running multiple Vertex AI instancesplatform_instance is explicitly configuredservice_name instead of database configuration. Previously, Oracle containers had no name (only an ID) when using service_name with data_dictionary_mode: ALL (the default). Container URNs will change for affected users as the database name is now properly populated in the container GUID. If stateful ingestion is enabled, running ingestion with the latest CLI version will automatically clean up the old containers and create properly named ones. This fix also ensures database names are normalized consistently with schema and table names.acryl-datahub, acryl-datahub-airflow-plugin, acryl-datahub-dagster-plugin, acryl-datahub-gx-plugin, prefect-datahub, and acryl-datahub-actions. Upgrade to Python 3.10+ before upgrading these packages.airflow variables set datahub_airflow_plugin_disable_listener true to disable the plugin, you must now use export AIRFLOW_VAR_DATAHUB_AIRFLOW_PLUGIN_DISABLE_LISTENER=true instead.tls_verify: true). This prevents Man-in-the-Middle attacks (CWE-295) but may break existing configurations using self-signed certificates. To restore the previous behavior, explicitly set tls_verify: false in your recipe.create_corp_user default changed from True to False. Previously, PowerBI would create user entities by default, potentially overwriting existing user profiles from LDAP/Okta. Now, PowerBI emits ownership URNs only (soft references) by default. To restore previous behavior, explicitly set ownership.create_corp_user: true in your recipe. Migration note: If upgrading and users appear with incomplete profiles, re-ingest from your authoritative source (LDAP/Okta/SCIM). Additionally, when create_corp_user: true, the connector now emits both CorpUserKeyClass and CorpUserInfoClass (was only CorpUserKeyClass), providing complete user metadata. Stateful ingestion note: User entities created by PowerBI are now marked as non-primary (is_primary_source=False), so they will NOT be soft-deleted by stateful ingestion when they disappear. This prevents accidental deletion of users who may also exist in LDAP/Okta/SCIM.{ds_type}.{ds_uid} to {ds_type}.{ds_uid}.{dashboard_uid}.{panel_id}. This means all existing Grafana dataset entities will have different URNs. If stateful ingestion is enabled, running ingestion with the latest CLI version will automatically clean up old entities and create new ones. Otherwise, we recommend soft deleting all Grafana datasets via the DataHub CLI: datahub delete --platform grafana --soft and then re-ingesting with the latest CLI version.SqlParsingBuilder is removed, use SqlParsingAggregator insteadbrowsePaths aspect replaced with browsePathsV2browsePaths aspect replaced with browsePathsV2Dashboard entitiy and dashboardInfo aspect, charts property (deprecated) is replaced with chartEdgesplatform_instance configured, as this PR now passes platform_instance to dataset containers (affects GUID generation). The env parameter addition is harmless as it's excluded from GUID calculation. Stateful ingestion will soft-delete old containers and create new ones on the next run. Dataset entities and their lineage are unaffected.service_name or database without add_database_name_to_urn: true), stored procedure DataJob URNs will change from database.schema.stored_procedures to schema.stored_procedures. This fixes a URN mismatch that prevented stored procedure lineage from working. Stateful ingestion will soft-delete old stored procedure entities and create new ones with correct lineage on the next run. Users with database config parameter and add_database_name_to_urn: true are unaffected.--use-password flag in datahub init command is deprecated. Token generation is now automatically detected when both --username and --password are provided together. The flag continues to work for backward compatibility but will be removed in a future release.dataset_pattern filtering to apply earlier in the ingestion pipeline, reducing unnecessary API calls to BigQuery for datasets that will be filtered out.DATAHUB_AUTO_INCREASE_PARTITIONS=true to enable.--extra-env option to datahub ingest deploy command to pass environment variables as comma-separated KEY=VALUE pairs (e.g., --extra-env "VAR1=value1,VAR2=value2"). These are stored in the ingestion source configuration and made available to the executor at runtime.{schema}.stored_procedures container (consistent with PostgreSQL, MySQL, and Snowflake), with individual subtypes to distinguish them.RedshiftLineageExtractor) has been removed, as lineage v2 (RedshiftSqlLineageV2) implementation has been default for a while already. As an effect use_lineage_v2 config has also been removed along with all lineage v1 references and tests have been updated to v2 implementation. This should not impact most users as change is isolated in redshift ingestion source only.acryl-datahub now requires pydantic v2. Support for pydantic v1 has been dropped and users must upgrade to pydantic v2 when using DataHub python package.
iceberg ingestion source. If it is run from CLI and datahub CLI was installed with all extras (acryl-datahub[all]), then pyiceberg has been kept at 0.4.0 version in such environment, just to satisfiy the pydantic v1 restriction. However now, pyiceberg will be installed in the newest available version. While this is a breaking change in the behaviour, versions >0.4.0 have been used for some time by Managed Ingestion.profile_name, region_name, aws_access_key_id, aws_secret_access_key, and aws_session_token were deprecated and removed in version 0.8.0. To check whether your configuration will work, consult https://py.iceberg.apache.org/configuration/#catalogs. Because of that, pyiceberg dependency has been restricted to be 0.8.0 at least.acryl-datahub-airflow-plugin now requires specifying the appropriate installation extra based on your Airflow version due to different OpenLineage dependencies:
pip install 'acryl-datahub-airflow-plugin[airflow2]'pip install 'acryl-datahub-airflow-plugin[airflow3]'pip install 'acryl-datahub-airflow-plugin[airflow3]' 'pydantic>=2.11.8'acryl-datahub-airflow-plugin now supports Apache Airflow 3.x while maintaining backward compatibility with Airflow 2.5+. Key changes include:
datahub_airflow_plugin_disable_listener to true)AIRFLOW_VAR_DATAHUB_AIRFLOW_PLUGIN_DISABLE_LISTENER=true to disable the plugin. This change is required to comply with Airflow 3.x's strict database access restrictions during listener initialization.WEBSERVER__BASE_URL → API__BASE_URL). The plugin automatically detects and uses the correct configuration for each version.metadata-ingestion-modules/airflow-plugin/AIRFLOW_3_MIGRATION.md in the DataHub repository.procedure_pattern to filter procedures if needed. See the Oracle source documentation for permissions and configuration details.extract_lineage_from_unsupported_custom_sql_queries by default. This improves the quality of lineage extracted by using DataHub's SQL parser in cases where the Tableau Catalog API fails to return lineage for Custom SQL queries.CDC_MCL_PROCESSING_ENABLED=true, MCLs are generated from Debezium-captured database changes rather than directly from GMS. This provides stronger ordering guarantees and decoupled processing. Requires MySQL 5.7+ or PostgreSQL 10+ with replication enabled. See CDC Configuration Guide for setup instructions.system.query.history table for improved performance with large query volumes. The new usage_data_source configuration (default: AUTO) automatically uses system tables when warehouse_id is configured, otherwise falls back to the REST API. This change is not breaking as the AUTO mode gracefully handles configurations without warehouse_id by using the existing REST API approach. Users can explicitly force system tables mode by setting usage_data_source: SYSTEM_TABLES (requires SELECT permission on system.query.history) or continue using the REST API with usage_data_source: API.acryl-datahub (DataHub CLI and SDK)acryl-datahub-actionsacryl-datahub-airflow-pluginacryl-datahub-prefect-pluginacryl-datahub-gx-pluginacryl-datahub-dagster-plugin (already required Python 3.9+)acryl-datahub-airflow-plugin has dropped support for Airflow versions less than 2.7.acryl-datahub-airflow-plugin has been removed. The v2 plugin has been the default for a while already, so this should not impact most users. Users who were explicitly setting DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN=true will need to either upgrade or pin to an older version to continue using the v1 plugin.default_dialect configuration parameter has been renamed to override_dialect. This also affects the Python SDK methods:
DataHubGraph.parse_sql_lineage(default_dialect=...) → DataHubGraph.parse_sql_lineage(override_dialect=...)LineageClient.add_lineage_via_sql(default_dialect=...) → LineageClient.add_lineage_via_sql(override_dialect=...)acryl-datahub-gx-plugin now requires pydantic v2, which means the effective minimum supported version of GX is 0.17.15 (from Sept 2023).use_queries_v2 flag is now enabled by default for Snowflake and BigQuery ingestion. This improves the quality of lineage and quantity of queries extracted.bigquery and redshift connectors please update schema_pattern to match against fully qualified schema name <database_name>.<schema_name> and set config match_fully_qualified_names : True. Current default match_fully_qualified_names: False is only to maintain backward compatibility. The config option match_fully_qualified_names will be removed in future and the default behavior will be like match_fully_qualified_names: True.acryl-datahub-actions package now requires Pydantic V2, while it previously was compatible with both Pydantic V1 and V2.DATAHUB_REST_EMITTER_BATCH_MAX_PAYLOAD_BYTES to control batch size limits when using the RestEmitter in ingestions. Default is 15MB but configurable.acryl-datahub-airflow-plugin dropped support for Airflow 2.3 and 2.4.async_flag removed from rest emitter, replaced with emit mode ASYNCuse_queries_v2 with warehouse ingestion.#12673: Business Glossary ID generation has been modified to handle special characters and URL cleaning. When enable_auto_id is false (default), IDs are now generated by cleaning the name (converting spaces to hyphens, removing special characters except periods which are used as path separators) while preserving case. This may result in different IDs being generated for terms with special characters.
#12580: The OpenAPI source handled nesting incorrectly. 12580 fixes it to create proper nested field paths, however, this will re-write the incorrect schemas of existing OpenAPI runs.
#12408: The platform field in the DataPlatformInstance GraphQL type is removed. Clients need to retrieve the platform via the optional dataPlatformInstance field.
#12671: The priority field of the Incident entity is changed from an integer to an enum. This field was previously completely unused in UI and API, so this change should not affect existing deployments.
#12716: Fix the platform_instance being added twice to the URN. If you want to have the previous behavior back, you need to add your platform_instance twice (i.e. plat.plat).
#12797: Previously endpoints when used in ASYNC mode would not validate URNs, entity & aspect names immediately. Starting with this release, even in ASYNC mode, these requests will be returned with http code 4xx. This includes URN, Entity Type names, Entity & Aspect names.
server_config property on DataHubRestEmitter can throw an unknown attribute error if test_connection is not called prior to directly accessing it as the default empty map initialization was removed. This is resolved in v1.1.0.CorpUser entity that has a newly introduced CorpUserInfo#system flag set to true.Dashboard to Dashboard lineage within the DashboardInfo aspect. Mainly, users of Sigma and PowerBI Apps ingestion may be affected by this adjustment. Consequently, a reindex will be automatically triggered during the system upgrade.pointInTimeCreationEnabled feature flag to be enabled and the elasticSearch.implementation configuration to be elasticsearch. This feature is not supported for OpenSearch at this time and the parameter will not be respected without both of these set.sort on the generic list entities endpoint for v3. This parameter is deprecated and only supports a single string value while the documentation indicates it supports a list of strings. This documentation error has been fixed and the correct field, sortCriteria, is now documented which supports a list of strings.include_view_lineage and include_view_column_lineage are removed from snowflake ingestion source. View and External Table DDL lineage will always be ingested when definitions are available.include_view_lineage, include_view_column_lineage and lineage_parse_view_ddl are removed from bigquery ingestion source. View and Snapshot lineage will always be ingested when definitions are available.Kafka source no longer ingests schemas from schema registry as separate entities by default, set ingest_schemas_as_entities to true to ingest themvalue parameter has been previously deprecated. Use of value instead of values is no longer supported and will be completely removed on the next major version.SANDBOX added as a FabricType. No rollbacks allowed once metadata with this fabric type is added without manual cleanups in databases.DatahubClientConfig's server field no longer defaults to http://localhost:8080. Be sure to explicitly set this.datahub_api is explicitly passed to a stateful ingestion config provider, it will be used. We previously ignored it if the pipeline context also had a graph object.datahub-gc ingestion source.sql_parser configuration from the Redash source, as Redash now exclusively uses the sqlglot-based parser for lineage extraction.datahub.utilities.sql_parser, datahub.utilities.sql_parser_base and datahub.utilities.sql_lineage_parser_impl module along with SqlLineageSQLParser and DefaultSQLParser. Use create_lineage_sql_parsed_result from datahub.sql_parsing.sqlglot_lineage module instead.datahub-gc ingestion
source.created and lastModified auditstamps by default
for input and output dataset edges. This should not have any user-observable
impact (time-based lineage viz will still continue working based on observed time), but could break assumptions previously being made by clients.user.props will need to be enabled before login in order to be granted access to DataHub.#12056: The DataHub Airflow plugin no longer supports Airflow 2.1 and Airflow 2.2.
#11701: The Fivetran sources_to_database field is deprecated in favor of setting directly within sources_to_platform_instance.<key>.database.
#11560 - The PowerBI ingestion source configuration option include_workspace_name_in_dataset_urn determines whether the workspace name is included in the PowerBI dataset's URN. PowerBI allows to have identical name of semantic model and their tables across the workspace, It will overwrite the semantic model in-case of multi-workspace ingestion.
Entity urn with include_workspace_name_in_dataset_urn: false
urn:li:dataset:(urn:li:dataPlatform:powerbi,[<PlatformInstance>.]<SemanticModelName>.<TableName>,<ENV>)
Entity urn with include_workspace_name_in_dataset_urn: true
urn:li:dataset:(urn:li:dataPlatform:powerbi,[<PlatformInstance>.].<WorkspaceName>.<SemanticModelName>.<TableName>,<ENV>)
The config include_workspace_name_in_dataset_urn is default to false for backward compatibility, However, we recommend enabling this flag after performing the necessary cleanup.
If stateful ingestion is enabled, running ingestion with the latest CLI version will handle the cleanup automatically. Otherwise, we recommend soft deleting all powerbi data via the DataHub CLI:
datahub delete --platform powerbi --soft and then re-ingest with the latest CLI version, ensuring the include_workspace_name_in_dataset_urn configuration is set to true.
use_powerbi_email is now enabled by default when extracting ownership information.#9857 (#10773) lower method was removed from get_db_name of SQLAlchemySource class. This change will affect the urns of all related to SQLAlchemySource entities.
Old urn, where data_base_name is Some_Database:
- urn:li:dataJob:(urn:li:dataFlow:(mssql,demodata.Foo.stored_procedures,PROD),Proc.With.SpecialChar)
New urn, where data_base_name is Some_Database:
- urn:li:dataJob:(urn:li:dataFlow:(mssql,DemoData.Foo.stored_procedures,PROD),Proc.With.SpecialChar)
Re-running with stateful ingestion should automatically clear up the entities with old URNS and add entities with new URNs, therefore not duplicating the containers or jobs.
#11313 - datahub get will no longer return a key aspect for entities that don't exist.
#11369 - The default datahub-rest sink mode has been changed to ASYNC_BATCH. This requires a server with version 0.14.0+.
#11214 Container properties aspect will produce an additional field that will require a corresponding upgrade of server. Otherwise server can reject the aspects.
#10190 - extractor_config.set_system_metadata of datahub source has been moved to be a top level config in the recipe under flags.set_system_metadata
-protocProp in case this
behavior is required.value key and the API is now symmetric with respect to input and outputs.Example Global Tags Aspect:
Previous:
{
"tags": [
{
"tag": "string",
"context": "string"
}
]
}
New (optional fields systemMetadata and headers):
{
"value": {
"tags": [
{
"tag": "string",
"context": "string"
}
]
},
"systemMetadata": {},
"headers": {}
}
#10858 Profiling configuration for Glue source has been updated.
Previously, the configuration was:
profiling: {}
Now, it needs to be:
profiling:
enabled: true
/v2 or /v3. The v1 endpoints
will be deprecated in no less than 6 months. Endpoints will be replaced with equivalents in the /v2 or /v3 APIs.
No loss of functionality expected unless explicitly mentioned in Breaking Changes.~/.datahubenv to match DatahubClientConfig object definition. See full configuration in https://docs.datahub.com/docs/python-sdk/clients/. The CLI should now respect the updated configurations specified in ~/.datahubenv across its functions and utilities. This means that for systems where ssl certification is disabled, setting disable_ssl_verification: true in ~./datahubenv will apply to all CLI calls.~/.datahubenv file. You must either run datahub init to create that file, or set environment variables so that the config is loaded.put cli command: --run-id. This parameter is useful to associate a given write to an ingestion process. A use-case can be mimick transformers when a transformer for aspect being written does not exist.dataHubExecutionRequestResult that are too large for GMS to handle.aws_region is now a required configuration in the DynamoDB connector. The connector will no longer loop through all AWS regions; instead, it will only use the region passed into the recipe configuration.RVW added as a FabricType. No rollbacks allowed once metadata with this fabric type is added without manual cleanups in databases.pipeline_name is set and either a datahub-rest sink or datahub_api is specified. It will still be disabled by default when any other sink type is used or if there is no pipeline name set.DataHubGraph client no longer makes a request to the backend during initialization. If you want to preserve the old behavior, call graph.test_connection() after constructing the client.use_compiled_code option has been removed, because we now support capturing both source and compiled dbt SQL. This can be configured using include_compiled_code, which will be default enabled in 0.13.1.use_lineage_v2 is now enabled by default.entities_enabled.model_performance and include_compiled_code are now both enabled by default. Upgrading dbt ingestion will also require upgrading the backend to 0.13.1.SEARCH_AUTHORIZATION_ENABLED replaced by VIEW_AUTHORIZATION_ENABLED to more accurately represent the feature.Updating MySQL version for quickstarts to 8.2, may cause quickstart issues for existing instances.
Neo4j 5.x, may require migration from 4.x
Build requires JDK17 (Runtime Java 11)
Build requires Docker Compose > 2.20
#9731 - The acryl-datahub CLI now requires Python 3.8+
#9601 - The Unity Catalog(UC) ingestion source config include_metastore is now disabled by default. This change will affect the urns of all entities in the workspace.
Entity Hierarchy with include_metastore: true (Old)
- UC Metastore
- Catalog
- Schema
- Table
Entity Hierarchy with include_metastore: false (New)
- Catalog
- Schema
- Table
We recommend using platform_instance for differentiating across metastores.
If stateful ingestion is enabled, running ingestion with latest cli version will perform all required cleanup. Otherwise, we recommend soft deleting all databricks data via the DataHub CLI:
datahub delete --platform databricks --soft and then reingesting with latest cli version.
#9601 - The Unity Catalog(UC) ingestion source config include_hive_metastore is now enabled by default. This requires config warehouse_id to be set. You can disable include_hive_metastore by setting it to False to avoid ingesting legacy hive metastore catalog in Databricks.
#9904 - The default Redshift table_lineage_mode is now MIXED, instead of STL_SCAN_BASED. Improved lineage generation is also available by enabling use_lineaege_v2. This v2 implementation will become the default in a future release.
redshift-legacy and redshift-legacy-usage sources, which have been deprecated for >6 months, have been removed. The new redshift source is a superset of the functionality provided by those legacy sources.database_alias config is no longer supported in SQL sources namely - Redshift, MySQL, Oracle, Postgres, Trino, Presto-on-hive. The config will automatically be ignored if it's present in your recipe. It has been deprecated since v0.9.6.TagUrn("tag", ["tag_name"]) is no longer supported, and the simpler TagUrn("tag_name") should be used instead.
The canonical place to import the urn classes from is datahub.metadata.urns.*. Other import paths, like datahub.utilities.urns.corpuser_urn.CorpuserUrn are retained for backwards compatibility, but are considered deprecated.DataHubRestEmitter.emit method no longer returns anything. It previously returned a tuple of timestamps.method: analyze under the profiling section in your recipe.
To use the new profiler, set method: ge. Profiling is disabled by default, so to enable it,
one of these methods must be specified.neo4j: section according to the new structure.ownershipTypeUrn referencing a customer ownership type or a (deprecated) type. Where before adding an ownership without a concrete type was allowed, this is no longer the case. For simplicity you can use the type parameter which will get translated to a custom ownership type internally if one exists for the type being added.incremental_lineage is set default to off.urn:li:corpuser:datahub owner for the Measure, Dimension and Temporal tags emitted
by Looker and LookML source connectors.pip install 'acryl-datahub-airflow-plugin[plugin-v2]'. To continue using the v1 plugin, set the DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN environment variable to true.include_metastore, which will cause all urns to be changed when disabled.
This is currently enabled by default to preserve compatibility, but will be disabled by default and then removed in the future.
If stateful ingestion is enabled, simply setting include_metastore: false will perform all required cleanup.
Otherwise, we recommend soft deleting all databricks data via the DataHub CLI:
datahub delete --platform databricks --soft and then reingesting with include_metastore: false.RESOURCE_TYPE became TYPE and RESOURCE_URN became URN.
Any existing policies using these filters (i.e. defined for particular urns or types such as dataset) need to be upgraded
manually, for example by retrieving their respective dataHubPolicyInfo aspect and changing part using filter i.e. "resources": {
"filter": {
"criteria": [
{
"field": "RESOURCE_TYPE",
"condition": "EQUALS",
"values": [
"dataset"
]
}
]
}
into
"resources": {
"filter": {
"criteria": [
{
"field": "TYPE",
"condition": "EQUALS",
"values": [
"dataset"
]
}
]
}
for example, using datahub put command. Policies can be also removed and re-created via UI.
match_fully_qualified_names: true.
This means that any dataset_pattern or schema_pattern specified will be matched on the fully
qualified dataset name, i.e. <project_name>.<dataset_name>. We attempt to support the old
pattern format by prepending .*\\. to dataset patterns lacking a period, so in most cases this
should not cause any issues. However, if you have a complex dataset pattern, we recommend you
manually convert it to the fully qualified format to avoid any potential issues.env properly. If you have
been setting env in your recipe to something besides PROD, we will now generate urns
with that new env variable, invalidating your existing urns.system-update job will run which will set indices to read-only and create a backup/clone of each index. During the reindexing new components will be prevented from start-up until the reindex completes. The logs of this job will indicate a % complete per index. Depending on index sizes and infrastructure this process can take 5 minutes to hours however as a rough estimate 1 hour for every 2.3 million entities.manager_pagination_enabled changed to general pagination_enableduri_opts argument; now we can add any options for clickhouse client.include_data_platform_instance config option.cluster argument is deprecated in favor of env.okta_profile_to_username_attr default changed from login to email.
This determines which Okta profile attribute is used for the corresponding DataHub user
and thus may change what DataHub users are generated by the Okta source. And in a follow up okta_profile_to_username_regex has been set to .* which taken together with previous change brings the defaults in line with OIDC.profile_table_level_only together with include_field_xyz config options to ingest
certain column-level metrics. Instead, set profile_table_level_only to false and
individually enable / disable desired field metrics.bigquery-beta and snowflake-beta source aliases have been dropped. Use bigquery and snowflake as the source type instead.no_default_report=True.snowflake connector will use user's email attribute as is in urn. To revert to previous behavior disable email_as_user_identifier in recipe.system-update job in non-blocking mode. This process generates data needed for the new search
and browse feature. This process must complete before enabling the new search and browse UI and while upgrading entities will be missing from the UI.
If not using the new search and browse UI, there will be no impact and the update will complete in the background.PlatformKey class has been renamed to ContainerKey.0.10.5 introduces the new Unified Search & Browse experience and is disabled by default. You can control whether or not you want to see just the new search filtering experience, the new search and browse experience together, or keep the existing search and browse experiences by toggling the two environment variable feature flags SHOW_SEARCH_FILTERS_V2 and SHOW_BROWSE_V2 in your GMS container.
Upgrade Considerations:
browsePathsV2 aspects. This job loops over entity types that need a browsePathsV2 aspect (Dataset, Dashboard, Chart, DataJob, DataFlow, MLModel, MLModelGroup, MLFeatureTable, and MLFeature) and generates one for them. For entities that may have Container parents (Datasets and Dashboards) we will try to fetch their parent containers in order to generate this new aspect. For those deployments with large amounts of data, consider whether running this upgrade job makes sense as it may be a heavy operation and take some time to complete. If you wish to skip this job, simply set the BACKFILL_BROWSE_PATHS_V2 environment variable flag to false in your GMS container. Without this backfill job, though, you will need to rely on the newest CLI of ingestion to create these browsePathsV2 aspects when running ingestion otherwise your browse sidebar will be out-of-sync.SHOW_BROWSE_V2 environment variable feature flag on is the right decision for your organization. If you’re creating custom browse paths with the browsePaths aspect, you can continue to do the same with the new experience, however you will have to generate browsePathsV2 aspects instead which are documented here.Owner aspect has been updated where the type field is deprecated in favor of a new field typeUrn. This latter field is an urn reference to the new OwnershipType entity. GraphQL endpoints have been updated to use the new field. For pre-existing ownership aspect records, DataHub now has logic to map the old field to the new field.catalog_pattern and schema_pattern options of the Unity Catalog source now match against the fully qualified name of the catalog/schema instead of just the name. Unless you're using regex ^ in your patterns, this should not affect you.containerPath aspect to browsePathsV2. This means any data with the aspect name containerPath will be invalid. We had not exposed this in the UI or used it anywhere, but it was a model we recently merged to open up other work. This should not affect many people if anyone at all unless you were manually creating containerPath data through ingestion on your instance.datahub delete CLI, if an --entity-type filter is not specified, we automatically delete across all entity types. The previous behavior was to use a default entity type of dataset.datahub delete CLI, the --start-time and --end-time parameters are not required for timeseries aspect hard deletes. To recover the previous behavior of deleting all data, use --start-time min --end-time max.Source.get_workunits() is changed from Iterable[WorkUnit] to the more restrictive Iterable[MetadataWorkUnit].UsageAggregation aspect, /usageStats?action=batchIngest GMS endpoint, and UsageStatsWorkUnit metadata-ingestion class are all deprecated.add_database_name_to_urn flag to Oracle source which ensure that Dataset urns have the DB name as a prefix to prevent collision (.e.g. {database}.{schema}.{table}). ONLY breaking if you set this flag to true, otherwise behavior remains the same.pip install acryl-datahub-airflow-plugin[datahub-kafka] for Kafka support.pip install acryl-datahub[airflow,datahub-kafka] for Kafka support./contrib section of
the repository. Please refer to older releases if needed.kafka-setup docker image have been updated to be in-line with other DataHub components, for more info see our docs on Configuring Kafka in DataHub
. They have been suffixed with _TOPIC where as now the correct suffix is _TOPIC_NAME. This change should not affect any user who is using default Kafka names.redshift-legacy. The redshift-usage source has also been renamed to redshift-usage-legacy will be removed in the future.system-update job will run which will set indices to read-only and create a backup/clone of each index. During the reindexing new components will be prevented from start-up until the reindex completes. The logs of this job will indicate a % complete per index. Depending on index sizes and infrastructure this process can take 5 minutes to hours however as a rough estimate 1 hour for every 2.3 million entities.Helm without --atomic: The default timeout for an upgrade command is 5 minutes. If the reindex takes longer (depending on data size) it will continue to run in the background even though helm will report a failure. Allow this job to finish and then re-run the helm upgrade command.
Helm with --atomic: In general, it is recommended to not use the --atomic setting for this particular upgrade since the system update job will be terminated before completion. If --atomic is preferred, then increase the timeout using the --timeout flag to account for the reindexing time (see note above for estimating this value).
legacy_nested_json_string option. The file source is backwards compatible and supports both formats.env and database_alias fields have been marked deprecated across all sources. We recommend using platform_instance where possible instead.apache-ranger-plugin in DataHub GMS.datahub check graph-consistency command has been removed. It was a beta API that we had considered but decided there are better solutions for this. So removing this.graphql_url option of powerbi-report-server source deprecated as the options is not used.enable_legacy_sharded_table_support is set to False, sharded table names will be suffixed with _yyyymmdd to make sure they don't clash with non-sharded tables. This means if stateful ingestion is enabled then old sharded tables will be recreated with a new id and attached tags/glossary terms/etc will need to be added again. This behavior is not enabled by default yet, but will be enabled by default in a future release.schema_pattern now accepts pattern for fully qualified schema name in format <catalog_name>.<schema_name> by setting config match_fully_qualified_names : True. Current default match_fully_qualified_names: False is only to maintain backward compatibility. The config option match_fully_qualified_names will be deprecated in future and the default behavior will assume match_fully_qualified_names: True."snowflake-legacy and snowflake-usage-legacy have been removed.datahub check graph-consistency command has been removed.workspace_id_pattern is introduced in place of workspace_id. workspace_id is now deprecated and set for removal in a future version.emit_reachable_views_only to False.node_type_pattern which was previously deprecated has been removed. Use entities_enabled instead to control whether to emit metadata for sources, models, seeds, tests, etc.snowflake connector now populates created and last modified timestamps for snowflake datasets and containers. This version of snowflake connector will not work with datahub-gms version older than v0.9.3bigquery-beta to bigquery. If you are using bigquery-beta then change your recipes to use the type bigquery.v0.8.45getNativeUserInviteToken and createNativeUserInviteToken GraphQL endpoints have been renamed to
getInviteToken and createInviteToken respectively. Additionally, both now accept an optional roleUrn parameter.
Both endpoints also now require the MANAGE_POLICIES privilege to execute, rather than MANAGE_USER_CREDENTIALS
privilege.urn:li:dataHubPolicy:7, or All Users - All Platform Privileges)
has been edited to no longer include MANAGE_POLICIES. Its name has consequently been changed to
All Users - All Platform Privileges (EXCEPT MANAGE POLICIES). This change was made to prevent all users from
effectively acting as superusers by default.v0.8.44disable_dbt_node_creation and load_schema options have been removed. They were no longer necessary due to the recently added sibling entities functionality.snowflake source now uses newer faster implementation (earlier snowflake-beta). Config properties provision_role and check_role_grants are not supported. Older snowflake and snowflake-usage are available as snowflake-legacy and snowflake-usage-legacy sources respectively.datahub-actions container is bumped to v0.0.7 or head.
This version contains changes to support running ingestion in debug mode. Previous versions are not compatible with this release.
Upgrading to helm chart version 0.2.103 will ensure that you have the compatible versions by default.v0.8.42GMS_HOST and GMS_PORT environment variables deprecated in v0.8.39 have been removed. Use DATAHUB_GMS_HOST and DATAHUB_GMS_PORT instead.delete command when used with --hard option will delete soft-deleted entities which match the other filters given.userEmail in dashboard user usage stats. This version of looker connnector will not work with older version of datahub-gms if you have extract_usage_history looker config enabled.ANALYTICS_ENABLED environment variable in datahub-gms is now deprecated. Use DATAHUB_ANALYTICS_ENABLED instead.--include-removed option was removed from delete CLIv0.8.41The should_overwrite flag in csv-enricher has been replaced with write_semantics to match the format used for other sources. See the documentation for more details
Closing an authorization hole in creating tags adding a Platform Privilege called Create Tags for creating tags. This is assigned to datahub root user, along
with default All Users policy. Notice: You may need to add this privilege (or Manage Tags) to existing users that need the ability to create tags on the platform.
#5329 Below profiling config parameters are now supported in BigQuery:
Set above parameters to null if you want older behaviour.
v0.8.40lineage_client_project_id in bigquery source is removed. Use storage_project_id instead.v0.8.39health field of the Dataset GraphQL Type to be of type list of HealthStatus (was type HealthStatus). See this PR for more details.GMS_HOST and GMS_PORT environment variables being set in various containers are deprecated in favour of DATAHUB_GMS_HOST and DATAHUB_GMS_PORT.KAFKA_TOPIC_NAME environment variable in datahub-mae-consumer and datahub-gms is now deprecated. Use METADATA_AUDIT_EVENT_NAME instead.KAFKA_MCE_TOPIC_NAME environment variable in datahub-mce-consumer and datahub-gms is now deprecated. Use METADATA_CHANGE_EVENT_NAME instead.KAFKA_FMCE_TOPIC_NAME environment variable in datahub-mce-consumer and datahub-gms is now deprecated. Use FAILED_METADATA_CHANGE_EVENT_NAME instead.snowflake source only if they have been updated since configured (default: 1) number of day(s). Update the config profiling.profile_if_updated_since_days as per your profiling schedule or set it to None if you want older behaviour.v0.8.38v0.8.36profiling.report_dropped_profiles to True if you want older behaviour.v0.8.35v0.8.34database option from snowflake source which was deprecated since v0.8.5report_upstream_lineage to upstream_lineage_in_report in snowflake connector which was added in 0.8.32host_port option of snowflake and snowflake-usage sources deprecated as the name was confusing. Use account_id option instead.check_role_grants option was added in snowflake to disable checking roles in snowflake as some people were reporting long run times when checking roles.