docs/how/backup-datahub.md
DataHub stores metadata in two key storage systems that require separate backup approaches:
metadata_aspect_v2 tableThis guide outlines how to properly back up both components to ensure complete recoverability of your DataHub instance.
The recommended backup strategy is to periodically dump the metadata_aspect_v2 table from the datahub database. This table contains all versioned aspects and can be restored in case of database failure. Most managed database services (e.g., AWS RDS) provide automated backup capabilities.
Option 1: Automated RDS Snapshots
datahub-backup-YYYY-MM-DD)Option 2: SQL Dump (MySQL)
For a targeted backup of only the essential metadata:
mysqldump -h <rds-endpoint> -u <username> -p datahub metadata_aspect_v2 > metadata_aspect_v2_backup.sql
To compress the backup:
mysqldump -h <rds-endpoint> -u <username> -p datahub metadata_aspect_v2 | gzip > metadata_aspect_v2_backup.sql.gz
mysqldump -u <username> -p datahub metadata_aspect_v2 > metadata_aspect_v2_backup.sql
Compressed version:
mysqldump -u <username> -p datahub metadata_aspect_v2 | gzip > metadata_aspect_v2_backup.sql.gz
Time Series Aspects power important features like usage statistics, dataset profiles, and assertion runs. These are stored in Elasticsearch/OpenSearch and require a separate backup strategy.
Create an IAM Role for Snapshots
Create an IAM role with permissions to write to an S3 bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": ["s3:ListBucket"],
"Effect": "Allow",
"Resource": ["arn:aws:s3:::your-backup-bucket"]
},
{
"Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
"Effect": "Allow",
"Resource": ["arn:aws:s3:::your-backup-bucket/*"]
}
]
}
Ensure the trust relationship allows OpenSearch to assume this role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "es.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
PUT _snapshot/datahub_s3_backup
{
"type": "s3",
"settings": {
"bucket": "your-backup-bucket",
"region": "us-east-1",
"role_arn": "arn:aws:iam::<account-id>:role/<snapshot-role>"
}
}
⚠️ Important: The S3 bucket must be in the same AWS region as your OpenSearch domain.
Create a Regular Snapshot Schedule
Set up an automated schedule using the OpenSearch Snapshot Management:
PUT _plugins/_sm/policies/datahub_backup_policy
{
"schedule": {
"cron": {
"expression": "0 0 * * *",
"timezone": "UTC"
}
},
"name": "<snapshot-{now/d}>",
"repository": "datahub_s3_backup",
"config": {
"partial": false
},
"retention": {
"expire_after": "15d",
"min_count": 5,
"max_count": 30
}
}
This configures daily snapshots with a 15-day retention period.
Take a Manual Snapshot (if needed)
PUT _snapshot/datahub_s3_backup/snapshot_YYYY_MM_DD?wait_for_completion=true
Verify Snapshot Status
GET _snapshot/datahub_s3_backup/snapshot_YYYY_MM_DD
Create a Local Repository
First, add path.repo setting to elasticsearch.yml on all nodes:
path.repo: ["/mnt/es-backups"]
Ensure /mnt/es-backups is a shared or mounted path on all Elasticsearch nodes.
Register the Repository
PUT _snapshot/datahub_fs_backup
{
"type": "fs",
"settings": {
"location": "/mnt/es-backups",
"compress": true
}
}
Create a Snapshot
PUT \_snapshot/datahub_fs_backup/snapshot_YYYY_MM_DD?wait_for_completion=true
Check Snapshot Status
GET \_snapshot/datahub_fs_backup/snapshot_YYYY_MM_DD
Restore from an RDS Snapshot (if using AWS RDS)
In the AWS Console, go to RDS > Snapshots, select your snapshot, and choose "Restore Snapshot".
Restore from SQL Dump
mysql -h <host> -u <user> -p datahub < metadata_aspect_v2_backup.sql
After restoring the database, you need to restore the search and graph indices using your snapshots.
Note that you can also rebuild the index from scratch after restoring the MySQL / Postgres Document Store, as outlined here.
To restore search indexes from a snapshot:
POST _snapshot/datahub_s3_backup/snapshot_YYYY_MM_DD/_restore
{
"indices": "datastream*,metadataindex*",
"include_global_state": false
}
Regularly test your backup and restore procedures to ensure they work when needed:
A good practice is to test restore procedures quarterly or after significant infrastructure changes.