docs/managed-datahub/observe/assertion-backfill.md
import FeatureAvailability from '@site/src/components/FeatureAvailability';
<div align="center"><iframe width="640" height="444" src="https://www.loom.com/embed/61a201aea8464f58826c965fdbfbe255" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe></div>The Backfill Assertion History feature is available as part of the DataHub Cloud Observe module of DataHub Cloud. If you are interested in learning more about DataHub Cloud Observe or trying it out, please visit our website.
When you create a new Smart Assertion, it needs historical data to learn what "normal" looks like before it can start making accurate predictions. Without historical context, the assertion's AI model has nothing to train on, meaning it will take days or weeks of real-time evaluations before it can reliably detect anomalies.
Backfill Assertion History solves this by running the assertion against historical data at the time of creation. Instead of waiting for the model to accumulate enough data points through scheduled evaluations, the system queries your warehouse for past data and populates the assertion's metrics history in one go. This means you get accurate anomaly detection thresholds from day one, with full awareness of daily, weekly, or monthly seasonality in your data.
Backfill is available for the following assertion types:
| Assertion Type | Backfill Support |
|---|---|
| Smart Volume Assertion | Yes (requires time-series bucketing) |
| Smart Column Metric Assertion | Yes (requires time-series bucketing) |
| Freshness Assertion | No |
| Schema Assertion | No |
| Custom SQL Assertion | No |
When you create a bucketed assertion with backfill enabled, the following process occurs:
PENDING state and queued for execution.GROUP BY queries. This balances query cost against resilience — if a single chunk fails, only that chunk needs to be retried rather than the entire backfill.The maximum amount of historical data that can be backfilled depends on the bucket interval:
| Bucket Interval | Maximum Lookback |
|---|---|
| Daily | 365 days (1 year) |
| Weekly | 156 weeks (3 years) |
This is lookback window is relative to the assertion's creation date
You can track the progress of a backfill from the assertion detail page. A backfill job will be in one of the following states:
:::info Backfilling large tables (> 1 TB) can be expensive in terms of warehouse compute. Consider starting with a shorter lookback period and extending it if needed. :::
When creating a new smart assertion with time-series bucketing enabled:
After creation, you can update the backfill start date, but you cannot change the bucketing configuration (timestamp column, bucket interval, or timezone).
You can configure backfill using the backfill_config parameter on the sync_smart_volume_assertion and sync_smart_column_metric_assertion methods.
from datahub.sdk import DataHubClient
from datahub.metadata.urns import DatasetUrn
client = DataHubClient(server="<your_server>", token="<your_token>")
dataset_urn = DatasetUrn.from_string(
"urn:li:dataset:(urn:li:dataPlatform:snowflake,database.schema.table,PROD)"
)
# Smart volume assertion with daily bucketing and 6-month backfill
assertion = client.assertions.sync_smart_volume_assertion(
dataset_urn=dataset_urn,
display_name="Daily Volume Anomaly Monitor",
detection_mechanism="information_schema",
sensitivity="medium",
time_bucketing_strategy={
"timestamp_field_path": "created_at",
"bucket_interval": {"unit": "DAY", "multiple": 1},
"timezone": "America/Los_Angeles",
},
backfill_config={
"backfill_start_date_ms": 1688169600000, # 2023-07-01T00:00:00Z
},
tags=["automated", "volume"],
enabled=True,
)
The backfill_config parameter accepts:
backfill_start_date_ms (epoch milliseconds)BackfillConfig Pydantic model (supports datetime objects)AssertionMonitorBootstrapConfigClass GMS modelfrom datetime import datetime
from acryl_datahub_cloud.sdk import BackfillConfig
# Using a BackfillConfig with a datetime object
backfill = BackfillConfig(backfill_start_date_ms=datetime(2024, 1, 1))
assertion = client.assertions.sync_smart_column_metric_assertion(
dataset_urn=dataset_urn,
column_name="user_id",
metric_type="null_count",
display_name="Smart Null Count - user_id",
detection_mechanism="all_rows_query_datahub_dataset_profile",
sensitivity="medium",
time_bucketing_strategy={
"timestamp_field_path": "created_at",
"bucket_interval": {"unit": "WEEK", "multiple": 1},
},
backfill_config=backfill,
enabled=True,
)
:::note
backfill_config requires time_bucketing_strategy to also be set. If you provide backfill_config without time_bucketing_strategy on a column metric assertion the configuration will be rejected, and the assertion will not be created.
:::
If a backfill fails (due to a warehouse timeout, network error, etc.), you can retry it from the assertion detail page. There are two retry modes:
Navigate to the assertion detail page. The backfill status will appear near the top of the page, alongside the error encountered. Click Retry.
mutation retryMonitorBackfill {
retryMonitorBackfill(
input: { monitorUrn: "urn:li:monitor:your-monitor-id", hardReset: false }
)
}
Set hardReset: true to perform a full re-backfill from scratch. This is useful if you recently ran a job that added/updated entries and backdated them.
Edit Assertions and Edit Monitors privileges for the target dataset.Q: Does backfill run queries against my warehouse?
Yes. For bucketed assertions, the backfill process issues GROUP BY queries against your warehouse to compute historical metrics. Queries are batched in chunks (approximately 28 days per chunk) to balance cost and resilience.
Q: Can I change the backfill start date after creation? Yes. The backfill start date can be updated after creation. However, the corresponding bucketing parameters (timestamp column, bucket interval, timezone) cannot be changed without recreating the assertion.
Q: What happens if my warehouse goes down during a backfill? The backfill will fail and can be retried. Because progress is tracked per-chunk, a soft retry will resume from the last successful chunk rather than starting over.
Q: Does backfill affect my scheduled assertion evaluations? No. Backfill runs do not interfere with your normal assertion evaluation schedule.