smoke-test/test_resources/analytics_backfill/README.md
Populate DataHub analytics dashboards with synthetic user activity data for testing and demonstration purposes.
These scripts generate realistic user profiles and activity events with historical timestamps to populate all DataHub analytics charts including:
acryl-datahub package installedjq and curl installed# Set your DataHub token
export DATAHUB_TOKEN="your-token-here"
# Run the master script with defaults (20 users, 30 days of activity)
./populate_analytics.sh --token $DATAHUB_TOKEN
This will:
View results at http://localhost:9002/analytics
populate_analytics.sh (Master Script)Orchestrates the entire pipeline.
Usage:
./populate_analytics.sh --token TOKEN [OPTIONS]
Options:
--num-users N - Number of users to generate (default: 20)--num-days N - Days of history to generate (default: 30)--events-per-day N - Target events per day (default: 200)--gms-url URL - DataHub GMS URL (default: http://localhost:8080)--token TOKEN - DataHub auth token (required)--elasticsearch-url URL - Elasticsearch URL (default: http://localhost:9200)--email-domain DOMAIN - Email domain for users (default: example.com)--skip-users - Skip user generation--skip-events - Skip event generation--skip-load - Skip loading to ElasticsearchExamples:
# Generate 50 users with 60 days of high activity
./populate_analytics.sh --token $TOKEN --num-users 50 --num-days 60 --events-per-day 500
# Only generate events (users already exist)
./populate_analytics.sh --token $TOKEN --skip-users
# Generate data but don't load yet (for review)
./populate_analytics.sh --token $TOKEN --skip-load
generate_users.pyCreates synthetic user profiles and emits them to DataHub.
Features:
Usage:
python generate_users.py \
--num-users 20 \
--token $TOKEN \
--output-file users.json
Output:
[
{
"username": "alice.anderson",
"first_name": "Alice",
"last_name": "Anderson",
"email": "[email protected]",
"display_name": "Alice Anderson",
"title": "Data Engineer",
"department": "Engineering",
"team": "Platform"
},
...
]
generate_glossary_terms.pyCreates business glossary terms and attaches them to datasets.
Features:
Categories:
Usage:
python generate_glossary_terms.py \
--token $TOKEN \
--entity-urns-file entity_urns.json \
--output-file glossary_terms.json
Output:
{
"terms": [
{
"name": "Customer ID",
"urn": "urn:li:glossaryTerm:customer_id",
"category": "Customer Data",
"definition": "Unique identifier for a customer"
},
...
],
"attachments": [
{
"dataset_urn": "urn:li:dataset:...",
"term_urns": ["urn:li:glossaryTerm:customer_id", ...],
"num_terms": 2
},
...
]
}
backfill_activity_events.pyGenerates realistic user activity events with backdated timestamps.
Event Types Generated:
EntityViewEvent - Dataset, dashboard, chart viewsEntityActionEvent - Tab views, actions on entitiesSearchEvent - Search queriesSearchResultsViewEvent - Search results viewedHomePageViewEvent - Home page visitsLogInEvent - User login eventsFeatures:
Usage:
# Generate and save to file
python backfill_activity_events.py \
--users-file users.json \
--days 30 \
--events-per-day 200 \
--output-file activity_events.json
# Generate and load directly to Elasticsearch (for CI/automated tests)
python backfill_activity_events.py \
--users-file users.json \
--days 30 \
--events-per-day 200 \
--elasticsearch-url http://localhost:9200 \
--load-to-elasticsearch
# Both save to file AND load to Elasticsearch
python backfill_activity_events.py \
--users-file users.json \
--days 30 \
--events-per-day 200 \
--output-file activity_events.json \
--elasticsearch-url http://localhost:9200 \
--load-to-elasticsearch
Key Feature: Relative Timestamps
The script generates events with timestamps relative to execution time, ensuring "Past Week" and "Past Month" analytics always have fresh data regardless of when tests run. This is critical for CI environments where tests must pass consistently over time.
Output:
[
{
"type": "EntityViewEvent",
"timestamp": 1696780800000,
"actorUrn": "urn:li:corpuser:alice.anderson",
"entityUrn": "urn:li:dataset:(urn:li:dataPlatform:bigquery,project.dataset.table,PROD)",
"entityType": "dataset",
"usageSource": "web"
},
...
]
load_events_to_elasticsearch.shLoads events into Elasticsearch using the bulk API.
Features:
Usage:
./load_events_to_elasticsearch.sh activity_events.json
# Custom Elasticsearch
./load_events_to_elasticsearch.sh \
-e http://elasticsearch:9200 \
-i datahub_usage_event \
activity_events.json
# DataHub configuration
export DATAHUB_GMS_URL="http://localhost:8080"
export DATAHUB_TOKEN="your-token-here"
# Elasticsearch configuration
export ELASTICSEARCH_URL="http://localhost:9200"
# Generation parameters
export NUM_USERS=20
export NUM_DAYS=30
export EVENTS_PER_DAY=200
export EMAIL_DOMAIN="example.com"
Provide your own entity URNs for more realistic activity:
# Extract entities from your DataHub instance
datahub get --urn "urn:li:dataset:*" > my_entities.json
# Use in event generation
python backfill_activity_events.py \
--entity-urns-file my_entities.json \
...
Events are generated with realistic patterns:
EVENTS_PER_DAY countEVENTS_PER_DAYNUM_DAYS × EVENTS_PER_DAY (default: 6,000 events)curl http://localhost:8080/health--gms-url parametercurl http://localhost:9200/_cluster/health--elasticsearch-url parameterdocker ps | grep elasticsearchpip install 'acryl-datahub[datahub-rest]'
# macOS
brew install jq
# Ubuntu/Debian
sudo apt-get install jq
# RHEL/CentOS
sudo yum install jq
curl http://localhost:9200/datahub_usage_event/_countusageSource is set to "web" (scripts handle this)Analytics tests now use pytest fixtures for automatic data loading with relative timestamps:
# In smoke-test/tests/analytics/test_analytics.py
def test_weekly_active_users_chart(auth_session, analytics_events_loaded):
"""Test Weekly Active Users chart - fixture ensures fresh data."""
# Query analytics charts...
# Assertions will pass because data is generated relative to execution time
The analytics_events_loaded fixture (defined in tests/analytics/conftest.py):
Benefits:
For ongoing testing:
# Cron job to add daily activity
0 0 * * * cd /path/to/analytics_backfill && ./populate_analytics.sh --token $TOKEN --num-days 1 --skip-users
Modify backfill_activity_events.py to add custom patterns:
# Add specific events for your use case
def generate_dashboard_migration_events(self, date):
"""Simulate a dashboard migration project."""
events = []
for dashboard_urn in self.dashboard_urns:
# Heavy activity on specific dashboards
for _ in range(50):
events.append(self.generate_entity_view_event(date, entity_urn=dashboard_urn))
return events
--skip-load to review generated eventsTo remove generated data:
# Delete test users
datahub delete --urn "urn:li:corpuser:alice.anderson"
# Delete usage events index
curl -X DELETE http://localhost:9200/datahub_usage_event
# Recreate empty index (will be auto-created on next event)
For issues or questions:
docker logs datahub-gmsdocker logs elasticsearchThese tools are part of the DataHub project and follow the same Apache 2.0 license.