examples/quickstart/releases/Druid26.ipynb
This notebook highlights some of the new features released in Druid 26.0.
Before you begin, ensure you have the following:
pandas Python packagerequests Python packageimport requests
druid_host = "http://localhost:8888"
session = requests.Session()
endpoint = druid_host + '/status'
response = session.get(endpoint)
json = response.json()
print("Running on Druid version: "+ json["version"])
Previously, Druid already supports string-based schema auto-discovery, but it has some limitations. Specifically, all the newly discovered columns will be stored as string types. This means aggregation queries on numerical columns can be slow (since they need to be parsed as numbers first), and some fields such as multi-value dimensions with null values can misbehave.
With the introduction of type-aware schema auto-discovery, Druid now properly infers data types. Set this in an ingestion job by including "useSchemaDiscovery": True in the dimensionsSpec object. In the example below, you perform a batch ingestion job and instruct Druid to automatically infer the input data types as long, float, string, etc. Run the following cell, then go to the web console to check the progress of your ingestion task.
import json
from IPython.display import JSON
ingestion_spec = {
"type": "index_parallel",
"spec": {
"ioConfig": {
"type": "index_parallel",
"inputSource": {
"type": "http",
"uris": ["https://druid.apache.org/data/wikipedia.json.gz"],
"filter": "*"
},
"inputFormat": {
"type": "json"
}
},
"tuningConfig": {
"type": "index_parallel",
"partitionsSpec": {
"type": "dynamic"
},
"indexSpec": {
"stringDictionaryEncoding": {
"type": "frontCoded",
"bucketSize": 16
}
}
},
"dataSchema": {
"dataSource": "wikipedia",
"timestampSpec": {
"missingValue": "2010-01-01T00:00:00Z"
},
"dimensionsSpec": {
"dimensions": [],
"dimensionExclusions": [],
"spatialDimensions": [],
"useSchemaDiscovery": True
},
"granularitySpec": {
"queryGranularity": "none",
"rollup": False
}
}
}
}
JSON(ingestion_spec,expanded=True)
endpoint = druid_host + '/druid/indexer/v1/task/'
response = session.post(endpoint,json = ingestion_spec)
Note that because we've set "useSchemaDiscovery": True in the ingestion spec, even though we didn't specify any data types for the columns, they are correctly inferred. The following cell queries the information schema metadata table and displays the data types of the columns in the wikipedia table you just ingested.
import pandas as pd
endpoint = druid_host + '/druid/v2/sql'
sql = '''
SELECT *
FROM "INFORMATION_SCHEMA"."COLUMNS"
WHERE "TABLE_NAME" = 'wikipedia'
'''
sql_request = {'query': sql}
json_data = session.post(endpoint, json=sql_request).json()
result_df = pd.json_normalize(json_data)
result_df.head()
As you can see, in the DATA_TYPE column, different data types are correctly detected. With string-based schema auto-discovery, Druid would have stored the data as string types.
Before the support of shuffle join, you'll need to use another tool to prepare the data then ingest into Druid. With shuffle join support, you can do the same transformation with one query. For example, in the query below, the user does a self-join on the wikipedia dataset. You can easily do the same query with a typical star-schema dataset.
query = '''
REPLACE INTO "wikipedia" OVERWRITE ALL
WITH "wikipedia_main" AS (SELECT *
FROM TABLE(
EXTERN(
'{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
'{"type":"json"}'
)
) EXTEND ("channel" VARCHAR, "timestamp" VARCHAR,"user" VARCHAR))
,
"wikipedia_dim" AS (SELECT *
FROM TABLE(
EXTERN(
'{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
'{"type":"json"}'
)
) EXTEND ("timestamp" VARCHAR,"user" VARCHAR,"comment" VARCHAR, "commentLength" BIGINT, "cityName" VARCHAR, "countryName" VARCHAR))
SELECT
TIME_PARSE("wikipedia_main"."timestamp") AS "__time",
"wikipedia_main".*,
"wikipedia_dim".*
FROM "wikipedia_main"
LEFT JOIN "wikipedia_dim"
ON
"wikipedia_main"."user" = "wikipedia_dim"."user"
AND
"wikipedia_main"."timestamp" = "wikipedia_dim"."timestamp"
PARTITIONED BY MONTH
'''
Submit the preceding query and monitor the ingestion job by running the following cells. This may take a while. You can check the status of the ingestion task in the web console.
# This block submits the ingestion query
sql_request={'query': query}
endpoint = druid_host + '/druid/v2/sql/task'
response = session.post(endpoint, json=sql_request)
# This block monitors the ingestion query (Takes about 25-35 seconds)
ingestion_taskId = response.json()['taskId']
endpoint = druid_host + f"/druid/indexer/v1/task/{ingestion_taskId}/status"
import time
json = session.get(endpoint).json()
ingestion_status = json['status']['status']
print("The ingestion is running...")
while ingestion_status == "RUNNING":
time.sleep(1)
json = session.get(endpoint).json()
ingestion_status = json['status']['status']
print('.', end='')
if ingestion_status == "SUCCESS":
print("\nThe ingestion is complete")
else:
print("\nThe ingestion task failed:", json)
UNNEST is useful to deal with Array data and allows you to "explode" an array into individual rows.
In this example, we are looking at an array of tags, which includes almond, blue_berry and muffin. We can use UNNEST to explode the array into individual rows, and then perform a GROUP BY on the tags.
import pandas as pd
endpoint = druid_host + '/druid/v2/sql'
sql = '''
SELECT 'post_id_123' AS "POST_ID", ARRAY['almond','blue_berry','muffin'] as "Tags"
'''
sql_request = {'query': sql}
json_data = session.post(endpoint, json=sql_request).json()
result_df = pd.json_normalize(json_data)
result_df.head()
For more examples and details on UNNEST, see Unnest arrays within a column.
import pandas as pd
endpoint = druid_host + '/druid/v2/sql'
sql = '''SELECT 'post_id_123' as "POST_ID", * FROM UNNEST(ARRAY['almond','blue_berry','muffin'])
'''
sql_request = {'query': sql, 'context':{'enableUnnest': 'true'}}
json_data = session.post(endpoint, json=sql_request).json()
JSON(json_data)
result_df = pd.json_normalize(json_data)
result_df.head()
Well, you've made it this far, try out some of the new features and let us know what you think!