docs/documentation/aggregates/tantivy.mdx
In addition to plain SQL aggregates, ParadeDB also has the ability to compute aggregates over a single BM25 index by accepting JSON query strings.
These aggregates can be more performant than plain SQL aggregates over some datasets.
paradedb.aggregate accepts three arguments: the name of the BM25 index, a full text search query builder function,
and a Tantivy aggregate JSON.
SELECT * FROM paradedb.aggregate(
'<index_name>',
<search_query>,
'<aggregate_query>'
);
A count aggregation tallies the number of values for the specified field across all documents.
SELECT * FROM paradedb.aggregate(
'search_idx',
paradedb.all(),
'{
"rating_total": {
"value_count": {"field": "rating"}
}
}'
);
An average aggregation calculates the mean of the specified numeric field values across all documents.
SELECT * FROM paradedb.aggregate(
'search_idx',
paradedb.all(),
'{
"avg_rating": {
"avg": {"field": "rating"}
}
}'
);
A sum aggregation computes the total sum of the specified numeric field values across all documents.
SELECT * FROM paradedb.aggregate(
'search_idx',
paradedb.all(),
'{
"rating_total": {
"sum": {"field": "rating"}
}
}'
);
A min aggregation finds the smallest value for the specified numeric field across all documents.
SELECT * FROM paradedb.aggregate(
'search_idx',
paradedb.all(),
'{
"min_rating": {
"min": {"field": "rating"}
}
}'
);
A max aggregation finds the largest value for the specified numeric field across all documents.
SELECT * FROM paradedb.aggregate(
'search_idx',
paradedb.all(),
'{
"max_rating": {
"max": {"field": "rating"}
}
}'
);
A stats aggregation provides a collection of statistical metrics for the specified numeric field, including count, sum, average, min, and max.
SELECT * FROM paradedb.aggregate(
'search_idx',
paradedb.all(),
'{
"rating_stats": {
"stats": {"field": "rating"}
}
}'
);
The percentiles aggregation calculates the values below which given percentages of the data fall, providing insights into the distribution of a dataset.
SELECT * FROM paradedb.aggregate(
'search_idx',
paradedb.all(),
'{
"rating_percentiles": {
"percentiles": {"field": "rating"}
}
}'
);
A cardinality aggregation estimates the number of unique values in the specified field using the HyperLogLog++ algorithm. This is useful for understanding the uniqueness of values in a large dataset.
<Note> The cardinality aggregation provides an approximate count, which is accurate within a small error range. This trade-off allows for efficient computation even on very large datasets. </Note>SELECT * FROM paradedb.aggregate(
'search_idx',
paradedb.all(),
'{
"unique_users": {
"cardinality": {"field": "user_id", "missing": "unknown"}
}
}'
);
Histogram is a bucket aggregation where buckets are created dynamically based on a specified interval. Each document value is rounded down to its bucket. For example, if you have a price of 18 and an interval of 5, the document will fall into the bucket with the key 15. The formula used for this is: ((val - offset) / interval).floor() * interval + offset.
SELECT * FROM paradedb.aggregate(
'search_idx',
paradedb.all(),
'{
"rating_histogram": {
"histogram": {"field": "rating", "interval": 1}
}
}'
);
Similar to histogram, but can only be used with datetime types. Currently, only fixed time intervals are supported.
SELECT * FROM paradedb.aggregate(
'search_idx',
paradedb.all(),
'{
"created_at_histogram": {
"date_histogram": {"field": "created_at", "fixed_interval": "1h"}
}
}'
);
Range allows you to define custom buckets for specific ranges.
SELECT * FROM paradedb.aggregate(
'search_idx',
paradedb.all(),
'{
"ranges": {
"range": {"field": "rating", "ranges": [
{ "to": 3.0 },
{ "from": 3.0, "to": 7.0 },
{ "from": 7.0, "to": 20.0 },
{ "from": 20.0 }
]}
}
}'
);
Terms creates a bucket for every unique term and counts the number of occurrences.
SELECT * FROM paradedb.aggregate(
'search_idx',
paradedb.all(),
'{
"rating_terms": {
"terms": {"field": "rating"}
}
}'
);
Buckets can contain sub-aggregations. For example, creating buckets with the range aggregation and then calculating the average on each bucket:
SELECT * FROM paradedb.aggregate(
'search_idx',
paradedb.all(),
'{
"range_rating": {
"range": {
"field": "rating",
"ranges": [
{ "from": 1, "to": 3 },
{ "from": 3, "to": 5 }
]
},
"aggs": {
"average_in_range": { "avg": { "field": "rating"} }
}
}
}'
);