Back to Paradedb

Tantivy Aggregates

docs/documentation/aggregates/tantivy.mdx

0.23.310.4 KB
Original Source

In addition to plain SQL aggregates, ParadeDB also has the ability to compute aggregates over a single BM25 index by accepting JSON query strings.

These aggregates can be more performant than plain SQL aggregates over some datasets.

Syntax

paradedb.aggregate accepts three arguments: the name of the BM25 index, a full text search query builder function, and a Tantivy aggregate JSON.

sql
SELECT * FROM paradedb.aggregate(
    '<index_name>',
    <search_query>,
    '<aggregate_query>'
);
<ParamField body="index_name" required> The name of the BM25 index as a string. </ParamField> <ParamField body="search_query" required> A full text search query builder function. The aggregate will be computed over the results of this function. </ParamField> <ParamField body="aggregate_query" required> A Tantivy aggregate JSON string. See the sections below for how to construct these JSONs. </ParamField>

Count

A count aggregation tallies the number of values for the specified field across all documents.

sql
SELECT * FROM paradedb.aggregate(
    'search_idx',
    paradedb.all(),
    '{
        "rating_total": {
            "value_count": {"field": "rating"}
        }
    }'
);
<ParamField body="field" required> The field name to compute the count on. </ParamField> <ParamField body="missing"> The value to use for documents missing the field. By default, missing values are ignored. </ParamField>

Average

An average aggregation calculates the mean of the specified numeric field values across all documents.

sql
SELECT * FROM paradedb.aggregate(
    'search_idx',
    paradedb.all(),
    '{
        "avg_rating": {
            "avg": {"field": "rating"}
        }
    }'
);
<ParamField body="field" required> The field name to compute the average on. </ParamField> <ParamField body="missing"> The value to use for documents missing the field. By default, missing values are ignored. </ParamField>

Sum

A sum aggregation computes the total sum of the specified numeric field values across all documents.

sql
SELECT * FROM paradedb.aggregate(
    'search_idx',
    paradedb.all(),
    '{
        "rating_total": {
            "sum": {"field": "rating"}
        }
    }'
);
<ParamField body="field" required> The field name to compute the sum on. </ParamField> <ParamField body="missing"> The value to use for documents missing the field. By default, missing values are ignored. </ParamField>

Min

A min aggregation finds the smallest value for the specified numeric field across all documents.

sql
SELECT * FROM paradedb.aggregate(
    'search_idx',
    paradedb.all(),
    '{
        "min_rating": {
            "min": {"field": "rating"}
        }
    }'
);
<ParamField body="field" required> The field name to compute the minimum on. </ParamField> <ParamField body="missing"> The value to use for documents missing the field. By default, missing values are ignored. </ParamField>

Max

A max aggregation finds the largest value for the specified numeric field across all documents.

sql
SELECT * FROM paradedb.aggregate(
    'search_idx',
    paradedb.all(),
    '{
        "max_rating": {
            "max": {"field": "rating"}
        }
    }'
);
<ParamField body="field" required> The field name to compute the maximum on. </ParamField> <ParamField body="missing"> The value to use for documents missing the field. By default, missing values are ignored. </ParamField>

Stats

A stats aggregation provides a collection of statistical metrics for the specified numeric field, including count, sum, average, min, and max.

sql
SELECT * FROM paradedb.aggregate(
    'search_idx',
    paradedb.all(),
    '{
        "rating_stats": {
            "stats": {"field": "rating"}
        }
    }'
);
<ParamField body="field" required> The field name to compute the stats on. </ParamField> <ParamField body="missing"> The value to use for documents missing the field. By default, missing values are ignored. </ParamField>

Percentiles

The percentiles aggregation calculates the values below which given percentages of the data fall, providing insights into the distribution of a dataset.

sql
SELECT * FROM paradedb.aggregate(
    'search_idx',
    paradedb.all(),
    '{
        "rating_percentiles": {
            "percentiles": {"field": "rating"}
        }
    }'
);
<ParamField body="field" required> The field name to compute the percentiles on. </ParamField> <ParamField body="percents" default={[1.0, 5.0, 25.0, 50.0, 75.0, 95.0, 99.0]}> The percentiles to compute. </ParamField> <ParamField body="keyed" default={false}> Whether to return the percentiles as a hash map. </ParamField> <ParamField body="missing"> The value to use for documents missing the field. By default, missing values are ignored. </ParamField>

Cardinality

A cardinality aggregation estimates the number of unique values in the specified field using the HyperLogLog++ algorithm. This is useful for understanding the uniqueness of values in a large dataset.

<Note> The cardinality aggregation provides an approximate count, which is accurate within a small error range. This trade-off allows for efficient computation even on very large datasets. </Note>
sql
SELECT * FROM paradedb.aggregate(
    'search_idx',
    paradedb.all(),
    '{
        "unique_users": {
            "cardinality": {"field": "user_id", "missing": "unknown"}
        }
    }'
);
<ParamField body="field" required> The field name to compute the cardinality on. </ParamField> <ParamField body="missing"> The value to use for documents missing the field. By default, missing values are ignored. </ParamField>

Histogram

Histogram is a bucket aggregation where buckets are created dynamically based on a specified interval. Each document value is rounded down to its bucket. For example, if you have a price of 18 and an interval of 5, the document will fall into the bucket with the key 15. The formula used for this is: ((val - offset) / interval).floor() * interval + offset.

sql
SELECT * FROM paradedb.aggregate(
    'search_idx',
    paradedb.all(),
    '{
        "rating_histogram": {
            "histogram": {"field": "rating", "interval": 1}
        }
    }'
);
<ParamField body="field" required> The field to aggregate on. </ParamField> <ParamField body="interval" required> The interval to chunk your data range. Each bucket spans a value range of [0..interval). Must be a positive value. </ParamField> <ParamField body="offset" default={0.0}> Shift the grid of buckets by the specified offset. </ParamField> <ParamField body="min_doc_count" default={0}> The minimum number of documents in a bucket to be returned. </ParamField> <ParamField body="hard_bounds"> Limits the data range to [min, max] closed interval. </ParamField> <ParamField body="extended_bounds"> Extends the value range of the buckets. </ParamField> <ParamField body="keyed" default={false}> Whether to return the buckets as a hash map. </ParamField> <ParamField body="is_normalized_to_ns" default={false}> Whether the values are normalized to ns for date time values. </ParamField>

Date Histogram

Similar to histogram, but can only be used with datetime types. Currently, only fixed time intervals are supported.

sql
SELECT * FROM paradedb.aggregate(
    'search_idx',
    paradedb.all(),
    '{
        "created_at_histogram": {
            "date_histogram": {"field": "created_at", "fixed_interval": "1h"}
        }
    }'
);
<ParamField body="field" required> The field to aggregate on. </ParamField> <ParamField body="fixed_interval" required> The interval to chunk your data range. Each bucket spans a value range of [0..fixed_interval). Accepted values should end in `ms`, `s`, `m`, `h`, or `d`. </ParamField> <ParamField body="offset" default={0}> Shift the grid of buckets by the specified offset. </ParamField> <ParamField body="min_doc_count" default={0}> The minimum number of documents in a bucket to be returned. </ParamField> <ParamField body="hard_bounds"> Limits the data range to [min, max] closed interval. </ParamField> <ParamField body="extended_bounds"> Extends the value range of the buckets. </ParamField> <ParamField body="keyed" default={false}> Whether to return the buckets as a hash map. </ParamField>

Range

Range allows you to define custom buckets for specific ranges.

sql
SELECT * FROM paradedb.aggregate(
    'search_idx',
    paradedb.all(),
    '{
        "ranges": {
            "range": {"field": "rating", "ranges": [
                { "to": 3.0 },
                { "from": 3.0, "to": 7.0 },
                { "from": 7.0, "to": 20.0 },
                { "from": 20.0 }
            ]}
        }
    }'
);
<ParamField body="field" required> The field to aggregate on. </ParamField> <ParamField body="ranges" required> A list of ranges to aggregate on. </ParamField> <ParamField body="keyed" default={false}> Whether to return the buckets as a hash map. </ParamField>

Terms

Terms creates a bucket for every unique term and counts the number of occurrences.

sql
SELECT * FROM paradedb.aggregate(
    'search_idx',
    paradedb.all(),
    '{
        "rating_terms": {
            "terms": {"field": "rating"}
        }
    }'
);
<ParamField body="field" required> The field to aggregate on. </ParamField> <ParamField body="size" default={10}> The number of terms to return. </ParamField> <ParamField body="segment_size" default={100}> The number of terms to fetch from each segment. </ParamField> <ParamField body="show_term_doc_count_error" default={false}> Whether to include the document count error. </ParamField> <ParamField body="min_doc_count" default={1}> The minimum number of documents in a term to be returned. </ParamField> <ParamField body="order">The order in which to return the terms.</ParamField> <ParamField body="missing"> The value to use for documents missing the field. </ParamField>

Nested Aggregations

Buckets can contain sub-aggregations. For example, creating buckets with the range aggregation and then calculating the average on each bucket:

sql
SELECT * FROM paradedb.aggregate(
    'search_idx',
    paradedb.all(),
    '{
        "range_rating": {
            "range": {
            "field": "rating",
            "ranges": [
                { "from": 1, "to": 3 },
                { "from": 3, "to": 5 }
            ]
            },
            "aggs": {
            "average_in_range": { "avg": { "field": "rating"} }
            }
        }
    }'
);