Back to Datafusion

Aggregate Functions

docs/source/user-guide/sql/aggregate_functions.md

53.1.033.8 KB
Original Source
<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <!--- This file was generated by the dev/update_function_docs.sh script. Do not edit it manually as changes will be overwritten. Instead, edit the AggregateUDFImpl's documentation() function to update documentation for an individual UDF or the dev/update_function_docs.sh file for updating surrounding text. -->

Aggregate Functions

Aggregate functions operate on a set of values to compute a single result.

Filter clause

Aggregate functions support the SQL FILTER (WHERE ...) clause to restrict which input rows contribute to the aggregate result.

sql
function([exprs]) FILTER (WHERE condition)

Example:

sql
SELECT
  sum(salary) FILTER (WHERE salary > 0) AS sum_positive_salaries,
  count(*)    FILTER (WHERE active)     AS active_count
FROM employees;

Note: When no rows pass the filter, COUNT returns 0 while SUM/AVG/MIN/MAX return NULL.

WITHIN GROUP / Ordered-set aggregates

Some aggregate functions accept the SQL WITHIN GROUP (ORDER BY ...) clause to specify the ordering the aggregate relies on. In DataFusion this is opt-in: only aggregate functions whose implementation returns true from AggregateUDFImpl::supports_within_group_clause() accept the WITHIN GROUP clause. Attempting to use WITHIN GROUP with a regular aggregate (for example, SELECT SUM(x) WITHIN GROUP (ORDER BY x)) will fail during planning with an error: "WITHIN GROUP is only supported for ordered-set aggregate functions".

Currently, the built-in aggregate functions that support WITHIN GROUP are:

  • percentile_cont — exact percentile aggregate (also available as percentile_cont(column, percentile))
  • approx_percentile_cont — approximate percentile using the t-digest algorithm
  • approx_percentile_cont_with_weight — approximate weighted percentile using the t-digest algorithm

Note: rank-like functions such as rank(), dense_rank(), and percent_rank() are window functions and use the OVER (...) clause; they are not ordered-set aggregates that accept WITHIN GROUP in DataFusion.

Example (ordered-set aggregate):

sql
percentile_cont(0.5) WITHIN GROUP (ORDER BY value)

Example (invalid usage — planner will error):

sql
-- This will fail: SUM is not an ordered-set aggregate
SELECT SUM(x) WITHIN GROUP (ORDER BY x) FROM t;

General Functions

array_agg

Returns an array created from the expression elements. If ordering is required, elements are inserted in the specified order. This aggregation function can only mix DISTINCT and ORDER BY if the ordering expression is exactly the same as the argument expression.

sql
array_agg(expression [ORDER BY expression])

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT array_agg(column_name ORDER BY other_column) FROM table_name;
+-----------------------------------------------+
| array_agg(column_name ORDER BY other_column)  |
+-----------------------------------------------+
| [element1, element2, element3]                |
+-----------------------------------------------+
> SELECT array_agg(DISTINCT column_name ORDER BY column_name) FROM table_name;
+--------------------------------------------------------+
| array_agg(DISTINCT column_name ORDER BY column_name)  |
+--------------------------------------------------------+
| [element1, element2, element3]                         |
+--------------------------------------------------------+

avg

Returns the average of numeric values in the specified column.

sql
avg(expression)

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT avg(column_name) FROM table_name;
+---------------------------+
| avg(column_name)           |
+---------------------------+
| 42.75                      |
+---------------------------+

Aliases

  • mean

bit_and

Computes the bitwise AND of all non-null input values.

sql
bit_and(expression)

Arguments

  • expression: Integer expression to operate on. Can be a constant, column, or function, and any combination of operators.

bit_or

Computes the bitwise OR of all non-null input values.

sql
bit_or(expression)

Arguments

  • expression: Integer expression to operate on. Can be a constant, column, or function, and any combination of operators.

bit_xor

Computes the bitwise exclusive OR of all non-null input values.

sql
bit_xor(expression)

Arguments

  • expression: Integer expression to operate on. Can be a constant, column, or function, and any combination of operators.

bool_and

Returns true if all non-null input values are true, otherwise false.

sql
bool_and(expression)

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT bool_and(column_name) FROM table_name;
+----------------------------+
| bool_and(column_name)       |
+----------------------------+
| true                        |
+----------------------------+

bool_or

Returns true if all non-null input values are true, otherwise false.

sql
bool_and(expression)

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT bool_and(column_name) FROM table_name;
+----------------------------+
| bool_and(column_name)       |
+----------------------------+
| true                        |
+----------------------------+

count

Returns the number of non-null values in the specified column. To include null values in the total count, use count(*).

sql
count(expression)

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT count(column_name) FROM table_name;
+-----------------------+
| count(column_name)     |
+-----------------------+
| 100                   |
+-----------------------+

> SELECT count(*) FROM table_name;
+------------------+
| count(*)         |
+------------------+
| 120              |
+------------------+

first_value

Returns the first element in an aggregation group according to the requested ordering. If no ordering is given, returns an arbitrary element from the group.

sql
first_value(expression [ORDER BY expression])

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT first_value(column_name ORDER BY other_column) FROM table_name;
+-----------------------------------------------+
| first_value(column_name ORDER BY other_column)|
+-----------------------------------------------+
| first_element                                 |
+-----------------------------------------------+

grouping

Returns 1 if the data is aggregated across the specified column, or 0 if it is not aggregated in the result set.

sql
grouping(expression)

Arguments

  • expression: Expression to evaluate whether data is aggregated across the specified column. Can be a constant, column, or function.

Example

sql
> SELECT column_name, GROUPING(column_name) AS group_column
  FROM table_name
  GROUP BY GROUPING SETS ((column_name), ());
+-------------+-------------+
| column_name | group_column |
+-------------+-------------+
| value1      | 0           |
| value2      | 0           |
| NULL        | 1           |
+-------------+-------------+

last_value

Returns the last element in an aggregation group according to the requested ordering. If no ordering is given, returns an arbitrary element from the group.

sql
last_value(expression [ORDER BY expression])

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT last_value(column_name ORDER BY other_column) FROM table_name;
+-----------------------------------------------+
| last_value(column_name ORDER BY other_column) |
+-----------------------------------------------+
| last_element                                  |
+-----------------------------------------------+

max

Returns the maximum value in the specified column.

sql
max(expression)

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT max(column_name) FROM table_name;
+----------------------+
| max(column_name)      |
+----------------------+
| 150                  |
+----------------------+

mean

Alias of avg.

median

Returns the median value in the specified column.

sql
median(expression)

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT median(column_name) FROM table_name;
+----------------------+
| median(column_name)   |
+----------------------+
| 45.5                 |
+----------------------+

min

Returns the minimum value in the specified column.

sql
min(expression)

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT min(column_name) FROM table_name;
+----------------------+
| min(column_name)      |
+----------------------+
| 12                   |
+----------------------+

percentile_cont

Returns the exact percentile of input values, interpolating between values if needed.

sql
percentile_cont(percentile) WITHIN GROUP (ORDER BY expression)

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.
  • percentile: Percentile to compute. Must be a float value between 0 and 1 (inclusive).

Example

sql
> SELECT percentile_cont(0.75) WITHIN GROUP (ORDER BY column_name) FROM table_name;
+----------------------------------------------------------+
| percentile_cont(0.75) WITHIN GROUP (ORDER BY column_name) |
+----------------------------------------------------------+
| 45.5                                                     |
+----------------------------------------------------------+

An alternate syntax is also supported:

sql
> SELECT percentile_cont(column_name, 0.75) FROM table_name;
+---------------------------------------+
| percentile_cont(column_name, 0.75)    |
+---------------------------------------+
| 45.5                                  |
+---------------------------------------+

Aliases

  • quantile_cont

quantile_cont

Alias of percentile_cont.

string_agg

Concatenates the values of string expressions and places separator values between them. If ordering is required, strings are concatenated in the specified order. This aggregation function can only mix DISTINCT and ORDER BY if the ordering expression is exactly the same as the first argument expression.

sql
string_agg([DISTINCT] expression, delimiter [ORDER BY expression])

Arguments

  • expression: The string expression to concatenate. Can be a column or any valid string expression.
  • delimiter: A literal string used as a separator between the concatenated values.

Example

sql
> SELECT string_agg(name, ', ') AS names_list
  FROM employee;
+--------------------------+
| names_list               |
+--------------------------+
| Alice, Bob, Bob, Charlie |
+--------------------------+
> SELECT string_agg(name, ', ' ORDER BY name DESC) AS names_list
  FROM employee;
+--------------------------+
| names_list               |
+--------------------------+
| Charlie, Bob, Bob, Alice |
+--------------------------+
> SELECT string_agg(DISTINCT name, ', ' ORDER BY name DESC) AS names_list
  FROM employee;
+--------------------------+
| names_list               |
+--------------------------+
| Charlie, Bob, Alice |
+--------------------------+

sum

Returns the sum of all values in the specified column.

sql
sum(expression)

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT sum(column_name) FROM table_name;
+-----------------------+
| sum(column_name)       |
+-----------------------+
| 12345                 |
+-----------------------+

var

Returns the statistical sample variance of a set of numbers.

sql
var(expression)

Arguments

  • expression: Numeric expression to operate on. Can be a constant, column, or function, and any combination of operators.

Aliases

  • var_sample
  • var_samp

var_pop

Returns the statistical population variance of a set of numbers.

sql
var_pop(expression)

Arguments

  • expression: Numeric expression to operate on. Can be a constant, column, or function, and any combination of operators.

Aliases

  • var_population

var_population

Alias of var_pop.

var_samp

Alias of var.

var_sample

Alias of var.

Statistical Functions

corr

Returns the coefficient of correlation between two numeric values.

sql
corr(expression1, expression2)

Arguments

  • expression1: First expression to operate on. Can be a constant, column, or function, and any combination of operators.
  • expression2: Second expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT corr(column1, column2) FROM table_name;
+--------------------------------+
| corr(column1, column2)         |
+--------------------------------+
| 0.85                           |
+--------------------------------+

covar

Alias of covar_samp.

covar_pop

Returns the sample covariance of a set of number pairs.

sql
covar_samp(expression1, expression2)

Arguments

  • expression1: First expression to operate on. Can be a constant, column, or function, and any combination of operators.
  • expression2: Second expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT covar_samp(column1, column2) FROM table_name;
+-----------------------------------+
| covar_samp(column1, column2)      |
+-----------------------------------+
| 8.25                              |
+-----------------------------------+

covar_samp

Returns the sample covariance of a set of number pairs.

sql
covar_samp(expression1, expression2)

Arguments

  • expression1: First expression to operate on. Can be a constant, column, or function, and any combination of operators.
  • expression2: Second expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT covar_samp(column1, column2) FROM table_name;
+-----------------------------------+
| covar_samp(column1, column2)      |
+-----------------------------------+
| 8.25                              |
+-----------------------------------+

Aliases

  • covar

nth_value

Returns the nth value in a group of values.

sql
nth_value(expression, n ORDER BY expression)

Arguments

  • expression: The column or expression to retrieve the nth value from.
  • n: The position (nth) of the value to retrieve, based on the ordering.

Example

sql
> SELECT dept_id, salary, NTH_VALUE(salary, 2) OVER (PARTITION BY dept_id ORDER BY salary ASC) AS second_salary_by_dept
  FROM employee;
+---------+--------+-------------------------+
| dept_id | salary | second_salary_by_dept   |
+---------+--------+-------------------------+
| 1       | 30000  | NULL                    |
| 1       | 40000  | 40000                   |
| 1       | 50000  | 40000                   |
| 2       | 35000  | NULL                    |
| 2       | 45000  | 45000                   |
+---------+--------+-------------------------+

regr_avgx

Computes the average of the independent variable (input) expression_x for the non-null paired data points.

sql
regr_avgx(expression_y, expression_x)

Arguments

  • expression_y: Dependent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.
  • expression_x: Independent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
create table daily_sales(day int, total_sales int) as values (1,100), (2,150), (3,200), (4,NULL), (5,250);
select * from daily_sales;
+-----+-------------+
| day | total_sales |
| --- | ----------- |
| 1   | 100         |
| 2   | 150         |
| 3   | 200         |
| 4   | NULL        |
| 5   | 250         |
+-----+-------------+

SELECT regr_avgx(total_sales, day) AS avg_day FROM daily_sales;
+----------+
| avg_day  |
+----------+
|   2.75   |
+----------+

regr_avgy

Computes the average of the dependent variable (output) expression_y for the non-null paired data points.

sql
regr_avgy(expression_y, expression_x)

Arguments

  • expression_y: Dependent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.
  • expression_x: Independent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
create table daily_temperature(day int, temperature int) as values (1,30), (2,32), (3, NULL), (4,35), (5,36);
select * from daily_temperature;
+-----+-------------+
| day | temperature |
| --- | ----------- |
| 1   | 30          |
| 2   | 32          |
| 3   | NULL        |
| 4   | 35          |
| 5   | 36          |
+-----+-------------+

-- temperature as Dependent Variable(Y), day as Independent Variable(X)
SELECT regr_avgy(temperature, day) AS avg_temperature FROM daily_temperature;
+-----------------+
| avg_temperature |
+-----------------+
| 33.25           |
+-----------------+

regr_count

Counts the number of non-null paired data points.

sql
regr_count(expression_y, expression_x)

Arguments

  • expression_y: Dependent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.
  • expression_x: Independent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
create table daily_metrics(day int, user_signups int) as values (1,100), (2,120), (3, NULL), (4,110), (5,NULL);
select * from daily_metrics;
+-----+---------------+
| day | user_signups  |
| --- | ------------- |
| 1   | 100           |
| 2   | 120           |
| 3   | NULL          |
| 4   | 110           |
| 5   | NULL          |
+-----+---------------+

SELECT regr_count(user_signups, day) AS valid_pairs FROM daily_metrics;
+-------------+
| valid_pairs |
+-------------+
| 3           |
+-------------+

regr_intercept

Computes the y-intercept of the linear regression line. For the equation (y = kx + b), this function returns b.

sql
regr_intercept(expression_y, expression_x)

Arguments

  • expression_y: Dependent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.
  • expression_x: Independent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
create table weekly_performance(week int, productivity_score int) as values (1,60), (2,65), (3, 70), (4,75), (5,80);
select * from weekly_performance;
+------+---------------------+
| week | productivity_score  |
| ---- | ------------------- |
| 1    | 60                  |
| 2    | 65                  |
| 3    | 70                  |
| 4    | 75                  |
| 5    | 80                  |
+------+---------------------+

SELECT regr_intercept(productivity_score, week) AS intercept FROM weekly_performance;
+----------+
|intercept|
|intercept |
+----------+
|  55      |
+----------+

regr_r2

Computes the square of the correlation coefficient between the independent and dependent variables.

sql
regr_r2(expression_y, expression_x)

Arguments

  • expression_y: Dependent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.
  • expression_x: Independent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
create table weekly_performance(day int ,user_signups int) as values (1,60), (2,65), (3, 70), (4,75), (5,80);
select * from weekly_performance;
+-----+--------------+
| day | user_signups |
+-----+--------------+
| 1   | 60           |
| 2   | 65           |
| 3   | 70           |
| 4   | 75           |
| 5   | 80           |
+-----+--------------+

SELECT regr_r2(user_signups, day) AS r_squared FROM weekly_performance;
+---------+
|r_squared|
+---------+
| 1.0     |
+---------+

regr_slope

Returns the slope of the linear regression line for non-null pairs in aggregate columns. Given input column Y and X: regr_slope(Y, X) returns the slope (k in Y = k*X + b) using minimal RSS fitting.

sql
regr_slope(expression_y, expression_x)

Arguments

  • expression_y: Dependent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.
  • expression_x: Independent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
create table weekly_performance(day int, user_signups int) as values (1,60), (2,65), (3, 70), (4,75), (5,80);
select * from weekly_performance;
+-----+--------------+
| day | user_signups |
+-----+--------------+
| 1   | 60           |
| 2   | 65           |
| 3   | 70           |
| 4   | 75           |
| 5   | 80           |
+-----+--------------+

SELECT regr_slope(user_signups, day) AS slope FROM weekly_performance;
+--------+
| slope  |
+--------+
| 5.0    |
+--------+

regr_sxx

Computes the sum of squares of the independent variable.

sql
regr_sxx(expression_y, expression_x)

Arguments

  • expression_y: Dependent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.
  • expression_x: Independent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
create table study_hours(student_id int, hours int, test_score int) as values (1,2,55), (2,4,65), (3,6,75), (4,8,85), (5,10,95);
select * from study_hours;
+------------+-------+------------+
| student_id | hours | test_score |
+------------+-------+------------+
| 1          | 2     | 55         |
| 2          | 4     | 65         |
| 3          | 6     | 75         |
| 4          | 8     | 85         |
| 5          | 10    | 95         |
+------------+-------+------------+

SELECT regr_sxx(test_score, hours) AS sxx FROM study_hours;
+------+
| sxx  |
+------+
| 40.0 |
+------+

regr_sxy

Computes the sum of products of paired data points.

sql
regr_sxy(expression_y, expression_x)

Arguments

  • expression_y: Dependent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.
  • expression_x: Independent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
create table employee_productivity(week int, productivity_score int) as values(1,60), (2,65), (3,70);
select * from employee_productivity;
+------+--------------------+
| week | productivity_score |
+------+--------------------+
| 1    | 60                 |
| 2    | 65                 |
| 3    | 70                 |
+------+--------------------+

SELECT regr_sxy(productivity_score, week) AS sum_product_deviations FROM employee_productivity;
+------------------------+
| sum_product_deviations |
+------------------------+
|       10.0             |
+------------------------+

regr_syy

Computes the sum of squares of the dependent variable.

sql
regr_syy(expression_y, expression_x)

Arguments

  • expression_y: Dependent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.
  • expression_x: Independent variable expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
create table employee_productivity(week int, productivity_score int) as values (1,60), (2,65), (3,70);
select * from employee_productivity;
+------+--------------------+
| week | productivity_score |
+------+--------------------+
| 1    | 60                 |
| 2    | 65                 |
| 3    | 70                 |
+------+--------------------+

SELECT regr_syy(productivity_score, week) AS sum_squares_y FROM employee_productivity;
+---------------+
| sum_squares_y |
+---------------+
|    50.0       |
+---------------+

stddev

Returns the standard deviation of a set of numbers.

sql
stddev(expression)

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT stddev(column_name) FROM table_name;
+----------------------+
| stddev(column_name)   |
+----------------------+
| 12.34                |
+----------------------+

Aliases

  • stddev_samp

stddev_pop

Returns the population standard deviation of a set of numbers.

sql
stddev_pop(expression)

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT stddev_pop(column_name) FROM table_name;
+--------------------------+
| stddev_pop(column_name)   |
+--------------------------+
| 10.56                    |
+--------------------------+

stddev_samp

Alias of stddev.

Approximate Functions

approx_distinct

Returns the approximate number of distinct input values calculated using the HyperLogLog algorithm.

sql
approx_distinct(expression)

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT approx_distinct(column_name) FROM table_name;
+-----------------------------------+
| approx_distinct(column_name)      |
+-----------------------------------+
| 42                                |
+-----------------------------------+

approx_median

Returns the approximate median (50th percentile) of input values. It is an alias of approx_percentile_cont(0.5) WITHIN GROUP (ORDER BY x).

sql
approx_median(expression)

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.

Example

sql
> SELECT approx_median(column_name) FROM table_name;
+-----------------------------------+
| approx_median(column_name)        |
+-----------------------------------+
| 23.5                              |
+-----------------------------------+

approx_percentile_cont

Returns the approximate percentile of input values using the t-digest algorithm.

sql
approx_percentile_cont(percentile [, centroids]) WITHIN GROUP (ORDER BY expression)

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.
  • percentile: Percentile to compute. Must be a float value between 0 and 1 (inclusive).
  • centroids: Number of centroids to use in the t-digest algorithm. Default is 100. A higher number results in more accurate approximation but requires more memory.

Example

sql
> SELECT approx_percentile_cont(0.75) WITHIN GROUP (ORDER BY column_name) FROM table_name;
+------------------------------------------------------------------+
| approx_percentile_cont(0.75) WITHIN GROUP (ORDER BY column_name) |
+------------------------------------------------------------------+
| 65.0                                                             |
+------------------------------------------------------------------+
> SELECT approx_percentile_cont(0.75, 100) WITHIN GROUP (ORDER BY column_name) FROM table_name;
+-----------------------------------------------------------------------+
| approx_percentile_cont(0.75, 100) WITHIN GROUP (ORDER BY column_name) |
+-----------------------------------------------------------------------+
| 65.0                                                                  |
+-----------------------------------------------------------------------+

An alternate syntax is also supported:

sql
> SELECT approx_percentile_cont(column_name, 0.75) FROM table_name;
+-----------------------------------------------+
| approx_percentile_cont(column_name, 0.75)     |
+-----------------------------------------------+
| 65.0                                          |
+-----------------------------------------------+

> SELECT approx_percentile_cont(column_name, 0.75, 100) FROM table_name;
+----------------------------------------------------------+
| approx_percentile_cont(column_name, 0.75, 100)           |
+----------------------------------------------------------+
| 65.0                                                     |
+----------------------------------------------------------+

approx_percentile_cont_with_weight

Returns the weighted approximate percentile of input values using the t-digest algorithm.

sql
approx_percentile_cont_with_weight(weight, percentile [, centroids]) WITHIN GROUP (ORDER BY expression)

Arguments

  • expression: The expression to operate on. Can be a constant, column, or function, and any combination of operators.
  • weight: Expression to use as weight. Can be a constant, column, or function, and any combination of arithmetic operators.
  • percentile: Percentile to compute. Must be a float value between 0 and 1 (inclusive).
  • centroids: Number of centroids to use in the t-digest algorithm. Default is 100. A higher number results in more accurate approximation but requires more memory.

Example

sql
> SELECT approx_percentile_cont_with_weight(weight_column, 0.90) WITHIN GROUP (ORDER BY column_name) FROM table_name;
+---------------------------------------------------------------------------------------------+
| approx_percentile_cont_with_weight(weight_column, 0.90) WITHIN GROUP (ORDER BY column_name) |
+---------------------------------------------------------------------------------------------+
| 78.5                                                                                        |
+---------------------------------------------------------------------------------------------+
> SELECT approx_percentile_cont_with_weight(weight_column, 0.90, 100) WITHIN GROUP (ORDER BY column_name) FROM table_name;
+--------------------------------------------------------------------------------------------------+
| approx_percentile_cont_with_weight(weight_column, 0.90, 100) WITHIN GROUP (ORDER BY column_name) |
+--------------------------------------------------------------------------------------------------+
| 78.5                                                                                             |
+--------------------------------------------------------------------------------------------------+

An alternative syntax is also supported:

sql
> SELECT approx_percentile_cont_with_weight(column_name, weight_column, 0.90) FROM table_name;
+--------------------------------------------------+
| approx_percentile_cont_with_weight(column_name, weight_column, 0.90) |
+--------------------------------------------------+
| 78.5                                             |
+--------------------------------------------------+