Help Center/ MapReduce Service/ Component Operation Guide (LTS)/ Using HetuEngine/ Common HetuEngine SQL Syntax/ HetuEngine SQL Functions and Operators/ Aggregate Functions

Updated on 2024-12-13 GMT+08:00

View PDF

Aggregate Functions

The aggregate function operates on a set of values to obtain a single value.

Except count(), count_if(), max_by(), min_by(), and approx_distinct(), other aggregate functions ignore null values and return null values when there is no input row or all values are null. For example, sum() returns null instead of 0, and avg() does not take null values during statistics collection. The coalesce function converts null to 0.

Clause of an Aggregate Function

Order by
Some aggregate functions may generate different results due to different sequences of input values. You can use the order by clause in the aggregate function to specify the sequence.
```
array_agg(x ORDER BY y DESC);
array_agg(x ORDER BYx,y,z);
```

Filter
You can use the filter keyword to filter out unnecessary rows by using the where condition expression during aggregation. All aggregate functions support this function.
```
aggregate_function(...) FILTER (WHERE <condition>)
```

Example:

--Create a table.
create table fruit (name varchar, price int);
--Insert data.
insert into fruit values ('peach',5),('apple',2);
--Sorting:
select array_agg (name order by price) from fruit;-- [apple, peach]
--Filtering:
select array_agg(name) filter (where price<10) from fruit;-- [peach, apple]

Common Aggregate Functions

An aggregate function usually applies to a specific field in a data set (table or view). The following parameter x is used to refer to the field.

arbitrary(x)
Description: Returns a non-null value of X. The return type is the same as X.
```
select arbitrary(price) from fruit;-- 5
```
array_agg(x)
Description: Returns an array of input x fields of the same type as the input field.
```
select array_agg(price) from fruit;-- [5,2]
```
avg(x)
Description: Returns the average of all input values in double type.
```
select avg(price) from fruit;-- 3.5
```

avg(time interval type)

Description: Returns the average length of all input intervals. The return type is interval.

select avg(last_login) from (values ('admin',interval '0 06:15:30' day to second),('user1',interval '0 07:15:30' day to second),('user2',interval '0 08:15:30' day to second)) as login_log(user,last_login);
-- 0 07:15:30.000  Assume that a log table records the time since the last login. The result indicates that the average login interval is 0 days, 7 hours, 15 minutes, and 30 seconds.

bool_and(boolean value)

Description: Returns true if each input value is true, otherwise returns false.

select bool_and(isfruit) from (values ('item1',true), ('item2',false),('item3',true)) as items(item,isfruit);--false
select bool_and(isfruit) from (values ('item1',true), ('item2',true),('item3',true)) as items(item,isfruit);-- true

bool_or(boolean value)

Description: Returns true if any of the input values is true, otherwise returns false.

select bool_or(isfruit) from (values ('item1',false), ('item2',false),('item3',false)) as items(item,isfruit);-- false
select bool_or(isfruit) from (values ('item1',true), ('item2',false),('item3',false)) as items(item,isfruit); --true

checksum(x)
Description: Returns the checksum of the input value, which is not affected by the input sequence. The result type is varbinary.
```
select checksum(price) from fruit; -- fb 28 f3 9a 9a fb bf 86
```
count(*)
Description: Returns the number of input records. The result type is bigint.
```
select count(*) from fruit; -- 2
```
count(x)
Description: Returns the number of records whose input field is not null. The result type is bigint.
```
select count(name) from fruit;-- 2
```
count_if(x)
Description: Returns the number of records whose input value is true. This function is similar to count(CASE WHEN x THEN 1 END) and is of the bigint type.
```
select count_if(price>7) from fruit;-- 0
```
every(boolean)
Description: Is an alias for bool_and().
geometric_mean(x)
Description: Returns the geometric mean of the input field value. The value is of the double type.
```
select geometric_mean(price) from fruit; -- 3.162277660168379
```

listagg(x, separator) → varchar

Description: Returns a string concatenated by input values separated by specified separators.

Syntax:

LISTAGG( expression [, separator] [ON OVERFLOW overflow_behaviour])    WITHIN GROUP (ORDER BY sort_item, [...])

If separator is not specified, the null character is used as the separator by default.

SELECT listagg(value, ',') WITHIN GROUP (ORDER BY value) csv_value FROM (VALUES 'a', 'c', 'b') t(value);
csv_value
-----------
'a,b,c'

When the return value of this function exceeds 1048576 bytes, you can use overflow_behaviour to specify the action to take in this case. By default, an error will be thrown.

SELECT listagg(value, ',' ON OVERFLOW ERROR) WITHIN GROUP (ORDER BY value) csv_value FROM (VALUES 'a', 'b', 'c') t(value);

If the return value exceeds 1048576 bytes, truncate the extra non-null string and replace it with the string specified by TRUNCATE. WITH COUNT and WITHOUT COUNT indicate whether the return value contains the count.

SELECT LISTAGG(value, ',' ON OVERFLOW TRUNCATE '.....' WITH COUNT) WITHIN GROUP (ORDER BY value)FROM (VALUES 'a', 'b', 'c') t(value);

The listagg function can also be used for grouping strings. The following is an example:

SELECT id, LISTAGG(value, ',') WITHIN GROUP (ORDER BY o) csv_value FROM (VALUES     
     (100, 1, 'a'),
     (200, 3, 'c'), 
     (200, 2, 'b') ) t(id, o, value)
 GROUP BY id
 ORDER BY id;
 id   | csv_value
 -----+------------- 
 100  | a 
 200  | b,c

max_by(x, y)
Description: Returns the value of x associated with the maximum value of the y field in all input values.
```
select max_by(name,price) from fruit; -- peach    
```
max_by(x, y, n)
Description: Returns n x values sorted by y in descending order.
```
select max_by(name,price,2) from fruit;-- [peach, apple]
```
min_by(x,y)
Description: Returns the value of x associated with the minimum value of the y field in all input values.
```
select min_by(name,price) from fruit;-- apple
```
min_by(x, y, n)
Description: Returns n x values sorted by y in ascending order.
```
select min_by(name,price,2) from fruit;-- [apple, peach] 
```
max(x)
Description: Returns the maximum value of the input field x.
```
select max(price) from fruit;-- 5
```
max(x, n)
Description: Returns the first n values of the input field x in descending order.
```
select max(price,2) from fruit; -- [5, 2]
```
min(x)
Description: Returns the minimum value of the input field x.
```
select min(price) from fruit;-- 2
```
min(x, n)
Description: Returns the first n values of the input field x in ascending order.
```
select min(price,2) from fruit;-- [2, 5]
```
sum(T, x)
Description: Sums up the input field x. T is of the numeric type, for example, int, double, or interval day to second.
```
select sum(price) from fruit;-- 7
```

regr_avgx(T independent, T dependent) → double

Description: Calculates the average value of the independent variable (expr2) of the regression line. After the empty pair (expr1, expr2) is removed, the value is AVG(expr2).

create table sample_collection(id int,time_cost int,weight decimal(5,2));
 
insert into sample_collection values 
(1,5,86.38),
(2,10,281.17),
(3,15,89.91),
(4,20,17.5),
(5,25,88.76),
(6,30,83.94),
(7,35,44.26),
(8,40,17.4),
(9,45,5.6),
(10,50,145.68);

select regr_avgx(time_cost,weight) from sample_collection;
       _col0       
-------------------
 86.06000000000002 
(1 row)

regr_avgy(T independent, T dependent) → double
Description: Calculates the average value of the dependent variable (expr1) of the regression line. After the empty pair (expr1, expr2) is removed, the value is AVG(expr1).
```
select regr_avgy(time_cost,weight) from sample_collection;
 _col0 
-------
  27.5 
(1 row)
```

regr_count(T independent, T dependent) → double
Description: Returns the non-null logarithm used to fit a linear regression line.
```
select regr_count(time_cost,weight) from sample_collection;
 _col0 
-------
    10 
(1 row)
```

regr_r2(T independent, T dependent) → double

Description: Returns the coefficient of determination for regression.

select regr_r2(time_cost,weight) from sample_collection;
       _col0        
--------------------
 0.1446739237728169 
(1 row)

regr_sxx(T independent, T dependent) → double

Description: Returns the value of REGR_COUNT(expr1, expr2) * VAR_POP(expr2).

select regr_sxx(time_cost,weight) from sample_collection;
       _col0        
--------------------
 59284.886600000005 
(1 row)

regr_sxy(T independent, T dependent) → double
Description: Returns the value of REGR_COUNT(expr1, expr2) * COVAR_POP(expr1, expr2).
```
select regr_sxy(time_cost,weight) from sample_collection;
  _col0   
----------
 -4205.95 
(1 row)
```

regr_syy(T independent, T dependent) → double
Description: Returns the value of REGR_COUNT(expr1, expr2) * VAR_POP(expr1).
```
select regr_syy(time_cost,weight) from sample_collection;
 _col0  
--------
 2062.5 
(1 row)
```

Bitwise Aggregate Function

bitwise_and_agg(x)
Description: Uses two's complement to represent the bitwise AND operation on the input field x. The return type is bigint.
```
select bitwise_and_agg(x) from (values (31),(32)) as t(x);-- 0
```

bitwise_or_agg(x)
Description: Uses two's complement to represent the bitwise OR operation on the input field x. The return type is bigint.
```
select bitwise_or_agg(x) from (values (31),(32)) as t(x);-- 63
```

Map Aggregate Function

histogram(x) -> map(K, bigint)

Description: Returns a map containing the number of occurrences of all input field x.

select histogram(x),histogram(y) from (values (15,17),(15,18),(15,19),(15,20)) as t(x,y);-- {15=4},{17=1, 18=1, 19=1, 20=1}

map_agg(key, value) -> map(K, V)
Description: Returns a map whose input field key and input field value are key-value pairs.
```
select map_agg(name,price) from fruit;-- {apple=2, peach=5}
```

map_union(x(K, V)) -> map(K, V)

Description: Returns the union of all input maps. If a key appears multiple times in the input set, the corresponding value is any value corresponding to the key in the input set.

select map_union(x) from (values (map(array['banana'],array[10.0])),(map(array['apple'],array[7.0]))) as t(x);-- {banana=10.0, apple=7.0}
select map_union(x) from (values (map(array['banana'],array[10.0])),(map(array['banana'],array[7.0]))) as t(x);-- {banana=10.0}

multimap_agg(key, value) -> map(K, array(V))
Description: Returns a map consisting of input key-value pairs. Each key can correspond to multiple values.
```
select multimap_agg(key, value) from (values ('apple',7),('apple',8),('apple',8),('lemon',5) ) as t(key,value); - {apple=[7, 8, 8], lemon=[5]} 
```

Approximation Aggregate Function

In actual situations, when we collect statistics on a large amount of data, sometimes we only care about an approximate value instead of a specific value. For example, when we collect statistics on the sales volume of a product, the approximation aggregate function is useful because it uses less memory and CPU resources, in this way, you can obtain data results without any problems, such as overflowing to disks or CPU peaks. This is useful for billions of rows of data calculations.

approx_median(x) → bigint
Description: Calculates the median value of a distribution of values.
```
select approx_median(price) from fruit; -- 10.0
```
approx_distinct(x) → bigint
Description: The return type of this function is bigint, which provides an approximate count of count(distinct x). If all inputs are null, 0 is returned.

The errors of all possible values of this function relative to the correct values follow an approximate normal distribution with a standard deviation of 2.3%. It does not guarantee the upper limit of any specific input set error.
```
select approx_distinct(price) from fruit; -- 2
```
approx_distinct(x, e)→ bigint
Description: The return type of this function is bigint, which provides an approximate count of count(distinct x). If all inputs are null, 0 is returned.

The errors of all possible values of this function relative to the correct values follow an approximate normal distribution with a standard deviation of less than e. It does not guarantee the upper limit of any specific input set error.

In the current implementation of the function, the value range of e is [0.0040625, 0.26000].
```
select approx_distinct(weight,0.0040625) from sample_collection; -- 10 
select approx_distinct(weight,0.26) from sample_collection; -- 8
```
approx_most_frequent(buckets, value, capacity) → map<[same as value], bigint>
Description: Approximately collects statistics on the top buckets elements that appear most frequently. Less memory will be used for approximate estimation of high-frequency values. A larger capacity value indicates a more accurate result, which consumes more memory. The return value of this function is a map that consists of key-value pairs of high-frequency values and their frequencies.
```
SELECT approx_most_frequent(3, x, 15) FROM (values 'A', 'B', 'A', 'C', 'A', 'B', 'C', 'D', 'E') t(x); --  {A=3, B=2, C=2}
SELECT approx_most_frequent(3, x, 100) FROM (values 1, 2, 1, 3, 1, 2, 3, 4, 5) t(x); -- {1=3, 2=2, 3=2}
```
The commonly used quantiles are binary, quaternary, decimal, and percentile. This means that the input set is evenly divided into equal parts, and then the value at the corresponding position is found. For example, approx_percentile(x, 0.5) is used to find a value that is about 50% of the x value, that is, a 2-quantile value.
approx_percentile(x, percentage)→[same as x]
Description: Returns the approximate percentile based on the given percentage. The percentage value must be a constant between 0 and 1 for all input rows.
```
select approx_percentile(x, 0.5) from ( values (2),(3),(7),(8),(9)) as t(x); --7
```
approx_percentile(x, percentages)→ array<[same as x]>
Description: Returns an approximate percentile of the x value of all input fields in a given percentage array. Each value in this percentage array must be a constant between 0 and 1 for all input rows.
```
select approx_percentile(x, array[0.1,0.2,0.3,0.5]) from ( values (2),(3),(7),(8),(9)) as t(x); --[2, 3, 3, 7]
```
approx_percentile(x, w, percentage)→array<[same as x]>
Description: Returns the approximate percentile of all x input values by percentage. The weight of each item is w and must be a positive number. Set a valid percentile for x. The value of percentage must range from 0 to 1, and all input rows must be constants.
```
select approx_percentile(x, 5,array[0.1,0.2,0.3,0.5]) from ( values (2),(3),(7),(8),(9)) as t(x); --[2, 3, 3, 7]
```
approx_percentile(x, w, percentage, accuracy) →[same as x]
Description: Returns the approximate percentile of all x input values by percentage. The weight of each item is w and must be a positive number. Set a valid percentile for x. The value of percentage must range from 0 to 1, and all input rows must be constants. The maximum progress error of the approximate value is specified by accuracy.
```
select approx_percentile(x, 5,0.5,0.97) from ( values (2),(3),(7),(8),(9)) as t(x); --7
```
approx_percentile(x, w, percentages)→[same as x]
Description: Returns an approximate percentile of all x input values based on each percentage value in the percentage array. The weight of each item is w and must be a positive number. Set a valid percentile for x. Each element value in the percentage array must range from 0 to 1, and all input rows must be constants.
```
select approx_percentile(x,5, array[0.1,0.2,0.3,0.5]) from ( values (2),(3),(7),(8),(9)) as t(x);  -- [2, 3, 3, 7]
```
The preceding approx_percentile function also supports the percentile_approx function of the same parameter set.

numeric_histogram(buckets, value, weight)

Description: Calculates the approximate histogram for all values based on the number of buckets. The width of each item uses weight. This algorithm is based on:

Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", J. Machine Learning Research 11 (2010), pp. 849--872.

The value of buckets must be bigint. The values of value and weight must be numbers.

select numeric_histogram(20,x,4) from ( values (2),(3),(7),(8),(9)) as t(x);
_col0                     
-----------------------------------------------
 {2.0=4.0, 3.0=4.0, 7.0=4.0, 8.0=4.0, 9.0=4.0} 
(1 row)

numeric_histogram(buckets, value)

Description: Compared with numeric_histogram(buckets, value,weight), this function sets weight to 1.

select numeric_histogram(20,x) from ( values (2),(3),(7),(8),(9)) as t(x);
_col0                     
-----------------------------------------------
 {2.0=1.0, 3.0=1.0, 7.0=1.0, 8.0=1.0, 9.0=1.0} 
(1 row)

Statistical Aggregate Function

corr(y,x)
Description: Returns the correlation coefficient of the input value.
```
select corr(y,x) from (values (1,5),(2,6),(3,7),(4,8)) as t(x,y);-- 1.0
```
covar_pop(y, x)
Description: Returns the population covariance of the input value.
```
select covar_pop(y,x) from (values (1,5),(2,6),(3,7),(4,8)) as t(x,y); --1.25
```

covar_samp(y, x)

Description: Returns the sample covariance of the input value.

select covar_samp(y,x) from (values (1,5),(2,6),(3,7),(4,8)) as t(x,y);-- 1.6666666

kurtosis(x)
Description: Kurtosis, also called peak-state coefficient, indicates the number of peak values at the average value of the probability density distribution curve, that is, the statistical value that describes the steepness of all value distribution forms in the entire system. Intuitively, kurtosis reflects the sharpness of the peak. This statistic needs to be compared with the normal distribution.

The kurtosis is defined as the 4th standardized central moment of the sample.

The kurtosis of a random variable is the ratio of the fourth-order central moment of the random variable to the square of the variance.

The calculation formula is as follows:
```
select kurtosis(x) from (values (1),(2),(3),(4)) as t(x); -- -1.1999999999999993
```
regr_intercept(y, x)
Description: Returns the linear regression intercept of the input value. y is a subordinate value. x is an independent value.
```
select regr_intercept(y,x) from (values (1,5),(2,6),(3,7),(4,8)) as t(x,y);-- 4.0
```
regr_slope(y, x)
Description: Returns the linear regression slope of the input value. y is a subordinate value. x is an independent value.
```
select regr_slope(y,x) from (values (1,5),(2,6),(3,7),(4,8)) as t(x,y);--  1.0
```
skewness(x)
Description: Returns the skew degree of all input values.
```
select skewness(x) from (values (1),(2),(3),(4)) as t(x); -- 0.0
```
stddev(x)
Description: Alias of stedev_samp()
stddev_pop(x)
Description: Returns the population standard deviation of all input values.
```
select stddev_pop(x) from (values (1),(2),(3),(4)) as t(x);--  1.118033988749895
```
stddev_samp(x)
Description: Returns the sample standard deviation of all input values.
```
select stddev_samp(x) from (values (1),(2),(3),(4)) as t(x);--  1.2909944487358056
```
variance(x)
Description: Alias of var_samp()
var_pop(x)
Description: Returns the population variance of all input values.
```
select var_pop(x) from (values (1),(2),(3),(4)) as t(x);-- 1.25
```

var_samp(x)

Description: Returns the sample variance of all input values.

select var_samp(x) from (values (1),(2),(3),(4)) as t(x);--  1.6666666666666667

Lambda Aggregation Function

reduce_agg(inputValue T, initialState S, inputFunction(S, T, S), combineFunction(S, S, S))

inputFunction is called for each non-null input value. In addition to obtaining the input value, inputFunction also obtains the current status, which is initialState initially, and then returns the new status. CombineFunction is invoked to combine the two states into a new state. Return the final status:

SELECT id, reduce_agg(value, 0, (a, b) -> a + b, (a, b) -> a + b)
FROM (
    VALUES
        (1, 3),
        (1, 4),
        (1, 5),
        (2, 6),
        (2, 7)
) AS t(id, value)
GROUP BY id;
-- (1, 12)
-- (2, 13)

SELECT id, reduce_agg(value, 1, (a, b) -> a * b, (a, b) -> a * b)
FROM (
    VALUES
        (1, 3),
        (1, 4),
        (1, 5),
        (2, 6),
        (2, 7)
) AS t(id, value)
GROUP BY id;
-- (1, 60)
-- (2, 42)

The status value must be Boolean, integer, floating-point, date, time, or interval.

Parent topic: HetuEngine SQL Functions and Operators

Previous topic: Date and Time Functions and Operators

Next topic: Window Functions