Aggregate Functions_Functions and Operators_SQL Syntax Reference_Developer Guide

sum(expression)

Description: Sum of expressions across all input rows.

Return type:

Generally, the input data type is the same as the output data type. However, type conversion occurs in the following cases:

For SMALLINT or INT inputs, the output type is BIGINT.
For BIGINT input, the output type is NUMBER.
For floating-point input, the output type is DOUBLE PRECISION.

Example:

    
         SELECT SUM(ss_ext_tax) FROM tpcds.STORE_SALES;
  sum      
--------------
 213267594.69
(1 row)

max(expression)

Description: Specifies the maximum value of expression across all input values.

Parameter type: any array, numeric, string, or date/time type.

Return type: same as the parameter type.

Example:

    
         SELECT MAX(inv_quantity_on_hand) FROM tpcds.inventory;
   max   
---------
 1000000
(1 row)

min(expression)

Description: Specifies the minimum value of expression across all input values.

Parameter type: any array, numeric, string, or date/time type.

Return type: same as the parameter type.

Example:

    
         SELECT MIN(inv_quantity_on_hand) FROM tpcds.inventory;
 min 
-----
   0
(1 row)

avg(expression)

Description: Average (arithmetic mean) of all input values.

Return type:

For any integer input, the result is of the NUMBER type.
For any floating-point input, the result is of the DOUBLE PRECISION type.
Others: same as the input data type.

Example:

    
         SELECT AVG(inv_quantity_on_hand) FROM tpcds.inventory;
         avg          
----------------------
 500.0387129084044604
(1 row)

median(expression)

Description: Median of all input values. Currently, only the numeric and interval types are supported. Null values are not used for calculation.

Return type:

For any integer input, the result is of the NUMERIC type. Otherwise, it is the same as the input data type.
In Teradata compatibility mode, if the input is an integer, the returned data precision is only integer bits.

Example:

    
         SELECT MEDIAN(inv_quantity_on_hand) FROM tpcds.inventory;
 median 
--------
    500
(1 row)

percentile_cont(const) within group(order by expression)

Description: Returns a value corresponding to the specified quantile in the target column sort, and inserts values between adjacent input items if necessary. Null values are not used for calculation.

Input: const is a number ranging from 0 to 1. Currently, only numeric and interval expressions are supported.

Return type:

For any integer input, the result is of the NUMERIC type. Otherwise, it is the same as the input data type.
In Teradata compatibility mode, if the input is an integer, the returned data precision is only integer bits.

Example:

    
         SELECT percentile_cont(0.3) within group(order by x) FROM (SELECT generate_series(1,5) AS x) AS t;
percentile_cont
-----------------
2.2
(1 row)
SELECT percentile_cont(0.3) within group(order by x desc) FROM (SELECT generate_series(1,5) AS x) AS t;
percentile_cont
-----------------
3.8
(1 row)

percentile_disc(const) within group(order by expression)

Description: Returns the first input value whose position in the sort is equal to or greater than the specified score.

Input: const is a number ranging from 0 to 1. Currently, only numeric and interval expressions are supported. Null values are not used for calculation.

Return type: For any integer input, the result is of the NUMERIC type. Otherwise, it is the same as the input data type.

Example:

    
         SELECT percentile_disc(0.3) within group(order by x) FROM (SELECT generate_series(1,5) AS x) AS t;
percentile_disc
-----------------
2
(1 row)
SELECT percentile_disc(0.3) within group(order by x desc) FROM (SELECT generate_series(1,5) AS x) AS t;
percentile_disc
-----------------
4
(1 row)

count(expression)

Description: Returns the number of rows in the table where the expression is not NULL.

Return type: bigint.

Example:

    
         SELECT COUNT(inv_quantity_on_hand) FROM tpcds.inventory;
  count   
----------
 11158087
(1 row)

count(*)

Description: Returns the total number of records in the table.

Return type: bigint.

Example:

    
         SELECT COUNT(*) FROM tpcds.inventory;
  count   
----------
 11745000
(1 row)

array_agg(expression)

Description: Concatenates all input values (including nulls) into an array. Function inputs do not support array formats.

Return type: Array of the parameter type.

Example:

Create a table named employeeinfo and insert data:

    
         CREATE EXTERNAL TABLE employeeinfo (empno smallint, ename varchar(20), job varchar(20), hiredate date,deptno smallint) store AS orc;
INSERT INTO employeeinfo VALUES (7155, 'JACK', 'SALESMAN', '2018-12-01', 30);
INSERT INTO employeeinfo VALUES (7003, 'TOM', 'FINANCE', '2016-06-15', 20);
INSERT INTO employeeinfo VALUES (7357, 'MAX', 'SALESMAN', '2020-10-01', 30);

SELECT * FROM employeeinfo;
 empno | ename |   job    |      hiredate       | deptno
-------+-------+----------+---------------------+--------
  7155 | JACK  | SALESMAN | 2018-12-01 00:00:00 |     30
  7357 | MAX   | SALESMAN | 2020-10-01 00:00:00 |     30
  7003 | TOM   | FINANCE  | 2016-06-15 00:00:00 |     20
(3 rows)

Query the names of all employees in the department with ID 30:

    
         SELECT array_agg(ename) FROM employeeinfo where deptno = 30;
 array_agg
------------
 {JACK,MAX}
(1 row)

Query all employees belonging to the same department:

    
         SELECT deptno, array_agg(ename) FROM employeeinfo group by deptno;
 deptno | array_agg
--------+------------
     30 | {JACK,MAX}
     20 | {TOM}
(2 rows)

SELECT distinct array_agg(ename) OVER (PARTITION BY deptno) FROM employeeinfo;
 array_agg
------------
 {TOM}
 {JACK,MAX}
(2 rows)

Query all unique department IDs:

    
         SELECT array_agg(distinct deptno) FROM employeeinfo group by deptno;
 array_agg
-----------
 {20}
 {30}
(2 rows)

Query all unique department IDs sorted in descending order:

    
         SELECT array_agg(distinct deptno order by deptno desc) FROM employeeinfo;
 array_agg
-----------
 {30,20}
(1 row)

string_agg(expression, delimiter)

Description: Joins input values into a single string separated by the specified delimiter.

Return type: same as the parameter type.

Example:

Based on the created table employeeinfo, query all employees belonging to the same department:

    
         SELECT deptno, string_agg(ename,',') FROM employeeinfo group by deptno;
 deptno | string_agg
--------+------------
     30 | JACK,MAX
     20 | TOM
(2 rows)

Query employees with employee IDs less than 7156:

    
         SELECT string_agg(ename,',') FROM employeeinfo where empno < 7156;
 string_agg
------------
 TOM,JACK
(1 row)

listagg(expression [, delimiter]) WITHIN GROUP(ORDER BY order-list)

Description: Orders the aggregated column data according to the sorting method specified by WITHIN GROUP and joins it into a string using the specified delimiter.

expression: Mandatory. Specifies the aggregation column name or a valid column-based expression. DISTINCT keywords and VARIADIC parameters are not supported.
delimiter: Optional. Specifies the delimiter, which can be a string constant or a deterministic expression based on grouping columns. It defaults to an empty string if omitted.
order-list: Mandatory. Specifies the sorting method within the group.

Return type: text.

listagg is a column-to-row aggregation function compatible with Oracle 11g2 and can include an OVER clause to act as a window function. To prevent ambiguity with the ORDER BY in the function's own WITHIN GROUP clause, the OVER clause does not support ORDER BY window sorting or framing when listagg is used as a window function.

Example:

The aggregation column is of text charset type:

    
         SELECT deptno, listagg(ename, ',') WITHIN GROUP(ORDER BY ename) AS employees FROM emp GROUP BY deptno;
 deptno |              employees               
--------+--------------------------------------
     10 | CLARK,KING,MILLER
     20 | ADAMS,FORD,JONES,SCOTT,SMITH
     30 | ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
(3 rows)

The aggregation column is of integer type:

    
         SELECT deptno, listagg(mgrno, ',') WITHIN GROUP(ORDER BY mgrno NULLS FIRST) AS mgrnos FROM emp GROUP BY deptno;
 deptno |            mgrnos             
--------+-------------------------------
     10 | 7782,7839
     20 | 7566,7566,7788,7839,7902
     30 | 7698,7698,7698,7698,7698,7839
(3 rows)

The aggregation column is of floating-point type:

    
         SELECT job, listagg(bonus, '($); ') WITHIN GROUP(ORDER BY bonus DESC) || '($)' AS bonus FROM emp GROUP BY job;
    job     |                      bonus                      
------------+-------------------------------------------------
 CLERK      | 10234.21($); 2000.80($); 1100.00($); 1000.22($)
 PRESIDENT  | 23011.88($)
 ANALYST    | 2002.12($); 1001.01($)
 MANAGER    | 10000.01($); 2399.50($); 999.10($)
 SALESMAN   | 1000.01($); 899.00($); 99.99($); 9.00($)
(5 rows)

The aggregation column is of date/time type:

    
         SELECT deptno, listagg(hiredate, ', ') WITHIN GROUP(ORDER BY hiredate DESC) AS hiredates FROM emp GROUP BY deptno;
 deptno |                                                          hiredates                                                           
--------+------------------------------------------------------------------------------------------------------------------------------
     10 | 1982-01-23 00:00:00, 1981-11-17 00:00:00, 1981-06-09 00:00:00
     20 | 2001-04-02 00:00:00, 1999-12-17 00:00:00, 1987-05-23 00:00:00, 1987-04-19 00:00:00, 1981-12-03 00:00:00
     30 | 2015-02-20 00:00:00, 2010-02-22 00:00:00, 1997-09-28 00:00:00, 1981-12-03 00:00:00, 1981-09-08 00:00:00, 1981-05-01 00:00:00
(3 rows)

The aggregation column is of interval type:

    
         SELECT deptno, listagg(vacationTime, '; ') WITHIN GROUP(ORDER BY vacationTime DESC) AS vacationTime FROM emp GROUP BY deptno;
 deptno |                                    vacationtime                                    
--------+------------------------------------------------------------------------------------
     10 | 1 year 30 days; 40 days; 10 days
     20 | 70 days; 36 days; 9 days; 5 days
     30 | 1 year 1 mon; 2 mons 10 days; 30 days; 12 days 12:00:00; 4 days 06:00:00; 24:00:00
(3 rows)

By default, the delimiter is empty:

    
         SELECT deptno, listagg(job) WITHIN GROUP(ORDER BY job) AS jobs FROM emp GROUP BY deptno;
 deptno |                     jobs                     
--------+----------------------------------------------
     10 | CLERKMANAGERPRESIDENT
     20 | ANALYSTANALYSTCLERKCLERKMANAGER
     30 | CLERKMANAGERSALESMANSALESMANSALESMANSALESMAN
(3 rows)

When listagg acts as a window function, the OVER clause does not support ORDER BY window sorting. The listagg column indicates the ordered aggregation of the respective group:

    
     
       
       
         SELECT deptno, mgrno, bonus, listagg(ename,'; ') WITHIN GROUP(ORDER BY hiredate) OVER(PARTITION BY deptno) AS employees FROM emp;
 deptno | mgrno |  bonus   |                 employees                 
--------+-------+----------+-------------------------------------------
|  7839 | 10000.01 | CLARK; KING; MILLER
|       | 23011.88 | CLARK; KING; MILLER
|  7782 | 10234.21 | CLARK; KING; MILLER
|  7566 |  2002.12 | FORD; SCOTT; ADAMS; SMITH; JONES
|  7566 |  1001.01 | FORD; SCOTT; ADAMS; SMITH; JONES
|  7788 |  1100.00 | FORD; SCOTT; ADAMS; SMITH; JONES
|  7902 |  2000.80 | FORD; SCOTT; ADAMS; SMITH; JONES
|  7839 |   999.10 | FORD; SCOTT; ADAMS; SMITH; JONES
|  7839 |  2399.50 | BLAKE; TURNER; JAMES; MARTIN; WARD; ALLEN
|  7698 |     9.00 | BLAKE; TURNER; JAMES; MARTIN; WARD; ALLEN
|  7698 |  1000.22 | BLAKE; TURNER; JAMES; MARTIN; WARD; ALLEN
|  7698 |    99.99 | BLAKE; TURNER; JAMES; MARTIN; WARD; ALLEN
|  7698 |  1000.01 | BLAKE; TURNER; JAMES; MARTIN; WARD; ALLEN
|  7698 |   899.00 | BLAKE; TURNER; JAMES; MARTIN; WARD; ALLEN
(14 rows)

        

      

    
   

group_concat(expression [ORDER BY {col_name | expr} [ASC | DESC]] [SEPARATOR str_val])

Description: Joins column data into a string using the specified str_val delimiter, ordered according to the ORDER BY clause specification. The ORDER BY clause must specify the ordering method, and ORDER BY 1 syntax is not supported.

expression: Mandatory. Specifies the column name or a valid column-based expression. DISTINCT keywords and VARIADIC parameters are not supported.
str_val: Optional. Specifies the delimiter, which can be a string constant or a deterministic expression based on grouping columns. The default value indicates that commas (,) are used as delimiters.

Return type: text.

The group_concat function is only supported in version 8.1.2 or later.

Example:

The default delimiter is a comma (,).

    
         SELECT group_concat(sname) FROM group_concat_test;
               group_concat
------------------------------------------
 ADAMS,FORD,JONES,KING,MILLER,SCOTT,SMITH
(1 row)

The group_concat function supports user-defined delimiters:

    
         SELECT group_concat(sname separator ';') from group_concat_test;
               group_concat
------------------------------------------
 ADAMS;FORD;JONES;KING;MILLER;SCOTT;SMITH
(1 row)

The group_concat function supports the ORDER BY clause for ordered concatenation of column data:

    
         SELECT group_concat(sname order by snumber separator ';') FROM group_concat_test;
               group_concat
------------------------------------------
 MILLER;FORD;SCOTT;SMITH;KING;JONES;ADAMS
(1 row)

covar_pop(Y, X)

Description: Population covariance.

Return type: double precision.

Example:

    
          SELECT COVAR_POP(sr_fee, sr_net_loss) FROM tpcds.store_returns WHERE sr_customer_sk < 1000;
    covar_pop     
------------------
 829.749627587403
(1 row)

covar_samp(Y, X)

Description: Sample covariance.

Return type: double precision.

Example:

    
         SELECT COVAR_SAMP(sr_fee, sr_net_loss) FROM tpcds.store_returns WHERE sr_customer_sk < 1000;
    covar_samp    
------------------
 830.052235037289
(1 row)

stddev_pop(expression)

Description: Population standard deviation.

Return type: double precision for floating-point inputs, and numeric for others.

Example:

    
         SELECT STDDEV_POP(inv_quantity_on_hand) FROM tpcds.inventory WHERE inv_warehouse_sk = 1;
    stddev_pop    
------------------
 289.224294957556
(1 row)

stddev_samp(expression)

Description: Sample standard deviation.

Return type: double precision for floating-point inputs, and numeric for others.

Example:

    
         SELECT STDDEV_SAMP(inv_quantity_on_hand) FROM tpcds.inventory WHERE inv_warehouse_sk = 1;
   stddev_samp    
------------------
 289.224359757315
(1 row)

var_pop(expression)

Description: Population variance (square of population standard deviation).

Return type: double precision for floating-point inputs, and numeric for others.

Example:

    
         SELECT VAR_POP(inv_quantity_on_hand) FROM tpcds.inventory WHERE inv_warehouse_sk = 1;
      var_pop       
--------------------
 83650.692793695475
(1 row)

var_samp(expression)

Description: Sample variance (square of sample standard deviation).

Return type: double precision for floating-point inputs, and numeric for others.

Example:

    
         SELECT VAR_SAMP(inv_quantity_on_hand) FROM tpcds.inventory WHERE inv_warehouse_sk = 1;
      var_samp      
--------------------
 83650.730277028768
(1 row)

bit_and(expression)

Description: Bitwise AND of all non-NULL input values. NULL if all inputs are NULL.

Return type: same as the parameter type.

Example:

    
         SELECT BIT_AND(inv_quantity_on_hand) FROM tpcds.inventory WHERE inv_warehouse_sk = 1;
 bit_and 
---------
       0
(1 row)

bit_or(expression)

Description: Bitwise OR of all non-NULL input values. NULL if all inputs are NULL.

Return type: same as the parameter type.

Example:

    
         SELECT BIT_OR(inv_quantity_on_hand) FROM tpcds.inventory WHERE inv_warehouse_sk = 1;
 bit_or 
--------
   1023
(1 row)

bool_and(expression)

Description: The value is true if all input values are true. Otherwise, the value is false.

Return type: Boolean.

Example:

    
         SELECT bool_and(100 <2500);
 bool_and
----------
 t
(1 row)

bool_or(expression)

Description: The value is true if at least one input value is true. Otherwise, the value is false.

Return type: Boolean.

Example:

    
         SELECT bool_or(100 <2500);
 bool_or
----------
 t
(1 row)

corr(Y, X)

Description: Correlation coefficient.

Return type: double precision.

Example:

    
         SELECT CORR(sr_fee, sr_net_loss) FROM tpcds.store_returns WHERE sr_customer_sk < 1000;
       corr        
-------------------
 0.0381383624904186
(1 row)

every(expression)

Description: Equivalent to bool_and.

Return type: Boolean.

Example:

    
         SELECT every(100 <2500);
 every
-------
 t
(1 row)

rank(expression)

Description: The tuples in different groups are sorted non-consecutively by expression.

Return type: bigint.

Example:

    
     
       
       
         SELECT d_moy, d_fy_week_seq, rank() OVER(PARTITION BY d_moy ORDER BY d_fy_week_seq) FROM tpcds.date_dim WHERE d_moy < 4 AND d_fy_week_seq < 7 ORDER BY 1,2;
   d_moy | d_fy_week_seq | rank 
-------+---------------+------
|             1 |    1
|             1 |    1
|             1 |    1
|             1 |    1
|             1 |    1
|             1 |    1
|             1 |    1
|             2 |    8
|             2 |    8
|             2 |    8
|             2 |    8
|             2 |    8
|             2 |    8
|             2 |    8
|             3 |   15
|             3 |   15
|             3 |   15
|             3 |   15
|             3 |   15
|             3 |   15
|             3 |   15
|             4 |   22
|             4 |   22
|             4 |   22
|             4 |   22
|             4 |   22
|             4 |   22
|             4 |   22
|             5 |   29
|             5 |   29
|             5 |    1
|             5 |    1
|             5 |    1
|             5 |    1
|             5 |    1
|             6 |    6
|             6 |    6
|             6 |    6
|             6 |    6
|             6 |    6
|             6 |    6
|             6 |    6
(42 rows)

        

      

    
   

regr_avgx(Y, X)

Description: Mean of independent variables (sum(X)/N).

Return type: double precision.

Example:

    
         SELECT REGR_AVGX(sr_fee, sr_net_loss) FROM tpcds.store_returns WHERE sr_customer_sk < 1000;
    regr_avgx     
------------------
 578.606576740795
(1 row)

regr_avgy(Y, X)

Description: Mean of dependent variables (sum(Y)/N).

Return type: double precision.

Example:

    
         SELECT REGR_AVGY(sr_fee, sr_net_loss) FROM tpcds.store_returns WHERE sr_customer_sk < 1000;
    regr_avgy     
------------------
 50.0136711629602
(1 row)

regr_count(Y, X)

Description: Number of input rows where both expressions are NOT NULL.

Return type: bigint.

Example:

    
         SELECT REGR_COUNT(sr_fee, sr_net_loss) FROM tpcds.store_returns WHERE sr_customer_sk < 1000;
 regr_count 
------------
       2743
(1 row)

regr_intercept(Y, X)

Description: Fits all input points (X, Y) into a linear equation using least squares and return the Y-intercept.

Return type: double precision.

Example:

    
         SELECT REGR_INTERCEPT(sr_fee, sr_net_loss) FROM tpcds.store_returns WHERE sr_customer_sk < 1000;
  regr_intercept  
------------------
 49.2040847848607
(1 row)

regr_r2(Y, X)

Description: Square of the correlation coefficient.

Return type: double precision.

Example:

    
         SELECT REGR_R2(sr_fee, sr_net_loss) FROM tpcds.store_returns WHERE sr_customer_sk < 1000;
      regr_r2       
--------------------
 0.00145453469345058
(1 row)

regr_slope(Y, X)

Description: Fits all input points (X, Y) into a linear equation using the least square method and returns the slope of the straight line.

Return type: double precision.

Example:

    
         SELECT REGR_SLOPE(sr_fee, sr_net_loss) FROM tpcds.store_returns WHERE sr_customer_sk < 1000;
     regr_slope     
--------------------
 0.00139920009665259
(1 row)

regr_sxx(Y, X)

Description: sum(X^2) - sum(X)^2/N (sum of squares of the independent variable).

Return type: double precision.

Example:

    
         SELECT REGR_SXX(sr_fee, sr_net_loss) FROM tpcds.store_returns WHERE sr_customer_sk < 1000;
     regr_sxx     
------------------
 1626645991.46135
(1 row)

regr_sxy(Y, X)

Description: sum(X*Y) - sum(X) * sum(Y)/N (product of sums of the independent and dependent variables).

Return type: double precision.

Example:

    
         SELECT REGR_SXY(sr_fee, sr_net_loss) FROM tpcds.store_returns WHERE sr_customer_sk < 1000;
     regr_sxy     
------------------
 2276003.22847225
(1 row)

regr_syy(Y, X)

Description: sum(Y^2) - sum(Y)^2/N (sum of squares of the dependent variable).

Return type: double precision.

Example:

    
         SELECT REGR_SYY(sr_fee, sr_net_loss) FROM tpcds.store_returns WHERE sr_customer_sk < 1000;
    regr_syy     
-----------------
 2189417.6547314
(1 row)

stddev(expression)

Description: Alias of stddev_samp.

Return type: double precision for floating-point inputs, and numeric for others.

Example:

    
         SELECT STDDEV(inv_quantity_on_hand) FROM tpcds.inventory WHERE inv_warehouse_sk = 1;
      stddev      
------------------
 289.224359757315
(1 row)

variance(expexpression,ression)

Description: Alias of var_samp.

Return type: double precision for floating-point inputs, and numeric for others.

Example:

    
         SELECT VARIANCE(inv_quantity_on_hand) FROM tpcds.inventory WHERE inv_warehouse_sk = 1;
      variance      
--------------------
 83650.730277028768
(1 row)

checksum(expression)

Description: Returns the CHECKSUM value of all input values. This function can be used to verify if the data in the tables remains the same before and after backup, restoration, or data migration operations for the DataArts Fabric SQL database (other databases besides DataArts Fabric SQL are not supported). You need to manually execute SQL commands to obtain the results before and after these operations and compare them to determine if the data in the tables has changed.

For large tables, the CHECKSUM function might take a significant amount of time.
If the CHECKSUM values of two tables differ, it indicates that their contents are different. However, due to potential hash collisions in the CHECKSUM function, there is a small chance that two tables with different content could have the same CHECKSUM value. The same applies to checksums computed for individual columns.
When computing CHECKSUM values for timestamp, timestamptz, and smalldatetime types, ensure consistent timezone settings.

To compute the CHECKSUM value of a column that can be automatically converted to TEXT, use the column name as the expression.
To compute the CHECKSUM value of a column that cannot be automatically converted to TEXT, use <column-name>::TEXT as the expression.
To compute the CHECKSUM value of all columns in a table, use <table-name>::TEXT as the expression.

Types that can be automatically converted to TEXT include: char, name, int8, int2, int1, int4, raw, pg_node_tree, float4, float8, bpchar, varchar, nvarchar2, date, timestamp, timestamptz, numeric, smalldatetime. Other types require explicit conversion to TEXT.

Return type: numeric.

Example:

CHECKSUM value of a column that can be automatically converted to TEXT:

    
         SELECT CHECKSUM(inv_quantity_on_hand) FROM tpcds.inventory;
     checksum      
-------------------
 24417258945265247
(1 row)

CHECKSUM value of a column that cannot be automatically converted to TEXT (note that the CHECKSUM parameter is <column-name>::TEXT):

    
         SELECT CHECKSUM(inv_quantity_on_hand::TEXT) FROM tpcds.inventory;
     checksum      
-------------------
 24417258945265247
(1 row)

CHECKSUM value of all columns in a table. Note that the CHECKSUM parameter is <table-name>::TEXT, and the table name does not include the schema prefix.

    
         SELECT CHECKSUM(inventory::TEXT) FROM tpcds.inventory;                    
     checksum      
-------------------
 25223696246875800
(1 row)

approx_count_distinct(col_name)

Description: Estimates the cardinality (the number of rows in a column after deduplication) using the HyperLogLog++ (HLL++) algorithm. This function is only supported by clusters of version 8.3.0 or later.

Input parameter description: col_name indicates the column whose cardinality needs to be estimated.

You can adjust the error rate using the GUC parameter approx_count_distinct_precision.

The value range is [10, 20]. The default value is 17. The theoretical error rate is 3‰.

This parameter indicates the number of buckets in the HyperLogLog++ (HLL++) algorithm. A larger value indicates a larger number of buckets and a smaller theoretical error rate.
A larger value of this parameter results in more computing time and memory resource overhead, but is still far less than the overhead of the accurate count distinct statement. You are advised to use this function to replace count distinct when the estimated cardinality is large.

Example:

    
         CREATE EXTERNAL TABLE employeeinfo (empno smallint, ename varchar(20), job varchar(20), hiredate date,deptno smallint) store AS orc;
INSERT INTO employeeinfo VALUES (7155, 'JACK', 'SALESMAN', '2018-12-01', 30);
INSERT INTO employeeinfo VALUES (7003, 'TOM', 'FINANCE', '2016-06-15', 20);
INSERT INTO employeeinfo VALUES (7357, 'MAX', 'SALESMAN', '2020-10-01', 30);

SELECT APPROX_COUNT_DISTINCT(empno) from employeeinfo;
 approx_count_distinct
-----------------------
                     3
(1 row)

SELECT COUNT(DISTINCT empno) FROM employeeinfo GROUP BY ename;
 count
-------
     1
     1
     1
(3 rows)

UNIQ(col_name)

Description: Aggregate function for calculating the number of non-duplicate values. Similar to COUNT DISTINCT, this function calculates the number of rows in a column after deduplication and returns a deduplicated value. This function is only supported by clusters of version 8.3.0 or later.

Input parameter description: col_name indicates the column whose number of rows after deduplication needs to be calculated. The SMALLINT, INTEGER, BIGINT, REAL, DOUBLE PRECISION, TEXT, VARCHAR, TIMESTAMP, TIMESTAMPTZ, DATE, TIMETZ and UUID types are supported.

When the UNIQ function is used, the SQL statement must contain GROUP BY. To achieve better performance, the GROUP BY fields must be evenly distributed.
When the number of columns to be deduplicated in the SQL statement is greater than or equal to 3, you are advised to use the UNIQ function for deduplication to achieve a better effect than COUNT DISTINCT.
Generally, the memory usage of the UNIQ function is lower than that of the COUNT DISTINCT function. If the memory usage exceeds the threshold when the COUNT DISTINCT function is used, you can use the UNIQ function instead.

Example:

    
         CREATE EXTERNAL TABLE employeeinfo (empno smallint, ename varchar(20), job varchar(20), hiredate date,deptno smallint) store AS orc;
INSERT INTO employeeinfo VALUES (7155, 'JACK', 'SALESMAN', '2018-12-01', 30);
INSERT INTO employeeinfo VALUES (7003, 'TOM', 'FINANCE', '2016-06-15', 20);
INSERT INTO employeeinfo VALUES (7357, 'MAX', 'SALESMAN', '2020-10-01', 30);

SELECT UNIQ(deptno) FROM employeeinfo GROUP BY ename;
 uniq
------
    1
    1
    1
(3 rows)

Aggregate Functions

sum(expression)

max(expression)

min(expression)

avg(expression)

median(expression)

percentile_cont(const) within group(order by expression)

percentile_disc(const) within group(order by expression)

count(expression)

count(*)

array_agg(expression)

string_agg(expression, delimiter)

listagg(expression [, delimiter]) WITHIN GROUP(ORDER BY order-list)

group_concat(expression [ORDER BY {col_name | expr} [ASC | DESC]] [SEPARATOR str_val])

covar_pop(Y, X)

covar_samp(Y, X)

stddev_pop(expression)

stddev_samp(expression)

var_pop(expression)

var_samp(expression)

bit_and(expression)

bit_or(expression)

bool_and(expression)

bool_or(expression)

corr(Y, X)

every(expression)

rank(expression)

regr_avgx(Y, X)

regr_avgy(Y, X)

regr_count(Y, X)

regr_intercept(Y, X)

regr_r2(Y, X)

regr_slope(Y, X)

regr_sxx(Y, X)

regr_sxy(Y, X)

regr_syy(Y, X)

stddev(expression)

variance(expexpression,ression)

checksum(expression)

approx_count_distinct(col_name)

UNIQ(col_name)

Feedback

Was this page helpful?