窗口函数
普通的聚集函数只能用来计算一行内的结果,或者把所有行聚集成一行结果。而窗口函数可以跨行计算,并且把结果填到每一行中。
- 通过查询筛选出的行的某些部分,窗口调用函数实现了类似于聚集函数的功能,所以聚集函数也可以作为窗口函数使用。 窗口函数可以扫描所有的行,根据窗口函数的PARTITION BY选项将查询的行分为一组。
- 列存表目前只支持窗口函数rank(expression)和row_number(expression),以及聚集函数的sum,count,avg,min和max,而行存表没有限制。
- 窗口函数需要特殊的关键字OVER语句来指定窗口即触发一个窗口函数。OVER语句用于对数据进行分组,并对组内元素进行排序。窗口函数用于给组内的值生成序号。
- 窗口函数中的order by后面必须跟字段名,若order by后面跟数字,该数字会被按照常量处理,对目标列没有起到排序的作用。
窗口函数的语法格式
function_name ([expression [, expression ... ]]) OVER ( window_definition ) function_name ([expression [, expression ... ]]) OVER window_namefunction_name ( * ) OVER ( window_definition ) function_name ( * ) OVER window_name
其中window_definition子句option为:
[ existing_window_name ] [ PARTITION BY expression [, ...] ] [ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ] [ frame_clause ]
frame_clause子句option为:
[ RANGE | ROWS ] frame_start [ RANGE | ROWS ] BETWEEN frame_start AND frame_end
窗口区间支持RANGE、ROWS两种模式,ROWS 以物理单位(行)指定窗口。RANGE将窗口指定为逻辑偏移量。
RANGE、ROWS中可以使用BETWEEN frame_start AND frame_end指定边界可取值。如果省略了frame_end默认为CURRENT ROW。
BETWEEN frame_start AND frame_end取值为:
- CURRENT ROW,当前行。
- N PRECEDING,当前行向前第n行。
- UNBOUNDED PRECEDING,当前PARTITION的第1行。
- N FOLLOWING,当前行向后第n行。
- UNBOUNDED FOLLOWING,当前PARTITION的最后1行。
需要注意,frame_start不能为UNBOUNDED FOLLOWING,frame_end不能为UNBOUNDED PRECEDING,并且frame_end选项不能比上面取值中出现的frame_start选项早。例如RANGE BETWEEN CURRENT ROW AND value PRECEDING是不被允许的。
窗口函数
- RANK()
描述:RANK函数为各组内值生成跳跃排序序号,其中,相同的值具有相同序号。
返回值类型:BIGINT
示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
SELECT d_moy, d_fy_week_seq, rank() OVER(PARTITION BY d_moy ORDER BY d_fy_week_seq) FROM tpcds.date_dim WHERE d_moy < 4 AND d_fy_week_seq < 7 ORDER BY 1,2; d_moy | d_fy_week_seq | rank -------+---------------+------ 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 2 | 8 1 | 2 | 8 1 | 2 | 8 1 | 2 | 8 1 | 2 | 8 1 | 2 | 8 1 | 2 | 8 1 | 3 | 15 1 | 3 | 15 1 | 3 | 15 1 | 3 | 15 1 | 3 | 15 1 | 3 | 15 1 | 3 | 15 1 | 4 | 22 1 | 4 | 22 1 | 4 | 22 1 | 4 | 22 1 | 4 | 22 1 | 4 | 22 1 | 4 | 22 1 | 5 | 29 1 | 5 | 29 2 | 5 | 1 2 | 5 | 1 2 | 5 | 1 2 | 5 | 1 2 | 5 | 1 2 | 6 | 6 2 | 6 | 6 2 | 6 | 6 2 | 6 | 6 2 | 6 | 6 2 | 6 | 6 2 | 6 | 6 (42 rows)
- ROW_NUMBER()
描述:ROW_NUMBER函数为各组内值生成连续排序序号,其中,相同的值其序号也不相同。
返回值类型:BIGINT
示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
SELECT d_moy, d_fy_week_seq, Row_number() OVER(PARTITION BY d_moy ORDER BY d_fy_week_seq) FROM tpcds.date_dim WHERE d_moy < 4 AND d_fy_week_seq < 7 ORDER BY 1,2; d_moy | d_fy_week_seq | row_number -------+---------------+------------ 1 | 1 | 1 1 | 1 | 2 1 | 1 | 3 1 | 1 | 4 1 | 1 | 5 1 | 1 | 6 1 | 1 | 7 1 | 2 | 8 1 | 2 | 9 1 | 2 | 10 1 | 2 | 11 1 | 2 | 12 1 | 2 | 13 1 | 2 | 14 1 | 3 | 15 1 | 3 | 16 1 | 3 | 17 1 | 3 | 18 1 | 3 | 19 1 | 3 | 20 1 | 3 | 21 1 | 4 | 22 1 | 4 | 23 1 | 4 | 24 1 | 4 | 25 1 | 4 | 26 1 | 4 | 27 1 | 4 | 28 1 | 5 | 29 1 | 5 | 30 2 | 5 | 1 2 | 5 | 2 2 | 5 | 3 2 | 5 | 4 2 | 5 | 5 2 | 6 | 6 2 | 6 | 7 2 | 6 | 8 2 | 6 | 9 2 | 6 | 10 2 | 6 | 11 2 | 6 | 12 (42 rows)
- DENSE_RANK()
描述:DENSE_RANK函数为各组内值生成连续排序序号,其中,相同的值具有相同序号。
返回值类型:BIGINT
示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
SELECT d_moy, d_fy_week_seq, dense_rank() OVER(PARTITION BY d_moy ORDER BY d_fy_week_seq) FROM tpcds.date_dim WHERE d_moy < 4 AND d_fy_week_seq < 7 ORDER BY 1,2; d_moy | d_fy_week_seq | dense_rank -------+---------------+------------ 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 2 | 2 1 | 2 | 2 1 | 2 | 2 1 | 2 | 2 1 | 2 | 2 1 | 2 | 2 1 | 2 | 2 1 | 3 | 3 1 | 3 | 3 1 | 3 | 3 1 | 3 | 3 1 | 3 | 3 1 | 3 | 3 1 | 3 | 3 1 | 4 | 4 1 | 4 | 4 1 | 4 | 4 1 | 4 | 4 1 | 4 | 4 1 | 4 | 4 1 | 4 | 4 1 | 5 | 5 1 | 5 | 5 2 | 5 | 1 2 | 5 | 1 2 | 5 | 1 2 | 5 | 1 2 | 5 | 1 2 | 6 | 2 2 | 6 | 2 2 | 6 | 2 2 | 6 | 2 2 | 6 | 2 2 | 6 | 2 2 | 6 | 2 (42 rows)
- PERCENT_RANK()
描述:PERCENT_RANK函数为各组内对应值生成相对序号,即根据公式 (rank - 1) / (total rows - 1)计算所得的值。其中rank为该值依据RANK函数所生成的对应序号,totalrows为该分组内的总元素个数。
返回值类型:DOUBLE PRECISION
示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
SELECT d_moy, d_fy_week_seq, percent_rank() OVER(PARTITION BY d_moy ORDER BY d_fy_week_seq) FROM tpcds.date_dim WHERE d_moy < 4 AND d_fy_week_seq < 7 ORDER BY 1,2; d_moy | d_fy_week_seq | percent_rank -------+---------------+------------------ 1 | 1 | 0 1 | 1 | 0 1 | 1 | 0 1 | 1 | 0 1 | 1 | 0 1 | 1 | 0 1 | 1 | 0 1 | 2 | .241379310344828 1 | 2 | .241379310344828 1 | 2 | .241379310344828 1 | 2 | .241379310344828 1 | 2 | .241379310344828 1 | 2 | .241379310344828 1 | 2 | .241379310344828 1 | 3 | .482758620689655 1 | 3 | .482758620689655 1 | 3 | .482758620689655 1 | 3 | .482758620689655 1 | 3 | .482758620689655 1 | 3 | .482758620689655 1 | 3 | .482758620689655 1 | 4 | .724137931034483 1 | 4 | .724137931034483 1 | 4 | .724137931034483 1 | 4 | .724137931034483 1 | 4 | .724137931034483 1 | 4 | .724137931034483 1 | 4 | .724137931034483 1 | 5 | .96551724137931 1 | 5 | .96551724137931 2 | 5 | 0 2 | 5 | 0 2 | 5 | 0 2 | 5 | 0 2 | 5 | 0 2 | 6 | .454545454545455 2 | 6 | .454545454545455 2 | 6 | .454545454545455 2 | 6 | .454545454545455 2 | 6 | .454545454545455 2 | 6 | .454545454545455 2 | 6 | .454545454545455 (42 rows)
- CUME_DIST()
描述:CUME_DIST函数为各组内对应值生成累积分布序号。即根据公式(小于等于当前值的数据行数)/(该分组总行数totalrows)计算所得的相对序号。
返回值类型:DOUBLE PRECISION
示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
SELECT d_moy, d_fy_week_seq, cume_dist() OVER(PARTITION BY d_moy ORDER BY d_fy_week_seq) FROM tpcds.date_dim e_dim WHERE d_moy < 4 AND d_fy_week_seq < 7 ORDER BY 1,2; d_moy | d_fy_week_seq | cume_dist -------+---------------+------------------ 1 | 1 | .233333333333333 1 | 1 | .233333333333333 1 | 1 | .233333333333333 1 | 1 | .233333333333333 1 | 1 | .233333333333333 1 | 1 | .233333333333333 1 | 1 | .233333333333333 1 | 2 | .466666666666667 1 | 2 | .466666666666667 1 | 2 | .466666666666667 1 | 2 | .466666666666667 1 | 2 | .466666666666667 1 | 2 | .466666666666667 1 | 2 | .466666666666667 1 | 3 | .7 1 | 3 | .7 1 | 3 | .7 1 | 3 | .7 1 | 3 | .7 1 | 3 | .7 1 | 3 | .7 1 | 4 | .933333333333333 1 | 4 | .933333333333333 1 | 4 | .933333333333333 1 | 4 | .933333333333333 1 | 4 | .933333333333333 1 | 4 | .933333333333333 1 | 4 | .933333333333333 1 | 5 | 1 1 | 5 | 1 2 | 5 | .416666666666667 2 | 5 | .416666666666667 2 | 5 | .416666666666667 2 | 5 | .416666666666667 2 | 5 | .416666666666667 2 | 6 | 1 2 | 6 | 1 2 | 6 | 1 2 | 6 | 1 2 | 6 | 1 2 | 6 | 1 2 | 6 | 1 (42 rows)
- NTILE(num_buckets integer)
描述:NTILE函数根据num_buckets integer将有序的数据集合平均分配到num_buckets所指定数量的桶中,并将桶号分配给每一行。分配时应尽量做到平均分配。
返回值类型:INTEGER
示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
SELECT d_moy, d_fy_week_seq, ntile(3) OVER(PARTITION BY d_moy ORDER BY d_fy_week_seq) FROM tpcds.date_dim WHERE d_moy < 4 AND d_fy_week_seq < 7 ORDER BY 1,2; d_moy | d_fy_week_seq | ntile -------+---------------+------- 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 2 1 | 2 | 2 1 | 2 | 2 1 | 2 | 2 1 | 3 | 2 1 | 3 | 2 1 | 3 | 2 1 | 3 | 2 1 | 3 | 2 1 | 3 | 2 1 | 3 | 3 1 | 4 | 3 1 | 4 | 3 1 | 4 | 3 1 | 4 | 3 1 | 4 | 3 1 | 4 | 3 1 | 4 | 3 1 | 5 | 3 1 | 5 | 3 2 | 5 | 1 2 | 5 | 1 2 | 5 | 1 2 | 5 | 1 2 | 5 | 2 2 | 6 | 2 2 | 6 | 2 2 | 6 | 2 2 | 6 | 3 2 | 6 | 3 2 | 6 | 3 2 | 6 | 3 (42 rows)
- LAG(value any [, offset integer [, default any ]])
描述:LAG函数为各组内对应值生成滞后值。即当前值对应的行数往前偏移offset位后所得行的value值作为序号。若经过偏移后行数不存在,则对应结果取为default值。若无指定,在默认情况下,offset取为1,default值取为NULL。
返回值类型:与参数数据类型相同。
示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
SELECT d_moy, d_fy_week_seq, lag(d_moy,3,null) OVER(PARTITION BY d_moy ORDER BY d_fy_week_seq) FROM tpcds.date_dim WHERE d_moy < 4 AND d_fy_week_seq < 7 ORDER BY 1,2; d_moy | d_fy_week_seq | lag -------+---------------+----- 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 5 | 1 1 | 5 | 1 2 | 5 | 2 | 5 | 2 | 5 | 2 | 5 | 2 2 | 5 | 2 2 | 6 | 2 2 | 6 | 2 2 | 6 | 2 2 | 6 | 2 2 | 6 | 2 2 | 6 | 2 2 | 6 | 2 (42 rows)
- LEAD(value any [, offset integer [, default any ]])
描述:LEAD函数为各组内对应值生成提前值。即当前值对应的行数向后偏移offset位后所得行的value值作为序号。若经过向后偏移后行数超过当前组内的总行数,则对应结果取为default值。若无指定,在默认情况下,offset取为1,default值取为NULL。
返回值类型:与参数数据类型相同。
示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
SELECT d_moy, d_fy_week_seq, lead(d_fy_week_seq,2) OVER(PARTITION BY d_moy ORDER BY d_fy_week_seq) FROM tpcds.date_dim WHERE d_moy < 4 AND d_fy_week_seq < 7 ORDER BY 1,2; d_moy | d_fy_week_seq | lead -------+---------------+------ 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 2 1 | 1 | 2 1 | 2 | 2 1 | 2 | 2 1 | 2 | 2 1 | 2 | 2 1 | 2 | 2 1 | 2 | 3 1 | 2 | 3 1 | 3 | 3 1 | 3 | 3 1 | 3 | 3 1 | 3 | 3 1 | 3 | 3 1 | 3 | 4 1 | 3 | 4 1 | 4 | 4 1 | 4 | 4 1 | 4 | 4 1 | 4 | 4 1 | 4 | 4 1 | 4 | 5 1 | 4 | 5 1 | 5 | 1 | 5 | 2 | 5 | 5 2 | 5 | 5 2 | 5 | 5 2 | 5 | 6 2 | 5 | 6 2 | 6 | 6 2 | 6 | 6 2 | 6 | 6 2 | 6 | 6 2 | 6 | 6 2 | 6 | 2 | 6 | (42 rows)
- FIRST_VALUE(value any)
描述:FIRST_VALUE函数取各组内的第一个值作为返回结果。
返回值类型:与参数数据类型相同。
示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
SELECT d_moy, d_fy_week_seq, first_value(d_fy_week_seq) OVER(PARTITION BY d_moy ORDER BY d_fy_week_seq) FROM tpcds.date_dim WHERE d_moy < 4 AND d_fy_week_seq < 7 ORDER BY 1,2; d_moy | d_fy_week_seq | first_value -------+---------------+------------- 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 5 | 1 1 | 5 | 1 2 | 5 | 5 2 | 5 | 5 2 | 5 | 5 2 | 5 | 5 2 | 5 | 5 2 | 6 | 5 2 | 6 | 5 2 | 6 | 5 2 | 6 | 5 2 | 6 | 5 2 | 6 | 5 2 | 6 | 5 (42 rows)
- LAST_VALUE(value any)
描述:LAST_VALUE函数取各组内的最后一个值作为返回结果。
返回值类型:与参数数据类型相同。
示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
SELECT d_moy, d_fy_week_seq, last_value(d_moy) OVER(PARTITION BY d_moy ORDER BY d_fy_week_seq) FROM tpcds.date_dim WHERE d_moy < 4 AND d_fy_week_seq < 6 ORDER BY 1,2; d_moy | d_fy_week_seq | last_value -------+---------------+------------ 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 5 | 1 1 | 5 | 1 2 | 5 | 2 2 | 5 | 2 2 | 5 | 2 2 | 5 | 2 2 | 5 | 2 (35 rows)
- NTH_VALUE(value any, nth integer)
描述:NTH_VALUE函数返回该组内的第nth行作为结果。若该行不存在,则默认返回NULL。
返回值类型:与参数数据类型相同。
示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
SELECT d_moy, d_fy_week_seq, nth_value(d_fy_week_seq,6) OVER(PARTITION BY d_moy ORDER BY d_fy_week_seq) FROM tpcds.date_dim WHERE d_moy < 4 AND d_fy_week_seq < 6 ORDER BY 1,2; d_moy | d_fy_week_seq | nth_value -------+---------------+----------- 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 1 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 2 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 3 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 4 | 1 1 | 5 | 1 1 | 5 | 1 2 | 5 | 2 | 5 | 2 | 5 | 2 | 5 | 2 | 5 | (35 rows)