Updated on 2024-12-13 GMT+08:00

HyperLogLog Functions

HetuEngine uses the HyperLogLog data structure to implement the rox_distinct () function.

Data Structure

HyperLogLog (hll) is a statistical base algorithm. It does not store the number of occurrences of each element. It uses the probability algorithm to calculate the number of elements by storing the location of the first 1 in the 32-bit hash value of the element. Generally, there are two types of storage structures: a sparse storage structure and a dense storage structure. HLL is created in a sparse storage structure. When more efficient processing is required, HLL is converted to intensive data structures. P4HyperLogLog is a dense data structure in its rectification life cycle. If necessary, run cast(hll as P4HyperLogLog) to convert it explicitly. In the current implementation of the data engine, the HLL data sketch uses a group of 32-bit buckets to store the maximum hash value.

Serialization

Data sketches can be serialized and deserialized using varbinary. This allows them to be easily stored for later use. By combining multiple sketches, we can query approx_distinct() of all elements in the partition, that is, the approximate number of occurrences of each element, and then complete the entire query with a small overhead.

For example, if you only need to calculate the number of times that each user browses web pages every day, you can calculate the weekly and yearly data by accumulating the data. This is similar to calculating the weekly revenue by summarizing the daily revenue.

You can use approx_distinct() together with GROUPING SETS to convert it to HyperLogLog. The following is an example:

CREATE TABLE visit_summaries(visit_date date,hll varbinary);

INSERT INTO visit_summaries 
SELECT visit_date,cast(approx_set(user_id) AS varbinary) 
FROM user_visits 
GROUP BY visit_date; 

SELECT cardinality(merge(cast(hll AS HyperLogLog)))AS weekly_unique_users 
FROM visit_summaries
WHERE visit_date>=current_date-interval'7'day;

Function

  • approx_set(x) → HyperLogLog

    Description: Returns HyperLogLog. This data sketch is the basis of approx distinct() and can be stored and used by calling cardinality().

    select approx_set(cookieid) from cookies_log;--02 0c 02 00 c0 77 15 40 c1 2f 1b c2 
  • cardinality(hll) → bigint

    Description: Calculates the data summarized by HLL.

    select cardinality(approx_set(cookieid)) from cookies_log; --2
  • empty_approx_set()→ HyperLogLog

    Description: Returns an empty HyperLogLog.

    select empty_approx_set();--02 0c 00 00
  • merge(HyperLogLog) → HyperLogLog

    Description: Summarizes the union set of each independent HLL data sketch.

    CREATE TABLE visit_summaries (  visit_date date,  hll varbinary);
    
    insert into visit_summaries select createtime,cast(approx_set(cookieid) as varbinary) from cookies_log group by createtime;
    
    SELECT cardinality(merge(cast(hll AS HyperLogLog))) AS weekly_unique_users FROM visit_summaries WHERE visit_date >=date '2020-07-11';
     weekly_unique_users 
    ---------------------
                       2 
    (1 row)