HyperLogLog Functions
HetuEngine uses the HyperLogLog data structure to implement the rox_distinct () function.
Data Structure
HyperLogLog (hll) is a statistical base algorithm. It does not store the number of occurrences of each element. It uses the probability algorithm to calculate the number of elements by storing the location of the first 1 in the 32-bit hash value of the element. Generally, there are two types of storage structures: a sparse storage structure and a dense storage structure. HLL is created in a sparse storage structure. When more efficient processing is required, HLL is converted to intensive data structures. P4HyperLogLog is a dense data structure in its rectification life cycle. If necessary, run cast(hll as P4HyperLogLog) to convert it explicitly. In the current implementation of the data engine, the HLL data sketch uses a group of 32-bit buckets to store the maximum hash value.
Serialization
Data sketches can be serialized and deserialized using varbinary. This allows them to be easily stored for later use. By combining multiple sketches, we can query approx_distinct() of all elements in the partition, that is, the approximate number of occurrences of each element, and then complete the entire query with a small overhead.
For example, if you only need to calculate the number of times that each user browses web pages every day, you can calculate the weekly and yearly data by accumulating the data. This is similar to calculating the weekly revenue by summarizing the daily revenue.
You can use approx_distinct() together with GROUPING SETS to convert it to HyperLogLog. The following is an example:
CREATE TABLE visit_summaries(visit_date date,hll varbinary); INSERT INTO visit_summaries SELECT visit_date,cast(approx_set(user_id) AS varbinary) FROM user_visits GROUP BY visit_date; SELECT cardinality(merge(cast(hll AS HyperLogLog)))AS weekly_unique_users FROM visit_summaries WHERE visit_date>=current_date-interval'7'day;
Function
- approx_set(x) → HyperLogLog
Description: Returns HyperLogLog. This data sketch is the basis of approx distinct() and can be stored and used by calling cardinality().
select approx_set(cookieid) from cookies_log;--02 0c 02 00 c0 77 15 40 c1 2f 1b c2
- cardinality(hll) → bigint
Description: Calculates the data summarized by HLL.
select cardinality(approx_set(cookieid)) from cookies_log; --2
- empty_approx_set()→ HyperLogLog
Description: Returns an empty HyperLogLog.
select empty_approx_set();--02 0c 00 00
- merge(HyperLogLog) → HyperLogLog
Description: Summarizes the union set of each independent HLL data sketch.
CREATE TABLE visit_summaries ( visit_date date, hll varbinary); insert into visit_summaries select createtime,cast(approx_set(cookieid) as varbinary) from cookies_log group by createtime; SELECT cardinality(merge(cast(hll AS HyperLogLog))) AS weekly_unique_users FROM visit_summaries WHERE visit_date >=date '2020-07-11'; weekly_unique_users --------------------- 2 (1 row)
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot