Updated on 2024-05-29 GMT+08:00

Set Digest Functions

Overview

HetuEngine provides several functions that use the MinHash technique.

MinHash quickly estimates how similar two sets based on their Jaccard similarity coefficients. It is usually used in data mining to detect almost the same web pages at scale. By using this information, search engines effectively avoid displaying two web pages that are nearly identical in search results.

The following example shows how to use set digest functions to estimate the similarity between texts. The input texts are split into 4-shingles by the ngrams() function. (The texts are divided into consecutive subsequences each is 4 characters long, and each subsequence is called a shingle or gram.) They are used to create a set digest of each initial text. The set digests are compared with each other to obtain an approximation of similarity of their initial texts.

WITH text_input(id, text) AS (
         VALUES
             (1, 'The quick brown fox jumps over the lazy dog'),
             (2, 'The quick and the lazy'),
             (3, 'The quick brown fox jumps over the dog')
     ), 
    text_ngrams(id, ngrams) AS (
        SELECT id,
                transform(
                  ngrams(
                    split(text, ' '),
                    4
                  ),
                  token -> array_join(token, ' ')
                )
         FROM text_input
     ),
     minhash_digest(id, digest) AS (
         SELECT id,
                (SELECT make_set_digest(v) FROM unnest(ngrams) u(v))
         FROM text_ngrams
     ),
     setdigest_side_by_side(id1, digest1, id2, digest2) AS (
         SELECT m1.id as id1,
                m1.digest as digest1,
                m2.id as id2,
                m2.digest as digest2
         FROM (SELECT id, digest FROM minhash_digest) m1
         JOIN (SELECT id, digest FROM minhash_digest) m2
           ON m1.id != m2.id AND m1.id < m2.id
     )
SELECT id1,
       id2,
       intersection_cardinality(digest1, digest2) AS intersection_cardinality,
       jaccard_index(digest1, digest2)
            AS jaccard_index
FROM setdigest_side_by_side
ORDER BY id1, id2;
 id1 | id2 | intersection_cardinality | jaccard_index 
-----|-----|--------------------------|---------------
   1 |   2 |                        0 |           0.0 
   1 |   3 |                        4 |           0.6 
   2 |   3 |                        0 |           0.0 
(3 rows)

The above list indicates that, as expected, the texts with IDs 1 and 3 are similar.

Data sketches can be serialized to varbinary or deserialized from varbinary. Varbinary can be used to store data sketches.

Function

  • make_set_digest(x) → setdigest

    Description: Composes all input values of x into a setdigest.

    SELECT make_set_digest(value) FROM (VALUES 1, 2, 3) T(value);
                          _col0                      
    -------------------------------------------------
     01 10 00 00 00 02 0b 03 00 80 03 44 00 00 58 3d 
     5b 80 20 08 de 00 20 00 00 03 00 00 00 a8 c0 76 
     6c a0 20 08 de 4a c4 05 fb b7 03 44 00 0c 8b 48 
     b2 39 58 3d 5b 01 00 01 00 01 00                
    (1 row)
    
    SELECT make_set_digest(value) FROM (VALUES 'Trino', 'SQL', 'on', 'everything') T(value);
                          _col0                      
    -------------------------------------------------
     01 14 00 00 00 02 0b 04 00 c0 8c 7d 1e c0 75 c9 
     2d c0 1a 1a 66 03 11 c3 a5 00 20 00 00 04 00 00 
     00 06 e5 2d 45 05 11 c3 a5 48 85 6b d5 e0 8c 7d 
     1e b9 1a 8a 39 ff 75 c9 2d 02 ad 0c 7c ed 1a 1a 
     66 01 00 01 00 01 00 01 00                      
    (1 row)
    
  • merge_set_digest(setdigest) → setdigest

    Description: Returns the setdigest of the aggregate union of the individual setdigest structures.

  • cardinality(setdigest) → long

    Description: Returns the cardinality of the setdigest from its internal HyperLogLog component.

    SELECT cardinality(make_set_digest(value)) FROM (VALUES 1, 2, 2, 3, 3,4, 4, 4, 5) T(value); -- 5
  • intersection_cardinality(x, y) → long

    Description: Returns the estimation for the cardinality of the intersection of the two set digests. x and y must be of the setdigest type.

    SELECT intersection_cardinality(make_set_digest(v1), make_set_digest(v2)) FROM (VALUES (1, 1), (NULL, 2), (2, 3), (3, 4)) T(v1, v2); -- 3
  • jaccard_index(x, y) → double

    Description: Returns the estimation of Jaccard index for the two set digests.aries. x and y must be of the setdigest type.

    SELECT jaccard_index(make_set_digest(v1), make_set_digest(v2)) FROM (VALUES (1, 1), (NULL,2), (2, 3), (NULL, 4)) T(v1, v2); -- 0.5
  • hash_counts(x)

    Description: Returns a map containing the Murmur3Hash128 hashed values and the count of their occurrences within the internal MinHash structure belonging to x. x must be of the setdigest type.

    SELECT hash_counts(make_set_digest(value)) FROM (VALUES 1, 1, 1, 2, 2) T(value); -- {19144387141682250=3, -2447670524089286488=2}