Set Digest函数
概述
HetuEngine提供了几个处理MinHash技术的函数。
MinHash用于估计两个集合的Jaccard相似系数。它通常用于数据挖掘,用于大规模检测近乎相同的网页。通过使用这些信息,搜索引擎有效地避免了在搜索结果中显示两个几乎相同的网页。
以下示例展示了如何使用Set Digest函数来简单估计文本之间的相似性。通过使用函数ngrams()将输入文本分割为4-shingles(文本被分成长度为4的连续子序列,每个子序列称为一个shingle或者gram),它们被用于创建每个初始文本的集合摘要。将集合摘要相互比较,以获得其相应初始文本相似性的近似值。
WITH text_input(id, text) AS ( VALUES (1, 'The quick brown fox jumps over the lazy dog'), (2, 'The quick and the lazy'), (3, 'The quick brown fox jumps over the dog') ), text_ngrams(id, ngrams) AS ( SELECT id, transform( ngrams( split(text, ' '), 4 ), token -> array_join(token, ' ') ) FROM text_input ), minhash_digest(id, digest) AS ( SELECT id, (SELECT make_set_digest(v) FROM unnest(ngrams) u(v)) FROM text_ngrams ), setdigest_side_by_side(id1, digest1, id2, digest2) AS ( SELECT m1.id as id1, m1.digest as digest1, m2.id as id2, m2.digest as digest2 FROM (SELECT id, digest FROM minhash_digest) m1 JOIN (SELECT id, digest FROM minhash_digest) m2 ON m1.id != m2.id AND m1.id < m2.id ) SELECT id1, id2, intersection_cardinality(digest1, digest2) AS intersection_cardinality, jaccard_index(digest1, digest2) AS jaccard_index FROM setdigest_side_by_side ORDER BY id1, id2; id1 | id2 | intersection_cardinality | jaccard_index -----|-----|--------------------------|--------------- 1 | 2 | 0 | 0.0 1 | 3 | 4 | 0.6 2 | 3 | 0 | 0.0 (3 rows)
上述结果列表指出,正如预期的那样,id为1和3的文本非常相似。
Data sketches(数据草图)可以序列化为varbinary,也可以从varbinary反序列化。因此可以用varbinary来存储数据草图。
函数
- make_set_digest(x) → setdigest
SELECT make_set_digest(value) FROM (VALUES 1, 2, 3) T(value); _col0 ------------------------------------------------- 01 10 00 00 00 02 0b 03 00 80 03 44 00 00 58 3d 5b 80 20 08 de 00 20 00 00 03 00 00 00 a8 c0 76 6c a0 20 08 de 4a c4 05 fb b7 03 44 00 0c 8b 48 b2 39 58 3d 5b 01 00 01 00 01 00 (1 row) SELECT make_set_digest(value) FROM (VALUES 'Trino', 'SQL', 'on', 'everything') T(value); _col0 ------------------------------------------------- 01 14 00 00 00 02 0b 04 00 c0 8c 7d 1e c0 75 c9 2d c0 1a 1a 66 03 11 c3 a5 00 20 00 00 04 00 00 00 06 e5 2d 45 05 11 c3 a5 48 85 6b d5 e0 8c 7d 1e b9 1a 8a 39 ff 75 c9 2d 02 ad 0c 7c ed 1a 1a 66 01 00 01 00 01 00 01 00 (1 row)
- cardinality(setdigest) → long
描述:基于内部HyperLogLog组件返回setdigest的基数。
SELECT cardinality(make_set_digest(value)) FROM (VALUES 1, 2, 2, 3, 3,4, 4, 4, 5) T(value); -- 5
- intersection_cardinality(x, y) → long
描述:返回两个集合摘要交集的基数估计。其中x,y 都是setdigest类型。
SELECT intersection_cardinality(make_set_digest(v1), make_set_digest(v2)) FROM (VALUES (1, 1), (NULL, 2), (2, 3), (3, 4)) T(v1, v2); -- 3