Set Digest函数
概述
HetuEngine提供了几个处理MinHash技术的函数。
MinHash用于估计两个集合的Jaccard相似系数。它通常用于数据挖掘,用于大规模检测近乎相同的网页。通过使用这些信息,搜索引擎有效地避免了在搜索结果中显示两个几乎相同的网页。
以下示例展示了如何使用Set Digest函数来简单估计文本之间的相似性。通过使用函数ngrams()将输入文本分割为4-shingles(文本被分成长度为4的连续子序列,每个子序列称为一个shingle或者gram),它们被用于创建每个初始文本的集合摘要。将集合摘要相互比较,以获得其相应初始文本相似性的近似值。
WITH text_input(id, text) AS (
VALUES
(1, 'The quick brown fox jumps over the lazy chicken'),
(2, 'The quick and the lazy'),
(3, 'The quick brown fox jumps over the chicken')
),
text_ngrams(id, ngrams) AS (
SELECT id,
transform(
ngrams(
split(text, ' '),
4
),
token -> array_join(token, ' ')
)
FROM text_input
),
minhash_digest(id, digest) AS (
SELECT id,
(SELECT make_set_digest(v) FROM unnest(ngrams) u(v))
FROM text_ngrams
),
setdigest_side_by_side(id1, digest1, id2, digest2) AS (
SELECT m1.id as id1,
m1.digest as digest1,
m2.id as id2,
m2.digest as digest2
FROM (SELECT id, digest FROM minhash_digest) m1
JOIN (SELECT id, digest FROM minhash_digest) m2
ON m1.id != m2.id AND m1.id < m2.id
)
SELECT id1,
id2,
intersection_cardinality(digest1, digest2) AS intersection_cardinality,
jaccard_index(digest1, digest2)
AS jaccard_index
FROM setdigest_side_by_side
ORDER BY id1, id2;
id1 | id2 | intersection_cardinality | jaccard_index
-----|-----|--------------------------|---------------
1 | 2 | 0 | 0.0
1 | 3 | 4 | 0.6
2 | 3 | 0 | 0.0
(3 rows)
上述结果列表指出,正如预期的那样,id为1和3的文本非常相似。
Data sketches(数据草图)可以序列化为varbinary,也可以从varbinary反序列化。因此可以用varbinary来存储数据草图。
函数
- make_set_digest(x)→setdigest
SELECT make_set_digest(value) FROM (VALUES 1, 2, 3) T(value); _col0 ------------------------------------------------- 01 10 00 00 00 02 0b 03 00 80 03 44 00 00 58 3d 5b 80 20 08 de 00 20 00 00 03 00 00 00 a8 c0 76 6c a0 20 08 de 4a c4 05 fb b7 03 44 00 0c 8b 48 b2 39 58 3d 5b 01 00 01 00 01 00 (1 row) SELECT make_set_digest(value) FROM (VALUES 'Trino', 'SQL', 'on', 'everything') T(value); _col0 ------------------------------------------------- 01 14 00 00 00 02 0b 04 00 c0 8c 7d 1e c0 75 c9 2d c0 1a 1a 66 03 11 c3 a5 00 20 00 00 04 00 00 00 06 e5 2d 45 05 11 c3 a5 48 85 6b d5 e0 8c 7d 1e b9 1a 8a 39 ff 75 c9 2d 02 ad 0c 7c ed 1a 1a 66 01 00 01 00 01 00 01 00 (1 row)
- cardinality(setdigest)→long
描述:基于内部HyperLogLog组件返回setdigest的基数。
SELECT cardinality(make_set_digest(value)) FROM (VALUES 1, 2, 2, 3, 3,4, 4, 4, 5) T(value); -- 5
- intersection_cardinality(x,y)→long
描述:返回两个集合摘要交集的基数估计。其中x,y都是setdigest类型。
SELECT intersection_cardinality(make_set_digest(v1), make_set_digest(v2)) FROM (VALUES (1, 1), (NULL, 2), (2, 3), (3, 4)) T(v1, v2); -- 3