文档首页/ AI开发平台ModelArts/ 开发环境/ ML Studio/ 预置算子说明/ 模型工程/ 文本/ PMI

更新时间：2024-05-27 GMT+08:00

PMI

概述

承接分词结果，计算一个文档里单词两两之间的互信息值（PMI）。PMI计算公式如下：

$\text{[math]}$

输入

参数	子参数	参数说明
inputs	input_table	输入的包含分词后句子的数据表；必选

输入参数说明

参数名称	参数描述	参数要求
doc_col_name	分词后的文本列	string类型；必填；多列时每列当做单独的句子处理
doc_sep	分词列中的词分隔符	string类型；必填；默认为" "
min_count	最小词频	integer类型；非必填；默认为5，小于该值的词会被过滤掉，不填则识别为0，取值范围[0,2147483647]
window_size	滑动窗口大小	integer类型；非必填；默认为整行，取值范围[1, 2147483647]
partitions	数据重分区数	integer类型；非必填；取值范围[1,5000]；

partitions

大数据量情况下建议partitions重分区数取大一些，100w长文本数据建议取1000，500w长文本数据建议取2000，如果在前两种场景下用户自定义partitions小于需求值，系统会自动替换为需求值（即前面的1000，2000）。

资源配置

数据量较大时建议采用更大的资源配置，可以设置executor memory大一些，参考配置如下：

cluster 32配置：

--executor-memory 8G \

--executor-cores 2 \

--num-executors 14 \

--driver-cores 4 \

--driver-memory 15G \

cluster 64配置：

--executor-memory 24G \

--executor-cores 6 \

--num-executors 10 \

--driver-cores 4 \

--driver-memory 15G \

参数配置

如果运行效率过慢，可考虑增大资源配置，或修改min_count、window_size参数，min_count大一些，window_size小一些。

输出

参数	子参数	参数说明
output	output_port_1	输出表表名；标签为dataframe

输出表说明

列名	列名描述
word1	共现词对儿的第一个单词
word2	共现词对儿的第二个单词
word1_count	word1出现在所有共现词对儿中的次数
word2_count	word2出现在所有共现词对儿中的次数
co_occurrence_count	(word1, word2)共现词对儿的个数
pmi	word1与word2的PMI值

样例

数据输入

input_table

input

Try to try it how to try it

Need to try it

You try to do do something

How can you these days still not try it not do anything

It is a good chance to try also you can do it

You are right that it is a good chance to try

配置流程

运行流程

点击放大

输入参数

输出结果

word1	word2	word1_count	word2_count	co_occurrences_count	pmi
You	a	11	16	1	-0.36646
You	chance	11	16	1	-0.36646
You	do	11	23	2	-0.03622
You	good	11	16	1	-0.36646
You	is	11	16	1	-0.36646
You	it	11	34	1	-1.12023
You	to	11	32	2	-0.36646
You	try	11	38	2	-0.53831
a	can	16	15	1	-0.67662
a	chance	16	16	2	-0.04801
a	do	16	23	1	-1.10406
a	good	16	16	2	-0.04801
a	is	16	16	2	-0.04801
a	it	16	34	2	-0.80178
a	to	16	32	2	-0.74116
a	try	16	38	2	-0.91301
a	you	16	15	1	-0.67662
can	chance	15	16	1	-0.67662
can	do	15	23	2	-0.34638
can	good	15	16	1	-0.67662
can	is	15	16	1	-0.67662
can	it	15	34	2	-0.73724
can	not	15	12	2	0.304211
can	to	15	32	1	-1.36977
can	try	15	38	2	-0.84847
can	you	15	15	2	0.081068
chance	do	16	23	1	-1.10406
chance	good	16	16	2	-0.04801
chance	is	16	16	2	-0.04801
chance	it	16	34	2	-0.80178
chance	to	16	32	2	-0.74116
chance	try	16	38	2	-0.91301
chance	you	16	15	1	-0.67662
do	do	23	23	1	-1.46697
do	good	23	16	1	-1.10406
do	is	23	16	1	-1.10406
do	it	23	34	2	-1.16469
do	not	23	12	2	-0.12323
do	to	23	32	3	-0.6986
do	try	23	38	4	-0.58276
do	you	23	15	2	-0.34638
good	is	16	16	2	-0.04801
good	it	16	34	2	-0.80178
good	to	16	32	2	-0.74116
good	try	16	38	2	-0.91301
good	you	16	15	1	-0.67662
is	it	16	34	2	-0.80178
is	to	16	32	2	-0.74116
is	try	16	38	2	-0.91301
is	you	16	15	1	-0.67662
it	it	34	34	1	-2.2487
it	not	34	12	2	-0.5141
it	to	34	32	7	-0.24217
it	try	34	38	8	-0.28048
it	you	34	15	2	-0.73724
not	not	12	12	1	-0.16579
not	try	12	38	2	-0.62532
not	you	12	15	2	0.304211
to	to	32	32	1	-2.12745
to	try	32	38	8	-0.21986
to	you	32	15	1	-1.36977
try	try	38	38	1	-2.47115
try	you	38	15	2	-0.84847

父主题： 文本

上一篇：NGram Count

下一篇：关键词抽取

意见反馈

文档内容是否对您有帮助？

有帮助没帮助

提供反馈

提交成功！非常感谢您的反馈，我们会继续努力做到更好！您可在我的云声建议查看反馈及问题处理状态。

系统繁忙，请稍后重试

在使用文档中是否遇到以下问题

内容与产品页面不一致

内容不易理解

缺失示例代码

步骤不可操作

搜不到想要的内容

缺少最佳实践

意见反馈（选填）

0/500

请至少选择一项反馈信息并填写问题反馈

字符长度不能超过500

直接提交取消

如您有其它疑问，您也可以通过华为云社区问答频道来与我们联系探讨

智能客服提问云社区提问

PMI

概述

输入

输入参数说明

输出

输出表说明

样例

相关文档

意见反馈

文档内容是否对您有帮助？

7*24

备案

专业服务

退订

建议反馈

售前咨询热线