更新时间:2024-08-29 GMT+08:00
分享

长文本摘要

应用介绍

切割长文本,利用大模型逐步总结,如对会议/报告/文章等总结概述。涉及长文本分割、摘要等相关特性。

环境准备

  • python3.9 及以上版本。
  • 安装依赖的组件包, pip install pangu_kits_app_dev_py gradio python-docx。
  • 盘古大语言模型。

开发实现

  • 创建配置文件llm.properties, 正确配置iam和pangu配置项。信息收集请参考准备工作
    #
    # Copyright (c) Huawei Technologies Co., Ltd. 2023-2023. All rights reserved.
    #
    
    ################################ GENERIC CONFIG ###############################
    
    ## If necessary, you can specify the http proxy configuration.
    # sdk.proxy.enabled=true
    # sdk.proxy.url=
    # sdk.proxy.user=
    # sdk.proxy.password=
    
    
    ## Generic IAM info. This config is used when the specified IAM is not configured.
    ## Either user password authentication or AK/SK authentication.
    #
    sdk.iam.url=
    sdk.iam.domain=
    sdk.iam.user=
    sdk.iam.password=
    sdk.iam.project=
    
    
    ## Pangu
    # Examples: https://{endPoint}/v1/{projectId}/deployments/{deploymentId} ;
    #
    sdk.llm.pangu.url=
  • 创建代码文件(doc_summary.py),示例如下:
    import os 
    import gradio as gr
    import docx
    import time
    
    from pangukitsappdev.skill.doc.summary import DocSummaryMapReduceSkill
    from pangukitsappdev.api.llms.factory import LLMs
    
    # 设置SDK使用的配置文件
    os.environ["SDK_CONFIG_PATH"] = "./llm.properties"
    
    # 初始化文档问答Skill
    doc_skill = DocSummaryMapReduceSkill(LLMs.of("pangu"))
    
    # 合并长度过小的段落,不超过maxLength
    def merge_docs(docs, maxLength, separator):
        result = []
        current = []
        length = 0
        for doc in docs:
            if (length + len(doc) > maxLength):
                if current:
                    result.append(separator.join(current))
                current = [doc]
                length = len(doc)
            else:
                current.append(doc)
                length = length + len(doc)
        if current:
            result.append(separator.join(current))
        return result
    
    # 加载文档,支持word和txt文本文件
    def load_file(name):
        docs = []
        if (name.endswith("doc") or name.endswith("docx")):
            doc = docx.Document(name)
            for grah in doc.paragraphs:
                docs.append(grah.text)
        else:
            data = ""
            with open(name, 'r', encoding='utf-8') as f:
                data = f.read()
            docs = data.split("\n")
        return docs
    
    
    def summary(files):
        sum_texts = []
        for file in files:
            docs = load_file(file.name)
            # print(docs)
            docs_merge = merge_docs(docs, 1000, "\n")
    
            # 大模型对文档做摘要
            sum_texts.append(doc_skill.execute_with_texts(docs_merge))
    
            # 设置延时,避免访问太频繁
            time.sleep(10)
    
        return sum_texts[0] if len(sum_texts) == 1 else doc_skill.execute_with_texts(sum_texts)
    
    def upload_file(files):
        file_paths = [file.name for file in files]
        return file_paths
    
    with gr.Blocks() as demo:
        upload_bt = gr.UploadButton("选择文档", file_types=["text"], file_count="multiple")
        file_output = gr.Files()
        upload_bt.upload(upload_file, upload_bt, file_output)
    
        greet_btn = gr.Button("生成摘要")
        output = gr.Textbox(label="输出")
    
        greet_btn.click(fn=summary, inputs=file_output, outputs=output, api_name="summary")
    
    
    demo.launch()
  • 终端命令行下执行python3 doc_summary.py运行应用,效果如下。

相关文档