更新时间:2025-07-28 GMT+08:00

NPU日志收集上传

场景描述

当NPU出现故障,您可通过本方案收集NPU的日志信息。本方案中生成的日志会保存在节点上,并自动上传至技术支持提供的OBS桶中,日志仅用于问题定位分析,因此需要您提供AK/SK给技术支持,用于授权认证。

约束限制

当前仅支持在西南-贵阳一、华北-乌兰察布一、华东二、华东-上海一、华北-北京四和华南-广州区域使用该功能。

当前支持机型:300IDuo、Snt9B、Snt9B23。

操作步骤

  1. 获取AK/SK。该AK/SK用于后续脚本配置,做认证授权。

    如果已生成过AK/SK,则可跳过此步骤,找到原来已下载的AK/SK文件,文件名一般为:credentials.csv。

    如下图所示,文件包含了租户名(User Name),AK(Access Key Id),SK(Secret Access Key)。

    图1 credential.csv文件内容
    AK/SK生成步骤:
    1. 登录华为云管理控制台
    2. 单击右上角的用户名,在下拉列表中单击“我的凭证”。
    3. 单击“访问密钥”。
    4. 单击“新增访问密钥”。
    5. 下载密钥,并妥善保管。
  2. 准备租户名ID和IAM用户名ID,用于OBS桶配置。

    将您的租户名ID和IAM用户名ID提供给技术支持,技术支持将根据您提供的信息,为您配置OBS桶策略,以便用户收集的日志可以上传至对应的OBS桶。

    技术支持配置完成后,会给您提供对应的OBS桶目录“obs_dir”,该目录用于后续配置的脚本中。

    图2 租户名ID和IAM用户名ID

  3. 准备日志收集上传脚本。
    修改以下脚本中NpuLogCollection的参数,将ak、sk、obs_dir替换为前面步骤中获取到的值,如果是300IDuo机型将is_300_iduo改为True。然后把该脚本上传到要收集NPU日志的节点上。
    import json
    import os
    import sys
    import hashlib
    import hmac
    import binascii
    import subprocess
    import re
    from datetime import datetime
    
    class NpuLogCollection(object):
        NPU_LOG_PATH = "/var/log/npu_log_collect"
        SUPPORT_REGIONS = ['cn-southwest-2', 'cn-north-9', 'cn-east-4', 'cn-east-3', 'cn-north-4', 'cn-south-1']
        OPENSTACK_METADATA = "http://169.254.169.254/openstack/latest/meta_data.json"
        OBS_BUCKET_PREFIX = "npu-log-"
    
        def __init__(self, ak, sk, obs_dir, is_300_iduo=False):
            self.ak = ak
            self.sk = sk
            self.obs_dir = obs_dir
            self.is_300_iduo = is_300_iduo
            self.region_id = self.get_region_id()
            self.card_ids, self.chip_count = self.get_card_ids()
    
        def get_region_id(self):
            meta_data = os.popen("curl {}".format(self.OPENSTACK_METADATA))
            json_meta_data = json.loads(meta_data.read())
            meta_data.close()
            region_id = json_meta_data["region_id"]
            if region_id not in self.SUPPORT_REGIONS:
                print("current region {} is not support.".format(region_id))
                raise Exception('region exception')
            return region_id
    
        def gen_collect_npu_log_shell(self):
            # 300IDUO does not support
            hccn_tool_log_shell = "echo {npu_network_info}\n" \
                                  "for i in {npu_card_ids}; do hccn_tool -i $i -net_health -g >> {npu_log_path}/npu-smi_net-health.log ;done\n" \
                                  "for i in {npu_card_ids}; do hccn_tool -i $i -link -g >> {npu_log_path}/npu-smi_link.log ;done\n" \
                                  "for i in {npu_card_ids}; do hccn_tool -i $i -tls -g |grep switch >> {npu_log_path}/npu-smi_switch.log;done\n" \
                                  "for i in {npu_card_ids}; do hccn_tool -i $i -optical -g | grep prese >> {npu_log_path}/npu-smi_present.log ;done\n" \
                                  "for i in {npu_card_ids}; do hccn_tool -i $i -link_stat -g >> {npu_log_path}/npu_link_history.log ;done\n" \ 
                                  "for i in {npu_card_ids}; do hccn_tool -i $i -ip -g >> {npu_log_path}/npu_roce_ip_info.log ;done\n" \ 
                                  "for i in {npu_card_ids}; do hccn_tool -i $i -lldp -g >> {npu_log_path}/npu_nic_switch_info.log ;done\n" \
                .format(npu_log_path=self.NPU_LOG_PATH,
                        npu_card_ids=self.card_ids,
                        npu_network_info="collect npu network info")
    
            collect_npu_log_shell = "# !/bin/sh\n" \
                                    "step=1\n" \
                                    "rm -rf {npu_log_path}\n" \
                                    "mkdir -p {npu_log_path}\n" \
                                    "echo {echo_npu_driver_info}\n" \
                                    "npu-smi info > {npu_log_path}/npu-smi_info.log\n" \
                                    "cat /usr/local/Ascend/driver/version.info > {npu_log_path}/npu-smi_driver-version.log\n" \
                                    "/usr/local/Ascend/driver/tools/upgrade-tool --device_index -1 --component -1 --version > {npu_log_path}/npu-smi_firmware-version.log\n" \
                                    "for i in {npu_card_ids}; do for ((j=0;j<{chip_count};j++)); do npu-smi info -t health -i $i -c $j; done >> {npu_log_path}/npu-smi_health-code.log;done;\n" \
                                    "for i in {npu_card_ids}; do npu-smi info -t board -i $i >> {npu_log_path}/npu-smi_board.log; done;\n" \
                                    "echo {echo_npu_ecc_info}\n" \
                                    "for i in {npu_card_ids};do npu-smi info -t ecc -i $i >> {npu_log_path}/npu-smi_ecc.log; done;\n" \
                                    "lspci | grep acce > {npu_log_path}/Device-info.log\n" \
                                    "echo {echo_npu_device_log}\n" \
                                    "cd {npu_log_path} && msnpureport -f > /dev/null\n" \
                                    "tar -czvPf {npu_log_path}/log_messages.tar.gz /var/log/message*  > /dev/null\n" \
                                    "tar -czvPf {npu_log_path}/ascend_install.tar.gz /var/log/ascend_seclog/*  > /dev/null\n" \
                                    "echo {echo_npu_tools_log}\n" \
                                    "tar -czvPf {npu_log_path}/ascend_toollog.tar.gz /var/log/nputools_LOG_*  > /dev/null\n" \
                .format(npu_log_path=self.NPU_LOG_PATH,
                        npu_card_ids=self.card_ids,
                        chip_count=self.chip_count,
                        echo_npu_driver_info="collect npu driver info.",
                        echo_npu_ecc_info="collect npu ecc info.",
                        echo_npu_device_log="collect npu device log.",
                        echo_npu_tools_log="collect npu tools log.")
            if self.is_300_iduo:
                return collect_npu_log_shell
            return collect_npu_log_shell + hccn_tool_log_shell
    
        def collect_npu_log(self):
            print("begin to collect npu log")
            os.system(self.gen_collect_npu_log_shell())
            date_collect = datetime.now().strftime('%Y%m%d%H%M%S')
            instance_ip_obj = os.popen("curl http://169.254.169.254/latest/meta-data/local-ipv4")
            instance_ip = instance_ip_obj.read()
            instance_ip_obj.close()
            log_tar = "%s-npu-log-%s.tar.gz" % (instance_ip, date_collect)
            os.system("tar -czvPf %s %s > /dev/null" % (log_tar, self.NPU_LOG_PATH))
            print("success to collect npu log with {}".format(log_tar))
            return log_tar
    
        def upload_log_to_obs(self, log_tar):
            obs_bucket = "{}{}".format(self.OBS_BUCKET_PREFIX, self.region_id)
            print("begin to upload {} to obs bucket {}".format(log_tar, obs_bucket))
            obs_url = "https://%s.obs.%s.myhuaweicloud.com/%s/%s" % (obs_bucket, self.region_id, self.obs_dir, log_tar)
            date = datetime.utcnow().strftime('%a, %d %b %Y %H:%M:%S GMT')
            canonicalized_headers = "x-obs-acl:public-read"
            obs_sign = self.gen_obs_sign(date, canonicalized_headers, obs_bucket, log_tar)
    
            auth = "OBS " + self.ak + ":" + obs_sign
            header_date = '\"' + "Date:" + date + '\"'
            header_auth = '\"' + "Authorization:" + auth + '\"'
            header_obs_acl = '\"' + canonicalized_headers + '\"'
    
            cmd = "curl -X PUT -T " + log_tar + " -w %{http_code} " + obs_url + " -H " + header_date + " -H " + header_auth + " -H " + header_obs_acl
            result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
            http_code = result.stdout.strip()
            if result.returncode == 0 and http_code == "200":
                print("success to upload {} to obs bucket {}".format(log_tar, obs_bucket))
            else:
                print("failed to upload {} to obs bucket {}".format(log_tar, obs_bucket))
                print(result)
    
        #  calculate obs auth sign
        def gen_obs_sign(self, date, canonicalized_headers, obs_bucket, log_tar):
            http_method = "PUT"
            canonicalized_resource = "/%s/%s/%s" % (obs_bucket, self.obs_dir, log_tar)
            IS_PYTHON2 = sys.version_info.major == 2 or sys.version < '3'
            canonical_string = http_method + "\n" + "\n" + "\n" + date + "\n" + canonicalized_headers + "\n" + canonicalized_resource
            if IS_PYTHON2:
                hashed = hmac.new(self.sk, canonical_string, hashlib.sha1)
                obs_sign = binascii.b2a_base64(hashed.digest())[:-1]
            else:
                hashed = hmac.new(self.sk.encode('UTF-8'), canonical_string.encode('UTF-8'), hashlib.sha1)
                obs_sign = binascii.b2a_base64(hashed.digest())[:-1].decode('UTF-8')
            return obs_sign
    
        # get NPU Id and Chip count
        def get_card_ids(self):
            card_ids = []
            cmd = "npu-smi info -l"
            result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
            if result.returncode != 0:
                print("failed to execute commond[{}]".format(cmd))
                return ""
            match = re.search(r'Chip Count\s*:\s*(\d+)', result.stdout)
            # default chip count is 1, 300IDUO or 910C is 2
            chip_count = 1
            if match and int(match.group(1)) > 0:
                chip_count=int(match.group(1))
    
            # filter NPU ID Regex
            pattern = re.compile(r'NPU ID(.*?): (.*?)\n', re.DOTALL)
            matches = pattern.findall(result.stdout)
            for match in matches:
                if len(match) != 2:
                    continue
                id = int(match[1])
                # if drop card
                if id < 0:
                    print("Card may not be found, NPU ID: {}".format(id))
                    continue
                card_ids.append(id)
            print("success to get card id {}, Chip Count {}".format(card_ids, chip_count))
            return " ".join(str(x) for x in card_ids), chip_count
    
        def execute(self):
            if self.obs_dir == "":
                print("the obs_dir is null, please enter a correct dir")
            else:
                log_tar = self.collect_npu_log()
                self.upload_log_to_obs(log_tar)
    
    
    if __name__ == '__main__':
        npu_log_collection = NpuLogCollection(ak='ak',
                                              sk='sk',
                                              obs_dir='obs_dir',
                                              is_300_iduo=False)
        npu_log_collection.execute()
  4. 执行脚本收集日志。

    在节点上执行该脚本,可以看到有如下输出,代表日志收集完成并成功上传至OBS。

    图3 日志收集完成
  5. 查看在脚本的同级目录下,可以看到收集到的日志压缩包。
    图4 查看结果