Updated on 2025-11-27 GMT+08:00

Collecting and Uploading NPU Logs

Scenario

When an NPU on a Lite Server node is faulty, you can use this solution to quickly deliver an NPU log collection task. The collected logs are stored in a specified directory on Lite Server. Once authorized, your logs will be automatically uploaded to the OBS bucket provided by Huawei Cloud for fault locating and analysis.

The following logs can be collected:

  • Device logs: system logs on the control CPU of the device, event-level system logs, system logs on non-control CPUs, and black box logs. When a device is abnormal, crashes, or has performance problems, you can accurately locate the root cause, shorten the troubleshooting time, and improve the O&M efficiency.
  • Host logs: kernel message logs on the host, system monitoring files on the host, OS log files on the host, and kernel message log files saved on the host when the system crashes. O&M engineers can view the host monitoring data and quickly determine whether the problem is caused by the application or underlying host resources, such as full CPUs, used up memory, and full disks. Accurate locating can greatly improve troubleshooting efficiency.
  • NPU environment logs: logs collected using tools such as npu-smi and hccn. By collecting such logs, you can improve O&M efficiency, ensure system stability, optimize resource usage, and facilitate root cause analysis.

Constraints

  • One-click log collection is supported only for Snt9b nodes and Snt9b23 supernodes. Manual log collection is supported for 300IDuo, Snt9b, and Snt9b23.
  • You can select a maximum of 50 common nodes or supernode subnodes for the same task.
  • Only one task can be executed on a node at the same time. The task cannot be interrupted once started. Plan the task priority.
  • Ensure that no services are running on the node. The command execution in the log collection task may interrupt or cause exceptions for the current services.
  • Ensure that the space of the directory for collecting logs is greater than 1 GB.

Prerequisites

Quick log collection depends on the AI plug-in preinstalled on the Lite Server node. If the plug-in is not installed, install it by referring to Managing Lite Server AI Plug-ins.

One-Click Log Collection

  1. Log in to the ModelArts console.
  2. In the navigation pane on the left, choose Lite Servers under Resource Management. On the displayed page, click the Task Center tab.
    Figure 1 Task center

  3. Click Create Task in the upper left corner. On the displayed Job Templates page, locate Log Collection, and click Create Task.
    Figure 2 Task template

  4. On the Log Collection page, enter the task name and description. Set server model and node type, select collection items, select the term of use, and click Create now.
    Table 1 Parameters for creating a task

    Parameter

    Description

    Name

    The system automatically generates a task name. You can change the name as required.

    Description

    Enter the task description for quick search.

    Template Parameters

    Enter the node directory on the Lite Server for storing logs. The default value is /root/log_collection.

    Server Model

    Snt9b and Snt9b23 supernodes are supported.

    Collection Items

    Select Device side log, Host side log, NPU environment log, or all of them.

    • Device logs: system logs on the control CPU of the device, event-level system logs, system logs on non-control CPUs, and black box logs.
    • Host logs: kernel message logs on the host, system monitoring files on the host, OS log files on the host, and kernel message log files saved on the host when the system crashes.
    • NPU environment logs: logs collected using tools such as npu-smi and hccn.
  5. Enable Log upload to upload collected logs to OBS for Huawei engineers to analyze logs. Click Create now.
  6. View the task execution status in the Task Center tab.
  7. Click the task name to access its details page, where you can view the task details.
  8. On the task details page, locate the target node and click View Logs in the Operation column. In the displayed window on the right, view the detailed log about task execution. In the Task Center tab, you can view all log collection results and check whether logs are collected and uploaded.

Manually Collecting Logs

  1. Obtain the AK/SK, which are used for script configuration, as well as authentication and authorization.

    If an AK/SK pair is already available, skip this step. Find the downloaded AK/SK file, which is usually named credentials.csv.

    The file contains the username, AK, and SK.

    Figure 3 credential.csv
    To generate an AK/SK pair, follow these steps:
    1. Log in to the Huawei Cloud console.
    2. Hover the cursor over the username in the upper right corner and choose My Credentials from the drop-down list.
    3. In the navigation pane on the left, choose Access Keys.
    4. Click Create Access Key.
    5. Download the key and keep it secure.
  2. Prepare the tenant ID and IAM ID for OBS bucket configuration.

    Send the prepared information to technical support, who will configure an OBS bucket policy based on your information. You can upload the collected logs to the corresponding OBS bucket.

    After the configuration, the OBS directory obs_dir is provided for you to configure subsequent scripts.

    Figure 4 Tenant and IAM accounts

  3. Collect the logs and upload the script.
    Modify the NpuLogCollection parameter in the script below. Replace ak, sk, and obs_dir with the values obtained in the previous steps. For the 300IDuo model, set is_300_iduo to True. Upload the script to the node whose NPU logs need to be collected.
    import json
    import os
    import sys
    import hashlib
    import hmac
    import binascii
    import subprocess
    import re
    from datetime import datetime
    
    class NpuLogCollection(object):
        NPU_LOG_PATH = "/var/log/npu_log_collect"
        SUPPORT_REGIONS = ['cn-southwest-2', 'cn-north-9', 'cn-east-4', 'cn-east-3', 'cn-north-4', 'cn-south-1']
        OPENSTACK_METADATA = "http://169.254.169.254/openstack/latest/meta_data.json"
        OBS_BUCKET_PREFIX = "npu-log-"
    
        def __init__(self, ak, sk, obs_dir, is_300_iduo=False):
            self.ak = ak
            self.sk = sk
            self.obs_dir = obs_dir
            self.is_300_iduo = is_300_iduo
            self.region_id = self.get_region_id()
            self.card_ids, self.chip_count = self.get_card_ids()
    
        def get_region_id(self):
            meta_data = os.popen("curl {}".format(self.OPENSTACK_METADATA))
            json_meta_data = json.loads(meta_data.read())
            meta_data.close()
            region_id = json_meta_data["region_id"]
            if region_id not in self.SUPPORT_REGIONS:
                print("current region {} is not support.".format(region_id))
                raise Exception('region exception')
            return region_id
    
        def gen_collect_npu_log_shell(self):
            # 300IDUO does not support
            hccn_tool_log_shell = "echo {npu_network_info}\n" \
                                  "for i in {npu_card_ids}; do hccn_tool -i $i -net_health -g >> {npu_log_path}/npu-smi_net-health.log ;done\n" \
                                  "for i in {npu_card_ids}; do hccn_tool -i $i -link -g >> {npu_log_path}/npu-smi_link.log ;done\n" \
                                  "for i in {npu_card_ids}; do hccn_tool -i $i -tls -g |grep switch >> {npu_log_path}/npu-smi_switch.log;done\n" \
                                  "for i in {npu_card_ids}; do hccn_tool -i $i -optical -g | grep prese >> {npu_log_path}/npu-smi_present.log ;done\n" \
                                  "for i in {npu_card_ids}; do hccn_tool -i $i -link_stat -g >> {npu_log_path}/npu_link_history.log ;done\n" \
                                  "for i in {npu_card_ids}; do hccn_tool -i $i -ip -g >> {npu_log_path}/npu_roce_ip_info.log ;done\n" \
                                  "for i in {npu_card_ids}; do hccn_tool -i $i -lldp -g >> {npu_log_path}/npu_nic_switch_info.log ;done\n" \
                .format(npu_log_path=self.NPU_LOG_PATH,
                        npu_card_ids=self.card_ids,
                        npu_network_info="collect npu network info")
    
            collect_npu_log_shell = "# !/bin/sh\n" \
                                    "step=1\n" \
                                    "rm -rf {npu_log_path}\n" \
                                    "mkdir -p {npu_log_path}\n" \
                                    "echo {echo_npu_driver_info}\n" \
                                    "npu-smi info > {npu_log_path}/npu-smi_info.log\n" \
                                    "cat /usr/local/Ascend/driver/version.info > {npu_log_path}/npu-smi_driver-version.log\n" \
                                    "/usr/local/Ascend/driver/tools/upgrade-tool --device_index -1 --component -1 --version > {npu_log_path}/npu-smi_firmware-version.log\n" \
                                    "for i in {npu_card_ids}; do for ((j=0;j<{chip_count};j++)); do npu-smi info -t health -i $i -c $j; done >> {npu_log_path}/npu-smi_health-code.log;done;\n" \
                                    "for i in {npu_card_ids}; do npu-smi info -t board -i $i >> {npu_log_path}/npu-smi_board.log; done;\n" \
                                    "echo {echo_npu_ecc_info}\n" \
                                    "for i in {npu_card_ids};do npu-smi info -t ecc -i $i >> {npu_log_path}/npu-smi_ecc.log; done;\n" \
                                    "lspci | grep acce > {npu_log_path}/Device-info.log\n" \
                                    "echo {echo_npu_device_log}\n" \
                                    "cd {npu_log_path} && msnpureport -f > /dev/null\n" \
                                    "tar -czvPf {npu_log_path}/log_messages.tar.gz /var/log/message*  > /dev/null\n" \
                                    "tar -czvPf {npu_log_path}/ascend_install.tar.gz /var/log/ascend_seclog/*  > /dev/null\n" \
                                    "echo {echo_npu_tools_log}\n" \
                                    "tar -czvPf {npu_log_path}/ascend_toollog.tar.gz /var/log/nputools_LOG_*  > /dev/null\n" \
                .format(npu_log_path=self.NPU_LOG_PATH,
                        npu_card_ids=self.card_ids,
                        chip_count=self.chip_count,
                        echo_npu_driver_info="collect npu driver info.",
                        echo_npu_ecc_info="collect npu ecc info.",
                        echo_npu_device_log="collect npu device log.",
                        echo_npu_tools_log="collect npu tools log.")
            if self.is_300_iduo:
                return collect_npu_log_shell
            return collect_npu_log_shell + hccn_tool_log_shell
    
        def collect_npu_log(self):
            print("begin to collect npu log")
            os.system(self.gen_collect_npu_log_shell())
            date_collect = datetime.now().strftime('%Y%m%d%H%M%S')
            instance_ip_obj = os.popen("curl http://169.254.169.254/latest/meta-data/local-ipv4")
            instance_ip = instance_ip_obj.read()
            instance_ip_obj.close()
            log_tar = "%s-npu-log-%s.tar.gz" % (instance_ip, date_collect)
            os.system("tar -czvPf %s %s > /dev/null" % (log_tar, self.NPU_LOG_PATH))
            print("success to collect npu log with {}".format(log_tar))
            return log_tar
    
        def upload_log_to_obs(self, log_tar):
            obs_bucket = "{}{}".format(self.OBS_BUCKET_PREFIX, self.region_id)
            print("begin to upload {} to obs bucket {}".format(log_tar, obs_bucket))
            obs_url = "https://%s.obs.%s.myhuaweicloud.com/%s/%s" % (obs_bucket, self.region_id, self.obs_dir, log_tar)
            date = datetime.utcnow().strftime('%a, %d %b %Y %H:%M:%S GMT')
            canonicalized_headers = "x-obs-acl:public-read"
            obs_sign = self.gen_obs_sign(date, canonicalized_headers, obs_bucket, log_tar)
    
            auth = "OBS " + self.ak + ":" + obs_sign
            header_date = '\"' + "Date:" + date + '\"'
            header_auth = '\"' + "Authorization:" + auth + '\"'
            header_obs_acl = '\"' + canonicalized_headers + '\"'
    
            cmd = "curl -X PUT -T " + log_tar + " -w %{http_code} " + obs_url + " -H " + header_date + " -H " + header_auth + " -H " + header_obs_acl
            result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
            http_code = result.stdout.strip()
            if result.returncode == 0 and http_code == "200":
                print("success to upload {} to obs bucket {}".format(log_tar, obs_bucket))
            else:
                print("failed to upload {} to obs bucket {}".format(log_tar, obs_bucket))
                print(result)
    
        #  calculate obs auth sign
        def gen_obs_sign(self, date, canonicalized_headers, obs_bucket, log_tar):
            http_method = "PUT"
            canonicalized_resource = "/%s/%s/%s" % (obs_bucket, self.obs_dir, log_tar)
            IS_PYTHON2 = sys.version_info.major == 2 or sys.version < '3'
            canonical_string = http_method + "\n" + "\n" + "\n" + date + "\n" + canonicalized_headers + "\n" + canonicalized_resource
            if IS_PYTHON2:
                hashed = hmac.new(self.sk, canonical_string, hashlib.sha1)
                obs_sign = binascii.b2a_base64(hashed.digest())[:-1]
            else:
                hashed = hmac.new(self.sk.encode('UTF-8'), canonical_string.encode('UTF-8'), hashlib.sha1)
                obs_sign = binascii.b2a_base64(hashed.digest())[:-1].decode('UTF-8')
            return obs_sign
    
        # get NPU Id and Chip count
        def get_card_ids(self):
            card_ids = []
            cmd = "npu-smi info -l"
            result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
            if result.returncode != 0:
                print("failed to execute command[{}]".format(cmd))
                return ""
            match = re.search(r'Chip Count\s*:\s*(\d+)', result.stdout)
            # default chip count is 1, 300IDUO or 910C is 2
            chip_count = 1
            if match and int(match.group(1)) > 0:
                chip_count=int(match.group(1))
    
            # filter NPU ID Regex
            pattern = re.compile(r'NPU ID(.*?): (.*?)\n', re.DOTALL)
            matches = pattern.findall(result.stdout)
            for match in matches:
                if len(match) != 2:
                    continue
                id = int(match[1])
                # if drop card
                if id < 0:
                    print("Card may not be found, NPU ID: {}".format(id))
                    continue
                card_ids.append(id)
            print("success to get card id {}, Chip Count {}".format(card_ids, chip_count))
            return " ".join(str(x) for x in card_ids), chip_count
    
        def execute(self):
            if self.obs_dir == "":
                print("the obs_dir is null, please enter a correct dir")
            else:
                log_tar = self.collect_npu_log()
                self.upload_log_to_obs(log_tar)
    
    
    if __name__ == '__main__':
        npu_log_collection = NpuLogCollection(ak='ak',
                                              sk='sk',
                                              obs_dir='obs_dir',
                                              is_300_iduo=False)
        npu_log_collection.execute()
  4. Run the script to collect logs.

    Run the script on the node. If the following information is displayed, logs are collected and uploaded to OBS.

    Figure 5 Log uploaded
  5. View the uploaded log package in the directory where the script is stored.
    Figure 6 Viewing the result