更新时间:2026-02-06 GMT+08:00
分享

配置NPU服务器驱动固件一致性与UDP端口hash散列

场景描述

轻量算力节点的Snt9b、超节点Snt9b23公共操作系统中,都进行了驱动固件一致性与UDP端口hash散列配置。驱动固件一致性配置能够在服务器关机重启后自动刷新固件,保证操作系统HDK驱动与固件的一致性。UDP端口hash散列配置能够在服务器关机重启后自动刷新参数面网络上行端口配置,保证参数面网络不发生拥塞。

当您使用自行构建的私有操作系统时,未配置驱动固件一致性与UDP端口hash散列,不具备服务器关机重启后自动刷新的能力。因此,推荐您在私有操作系统上进行驱动固件一致性与UDP端口hash散列的配置。

约束限制

驱动固件一致性与hash散列配置仅适用于轻量算力节点轻量算力节点Snt9b、超节点Snt9b23。

驱动固件一致性配置与UDP端口hash散列配置前置服务依赖bms-network-config,您需要先完成操作系统的云化适配与软件包安装,可参考BMS镜像制作文档

驱动固件一致性配置

请前往华为官方Support网站,获取NPU驱动包与固件包,根据您的机型与芯片架构选择对应的驱动包与固件包。以HDK25.2.1为例,在超节点Snt9b23 ARM64上进行驱动固件一致性配置。

表1 驱动和固件版本配套要求

类型

包名

版本

驱动包

Atlas-A3-hdk-npu-driver_25.2.1_linux-aarch64.run

25.2.1

固件包

Atlas-A3-hdk-npu-firmware_7.7.0.9.220.run

7.7.0.9.220

  1. 将您下载的NPU驱动包与固件包上传到轻量算力节点服务器的任意目录中,此步骤您可以使用Xftp工具或者OBS桶工具完成。
  2. 登录轻量算力节点服务器,并前往软件包上传的路径。
  3. 安装NPU驱动。

    参考NPU服务器上配置轻量算力节点资源软件环境安装驱动与固件部分,完成NPU驱动安装。

  4. 执行下面的命令,创建驱动固件一致性配置目录。
    mkdir -p /opt/huawei/firmware_check
  5. 执行下面的命令,创建驱动固件一致性配置脚本。
    cd /opt/huawei/firmware_check
    touch firmware_check.sh
  6. 执行下面的命令,将固件包移动到驱动固件一致性目录,并修改软件包名。
    mv Atlas-A3-hdk-npu-firmware_7.7.0.9.220.run /opt/huawei/firmware_check/Ascend-hdk-npu-firmware_7.7.0.9.220.run

    请注意:若您的固件版本号与示例不一致,或者后续更换版本,上述命令中软件包版本号需要对应修改。

  7. 打开脚本文件firmware_check.sh,将以下内容写入脚本。
    打开脚本文件。
    vim firmware_check.sh
    写入以下内容。
    #!/bin/bash
    HOME_DIR="/opt/huawei/firmware_check"
    MAX_LOG_LINES=500
    LOG_FILE="${HOME_DIR}/firmware_check.log"
    ASCEND_INSTALL_INFO="/etc/ascend_install.info"
    ASCEND_PATH="/usr/local/Ascend"
    MIN_DURATION=3600
    # ***制作镜像时如果驱动版本有变化,需要更新这里***
    PRESET_SOFTWARE_VERSION="25.2.1"
    PRESET_FIRMWARE_VERSION="7.7.0.9.220"
    function init_log() {
        if [ ! -f ${LOG_FILE} ]; then
            cat /dev/null > $LOG_FILE
            return
        fi
        line_count=$(wc -l < "${LOG_FILE}")
        if [ $line_count -gt $MAX_LOG_LINES ]; then
            tail -n "${MAX_LOG_LINES}" "${LOG_FILE}" > "${LOG_FILE}.tmp"
            mv "${LOG_FILE}.tmp" "${LOG_FILE}"
        fi
    }
    function log_info() {
        echo -e "$(date +%Y-%m-%d" "%H:%M:%S):[Info]$@" >> $LOG_FILE && echo -e "\033[32m[INFO]\033[0m: $@" > /dev/tty
    }
    function log_warning() {
        echo -e "$(date +%Y-%m-%d" "%H:%M:%S):[Warnning]$@" >> $LOG_FILE && echo -e "\033[33m[WARN]\033[0m: $@" > /dev/tty
    }
    function log_error() {
        echo -e "$(date +%Y-%m-%d" "%H:%M:%S):[Error]$@" >> $LOG_FILE && echo -e "\033[31m[ERROR]\033[0m: $@" > /dev/tty
    }
    function get_param_from_config() {
        file=$1
        wanted=$2
        if [ ! -e $file ]; then
            log_error "File ${file} does not exist."
            return 1
        fi
        while IFS="=" read -r key val; do
            key=$(echo "$key" | tr -d '[:space:]')
            if [[ "$key" == "$wanted" ]]; then
                echo $val
                return 0
            fi
        done < "$file"
        return 1
    }
    function get_object_version() {
        install_info=$1
        install_path_key=$2
        default_install_path=$3
        object_type=$4
        install_path=$(get_param_from_config "${ASCEND_INSTALL_INFO}" "${install_path_key}")
        if [ $? -ne 0 ]; then
            log_warning "Failed to get value of ${install_path_key} from ${install_info}, use ${default_install_path}."
            install_path=${default_install_path}
        fi
        version_info_file="${install_path}/${object_type}/version.info"
        version=$(get_param_from_config "${version_info_file}" "Version")
        if [ $? -ne 0 ]; then
            log_error "Failed to get Version value from ${version_info_file}."
            return 1
        fi
        echo $version
        return 0
    }
    function get_versions_with_tool() {
        output=$(npu-smi info -t board -i ${1})
        if [ $? -ne 0 ]; then
            log_error "Run command 'npu-smi info -t board -i ${i}' error:\n$output"
            return 1
        fi
        software_version=$(echo "$output" | awk -F':' '/Software Version/ {print $2}' | tr -d '[:space:]')
        firmware_version=$(echo "$output" | awk -F':' '/Firmware Version/ {print $2}' | tr -d '[:space:]')
        if [ -n "${software_version}" ] && [ -n "${firmware_version}" ]; then
            echo "${software_version}|${firmware_version}"
            return 0
        fi
        return 1
    }
    function check_versoins() {
        software_version=$1
        firmware_version=$2
        if [ "${software_version}" != "${PRESET_SOFTWARE_VERSION}" ]; then
            log_warning "The current software version is not preset version '${PRESET_SOFTWARE_VERSION}', does not need to upgrade the firmware."
            echo "false"
            return
        fi
        if [ -n "${firmware_version}" ] && [ "${firmware_version}" == "${PRESET_FIRMWARE_VERSION}" ]; then
            log_info "The current firmware version match preset version '${PRESET_FIRMWARE_VERSION}', does not need to be upgraded."
            echo "false"
            return
        fi
        log_info "The current firmware '${firmware_version}' does not match preset version '${PRESET_FIRMWARE_VERSION}', need to be upgraded."
        echo "true"
        return
    }
    function upgrade_firmware() {
        firmware_version=$1
        firmware_package="${HOME_DIR}/Ascend-hdk-npu-firmware_${firmware_version}.run"
        if [ ! -e ${firmware_package} ]; then
            log_error "Firmware package ${firmware_package} does not exist."
            return 1
        fi
        bash ${firmware_package} --full --quiet >> $LOG_FILE 2>&1
        return $?
    }
    function upgrade_mcu() {
        bash ${HOME_DIR}/upgrade_mcu.sh ${HOME_DIR}/Ascend-hdk-mcu_${PRESET_MCU_VERSION}.hpm
    }
    function permit_check() {
        if [ ! -e ${TIME_TAG} ]; then
            log_info "First check of this machine, permit."
            return 0
        fi
        this_time=$(date +%s)
        if [ $? -ne 0 ]; then
            log_error "Failed to get current time, error code $?."
            return 1
        fi
        last_time=$(stat -c %Y ${TIME_TAG})
        if [ $? -ne 0 ]; then
            log_error "Failed to get the timestamp of last check, error code $?."
            return 1
        fi
        duration=$(expr $this_time - $last_time)
        if [ $? -ne 0 ]; then
            log_error "Failed to caculate duration, forbidden."
            return 1
        fi
        if [ $duration -lt ${MIN_DURATION} ]; then
            log_info "Duration less than ${MIN_DURATION} sec, forbidden."
            return 1
        fi
        log_info "Duration from this time to last time is ${duration} sec."
        return 0
    }
    function main() {
        log_info "Start to check the firmware."
        init_log
        permit_check
        if [ $? -ne 0 ]; then
            log_error "This check is too frequent, try later."
            return
        fi
        firmware_reboot="false"
        need_upgrade="false"
        for i in $npu_ids; do
            result=$(get_versions_with_tool ${i})
            if [ $? -eq 0 ]; then
                software_version=$(echo $result | awk -F'|' '{print $1}')
                firmware_version=$(echo $result | awk -F'|' '{print $2}')
                log_info "npu ${i} software version and firmware version from npu-smi are ${software_version} and ${firmware_version}."
            else
                log_warning "npu ${i} failed to get versions using npu-smi, try to read the software version from config file."
                if [ -z "${config_software_version}" ]; then
                    log_error "failed to get software version from config file."
                    return
                fi
                software_version=${config_software_version}
                log_info "npu ${i} software version from config is ${software_version}."
            fi
            need_upgrade=$(check_versoins "${software_version}" "${firmware_version}")
            if [ "${need_upgrade}" == "true" ]; then
                log_info "npu ${i} firmware need upgrade."
                break
            fi
        done
        if [ "${need_upgrade}" == "true" ]; then
            log_info "Start upgrading the firmware."
            upgrade_firmware ${PRESET_FIRMWARE_VERSION}
            if [ $? -ne 0 ]; then
                log_error "Failed to upgrade firmware."
            else
                log_info "The firmware upgrade to '${PRESET_FIRMWARE_VERSION}' succeeded, reboot now."
                firmware_reboot="true"
            fi
        fi
        if [[ "${firmware_reboot}" == "true" ]]; then
            log_info "Start to reboot."
            touch ${TIME_TAG}
            reboot
        fi
        log_info "The firmware check completed."
    }
    metadata=$(/usr/bin/curl http://169.254.169.254/openstack/latest/meta_data.json)
    if [ $? -ne 0 ]; then
        log_error "Failed to get metadata, error code $?."
        exit
    fi
    if [ -z "${metadata}" ]; then
        log_error "Metadata is empty, abort."
        exit
    fi
    uuid=$(echo ${metadata} | grep -o '"uuid": "[^"]*' | sed 's/"uuid": "//')
    TIME_TAG="${HOME_DIR}/timetag${uuid}"
    config_software_version=$(get_object_version "${ASCEND_INSTALL_INFO}" "Driver_Install_Path_Param" "${ASCEND_PATH}" "driver")
    if [ $? -ne 0 ]; then
        log_error "npu ${i} failed to get software version from config file."
    fi
    npu_ids=$(npu-smi info -l | awk '/NPU ID/{print $NF}')
    npu_count=$(echo "$npu_ids" | wc -l)
    main
    echo "Check $LOG_FILE for details."

    请注意:若您的驱动与固件版本号与示例不一致,或者后续更换版本,请修改脚本配置中的版本号设置。

  8. 执行下面的命令,为固件包与配置脚本添加可执行权限。
    cd /opt/huawei/firmware_check
    chmod 700 firmware_check.sh
    chmod 700 Ascend-hdk-npu-firmware_7.7.0.9.220.run
  9. 执行下面的命令,为驱动固件一致性添加开机自启动服务。
    cd /etc/systemd/system
    vim firmware_check.service

    写入以下内容。

    [Unit]
    Description=check and upgrade firmware
    After=config-hash.service
    Requires=config-hash.service
    [Service]
    Type=oneshot
    ExecStart=/opt/huawei/firmware_check/firmware_check.sh
    RemainAfterExit=yes
    User=root
    [Install]
    WantedBy=cloud-init.target

    配置服务开机自启动生效。

    systemctl daemon-reload
    systemctl enable firmware_check.service

UDP端口hasn散列配置

  1. 执行下面的命令,创建UDP端口hash散列配置的目录。
    mkdir -p /opt/huawei/port_config
  2. 执行下面的命令,添加hash散列配置脚本文件。
    touch port_config.json
    touch uplink_hash_config.py
  3. 执行下面的命令,写入hash散列配置脚本。
    vim uplink_hash_config.py

    写入以下内容。

    # -*- coding: UTF-8 -*-
    import base64
    import logging
    import time
    import json
    import shutil
    import os
    import requests
    try:
        import commands
    except ImportError:
        import subprocess as commands
    A2_NPU_COUNT = 8
    A3_NPU_COUNT = 16
    METADATA_URL = 'http://169.254.169.254/openstack/latest/meta_data.json'
    A3_32K_CLUSTER = "cluster4"
    A3_64K_CLUSTER = "cluster5"
    A3_128K_CLUSTER = "cluster6"
    A3_9866_CLUSTER = "cluster8"
    A3_9866_REGION_ID = "cn-guian02"
    log = logging.getLogger(__name__)
    class UplinkHashConfig:
        def __init__(self, config_dir="/opt/huawei/port_config/",
                     log_file="/opt/huawei/port_config/uplink_hash_config.log"):
            self.DIR = config_dir
            self.setup_log(log_file)
        def setup_log(self, log_file):
            handler = logging.FileHandler(log_file)
            formatter = logging.Formatter('%(asctime)s - %(filename)s[%(levelname)s]: %(message)s')
            handler.setFormatter(formatter)
            log.addHandler(handler)
            log.setLevel(logging.INFO)
        def read_url(self, url, timeout=None, retries=0, sec_between=1, check_status=True):
            manual_tries = 1
            if retries:
                manual_tries = max(int(retries) + 1, 1)
            if sec_between is None:
                sec_between = -1
            for i in range(0, manual_tries):
                try:
                    r = requests.get(url, timeout=timeout)
                    if check_status:
                        r.raise_for_status()
                    return r
                except requests.exceptions.RequestException as e:
                    if i + 1 < manual_tries and sec_between > 0:
                        log.warning("Get metadata failed, wait %s seconds to try again", sec_between)
                        time.sleep(sec_between)
            log.error("Get metadata failed, please check network and security group.")
            return None
        def get_meta_json(self, timeout=5, retries=5):
            try:
                resp = self.read_url(url=METADATA_URL,
                    timeout=timeout,
                    retries=retries)
                metadata = resp.json()
            except Exception as e:
                log.error(
                    "Get metadata failed. Error: %s", e)
                return False, None
            return True, metadata
        def get_npu_count(self, meta_json):
            hyperinstance_type = meta_json.get("meta", {}).get("_sys_hyperinstance_type")
            if hyperinstance_type:
                log.info("Node type is hyperinstance.")
                return A3_NPU_COUNT
            return A2_NPU_COUNT
        def get_config_url(self, region):
            if region == "cn-north-7":
                url = "https://cnnorth7-modelarts-sdk.obs.cn-north-7.ulanqab.huawei.com"
            elif region == "cn-north-9":
                url = "https://cnnorth9-modelarts-sdk.obs.cn-north-9.myhuaweicloud.com"
            elif region == "cn-south-1":
                url = "https://cnsouth1-modelarts-sdk.obs.cn-south-1.myhuaweicloud.com"
            elif region == "cn-east-3":
                url = "https://cneast3-modelarts-sdk.obs.cn-east-3.myhuaweicloud.com"
            elif region == "ap-southeast-1":
                url = "https://ap-southeast1-modelarts-sdk.obs.ap-southeast-1.myhuaweicloud.com"
            elif region == "cn-north-11":
                url = "https://cnnorth11-modelarts-sdk.obs.cn-north-11.myhuaweicloud.com"
            elif region == "cn-southwest-2":
                url = "https://cn-southwest-2-modelarts-sdk.obs.cn-southwest-2.myhuaweicloud.com"
            elif region == "cn-east-4":
                url = "https://cneast4-modelarts-sdk.obs.dualstack.cn-east-4.myhuaweicloud.com"
            elif region == "la-south-2":
                url = "https://la-south2-modelarts-sdk.obs.la-south-2.myhuaweicloud.com"
            elif region == "ap-southeast-3":
                url = "https://ap-southeast3-modelarts-sdk.obs.ap-southeast-3.myhuaweicloud.com"
            elif region == "me-east-1":
                url = "https://me-east-1-modelarts-sdk.obs.me-east-1.myhuaweicloud.com"
            else:
                url = "https://{0}-modelarts-sdk.obs.{1}.myhuaweicloud.com".format(region.replace("-", ""), region)
            return url + "/devserver/port_config.json"
        def download_file(self, url, destination):
            log.info("Downloaded file from %s to %s", url, destination)
            try:
                response = requests.get(url, timeout=10)
                response.raise_for_status() 
                with open(destination, "wb") as f:
                    f.write(response.content)
                return True
            except requests.exceptions.RequestException as e:
                log.error("Failed to download file from %s: %s", url, str(e))
                return False
        def get_port_config(self, url):
            config_file = os.path.join(self.DIR, "port_config.json")
            backup_file = os.path.join(self.DIR, "port_config.json.bak")
            if url is None:
                log.warning("Url is none, use local config file.")
                try:
                    with open(config_file, "r") as f:
                        hash_config = json.load(f)
                    log.info("Loaded config from %s", config_file)
                    return hash_config
                except FileNotFoundError:
                    log.error("Config file %s not found.", config_file)
                    return None
                except json.JSONDecodeError:
                    log.error("Failed to decode JSON from %s", config_file)
                    return None
            try:
                shutil.copy2(config_file, backup_file)
                log.info("Backed up %s to %s", config_file, backup_file)
            except Exception as e:
                log.error("Failed to backup %s: %s", config_file, str(e))
                return None
            if self.download_file(url, config_file):
                try:
                    with open(config_file, "r") as f:
                        hash_config = json.load(f)
                    log.info("Loaded new config from %s", config_file)
                    os.remove(backup_file)
                    return hash_config
                except json.JSONDecodeError:
                    log.error("Failed to decode json from downloaded file %s, restored backup file.", config_file)
                    shutil.copy2(backup_file, config_file)
                    os.remove(backup_file)
                    with open(config_file, "r") as f:
                        hash_config = json.load(f)
                    return hash_config
            else:
                log.warning("Download failed, restored local backup config file.")
                shutil.copy2(backup_file, config_file)
                os.remove(backup_file)
                with open(config_file, "r") as f:
                    hash_config = json.load(f)
                return hash_config
        def wait_until_npu_ready(self, npu_count):
            for _ in range(0, 10):
                count = 0
                for i in range(0, npu_count):
                    cmd = "hccn_tool -i {} -lldp -g | grep Ifname | awk -F ': ' '{{print $2}}'".format(i)
                    (status, output) = commands.getstatusoutput(cmd)
                    if not output:
                        log.error("result: get ifname failed, try again after 30s, id:%s", i)
                        time.sleep(30)
                        break
                    count += 1
                if count == npu_count:
                    log.info("result: get all ifname success.")
                    return
            log.warning("Failed to get ifname after 10 attempts.")
        def print_current_port(self, npu_count):
            for i in range(0, npu_count):
                cmd = "hccn_tool -i {} -udp -g".format(i)
                (status, output) = commands.getstatusoutput(cmd)
                log.info("port %s: %s", i, output)
        def config_udp_port_auto(self, npu_count):
            log.info("Configuring ports in auto mode.")
            for i in range(0, npu_count):
                cmd = "hccn_tool -i {} -udp -s auto".format(i)
                (status, _) = commands.getstatusoutput(cmd)
        def config_udp_port(self, port_config, flavor, npu_count, region_id):
            if port_config is None:
                self.config_udp_port_auto(npu_count)
                return
            cmd = "echo -n {} | sha256sum | awk '{{print $1}}'".format(flavor)
            (status, sha256_flavor) = commands.getstatusoutput(cmd)
            log.info("Sha256 flavor is %s", sha256_flavor)
            cur_cluster = self.get_cluster(port_config, sha256_flavor, region_id)
            if not cur_cluster:
                if A3_NPU_COUNT == npu_count:
                    log.warning("Get cluster from port_config failed, flavor is %s, set default value %s", flavor, A3_32K_CLUSTER)
                    cur_cluster = A3_32K_CLUSTER
                else:
                    log.error("Get cluster from port_config failed, flavor is %s", flavor)
                    self.config_udp_port_auto(npu_count)
                    return
            log.info("Cluster is %s", cur_cluster)
            desire_port_map = self.get_desire_port_map(cur_cluster, port_config)
            for i in range(0, npu_count):
                cmd = "hccn_tool -i {} -lldp -g | grep Ifname | awk -F ': ' '{{print $2}}'".format(i)
                (status, ifname_tmp) = commands.getstatusoutput(cmd)
                key = "{}|{}".format(i, ifname_tmp)
                desire_port = desire_port_map.get(key, None)
                if not desire_port:
                    log.error("Get desire port from port_config failed, id %s, ifname %s", i, ifname_tmp)
                    self.config_udp_port_auto(npu_count)
                    return
                cmd = "hccn_tool -i {} -udp -g | awk -F ':' '{{print $2}}' | grep -v auto".format(i)
                (status, current_port) = commands.getstatusoutput(cmd)
                if desire_port != current_port:
                    cmd = "hccn_tool -i {} -udp -s port {}".format(i, desire_port)
                    (status, _) = commands.getstatusoutput(cmd)
        def get_cluster(self, port_config, sha256_flavor, region_id):
            if region_id == A3_9866_REGION_ID:
                return A3_9866_CLUSTER
            return port_config.get("flavors", {}).get(sha256_flavor, None)
        def get_desire_port_map(self, cur_cluster, port_config):
            if cur_cluster == A3_32K_CLUSTER:
                return self.get_a3_desire_port_map(5120, 5183, 32)
            elif cur_cluster == A3_64K_CLUSTER:
                return self.get_a3_desire_port_map(5184, 5327, 36)
            elif cur_cluster == A3_128K_CLUSTER:
                return self.get_a3_desire_port_map(5184, 5471, 36)
            else:
                return port_config.get("clusters", {}).get(cur_cluster, {})
        def get_a3_desire_port_map(self, start, end, max_tor_port):
            udp_port_map = {}
            tor_ge_id = 1
            tor_port = 0
            udp_port = start
            while udp_port <= end:
                udp_port_map["6|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port)] = udp_port
                udp_port_map["14|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port)] = udp_port
                udp_port_map["4|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 1)] = udp_port + 1
                udp_port_map["12|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 1)] = udp_port + 1
                udp_port_map["2|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 2)] = udp_port + 2
                udp_port_map["10|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 2)] = udp_port + 2
                udp_port_map["0|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 3)] = udp_port + 3
                udp_port_map["8|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 3)] = udp_port + 3
                udp_port_map["7|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port)] = udp_port
                udp_port_map["15|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port)] = udp_port
                udp_port_map["5|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 1)] = udp_port + 1
                udp_port_map["13|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 1)] = udp_port + 1
                udp_port_map["3|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 2)] = udp_port + 2
                udp_port_map["11|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 2)] = udp_port + 2
                udp_port_map["1|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 3)] = udp_port + 3
                udp_port_map["9|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 3)] = udp_port + 3
                udp_port += 4
                tor_port += 4
                if tor_port >= max_tor_port:
                    tor_port = 0
                    tor_ge_id += 1
            return udp_port_map
        def run(self):
            ret, meta_json = self.get_meta_json()
            if not ret:
                log.error("Get meta_json failed, metadata json is %s", meta_json)
                return
            log.info("Get meta_json success, metadata json is %s", meta_json)
            region_id = meta_json.get('region_id', None)
            if not region_id:
                log.error("Get region_id from metadata failed.")
                return
            log.info("Region is %s", region_id)
            flavor = meta_json.get('instance_type', None)
            if not flavor:
                log.error("Get instance_type from metadata failed.")
                return
            log.info("Flavor is %s", flavor)
            npu_count = self.get_npu_count(meta_json)
            log.info("Npu count is %s", npu_count)
            url = self.get_config_url(region_id)
            port_config = self.get_port_config(url)
            self.wait_until_npu_ready(npu_count)
            log.info("Before config uplink udp hash, Port is:")
            self.print_current_port(npu_count)
            log.info("=====================================================")
            self.config_udp_port(port_config, flavor, npu_count, region_id)
            log.info("Config uplink udp hash done. Port is:")
            self.print_current_port(npu_count)
    if __name__ == "__main__":
        configurator = UplinkHashConfig()
        configurator.run()
  4. 执行下面的命令,添加脚本可执行权限。
    cd /opt/huawei/port_config
    chmod +x uplink_hash_config.py
  5. 执行下面的命令,运行获取当前局点最新配置文件。
    python uplink_hash_config.py
  6. 执行下面的命令,为hash配置添加开机自启动服务。
    cd /etc/systemd/system
    vim config-hash.service

    写入以下内容。

    [Unit]
    Description=Run uplink hash config
    After=bms-network-config.service
    Requires=bms-network-config.service
    [Service]
    Type=oneshot
    ExecStart=python3 /opt/huawei/port_config/uplink_hash_config.py
    RemainAfterExit=yes
    User=root
    [Install]
    WantedBy=cloud-init.target

    配置服务开机自启动生效。

    cd /etc/systemd/system
    systemctl daemon-reload
    systemctl enable config-hash.service

验证

为了验证驱动固件一致性和UDP端口hash散列配置是否生效,请您执行重启服务器操作,重启操作请参考重启轻量算力节点服务器

请注意:固件升级后会自动重启服务器使固件生效。

  1. 验证驱动固件版本。

    执行下面的命令,可查看到NPU驱动与固件的版本。

    ascend-dmi -c

    固件版本与预期一致,则驱动固件一致性设置成功。

  2. 验证UDP端口hash散列配置。

    执行下面的命令,可查看参数面网络配置的上行端口。

    npu_ids=$(npu-smi info -l | awk '/NPU ID/{print $NF}')
    for i in $npu_ids; do hccn_tool -i $i -udp -g;done

    如果命令的输出中,udp_port不为Unknown、status为custom,则表明设置成功。

清理并保存镜像

  1. 执行下面的命令,清理驱动固件一致性与UDP端口hash散列的日志。
    rm -rf /opt/huawei/port_config/uplink_hash_config.log
    rm -rf /opt/huawei/firmware_check/firmware_check.log
    rm -rf /opt/huawei/firmware_check/time*
  2. 清理痕迹并创建新镜像,具体操作请参考制作轻量算力节点服务器操作系统

相关文档