配置NPU服务器驱动固件一致性与UDP端口hash散列
场景描述
轻量算力节点的Snt9b、超节点Snt9b23公共操作系统中,都进行了驱动固件一致性与UDP端口hash散列配置。驱动固件一致性配置能够在服务器关机重启后自动刷新固件,保证操作系统HDK驱动与固件的一致性。UDP端口hash散列配置能够在服务器关机重启后自动刷新参数面网络上行端口配置,保证参数面网络不发生拥塞。
当您使用自行构建的私有操作系统时,未配置驱动固件一致性与UDP端口hash散列,不具备服务器关机重启后自动刷新的能力。因此,推荐您在私有操作系统上进行驱动固件一致性与UDP端口hash散列的配置。
约束限制
驱动固件一致性与hash散列配置仅适用于轻量算力节点轻量算力节点Snt9b、超节点Snt9b23。
驱动固件一致性配置与UDP端口hash散列配置前置服务依赖bms-network-config,您需要先完成操作系统的云化适配与软件包安装,可参考BMS镜像制作文档。
驱动固件一致性配置
请前往华为官方Support网站,获取NPU驱动包与固件包,根据您的机型与芯片架构选择对应的驱动包与固件包。以HDK25.2.1为例,在超节点Snt9b23 ARM64上进行驱动固件一致性配置。
|
类型 |
包名 |
版本 |
|---|---|---|
|
驱动包 |
Atlas-A3-hdk-npu-driver_25.2.1_linux-aarch64.run |
25.2.1 |
|
固件包 |
Atlas-A3-hdk-npu-firmware_7.7.0.9.220.run |
7.7.0.9.220 |
- 将您下载的NPU驱动包与固件包上传到轻量算力节点服务器的任意目录中,此步骤您可以使用Xftp工具或者OBS桶工具完成。
- 登录轻量算力节点服务器,并前往软件包上传的路径。
- 安装NPU驱动。
参考NPU服务器上配置轻量算力节点资源软件环境中安装驱动与固件部分,完成NPU驱动安装。
- 执行下面的命令,创建驱动固件一致性配置目录。
mkdir -p /opt/huawei/firmware_check
- 执行下面的命令,创建驱动固件一致性配置脚本。
cd /opt/huawei/firmware_check touch firmware_check.sh
- 执行下面的命令,将固件包移动到驱动固件一致性目录,并修改软件包名。
mv Atlas-A3-hdk-npu-firmware_7.7.0.9.220.run /opt/huawei/firmware_check/Ascend-hdk-npu-firmware_7.7.0.9.220.run
请注意:若您的固件版本号与示例不一致,或者后续更换版本,上述命令中软件包版本号需要对应修改。
- 打开脚本文件firmware_check.sh,将以下内容写入脚本。
写入以下内容。
#!/bin/bash HOME_DIR="/opt/huawei/firmware_check" MAX_LOG_LINES=500 LOG_FILE="${HOME_DIR}/firmware_check.log" ASCEND_INSTALL_INFO="/etc/ascend_install.info" ASCEND_PATH="/usr/local/Ascend" MIN_DURATION=3600 # ***制作镜像时如果驱动版本有变化,需要更新这里*** PRESET_SOFTWARE_VERSION="25.2.1" PRESET_FIRMWARE_VERSION="7.7.0.9.220" function init_log() { if [ ! -f ${LOG_FILE} ]; then cat /dev/null > $LOG_FILE return fi line_count=$(wc -l < "${LOG_FILE}") if [ $line_count -gt $MAX_LOG_LINES ]; then tail -n "${MAX_LOG_LINES}" "${LOG_FILE}" > "${LOG_FILE}.tmp" mv "${LOG_FILE}.tmp" "${LOG_FILE}" fi } function log_info() { echo -e "$(date +%Y-%m-%d" "%H:%M:%S):[Info]$@" >> $LOG_FILE && echo -e "\033[32m[INFO]\033[0m: $@" > /dev/tty } function log_warning() { echo -e "$(date +%Y-%m-%d" "%H:%M:%S):[Warnning]$@" >> $LOG_FILE && echo -e "\033[33m[WARN]\033[0m: $@" > /dev/tty } function log_error() { echo -e "$(date +%Y-%m-%d" "%H:%M:%S):[Error]$@" >> $LOG_FILE && echo -e "\033[31m[ERROR]\033[0m: $@" > /dev/tty } function get_param_from_config() { file=$1 wanted=$2 if [ ! -e $file ]; then log_error "File ${file} does not exist." return 1 fi while IFS="=" read -r key val; do key=$(echo "$key" | tr -d '[:space:]') if [[ "$key" == "$wanted" ]]; then echo $val return 0 fi done < "$file" return 1 } function get_object_version() { install_info=$1 install_path_key=$2 default_install_path=$3 object_type=$4 install_path=$(get_param_from_config "${ASCEND_INSTALL_INFO}" "${install_path_key}") if [ $? -ne 0 ]; then log_warning "Failed to get value of ${install_path_key} from ${install_info}, use ${default_install_path}." install_path=${default_install_path} fi version_info_file="${install_path}/${object_type}/version.info" version=$(get_param_from_config "${version_info_file}" "Version") if [ $? -ne 0 ]; then log_error "Failed to get Version value from ${version_info_file}." return 1 fi echo $version return 0 } function get_versions_with_tool() { output=$(npu-smi info -t board -i ${1}) if [ $? -ne 0 ]; then log_error "Run command 'npu-smi info -t board -i ${i}' error:\n$output" return 1 fi software_version=$(echo "$output" | awk -F':' '/Software Version/ {print $2}' | tr -d '[:space:]') firmware_version=$(echo "$output" | awk -F':' '/Firmware Version/ {print $2}' | tr -d '[:space:]') if [ -n "${software_version}" ] && [ -n "${firmware_version}" ]; then echo "${software_version}|${firmware_version}" return 0 fi return 1 } function check_versoins() { software_version=$1 firmware_version=$2 if [ "${software_version}" != "${PRESET_SOFTWARE_VERSION}" ]; then log_warning "The current software version is not preset version '${PRESET_SOFTWARE_VERSION}', does not need to upgrade the firmware." echo "false" return fi if [ -n "${firmware_version}" ] && [ "${firmware_version}" == "${PRESET_FIRMWARE_VERSION}" ]; then log_info "The current firmware version match preset version '${PRESET_FIRMWARE_VERSION}', does not need to be upgraded." echo "false" return fi log_info "The current firmware '${firmware_version}' does not match preset version '${PRESET_FIRMWARE_VERSION}', need to be upgraded." echo "true" return } function upgrade_firmware() { firmware_version=$1 firmware_package="${HOME_DIR}/Ascend-hdk-npu-firmware_${firmware_version}.run" if [ ! -e ${firmware_package} ]; then log_error "Firmware package ${firmware_package} does not exist." return 1 fi bash ${firmware_package} --full --quiet >> $LOG_FILE 2>&1 return $? } function upgrade_mcu() { bash ${HOME_DIR}/upgrade_mcu.sh ${HOME_DIR}/Ascend-hdk-mcu_${PRESET_MCU_VERSION}.hpm } function permit_check() { if [ ! -e ${TIME_TAG} ]; then log_info "First check of this machine, permit." return 0 fi this_time=$(date +%s) if [ $? -ne 0 ]; then log_error "Failed to get current time, error code $?." return 1 fi last_time=$(stat -c %Y ${TIME_TAG}) if [ $? -ne 0 ]; then log_error "Failed to get the timestamp of last check, error code $?." return 1 fi duration=$(expr $this_time - $last_time) if [ $? -ne 0 ]; then log_error "Failed to caculate duration, forbidden." return 1 fi if [ $duration -lt ${MIN_DURATION} ]; then log_info "Duration less than ${MIN_DURATION} sec, forbidden." return 1 fi log_info "Duration from this time to last time is ${duration} sec." return 0 } function main() { log_info "Start to check the firmware." init_log permit_check if [ $? -ne 0 ]; then log_error "This check is too frequent, try later." return fi firmware_reboot="false" need_upgrade="false" for i in $npu_ids; do result=$(get_versions_with_tool ${i}) if [ $? -eq 0 ]; then software_version=$(echo $result | awk -F'|' '{print $1}') firmware_version=$(echo $result | awk -F'|' '{print $2}') log_info "npu ${i} software version and firmware version from npu-smi are ${software_version} and ${firmware_version}." else log_warning "npu ${i} failed to get versions using npu-smi, try to read the software version from config file." if [ -z "${config_software_version}" ]; then log_error "failed to get software version from config file." return fi software_version=${config_software_version} log_info "npu ${i} software version from config is ${software_version}." fi need_upgrade=$(check_versoins "${software_version}" "${firmware_version}") if [ "${need_upgrade}" == "true" ]; then log_info "npu ${i} firmware need upgrade." break fi done if [ "${need_upgrade}" == "true" ]; then log_info "Start upgrading the firmware." upgrade_firmware ${PRESET_FIRMWARE_VERSION} if [ $? -ne 0 ]; then log_error "Failed to upgrade firmware." else log_info "The firmware upgrade to '${PRESET_FIRMWARE_VERSION}' succeeded, reboot now." firmware_reboot="true" fi fi if [[ "${firmware_reboot}" == "true" ]]; then log_info "Start to reboot." touch ${TIME_TAG} reboot fi log_info "The firmware check completed." } metadata=$(/usr/bin/curl http://169.254.169.254/openstack/latest/meta_data.json) if [ $? -ne 0 ]; then log_error "Failed to get metadata, error code $?." exit fi if [ -z "${metadata}" ]; then log_error "Metadata is empty, abort." exit fi uuid=$(echo ${metadata} | grep -o '"uuid": "[^"]*' | sed 's/"uuid": "//') TIME_TAG="${HOME_DIR}/timetag${uuid}" config_software_version=$(get_object_version "${ASCEND_INSTALL_INFO}" "Driver_Install_Path_Param" "${ASCEND_PATH}" "driver") if [ $? -ne 0 ]; then log_error "npu ${i} failed to get software version from config file." fi npu_ids=$(npu-smi info -l | awk '/NPU ID/{print $NF}') npu_count=$(echo "$npu_ids" | wc -l) main echo "Check $LOG_FILE for details."请注意:若您的驱动与固件版本号与示例不一致,或者后续更换版本,请修改脚本配置中的版本号设置。

- 执行下面的命令,为固件包与配置脚本添加可执行权限。
cd /opt/huawei/firmware_check chmod 700 firmware_check.sh chmod 700 Ascend-hdk-npu-firmware_7.7.0.9.220.run
- 执行下面的命令,为驱动固件一致性添加开机自启动服务。
cd /etc/systemd/system vim firmware_check.service
写入以下内容。
[Unit] Description=check and upgrade firmware After=config-hash.service Requires=config-hash.service [Service] Type=oneshot ExecStart=/opt/huawei/firmware_check/firmware_check.sh RemainAfterExit=yes User=root [Install] WantedBy=cloud-init.target
配置服务开机自启动生效。
systemctl daemon-reload systemctl enable firmware_check.service
UDP端口hasn散列配置
- 执行下面的命令,创建UDP端口hash散列配置的目录。
mkdir -p /opt/huawei/port_config
- 执行下面的命令,添加hash散列配置脚本文件。
touch port_config.json touch uplink_hash_config.py
- 执行下面的命令,写入hash散列配置脚本。
vim uplink_hash_config.py
写入以下内容。
# -*- coding: UTF-8 -*- import base64 import logging import time import json import shutil import os import requests try: import commands except ImportError: import subprocess as commands A2_NPU_COUNT = 8 A3_NPU_COUNT = 16 METADATA_URL = 'http://169.254.169.254/openstack/latest/meta_data.json' A3_32K_CLUSTER = "cluster4" A3_64K_CLUSTER = "cluster5" A3_128K_CLUSTER = "cluster6" A3_9866_CLUSTER = "cluster8" A3_9866_REGION_ID = "cn-guian02" log = logging.getLogger(__name__) class UplinkHashConfig: def __init__(self, config_dir="/opt/huawei/port_config/", log_file="/opt/huawei/port_config/uplink_hash_config.log"): self.DIR = config_dir self.setup_log(log_file) def setup_log(self, log_file): handler = logging.FileHandler(log_file) formatter = logging.Formatter('%(asctime)s - %(filename)s[%(levelname)s]: %(message)s') handler.setFormatter(formatter) log.addHandler(handler) log.setLevel(logging.INFO) def read_url(self, url, timeout=None, retries=0, sec_between=1, check_status=True): manual_tries = 1 if retries: manual_tries = max(int(retries) + 1, 1) if sec_between is None: sec_between = -1 for i in range(0, manual_tries): try: r = requests.get(url, timeout=timeout) if check_status: r.raise_for_status() return r except requests.exceptions.RequestException as e: if i + 1 < manual_tries and sec_between > 0: log.warning("Get metadata failed, wait %s seconds to try again", sec_between) time.sleep(sec_between) log.error("Get metadata failed, please check network and security group.") return None def get_meta_json(self, timeout=5, retries=5): try: resp = self.read_url(url=METADATA_URL, timeout=timeout, retries=retries) metadata = resp.json() except Exception as e: log.error( "Get metadata failed. Error: %s", e) return False, None return True, metadata def get_npu_count(self, meta_json): hyperinstance_type = meta_json.get("meta", {}).get("_sys_hyperinstance_type") if hyperinstance_type: log.info("Node type is hyperinstance.") return A3_NPU_COUNT return A2_NPU_COUNT def get_config_url(self, region): if region == "cn-north-7": url = "https://cnnorth7-modelarts-sdk.obs.cn-north-7.ulanqab.huawei.com" elif region == "cn-north-9": url = "https://cnnorth9-modelarts-sdk.obs.cn-north-9.myhuaweicloud.com" elif region == "cn-south-1": url = "https://cnsouth1-modelarts-sdk.obs.cn-south-1.myhuaweicloud.com" elif region == "cn-east-3": url = "https://cneast3-modelarts-sdk.obs.cn-east-3.myhuaweicloud.com" elif region == "ap-southeast-1": url = "https://ap-southeast1-modelarts-sdk.obs.ap-southeast-1.myhuaweicloud.com" elif region == "cn-north-11": url = "https://cnnorth11-modelarts-sdk.obs.cn-north-11.myhuaweicloud.com" elif region == "cn-southwest-2": url = "https://cn-southwest-2-modelarts-sdk.obs.cn-southwest-2.myhuaweicloud.com" elif region == "cn-east-4": url = "https://cneast4-modelarts-sdk.obs.dualstack.cn-east-4.myhuaweicloud.com" elif region == "la-south-2": url = "https://la-south2-modelarts-sdk.obs.la-south-2.myhuaweicloud.com" elif region == "ap-southeast-3": url = "https://ap-southeast3-modelarts-sdk.obs.ap-southeast-3.myhuaweicloud.com" elif region == "me-east-1": url = "https://me-east-1-modelarts-sdk.obs.me-east-1.myhuaweicloud.com" else: url = "https://{0}-modelarts-sdk.obs.{1}.myhuaweicloud.com".format(region.replace("-", ""), region) return url + "/devserver/port_config.json" def download_file(self, url, destination): log.info("Downloaded file from %s to %s", url, destination) try: response = requests.get(url, timeout=10) response.raise_for_status() with open(destination, "wb") as f: f.write(response.content) return True except requests.exceptions.RequestException as e: log.error("Failed to download file from %s: %s", url, str(e)) return False def get_port_config(self, url): config_file = os.path.join(self.DIR, "port_config.json") backup_file = os.path.join(self.DIR, "port_config.json.bak") if url is None: log.warning("Url is none, use local config file.") try: with open(config_file, "r") as f: hash_config = json.load(f) log.info("Loaded config from %s", config_file) return hash_config except FileNotFoundError: log.error("Config file %s not found.", config_file) return None except json.JSONDecodeError: log.error("Failed to decode JSON from %s", config_file) return None try: shutil.copy2(config_file, backup_file) log.info("Backed up %s to %s", config_file, backup_file) except Exception as e: log.error("Failed to backup %s: %s", config_file, str(e)) return None if self.download_file(url, config_file): try: with open(config_file, "r") as f: hash_config = json.load(f) log.info("Loaded new config from %s", config_file) os.remove(backup_file) return hash_config except json.JSONDecodeError: log.error("Failed to decode json from downloaded file %s, restored backup file.", config_file) shutil.copy2(backup_file, config_file) os.remove(backup_file) with open(config_file, "r") as f: hash_config = json.load(f) return hash_config else: log.warning("Download failed, restored local backup config file.") shutil.copy2(backup_file, config_file) os.remove(backup_file) with open(config_file, "r") as f: hash_config = json.load(f) return hash_config def wait_until_npu_ready(self, npu_count): for _ in range(0, 10): count = 0 for i in range(0, npu_count): cmd = "hccn_tool -i {} -lldp -g | grep Ifname | awk -F ': ' '{{print $2}}'".format(i) (status, output) = commands.getstatusoutput(cmd) if not output: log.error("result: get ifname failed, try again after 30s, id:%s", i) time.sleep(30) break count += 1 if count == npu_count: log.info("result: get all ifname success.") return log.warning("Failed to get ifname after 10 attempts.") def print_current_port(self, npu_count): for i in range(0, npu_count): cmd = "hccn_tool -i {} -udp -g".format(i) (status, output) = commands.getstatusoutput(cmd) log.info("port %s: %s", i, output) def config_udp_port_auto(self, npu_count): log.info("Configuring ports in auto mode.") for i in range(0, npu_count): cmd = "hccn_tool -i {} -udp -s auto".format(i) (status, _) = commands.getstatusoutput(cmd) def config_udp_port(self, port_config, flavor, npu_count, region_id): if port_config is None: self.config_udp_port_auto(npu_count) return cmd = "echo -n {} | sha256sum | awk '{{print $1}}'".format(flavor) (status, sha256_flavor) = commands.getstatusoutput(cmd) log.info("Sha256 flavor is %s", sha256_flavor) cur_cluster = self.get_cluster(port_config, sha256_flavor, region_id) if not cur_cluster: if A3_NPU_COUNT == npu_count: log.warning("Get cluster from port_config failed, flavor is %s, set default value %s", flavor, A3_32K_CLUSTER) cur_cluster = A3_32K_CLUSTER else: log.error("Get cluster from port_config failed, flavor is %s", flavor) self.config_udp_port_auto(npu_count) return log.info("Cluster is %s", cur_cluster) desire_port_map = self.get_desire_port_map(cur_cluster, port_config) for i in range(0, npu_count): cmd = "hccn_tool -i {} -lldp -g | grep Ifname | awk -F ': ' '{{print $2}}'".format(i) (status, ifname_tmp) = commands.getstatusoutput(cmd) key = "{}|{}".format(i, ifname_tmp) desire_port = desire_port_map.get(key, None) if not desire_port: log.error("Get desire port from port_config failed, id %s, ifname %s", i, ifname_tmp) self.config_udp_port_auto(npu_count) return cmd = "hccn_tool -i {} -udp -g | awk -F ':' '{{print $2}}' | grep -v auto".format(i) (status, current_port) = commands.getstatusoutput(cmd) if desire_port != current_port: cmd = "hccn_tool -i {} -udp -s port {}".format(i, desire_port) (status, _) = commands.getstatusoutput(cmd) def get_cluster(self, port_config, sha256_flavor, region_id): if region_id == A3_9866_REGION_ID: return A3_9866_CLUSTER return port_config.get("flavors", {}).get(sha256_flavor, None) def get_desire_port_map(self, cur_cluster, port_config): if cur_cluster == A3_32K_CLUSTER: return self.get_a3_desire_port_map(5120, 5183, 32) elif cur_cluster == A3_64K_CLUSTER: return self.get_a3_desire_port_map(5184, 5327, 36) elif cur_cluster == A3_128K_CLUSTER: return self.get_a3_desire_port_map(5184, 5471, 36) else: return port_config.get("clusters", {}).get(cur_cluster, {}) def get_a3_desire_port_map(self, start, end, max_tor_port): udp_port_map = {} tor_ge_id = 1 tor_port = 0 udp_port = start while udp_port <= end: udp_port_map["6|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port)] = udp_port udp_port_map["14|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port)] = udp_port udp_port_map["4|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 1)] = udp_port + 1 udp_port_map["12|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 1)] = udp_port + 1 udp_port_map["2|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 2)] = udp_port + 2 udp_port_map["10|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 2)] = udp_port + 2 udp_port_map["0|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 3)] = udp_port + 3 udp_port_map["8|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 3)] = udp_port + 3 udp_port_map["7|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port)] = udp_port udp_port_map["15|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port)] = udp_port udp_port_map["5|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 1)] = udp_port + 1 udp_port_map["13|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 1)] = udp_port + 1 udp_port_map["3|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 2)] = udp_port + 2 udp_port_map["11|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 2)] = udp_port + 2 udp_port_map["1|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 3)] = udp_port + 3 udp_port_map["9|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 3)] = udp_port + 3 udp_port += 4 tor_port += 4 if tor_port >= max_tor_port: tor_port = 0 tor_ge_id += 1 return udp_port_map def run(self): ret, meta_json = self.get_meta_json() if not ret: log.error("Get meta_json failed, metadata json is %s", meta_json) return log.info("Get meta_json success, metadata json is %s", meta_json) region_id = meta_json.get('region_id', None) if not region_id: log.error("Get region_id from metadata failed.") return log.info("Region is %s", region_id) flavor = meta_json.get('instance_type', None) if not flavor: log.error("Get instance_type from metadata failed.") return log.info("Flavor is %s", flavor) npu_count = self.get_npu_count(meta_json) log.info("Npu count is %s", npu_count) url = self.get_config_url(region_id) port_config = self.get_port_config(url) self.wait_until_npu_ready(npu_count) log.info("Before config uplink udp hash, Port is:") self.print_current_port(npu_count) log.info("=====================================================") self.config_udp_port(port_config, flavor, npu_count, region_id) log.info("Config uplink udp hash done. Port is:") self.print_current_port(npu_count) if __name__ == "__main__": configurator = UplinkHashConfig() configurator.run() - 执行下面的命令,添加脚本可执行权限。
cd /opt/huawei/port_config chmod +x uplink_hash_config.py
- 执行下面的命令,运行获取当前局点最新配置文件。
python uplink_hash_config.py
- 执行下面的命令,为hash配置添加开机自启动服务。
cd /etc/systemd/system vim config-hash.service
写入以下内容。
[Unit] Description=Run uplink hash config After=bms-network-config.service Requires=bms-network-config.service [Service] Type=oneshot ExecStart=python3 /opt/huawei/port_config/uplink_hash_config.py RemainAfterExit=yes User=root [Install] WantedBy=cloud-init.target
配置服务开机自启动生效。
cd /etc/systemd/system systemctl daemon-reload systemctl enable config-hash.service
- 验证驱动固件版本。
ascend-dmi -c
固件版本与预期一致,则驱动固件一致性设置成功。
- 验证UDP端口hash散列配置。
npu_ids=$(npu-smi info -l | awk '/NPU ID/{print $NF}') for i in $npu_ids; do hccn_tool -i $i -udp -g;done如果命令的输出中,udp_port不为Unknown、status为custom,则表明设置成功。
清理并保存镜像
- 执行下面的命令,清理驱动固件一致性与UDP端口hash散列的日志。
rm -rf /opt/huawei/port_config/uplink_hash_config.log rm -rf /opt/huawei/firmware_check/firmware_check.log rm -rf /opt/huawei/firmware_check/time*
- 清理痕迹并创建新镜像,具体操作请参考制作轻量算力节点服务器操作系统。