Configuring the NPU Driver Firmware Consistency and UDP Port Hash
Description
In the public OSs of the Lite Server Snt9b and supernode Snt9b23, configurations for driver-firmware consistency and UDP port hash have been implemented. The driver-firmware consistency configuration ensures that the firmware is automatically refreshed after the server is shut down and restarted, maintaining the consistency between the OS HDK driver and firmware. The UDP port hash configuration ensures that the uplink port configuration of the parameter plane network is automatically refreshed after the server is shut down and restarted, preventing congestion in the parameter plane network.
When using a privately built OS, if you do not configure driver-firmware consistency and UDP port hash, the system will lack the ability to automatically refresh after the server is powered off and restarted. Therefore, it is recommended that you configure driver-firmware consistency and UDP port hash on your private OS.
Constraints
You can only configure driver firmware consistency and port hash for Lite Server Snt9b and supernode Snt9b23.
The driver firmware consistency configuration and UDP port hash configuration depend on bms-network-config. You need to first complete the cloud-based adaptation of the OS and install the software package. For details, see the BMS image creation documentation.
Configuring Driver Firmware Consistency
Obtain the required NPU driver and firmware packages based on your model and chip architecture from Huawei Support. HDK25.2.1 is used as an example to describe how to configure driver firmware consistency on an Arm64-powered Snt9b23 supernode.
| Type | Package | Version |
|---|---|---|
| Driver package | Atlas-A3-hdk-npu-driver_25.2.1_linux-aarch64.run | 25.2.1 |
| Firmware package | Atlas-A3-hdk-npu-firmware_7.7.0.9.220.run | 7.7.0.9.220 |
- Upload the downloaded NPU driver package and firmware package to any directory on the Lite Server. You can use Xftp or an OBS bucket.
- Log in to the Lite Server and go to the directory where the software package is to be uploaded.
- Install the NPU driver.
For details, see "Installing the Driver and Firmware" in Configuring the Resource Software Environment for NPU-based Lite Servers.
- Create a driver firmware consistency configuration directory:
mkdir -p /opt/huawei/firmware_check
- Create a driver firmware consistency configuration script:
cd /opt/huawei/firmware_check touch firmware_check.sh
- Move the firmware package to the driver firmware consistency directory and modify the software package name:
mv Atlas-A3-hdk-npu-firmware_7.7.0.9.220.run /opt/huawei/firmware_check/Ascend-hdk-npu-firmware_7.7.0.9.220.run
Note: If the firmware version is different from the example or the version is changed later, modify the software package version number in the preceding commands accordingly.
- Open the firmware_check.sh script and add the content below. Open the script file.
vim firmware_check.sh
Add the following content:#!/bin/bash HOME_DIR="/opt/huawei/firmware_check" MAX_LOG_LINES=500 LOG_FILE="${HOME_DIR}/firmware_check.log" ASCEND_INSTALL_INFO="/etc/ascend_install.info" ASCEND_PATH="/usr/local/Ascend" MIN_DURATION=3600 # ***If the driver version is changed during image creation, update the content below.*** PRESET_SOFTWARE_VERSION="25.2.1" PRESET_FIRMWARE_VERSION="7.7.0.9.220" function init_log() { if [ ! -f ${LOG_FILE} ]; then cat /dev/null > $LOG_FILE return fi line_count=$(wc -l < "${LOG_FILE}") if [ $line_count -gt $MAX_LOG_LINES ]; then tail -n "${MAX_LOG_LINES}" "${LOG_FILE}" > "${LOG_FILE}.tmp" mv "${LOG_FILE}.tmp" "${LOG_FILE}" fi } function log_info() { echo -e "$(date +%Y-%m-%d" "%H:%M:%S):[Info]$@" >> $LOG_FILE && echo -e "\033[32m[INFO]\033[0m: $@" > /dev/tty } function log_warning() { echo -e "$(date +%Y-%m-%d" "%H:%M:%S):[Warnning]$@" >> $LOG_FILE && echo -e "\033[33m[WARN]\033[0m: $@" > /dev/tty } function log_error() { echo -e "$(date +%Y-%m-%d" "%H:%M:%S):[Error]$@" >> $LOG_FILE && echo -e "\033[31m[ERROR]\033[0m: $@" > /dev/tty } function get_param_from_config() { file=$1 wanted=$2 if [ ! -e $file ]; then log_error "File ${file} does not exist." return 1 fi while IFS="=" read -r key val; do key=$(echo "$key" | tr -d '[:space:]') if [[ "$key" == "$wanted" ]]; then echo $val return 0 fi done < "$file" return 1 } function get_object_version() { install_info=$1 install_path_key=$2 default_install_path=$3 object_type=$4 install_path=$(get_param_from_config "${ASCEND_INSTALL_INFO}" "${install_path_key}") if [ $? -ne 0 ]; then log_warning "Failed to get value of ${install_path_key} from ${install_info}, use ${default_install_path}." install_path=${default_install_path} fi version_info_file="${install_path}/${object_type}/version.info" version=$(get_param_from_config "${version_info_file}" "Version") if [ $? -ne 0 ]; then log_error "Failed to get Version value from ${version_info_file}." return 1 fi echo $version return 0 } function get_versions_with_tool() { output=$(npu-smi info -t board -i ${1}) if [ $? -ne 0 ]; then log_error "Run command 'npu-smi info -t board -i ${i}' error:\n$output" return 1 fi software_version=$(echo "$output" | awk -F':' '/Software Version/ {print $2}' | tr -d '[:space:]') firmware_version=$(echo "$output" | awk -F':' '/Firmware Version/ {print $2}' | tr -d '[:space:]') if [ -n "${software_version}" ] && [ -n "${firmware_version}" ]; then echo "${software_version}|${firmware_version}" return 0 fi return 1 } function check_versoins() { software_version=$1 firmware_version=$2 if [ "${software_version}" != "${PRESET_SOFTWARE_VERSION}" ]; then log_warning "The current software version is not preset version '${PRESET_SOFTWARE_VERSION}', does not need to upgrade the firmware." echo "false" return fi if [ -n "${firmware_version}" ] && [ "${firmware_version}" == "${PRESET_FIRMWARE_VERSION}" ]; then log_info "The current firmware version match preset version '${PRESET_FIRMWARE_VERSION}', does not need to be upgraded." echo "false" return fi log_info "The current firmware '${firmware_version}' does not match preset version '${PRESET_FIRMWARE_VERSION}', need to be upgraded." echo "true" return } function upgrade_firmware() { firmware_version=$1 firmware_package="${HOME_DIR}/Ascend-hdk-npu-firmware_${firmware_version}.run" if [ ! -e ${firmware_package} ]; then log_error "Firmware package ${firmware_package} does not exist." return 1 fi bash ${firmware_package} --full --quiet >> $LOG_FILE 2>&1 return $? } function upgrade_mcu() { bash ${HOME_DIR}/upgrade_mcu.sh ${HOME_DIR}/Ascend-hdk-mcu_${PRESET_MCU_VERSION}.hpm } function permit_check() { if [ ! -e ${TIME_TAG} ]; then log_info "First check of this machine, permit." return 0 fi this_time=$(date +%s) if [ $? -ne 0 ]; then log_error "Failed to get current time, error code $?." return 1 fi last_time=$(stat -c %Y ${TIME_TAG}) if [ $? -ne 0 ]; then log_error "Failed to get the timestamp of last check, error code $?." return 1 fi duration=$(expr $this_time - $last_time) if [ $? -ne 0 ]; then log_error "Failed to caculate duration, forbidden." return 1 fi if [ $duration -lt ${MIN_DURATION} ]; then log_info "Duration less than ${MIN_DURATION} sec, forbidden." return 1 fi log_info "Duration from this time to last time is ${duration} sec." return 0 } function main() { log_info "Start to check the firmware." init_log permit_check if [ $? -ne 0 ]; then log_error "This check is too frequent, try later." return fi firmware_reboot="false" need_upgrade="false" for i in $npu_ids; do result=$(get_versions_with_tool ${i}) if [ $? -eq 0 ]; then software_version=$(echo $result | awk -F'|' '{print $1}') firmware_version=$(echo $result | awk -F'|' '{print $2}') log_info "npu ${i} software version and firmware version from npu-smi are ${software_version} and ${firmware_version}." else log_warning "npu ${i} failed to get versions using npu-smi, try to read the software version from config file." if [ -z "${config_software_version}" ]; then log_error "failed to get software version from config file." return fi software_version=${config_software_version} log_info "npu ${i} software version from config is ${software_version}." fi need_upgrade=$(check_versoins "${software_version}" "${firmware_version}") if [ "${need_upgrade}" == "true" ]; then log_info "npu ${i} firmware need upgrade." break fi done if [ "${need_upgrade}" == "true" ]; then log_info "Start upgrading the firmware." upgrade_firmware ${PRESET_FIRMWARE_VERSION} if [ $? -ne 0 ]; then log_error "Failed to upgrade firmware." else log_info "The firmware upgrade to '${PRESET_FIRMWARE_VERSION}' succeeded, reboot now." firmware_reboot="true" fi fi if [[ "${firmware_reboot}" == "true" ]]; then log_info "Start to reboot." touch ${TIME_TAG} reboot fi log_info "The firmware check completed." } metadata=$(/usr/bin/curl http://169.254.169.254/openstack/latest/meta_data.json) if [ $? -ne 0 ]; then log_error "Failed to get metadata, error code $?." exit fi if [ -z "${metadata}" ]; then log_error "Metadata is empty, abort." exit fi uuid=$(echo ${metadata} | grep -o '"uuid": "[^"]*' | sed 's/"uuid": "//') TIME_TAG="${HOME_DIR}/timetag${uuid}" config_software_version=$(get_object_version "${ASCEND_INSTALL_INFO}" "Driver_Install_Path_Param" "${ASCEND_PATH}" "driver") if [ $? -ne 0 ]; then log_error "npu ${i} failed to get software version from config file." fi npu_ids=$(npu-smi info -l | awk '/NPU ID/{print $NF}') npu_count=$(echo "$npu_ids" | wc -l) main echo "Check $LOG_FILE for details."Note: If the driver and firmware versions are different from the example or the version is changed later, modify the version settings in the script configuration.

- Add the execute permission for the firmware package and configuration script:
cd /opt/huawei/firmware_check chmod 700 firmware_check.sh chmod 700 Ascend-hdk-npu-firmware_7.7.0.9.220.run
- Add the automatic startup service for driver firmware consistency:
cd /etc/systemd/system vim firmware_check.service
Add the following content:
[Unit] Description=check and upgrade firmware After=config-hash.service Requires=config-hash.service [Service] Type=oneshot ExecStart=/opt/huawei/firmware_check/firmware_check.sh RemainAfterExit=yes User=root [Install] WantedBy=cloud-init.target
Configure the service to automatically start upon system startup.
systemctl daemon-reload systemctl enable firmware_check.service
Configuring UDP Port Hash
- Create a UDP port hash configuration directory:
mkdir -p /opt/huawei/port_config
- Add the hash configuration script:
touch port_config.json touch uplink_hash_config.py
- Write the hash configuration script:
vim uplink_hash_config.py
Add the following content:
# -*- coding: UTF-8 -*- import base64 import logging import time import json import shutil import os import requests try: import commands except ImportError: import subprocess as commands A2_NPU_COUNT = 8 A3_NPU_COUNT = 16 METADATA_URL = 'http://169.254.169.254/openstack/latest/meta_data.json' A3_32K_CLUSTER = "cluster4" A3_64K_CLUSTER = "cluster5" A3_128K_CLUSTER = "cluster6" A3_9866_CLUSTER = "cluster8" A3_9866_REGION_ID = "cn-guian02" log = logging.getLogger(__name__) class UplinkHashConfig: def __init__(self, config_dir="/opt/huawei/port_config/", log_file="/opt/huawei/port_config/uplink_hash_config.log"): self.DIR = config_dir self.setup_log(log_file) def setup_log(self, log_file): handler = logging.FileHandler(log_file) formatter = logging.Formatter('%(asctime)s - %(filename)s[%(levelname)s]: %(message)s') handler.setFormatter(formatter) log.addHandler(handler) log.setLevel(logging.INFO) def read_url(self, url, timeout=None, retries=0, sec_between=1, check_status=True): manual_tries = 1 if retries: manual_tries = max(int(retries) + 1, 1) if sec_between is None: sec_between = -1 for i in range(0, manual_tries): try: r = requests.get(url, timeout=timeout) if check_status: r.raise_for_status() return r except requests.exceptions.RequestException as e: if i + 1 < manual_tries and sec_between > 0: log.warning("Get metadata failed, wait %s seconds to try again", sec_between) time.sleep(sec_between) log.error("Get metadata failed, please check network and security group.") return None def get_meta_json(self, timeout=5, retries=5): try: resp = self.read_url(url=METADATA_URL, timeout=timeout, retries=retries) metadata = resp.json() except Exception as e: log.error( "Get metadata failed. Error: %s", e) return False, None return True, metadata def get_npu_count(self, meta_json): hyperinstance_type = meta_json.get("meta", {}).get("_sys_hyperinstance_type") if hyperinstance_type: log.info("Node type is hyperinstance.") return A3_NPU_COUNT return A2_NPU_COUNT def get_config_url(self, region): if region == "cn-north-7": url = "https://cnnorth7-modelarts-sdk.obs.cn-north-7.ulanqab.huawei.com" elif region == "cn-north-9": url = "https://cnnorth9-modelarts-sdk.obs.cn-north-9.myhuaweicloud.com" elif region == "cn-south-1": url = "https://cnsouth1-modelarts-sdk.obs.cn-south-1.myhuaweicloud.com" elif region == "cn-east-3": url = "https://cneast3-modelarts-sdk.obs.cn-east-3.myhuaweicloud.com" elif region == "ap-southeast-1": url = "https://ap-southeast1-modelarts-sdk.obs.ap-southeast-1.myhuaweicloud.com" elif region == "cn-north-11": url = "https://cnnorth11-modelarts-sdk.obs.cn-north-11.myhuaweicloud.com" elif region == "cn-southwest-2": url = "https://cn-southwest-2-modelarts-sdk.obs.cn-southwest-2.myhuaweicloud.com" elif region == "cn-east-4": url = "https://cneast4-modelarts-sdk.obs.dualstack.cn-east-4.myhuaweicloud.com" elif region == "la-south-2": url = "https://la-south2-modelarts-sdk.obs.la-south-2.myhuaweicloud.com" elif region == "ap-southeast-3": url = "https://ap-southeast3-modelarts-sdk.obs.ap-southeast-3.myhuaweicloud.com" elif region == "me-east-1": url = "https://me-east-1-modelarts-sdk.obs.me-east-1.myhuaweicloud.com" else: url = "https://{0}-modelarts-sdk.obs.{1}.myhuaweicloud.com".format(region.replace("-", ""), region) return url + "/devserver/port_config.json" def download_file(self, url, destination): log.info("Downloaded file from %s to %s", url, destination) try: response = requests.get(url, timeout=10) response.raise_for_status() with open(destination, "wb") as f: f.write(response.content) return True except requests.exceptions.RequestException as e: log.error("Failed to download file from %s: %s", url, str(e)) return False def get_port_config(self, url): config_file = os.path.join(self.DIR, "port_config.json") backup_file = os.path.join(self.DIR, "port_config.json.bak") if url is None: log.warning("Url is none, use local config file.") try: with open(config_file, "r") as f: hash_config = json.load(f) log.info("Loaded config from %s", config_file) return hash_config except FileNotFoundError: log.error("Config file %s not found.", config_file) return None except json.JSONDecodeError: log.error("Failed to decode JSON from %s", config_file) return None try: shutil.copy2(config_file, backup_file) log.info("Backed up %s to %s", config_file, backup_file) except Exception as e: log.error("Failed to backup %s: %s", config_file, str(e)) return None if self.download_file(url, config_file): try: with open(config_file, "r") as f: hash_config = json.load(f) log.info("Loaded new config from %s", config_file) os.remove(backup_file) return hash_config except json.JSONDecodeError: log.error("Failed to decode json from downloaded file %s, restored backup file.", config_file) shutil.copy2(backup_file, config_file) os.remove(backup_file) with open(config_file, "r") as f: hash_config = json.load(f) return hash_config else: log.warning("Download failed, restored local backup config file.") shutil.copy2(backup_file, config_file) os.remove(backup_file) with open(config_file, "r") as f: hash_config = json.load(f) return hash_config def wait_until_npu_ready(self, npu_count): for _ in range(0, 10): count = 0 for i in range(0, npu_count): cmd = "hccn_tool -i {} -lldp -g | grep Ifname | awk -F ': ' '{{print $2}}'".format(i) (status, output) = commands.getstatusoutput(cmd) if not output: log.error("result: get ifname failed, try again after 30s, id:%s", i) time.sleep(30) break count += 1 if count == npu_count: log.info("result: get all ifname success.") return log.warning("Failed to get ifname after 10 attempts.") def print_current_port(self, npu_count): for i in range(0, npu_count): cmd = "hccn_tool -i {} -udp -g".format(i) (status, output) = commands.getstatusoutput(cmd) log.info("port %s: %s", i, output) def config_udp_port_auto(self, npu_count): log.info("Configuring ports in auto mode.") for i in range(0, npu_count): cmd = "hccn_tool -i {} -udp -s auto".format(i) (status, _) = commands.getstatusoutput(cmd) def config_udp_port(self, port_config, flavor, npu_count, region_id): if port_config is None: self.config_udp_port_auto(npu_count) return cmd = "echo -n {} | sha256sum | awk '{{print $1}}'".format(flavor) (status, sha256_flavor) = commands.getstatusoutput(cmd) log.info("Sha256 flavor is %s", sha256_flavor) cur_cluster = self.get_cluster(port_config, sha256_flavor, region_id) if not cur_cluster: if A3_NPU_COUNT == npu_count: log.warning("Get cluster from port_config failed, flavor is %s, set default value %s", flavor, A3_32K_CLUSTER) cur_cluster = A3_32K_CLUSTER else: log.error("Get cluster from port_config failed, flavor is %s", flavor) self.config_udp_port_auto(npu_count) return log.info("Cluster is %s", cur_cluster) desire_port_map = self.get_desire_port_map(cur_cluster, port_config) for i in range(0, npu_count): cmd = "hccn_tool -i {} -lldp -g | grep Ifname | awk -F ': ' '{{print $2}}'".format(i) (status, ifname_tmp) = commands.getstatusoutput(cmd) key = "{}|{}".format(i, ifname_tmp) desire_port = desire_port_map.get(key, None) if not desire_port: log.error("Get desire port from port_config failed, id %s, ifname %s", i, ifname_tmp) self.config_udp_port_auto(npu_count) return cmd = "hccn_tool -i {} -udp -g | awk -F ':' '{{print $2}}' | grep -v auto".format(i) (status, current_port) = commands.getstatusoutput(cmd) if desire_port != current_port: cmd = "hccn_tool -i {} -udp -s port {}".format(i, desire_port) (status, _) = commands.getstatusoutput(cmd) def get_cluster(self, port_config, sha256_flavor, region_id): if region_id == A3_9866_REGION_ID: return A3_9866_CLUSTER return port_config.get("flavors", {}).get(sha256_flavor, None) def get_desire_port_map(self, cur_cluster, port_config): if cur_cluster == A3_32K_CLUSTER: return self.get_a3_desire_port_map(5120, 5183, 32) elif cur_cluster == A3_64K_CLUSTER: return self.get_a3_desire_port_map(5184, 5327, 36) elif cur_cluster == A3_128K_CLUSTER: return self.get_a3_desire_port_map(5184, 5471, 36) else: return port_config.get("clusters", {}).get(cur_cluster, {}) def get_a3_desire_port_map(self, start, end, max_tor_port): udp_port_map = {} tor_ge_id = 1 tor_port = 0 udp_port = start while udp_port <= end: udp_port_map["6|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port)] = udp_port udp_port_map["14|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port)] = udp_port udp_port_map["4|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 1)] = udp_port + 1 udp_port_map["12|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 1)] = udp_port + 1 udp_port_map["2|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 2)] = udp_port + 2 udp_port_map["10|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 2)] = udp_port + 2 udp_port_map["0|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 3)] = udp_port + 3 udp_port_map["8|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 3)] = udp_port + 3 udp_port_map["7|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port)] = udp_port udp_port_map["15|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port)] = udp_port udp_port_map["5|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 1)] = udp_port + 1 udp_port_map["13|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 1)] = udp_port + 1 udp_port_map["3|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 2)] = udp_port + 2 udp_port_map["11|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 2)] = udp_port + 2 udp_port_map["1|400GE{0}/0/{1}:1".format(tor_ge_id, tor_port + 3)] = udp_port + 3 udp_port_map["9|400GE{0}/0/{1}:2".format(tor_ge_id, tor_port + 3)] = udp_port + 3 udp_port += 4 tor_port += 4 if tor_port >= max_tor_port: tor_port = 0 tor_ge_id += 1 return udp_port_map def run(self): ret, meta_json = self.get_meta_json() if not ret: log.error("Get meta_json failed, metadata json is %s", meta_json) return log.info("Get meta_json success, metadata json is %s", meta_json) region_id = meta_json.get('region_id', None) if not region_id: log.error("Get region_id from metadata failed.") return log.info("Region is %s", region_id) flavor = meta_json.get('instance_type', None) if not flavor: log.error("Get instance_type from metadata failed.") return log.info("Flavor is %s", flavor) npu_count = self.get_npu_count(meta_json) log.info("Npu count is %s", npu_count) url = self.get_config_url(region_id) port_config = self.get_port_config(url) self.wait_until_npu_ready(npu_count) log.info("Before config uplink udp hash, Port is:") self.print_current_port(npu_count) log.info("=====================================================") self.config_udp_port(port_config, flavor, npu_count, region_id) log.info("Config uplink udp hash done. Port is:") self.print_current_port(npu_count) if __name__ == "__main__": configurator = UplinkHashConfig() configurator.run() - Add the script execution permission:
cd /opt/huawei/port_config chmod +x uplink_hash_config.py
- Obtain the latest configuration file of the current site:
python uplink_hash_config.py
- Add the automatic startup service for hash configuration:
cd /etc/systemd/system vim config-hash.service
Add the following content:
[Unit] Description=Run uplink hash config After=bms-network-config.service Requires=bms-network-config.service [Service] Type=oneshot ExecStart=python3 /opt/huawei/port_config/uplink_hash_config.py RemainAfterExit=yes User=root [Install] WantedBy=cloud-init.target
Configure the service to automatically start upon system startup.
cd /etc/systemd/system systemctl daemon-reload systemctl enable config-hash.service
Verification
To verify whether driver firmware consistency and port hash are configured, restart the server. For details, see Restarting a Lite Server.
Note: The server restarts automatically for the firmware upgrade to take effect.
- Verify the driver firmware version.
View the NPU driver and firmware versions:
ascend-dmi -c
If the firmware version is the same as configured, the settings are successful.
- Verify the UDP port hash configuration.
View the uplink port on the parameter plane network:
npu_ids=$(npu-smi info -l | awk '/NPU ID/{print $NF}') for i in $npu_ids; do hccn_tool -i $i -udp -g;doneIn the command output, if udp_port is not Unknown and status is custom, the configuration is successful.
Clearing and Saving the Image
- Clear the logs about driver firmware consistency and UDP port hash.
rm -rf /opt/huawei/port_config/uplink_hash_config.log rm -rf /opt/huawei/firmware_check/firmware_check.log rm -rf /opt/huawei/firmware_check/time*
- Clear the traces and create a new image. For details, see Creating the OS of a Lite Server.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot