更新时间:2024-12-16 GMT+08:00
NPU日志收集上传
场景描述
当NPU出现故障,您可通过本方案收集NPU的日志信息。本方案中生成的日志会保存在节点上,并自动上传至华为云技术支持提供的OBS桶中,日志仅用于问题定位分析,因此需要您提供AK/SK给华为云技术支持,用于授权认证。
约束限制
当前仅支持在贵阳一、乌兰察布一使用该功能。
操作步骤
- 获取AK/SK。该AK/SK用于后续脚本配置,做认证授权。
如果已生成过AK/SK,则可跳过此步骤,找到原来已下载的AK/SK文件,文件名一般为:credentials.csv。
如下图所示,文件包含了租户名(User Name),AK(Access Key Id),SK(Secret Access Key)。
图1 credential.csv文件内容
AK/SK生成步骤:- 登录管理控制台。
- 单击右上角的用户名,在下拉列表中单击“我的凭证”。
- 单击“访问密钥”。
- 单击“新增访问密钥”。
- 下载密钥,并妥善保管。
- 准备租户名ID和IAM用户名ID,用于OBS桶配置。
将您的租户名ID和IAM用户名ID提供给华为技术支持,华为云技术支持将根据您提供的信息,为您配置OBS桶策略,以便用户收集的日志可以上传至对应的OBS桶。
华为云技术支持配置完成后,会给您提供对应的OBS桶目录“obs_dir”,该目录用于后续配置的脚本中。
图2 租户名ID和IAM用户名ID
- 准备日志收集上传脚本。
修改以下脚本中NpuLogCollection的参数,将ak、sk、obs_dir替换为前面步骤中获取到的值,然后把该脚本上传到要收集NPU日志的节点上。
import json import os import sys import hashlib import hmac import binascii from datetime import datetime class NpuLogCollection(object): NPU_LOG_PATH = "/var/log/npu_log_collect" SUPPORT_REGIONS = ['cn-southwest-2', 'cn-north-9'] OPENSTACK_METADATA = "http://169.254.169.254/openstack/latest/meta_data.json" OBS_BUCKET_PREFIX = "npu-log-" def __init__(self, ak, sk, obs_dir): self.ak = ak self.sk = sk self.obs_dir = obs_dir self.region_id = self.get_region_id() def get_region_id(self): meta_data = os.popen("curl {}".format(self.OPENSTACK_METADATA)) json_meta_data = json.loads(meta_data.read()) meta_data.close() region_id = json_meta_data["region_id"] if region_id not in self.SUPPORT_REGIONS: print("current region {} is not support.".format(region_id)) raise Exception('region exception') return region_id def gen_collect_npu_log_shell(self): collect_npu_log_shell = "# !/bin/sh\n" \ "rm -rf {npu_log_path}\n" \ "mkdir -p {npu_log_path}\n" \ "echo {echo_npu_driver_info}\n" \ "npu-smi info > {npu_log_path}/npu-smi_info.log\n" \ "cat /usr/local/Ascend/driver/version.info > {npu_log_path}/npu-smi_driver-version.log\n" \ "/usr/local/Ascend/driver/tools/upgrade-tool --device_index -1 --component -1 --version > {npu_log_path}/npu-smi_firmware-version.log\n" \ "for i in $(seq 0 7) ; do npu-smi info -t health -i $i -c 0 >> {npu_log_path}/npu-smi_health-code.log;done;\n" \ "for i in $(seq 0 7);do hccn_tool -i $i -net_health -g >> {npu_log_path}/npu-smi_net-health.log ;done\n" \ "for i in $(seq 0 7);do hccn_tool -i $i -link -g >> {npu_log_path}/npu-smi_link.log ;done\n" \ "for i in $(seq 0 7);do hccn_tool -i $i -tls -g |grep switch >> {npu_log_path}/npu-smi_switch.log;done\n" \ "for i in $(seq 0 7);do npu-smi info -t board -i $i >> {npu_log_path}/npu-smi_board.log; done;\n" \ "echo {echo_npu_ecc_info}\n" \ "for i in $(seq 0 7);do npu-smi info -t ecc -i $i >> {npu_log_path}/npu-smi_ecc.log; done;\n" \ "for i in $(seq 0 7);do hccn_tool -i $i -optical -g | grep prese >> {npu_log_path}/npu-smi_present.log ;done\n" \ "lspci | grep acce > {npu_log_path}/Device-info.log\n" \ "echo {echo_npu_device_log}\n" \ "cd {npu_log_path} && msnpureport -f > /dev/null\n" \ "tar -czvPf {npu_log_path}/log_messages.tar.gz /var/log/message* > /dev/null\n" \ "tar -czvPf {npu_log_path}/ascend_install.tar.gz /var/log/ascend_seclog/* > /dev/null\n" \ "echo {echo_npu_tools_log}\n" \ "tar -czvPf {npu_log_path}/ascend_toollog.tar.gz /var/log/nputools_LOG_* > /dev/null" \ .format(npu_log_path=self.NPU_LOG_PATH, echo_npu_driver_info="collect npu driver info.", echo_npu_ecc_info="collect npu ecc info.", echo_npu_device_log="collect npu device log.", echo_npu_tools_log="collect npu tools log.") return collect_npu_log_shell def collect_npu_log(self): print("begin to collect npu log") os.system(self.gen_collect_npu_log_shell()) date_collect = datetime.now().strftime('%Y%m%d%H%M%S') instance_ip_obj = os.popen("curl http://169.254.169.254/latest/meta-data/local-ipv4") instance_ip = instance_ip_obj.read() instance_ip_obj.close() log_tar = "%s-npu-log-%s.tar.gz" % (instance_ip, date_collect) os.system("tar -czvPf %s %s > /dev/null" % (log_tar, self.NPU_LOG_PATH)) print("success to collect npu log with {}".format(log_tar)) return log_tar def upload_log_to_obs(self, log_tar): obs_bucket = "{}{}".format(self.OBS_BUCKET_PREFIX, self.region_id) print("begin to upload {} to obs bucket {}".format(log_tar, obs_bucket)) obs_url = "https://%s.obs.%s.myhuaweicloud.com/%s/%s" % (obs_bucket, self.region_id, self.obs_dir, log_tar) date = datetime.utcnow().strftime('%a, %d %b %Y %H:%M:%S GMT') canonicalized_headers = "x-obs-acl:public-read" obs_sign = self.gen_obs_sign(date, canonicalized_headers, obs_bucket, log_tar) auth = "OBS " + self.ak + ":" + obs_sign header_date = '\"' + "Date:" + date + '\"' header_auth = '\"' + "Authorization:" + auth + '\"' header_obs_acl = '\"' + canonicalized_headers + '\"' cmd = "curl -X PUT -T " + log_tar + " " + obs_url + " -H " + header_date + " -H " + header_auth + " -H " + header_obs_acl os.system(cmd) print("success to upload {} to obs bucket {}".format(log_tar, obs_bucket)) # calculate obs auth sign def gen_obs_sign(self, date, canonicalized_headers, obs_bucket, log_tar): http_method = "PUT" canonicalized_resource = "/%s/%s/%s" % (obs_bucket, self.obs_dir, log_tar) IS_PYTHON2 = sys.version_info.major == 2 or sys.version < '3' canonical_string = http_method + "\n" + "\n" + "\n" + date + "\n" + canonicalized_headers + "\n" + canonicalized_resource if IS_PYTHON2: hashed = hmac.new(self.sk, canonical_string, hashlib.sha1) obs_sign = binascii.b2a_base64(hashed.digest())[:-1] else: hashed = hmac.new(self.sk.encode('UTF-8'), canonical_string.encode('UTF-8'), hashlib.sha1) obs_sign = binascii.b2a_base64(hashed.digest())[:-1].decode('UTF-8') return obs_sign def execute(self): log_tar = self.collect_npu_log() self.upload_log_to_obs(log_tar) if __name__ == '__main__': npu_log_collection = NpuLogCollection(ak='ak', sk='sk', obs_dir='obs_dir') npu_log_collection.execute()
- 执行脚本收集日志。
在节点上执行该脚本,可以看到有如下输出,代表日志收集完成并成功上传至OBS。
图3 日志收集完成
- 查看在脚本的同级目录下,可以看到收集到的日志压缩包。
图4 查看结果
父主题: Lite Server资源管理