NPUErrorCodeWarning事件处理建议
故障影响
NPU当前存在故障,可能导致客户业务终止。
告警解释
NPU出现错误码告警。
告警参数
|
参数名称 |
参数含义 |
|---|---|
|
名称 |
NPU: ErrorCode告警 |
|
类型 |
故障类告警 |
|
发生时间 |
告警触发时间 |
|
定位信息 |
局点、云服务、微服务、虚拟机ID、虚拟机名称、告警信息 |
可能原因
此事件涵盖大量重要及以上的NPU错误码,可以根据这些错误码进一步定位错误原因。
处理步骤
- 具体故障影响及修复建议详见Snt9b或超节点Snt9b23的《黑匣子错误码信息列表》和《健康管理故障定义》文档,查询时注意匹配对应的HDK版本。


2. 在上述文档提供的故障处理建议基础上,可同步参考表1文档进行处理。
告警清除
此告警恢复后,关闭工单时清除方式选择清除网管告警即可。
参考信息
|
Level |
故障处理类型 |
故障说明 |
训练业务影响说明及处理建议 |
推理业务影响说明及处理建议 |
错误码 |
|---|---|---|---|---|---|
|
L1 |
NotHandleFault |
芯片故障可忽略 |
业务无影响,不做处理。 |
业务无影响,不做处理。 |
0x80E21007,0x80E38003,0x80F78006,0x80C98006,0x80CB8006,0x81318006,0x80A18006,0x80A18005,0x8C1F8609,0x80F38009,0x80CD8006,0x80CD8003,0x80A38006,0x80A38003,0x80A58006,0x80A58003,0x80DE1805,0x80F18006,0x80F18003,0x80DF8006,0x80E01805,0x80E18400,0x80E01809,0x80E18401,0x80E00209,0x80F38006,0x80F38003,0x80E18006,0x80D38009,0x819B800D,0x80DD8008,0x80DD8007,0x80B98006,0x80BD8006,0x819B8006,0x80DE1803,0x819D8000,0x81998006,0x81978006,0x81978004,0x815F8006,0x815F8004,0x81338006,0x81338004,0x817F8006,0x817F8004,0x816F8006,0x816F8004,0x814F8006,0x814F8004,0x81938006,0x81938004,0x81478006,0x81478004,0x813B8006,0x813B8004,0x81578006,0x81578004,0x81958006,0x81958004,0x81078603,0x8C2FA009,0xA4025021,0xA60250C1,0xA4025081,0xA214000D,0xA414000D,0xA4028801,0xA4025101,0xA2140007,0xA4140007,0xA2140008,0xA4140008,0xA40250E1,0xA214000A,0xA414000A,0xA4025061,0xA4025041,0xA214000B,0xA414000B,0xA414000C,0xA2140009,0xA4140009,0xA4303002,0x80B78006,0x80B78005,0x80E1800F,0x819B8003,0x80FB8000,0x81B78009,0x814D8006,0x80DE0200,0x80DF8400,0x80DF8401,0x81B18605,0x81B58004,0x80F78009,0x4C1F8608,0x8C1F8608,0x8C1F860B,0x8C1F860A |
|
L2 |
RestartRequest |
芯片底层可自动自愈 |
业务受影响。 处理建议: 停止训练任务,等待90s,检查芯片状态:
|
业务受影响。 处理建议: 推理进程不停止,重新执行推理请求,等待90s,检查芯片状态:
|
0x80C98008,0x80C98002,0x80C98003,0x80C98009,0x80CB8002,0x80CB8008,0x80CB8009,0x80CF8003,0x81318008,0x80D58000,0x80D58009,0x80D98008,0x80DB800A,0x80DB8000,0x80DD8000,0x80DD8003,0x80C98000,0x81B18008,0x81B1800D,0x81B18603 |
|
L3 |
RestartBusiness |
芯片要配合业务自愈 |
主动知会客户,确认业务是否受影响。若不受影响,则等业务运行结束再处理,若受影响则需要立即处理。 处理建议: 停止训练任务,等待90s,检查芯片状态:
|
主动知会客户,确认业务是否受影响。 业务受影响。 处理建议: 迁移该节点推理实例进程,等待90s,检查芯片状态:
|
0x8C204E00,0xA8028802,0xA4302003,0xA4302004,0xA4302005,0xA4302006,0xA4302009,0xA430200A,0xA6301002,0xB406009C,0xB4060008,0xB4060009,0xB406000E,0xA60250A1,0xA2301002,0xA2303001,0xB4060006,0xB4060007,0xB406000D,0xB4060014,0xB4060010,0xB4060011,0x80E01801,0x80C9800A,0x80CB800A,0x81AF8009,0x81AF8004,0x80C98005,0x80CF8009,0x80CF8008,0x80E38009,0x80CB8005 |
|
L4 |
FreeRestartNPU |
芯片亚健康 |
当前业务不受影响,待芯片空闲时,复位芯片 npu-smi set -t reset -i id -c chip_id |
当前业务不受影响,待芯片空闲时,复位芯片 npu-smi set -t reset -i id -c chip_id |
0x8C0E4E00,0x8C104E00,0x8C0C4E00,0x8C044E00,0x8C064E00,0x8C17A005,0x8C1DA005,0x8C19A005,0x80E58E03,0x80E58E02,0xA4193217,0xA4193218,0xA42A0000,0xA42F3917,0xA42F3918,0x80818C06,0x81A3880C,0x80DE0207,0x80E44E00,0x8C084E00,0x819B8605,0x81078605,0x81AD8605,0x8C464E00,0x80E20207,0x8C0A4E00,0x8C124E00 |
|
L5 |
RestartNPU |
芯片故障可以复位恢复 |
业务受影响。 直接复位芯片并重新执行业务。 芯片复位命令 npu-smi set -t reset -i id -c chip_id 其中id为告警中的card id,chip_id为告警中的device id |
业务受影响。 直接复位芯片并重新执行业务。 芯片复位命令 npu-smi set -t reset -i id -c chip_id 其中id为告警中的card id,chip_id为告警中的device id |
0x8C03A000,0x8C1FA006,0x8C2FA001,0x40F84E00,0x80E24E00,0x80E21E01,0x80E38008,0x80E3A202,0x80E3A203,0x80E39200,0x80E2120D,0x80E78000,0x80E78008,0x80FA4E00,0x812E4E00,0x80C78008,0x80F78008,0x80F78003,0x80E18404,0x80C98001,0x80FB8005,0x80A18008,0x80CD8008,0x80A38008,0x80A58008,0x80DE1801,0x80F18008,0x80F18000,0x80F1800A,0x80CF8000,0x80DF8000,0x80DF8009,0x80DF8008,0x80DF800A,0x80F38008,0x80F2180D,0x80E18005,0x80E18008,0x80E1800A,0x80E21008,0x80B98000,0x80B98008,0x80BD8008,0x80BD8000,0x80BD8003,0x80BD8009,0x80BB8008,0x80BB8000,0x80BB8009,0x80BB8003,0x80BB800A,0x81998009,0x81998008,0x81978008,0x815F8008,0x81338008,0x817F8008,0x816F8008,0x814F8008,0x81938008,0x81478008,0x813B8008,0x81578008,0x81958008,0xA2141004,0xA2141006,0xA2142004,0xA2142006,0xA2145004,0xA4183200,0xA6023001,0xA6023002,0xA6023003,0xA6023004,0xA6060000,0xA6060001,0xA6060002,0xA6060003,0xA6060004,0xA6060005,0xA606000A,0xA606000B,0xA606000C,0xA606000F,0xA606009D,0xA6060FFF,0xA607FFFF,0xA6140001,0xA6140002,0xA6140003,0xA6140004,0xA6140005,0xA6140006,0xA6141003,0xA6142003,0xA6143003,0xA6144003,0xA6145003,0xA6192D15,0xA6193206,0xA6193215,0xA6193248,0xA62F3905,0xA62FFFFF,0xA6303003,0xA6303004,0xA6360000,0xA6361000,0xA6362000,0xA8021004,0xA8060FFF,0xA807FFFF,0xA82A0000,0x80B78000,0x80B58000,0x81498004,0x80F78C02,0x80F78C03,0x80F78C04,0x80E3A207,0x81AB800D,0x81AB8003,0x81AB8008,0x81AF8008,0x81AF8000,0x81B78004,0x81B58002,0x819B800A,0x80E18000,0x80CB8001,0x80818c00 |
|
L6 |
SeparateNPU |
芯片故障需要更换 |
业务受影响。 对节点进行隔离,重启执行任务。 提交工单,运维介入对故障节点进行HA。 |
业务受影响。 对节点进行隔离,重启执行任务。 提交工单,运维介入对故障节点进行HA。 |
0x80E3A201,0x80E18402,0x80E0020B,0x81978002,0x815F8002,0x81338002,0x817F8002,0x816F8002,0x814F8002,0x81938002,0x81478002,0x813B8002,0x81578002,0x81958002,0x9419321B,0xA2301000,0xA2301001,0xA2302001,0xA4192C1A,0xA4193216,0xA419321B,0xA419321C,0xA42F390F,0xA42F3916,0xA42F391A,0xA6183207,0xA62F3934,0xA8028801,0xA819320F,0xA8193234,0xA8193235,0x80818C05,0x80DF8402,0x80E1880C |