Updated on 2025-08-01 GMT+08:00

Periodic Stress Test on Lite Server Supernodes

Scenario

For Snt9B23 supernodes, you can perform periodic performance tests and fault diagnosis on Huawei Cloud AI servers to detect NPU faults in a timely manner and reduce impact on services.

Table 1 Performance test

Scenario

Description

Bandwidth test

Test the bus bandwidth, memory bandwidth, and total time consumption.

Compute test

Construct a matrix multiplication A(m,k)*B(k,n) and execute it a certain number of times. Calculate the AI Core compute and the real-time power at full computational load of the entire PU or processor based on the computation workload and the time taken to execute the matrix multiplication.

Power consumption test

Detect the power consumption of the entire PU by running the single-operator model.

Eye pattern test

Test the network and query the current signal quality.

Stream test

One-click traffic testing and custom traffic testing are supported.

Software and hardware version compatibility test

The software and hardware compatibility tool obtains the hardware information, architecture, driver version, firmware version, and version software.

Table 2 Fault diagnosis

Scenario

Description

Network diagnosis

Diagnose the health status of the network and output the diagnosis result.

Signal quality diagnosis

Diagnose the signal quality and output the diagnosis result.

On-chip memory diagnosis

Diagnose the high bandwidth memory and output the diagnosis result.

On-chip memory stress test

Run a stress test on the high-bandwidth memory and output the diagnosis result.

On-chip memory high-risk address stress test

Run a stress test on the high-risk addresses of the high bandwidth memory and output the diagnosis result.

AI Core diagnosis

Diagnose the AI Core error and output the diagnosis result.

AI FLOPs diagnosis

Diagnose the chip compute and output the test result.

Bandwidth diagnosis

Diagnose the local bandwidth and output the diagnosis result.

P2P stress test

Check whether the HCCS communication link from the source device to the target device has hardware faults and output the test result.

Power consumption stress test

Perform the EDP/TDP power consumption stress test and output the diagnosis result.

Constraints

  • Only Snt9b23 supernodes are supported.
  • Ascend DMI is used for stress test. Starting multiple processes on the same device to test performance data is not supported. If multi-process test is performed, the test result may be inaccurate or unpredictable.
  • Performance test and fault diagnosis will affect training and inference services. Ensure that no service is running first.
  • To ensure the correctness and accuracy of the test result, run each detection command separately.
  • Ascend DMI can check only the NPUs that are properly installed. To ensure the accuracy of the test result, run the npu-smi info command first.

Performance Test 1: Bandwidth Test

Test the bus bandwidth, memory bandwidth, and total time consumption. Table 3 describes the parameters of the bandwidth test command.

ascend-dmi --bw -h
Table 3 Parameters for the bandwidth test

Parameter

Description

Mandatory

[-bw, --bw, --bandwidth]

This parameter is used to test the bandwidth of the chip. -bw is supported, but --bw and --bandwidth are recommended.

Yes

[-t, --type]

Specify the data flows to be tested. When the bandwidth test function is used, the tested data flows can be classified into the directions listed below. If this parameter is not specified, h2d, d2h, and d2d are returned by default.

Bandwidth and total duration in the three directions.

  • h2d: Data is transferred from the host memory to the device memory through the PCIe bus. The overall bandwidth and total duration are tested.
  • d2h: Data is transferred from the device memory to the host memory through the PCIe bus. The overall bandwidth and total duration are tested.
  • d2d: Data is transferred from the device memory to the same device memory. (This mode is used to test the device memory bandwidth.) The overall bandwidth and total duration are tested.
  • p2p: Data is transferred from the specified source device to the target device. The transfer rate and total duration are tested.

No

[-s, --size]

Specify the data size to be transferred and the test result display mode. For supernodes, the maximum transmission value ranges from 1 byte to 4 GB in d2h, h2d, and p2p modes.

  • The -s parameter must be followed by a number, which specifies the size of the data to be transferred. If no number is specified, the syntax is incorrect.
    • In h2d, d2h, d2d, and p2p modes where -ds and -dd are specified, -s specifies the fixed-length mode. If -s is not specified, the step mode is used. The default data transfer range ranges from 2 bytes to 32 MB.

No

[-et, --et, --execute-times]

Specify the number of iterations, that is, the number of memory copy times. The value ranges from 1 to 1000. If this parameter is not specified, the default value 5 is used in step mode, and the default value 40 is used in fixed-length mode.

No

[-d, --device]

Specify the ID of the device whose bandwidth needs to be tested. The device ID is the logical ID of the Huawei Cloud AI processor. If the device ID is not specified, the bandwidth information of device 0 is returned by default.

No

[-ds, --ds, --device-src]

Specify the ID of the source device for a P2P test. This parameter must be specified together with the -dd, --dd, and --device-dst parameters. Otherwise, all Huawei Cloud AI NPU chips are tested.

No

[-dd, --dd, --device-dst]

Specify the ID of the target device for a P2P test. This parameter must be specified together with the -ds, --ds, and --device-src parameters. Otherwise, all Huawei Cloud AI NPU chips are tested.

No

[-fmt, --fmt, --format]

Specify the output format. The value can be normal or json. If not specified, the default value normal is used.

No

[-q, --quiet]

If this parameter is specified, no foolproof message is displayed. By default, this operation is allowed.

No

The following uses data transmission from a device to the same device as an example to describe how to test the bandwidth and total duration.

ascend-dmi --bw -t d2d -d 0
Figure 1 Bandwidth test example
Table 4 Parameters for the bandwidth test output

Parameter

Description

Host to Device Test

Data flow direction of the bandwidth. The possible outputs are as follows:

  • Host to Device Test
  • Device to Host Test
  • Device to Device Test
  • Unidirectional Peer to Peer Test
  • Bidirectional Peer to Peer Test

Device X: Ascend XXX

X indicates the ID of the device to be tested, and XXX indicates the processor type. 0 indicates the source device, and 1 indicates the target device.

ID

0→1 indicates the unidirectional P2P bandwidth from device 0 to device 1.

0↔1 indicates the bidirectional P2P bandwidth between device 0 and device 1.

Size(Bytes)

Size of the data to be transferred, in bytes.

Execute Times

Number of iterations.

Bandwidth(GB/s)

Bandwidth of the chip.

Elapsed Time(us)

Total execution duration.

Performance Test 2: Compute Test

Construct a matrix multiplication A(m,k)*B(k,n) and execute it a certain number of times. Calculate the AI Core compute and the real-time power at full computational load of the entire PU or processor based on the computation workload and the time taken to execute the matrix multiplication. For details about the parameters for compute test, see Table 5.

Table 5 Parameters for the compute test

Parameter

Description

Mandatory

[-f, --flops]

This parameter is used to test the compute capability of the entire PU or chip.

Yes

[-t, --type]

Specify the operator type. The value can be fp16, fp32, hf32, bf16, or int8. The default value is fp16.

No

[-d, --device]

Specify the device ID. The entire PU where the device ID is located is tested. The device ID is the logical ID of the Huawei Cloud AI chip. If this parameter is not specified, the compute of device 0 is returned by default.

No

[-et, --et, --execute-times]

Specify the number of times that the matrix multiplication is executed on a single AI Core of the chip.

  • Training: If this parameter is not specified, the default value 60 is used. In this case, the unit is hundred thousand, and the value ranges from 10 to 80.
  • Inference: If this parameter is not specified, the default value 10 is used. In this case, the unit is million, and the value ranges from 10 to 80.

No

The following shows an example of running the int8 operator and 6 million executions of compute on device 7.

ascend-dmi -f -t int8 -d 7 -et 60-q
Figure 2 Compute test example
Table 6 Parameters for the compute test output

Parameter

Description

Device

Device ID.

Execute Times

Number of times that a single AI Core performs matrix multiplication multiplied by the number of AI Cores.

Duration(ms)

Time taken to perform matrix multiplication multiple times.

TFLOPS@FP16

Compute obtained through compute test. FP16 is the specified operator run type.

Power(W)

Real-time power at full compute.

Performance Test 3: Power Consumption Test

Detect the power consumption of the entire PU by running the single-operator model. For details about the parameters for power consumption test, see Table 7.

ascend-dmi -p -h
Table 7 Parameters for power consumption test

Parameter

Description

Mandatory

[-p, --power]

This parameter is used to test the power consumption of the entire PU.

Yes

[-t, --type]

Specify the operator type. The value can be fp16 or int8. The default value is fp16.

No

[-pt, --pt, --pressure-type]

Specify the stress test type.

  • Currently, the following types are supported:
    • Estimated design power (edp): EDP power consumption stress test
    • Thermal design power (tdp): TDP power consumption stress test
  • It can be used together with --dur, --it, --pm, and -q.
  • It cannot be used together with -t.
  • If this parameter is not specified, the power consumption of the entire PU is tested by default.

No

[-dur, --dur, --duration]

Specify the running time. If this parameter is not specified, the default value 600 is used. The unit is second. The value ranges from 60 to 604800.

No

[-it, --it, --interval-times]

Specify the interval for refreshing the screen information. If this parameter is not specified, the default value 5 is used. The unit is second. The value ranges from 1 to 5.

No

[--skip-check]

If this parameter is specified, the device health check is skipped. Otherwise, the device health status is checked by default.

No

[-pm, --pm, --print-mode]

Specify the screen output mode. If this parameter is not specified, the default value refresh is used. Print modes:

  • refresh: Clear the historical information each time the information is printed.
  • history: Print the saved historical information. Note:

    In refresh mode, if there are a large number of chips, you are advised to reduce the font size so that all results are displayed on one screen. Otherwise, the display may be abnormal and some content may be repeatedly printed.

No

The following shows an example of executing for 60s at an interval of 5s in refresh mode.

ascend-dmi -p --dur 60--it 5--pm refresh
Figure 3 Power consumption test example

Table 8 Parameters for power consumption test output

Parameter

Description

Type

PU model

Card

PU ID

Chip

Processor No.

Name

Processor name

Type

Processor model

Chip Name

Chip name

NPU Count

Number of NPUs

Power

Actual power consumption of the entire PU or chip.

Health

Processor health status

Temperature

Current processor temperature

Device ID

Device logical ID of the processor

AI Core Usage

AI Core usage

Voltage

Current processor voltage

Frequency

Current processor frequency

Performance Test 4: Eye Pattern Test

Test the network and query the current signal quality. This function is used to query the specific signal quality. To check whether the signal quality of the current port is normal, perform signalQuality diagnosis. If CDR loopback has been configured for an NPU, disable the loopback before performing the eye pattern test. For details about the parameters for eye pattern test, see Table 9.

ascend-dmi --sq -h
Table 9 Parameters for the eye pattern test

Parameter

Description

Mandatory

[-sq, --sq, --signal-quality]

Query the signal quality of the PCIe, HCCS, and RoCE communication ports on the NPU.

Yes

[-d, --device]

Specify the device ID to be queried. If multiple device IDs are specified, use commas (,) to separate them. If this parameter is not specified, all NPUs on the device are queried by default.

No

[-t --type]

Specify the type of the communication port. Currently, HCCS and RoCE are supported. If multiple communication port types are specified, use commas (,) to separate them. If this parameter is not specified, the signal quality of the RoCE communication port is queried.

No

The following shows an example of querying the HCCS and RoCE signal quality of device 0 and device 1.

Figure 4 Eye pattern test example
Table 10 Parameters of the HCCS signal quality

Parameter

Description

type

Specify the type of the communication port.

device

Logical ID of an NPU.

M* (macro port)

Macro port, for example, M0 and M1 indicate macro ports 0 and 1, respectively.

L* (LANE)

Lane number of the HCCS link, for example, L0 and L1 indicate lane 0 and lane 1, respectively.

S (SNR)

Signal-to-noise ratio (SNR) of a lane.

H (HEH)

Half eye height of a lane.

Table 11 Parameters of the RoCE signal quality

Parameter

Description

type

Specify the type of the communication port.

device

Logical ID of the NPU.

M* (macro port)

Macro port, for example, M0 indicates macro port 0.

S (SNR)

Signal-to-noise ratio (SNR) of a lane.

H (HEH)

Half eye height of a lane.

L* (LANE)

Lane number of the RoCE link, for example, L0 and L1 indicate lane 0 and lane 1, respectively.

Performance Test 5: Stream Test

One-click traffic testing and custom traffic testing are supported.

Table 12 Stream test

Test Name

Supported Traffic Mode

Usage

One-click traffic testing

Stream test in a CDR loopback or fiber optic circulator (loopback device)

Execute the one-click traffic test command. The Ascend DMI tool automatically sends and receives the streams of all lanes of the specified device. After a period of time, the streams are disabled and the results are queried.

Custom traffic testing

Stream test in a CDR loopback, fiber optic circulator (loopback device), and traffic test using direct connection between NPUs

Custom traffic testing is to separate each step of one-click traffic testing. You can flexibly control the TX and RX directions and specify the lanes for traffic testing.

There are three traffic testing modes:

  • CDR loopback traffic test: A single device sends and receives traffic at the same time. This test can be used to check the signal quality from the physical SerDes port of the NPU to the CDR unit. Before starting traffic testing, ensure that the optical module is in position. Then, run the commands below to configure or disable CDR loopback.

    Run the commands below to configure CDR loopback. In the commands, t uses values 3 and 0 in sequence and i indicates the NPU ID.

    hccn_tool -i 0 -scdr -t 3

    hccn_tool -i 0 -scdr -t 0

    Run the commands below to disable CDR loopback. In the commands, t uses values 2 and 1 in sequence and i indicates the NPU ID.

    hccn_tool -i 0 -scdr -t 2

    hccn_tool -i 0 -scdr -t 1

  • Traffic testing using a fiber optic circulator (loopback device) connected to an optical module: A single device sends and receives traffic at the same time. This method can be used to check the signal quality of the physical SerDes port of the NPU to the optical module. No loopback needs to be set.
  • Traffic testing using direct connection between NPUs: After traffic is sent in the TX direction of the SerDes port on NPU A, data flows reach the SerDes port on NPU B through the tested link. NPU B compares the received data with the code pattern in the RX direction and collects statistics on bit errors. This method can be used to check the signal quality of the link between two NPUs (only custom traffic testing is supported).

For details about parameters for stream tests, see Table 13.

ascend-dmi --prbs-check -h
Table 13 Parameters for stream test

Parameter

Description

Mandatory

[-pc, --pc, --prbs-check]

This parameter is used for PRBS stream tests.

Yes

[-d, --device]

Specify the ID of the device for which the stream test is to be performed.

  • The device ID is the logical ID of the Huawei Cloud AI processor. If this parameter is not specified, the stream of all Huawei Cloud AI NPUs is tested.
  • Multiple device IDs can be specified at the same time. Use commas (,) to separate them.

No

[-dur, --dur, --duration]

Specify the duration of the stream test.

  • The value ranges from 3 to 10, in seconds.
  • If this parameter is not specified, the default value 3 is used.

No

[--prbs-mode]

Whether to switch the traffic testing status.

EN: enabled

DS: disabled

  • The value is case-sensitive.
  • If --prbs-mode is set to EN or DS, the configuration takes effect in both the signal TX and RX directions, regardless of whether --generator-pattern, --generator-lanes, --checker-pattern, or --checker-lanes is specified.
  • If --prbs-mode is set to EN, -generator-pattern, --checker-pattern, --generator-lanes, and --checker-lanes can be specified.
  • If --prbs-mode is set to DS, the traffic testing stops. In this case, -generator-pattern, --checker-pattern, --generator-lanes, and --checker-lanes cannot be specified.
  • This parameter cannot be specified together with --show or --clear.

Yes

[--generator-pattern]

Specify the stream type of the TX end.

  • Currently, the following stream types are supported: prbs7, prbs9, prbs10, prbs11, prbs15, prbs20, prbs23, and prbs31.
  • If this parameter is not specified, the default value prbs31 is used.
  • The parameter value is case-insensitive. For example, prbs7 and PRBS7 both are supported.
  • This parameter cannot be specified together with --show or --clear.

No

[--generator-lanes]

Specify the lane of the TX end.

  • You can specify one or more lanes at a time. Use commas (,) to separate multiple lanes. If multiple lanes are specified, the lanes must be consecutive, for example, 0, 1, 2 or 2, 1, 3. Non-consecutive lanes are not supported.
  • If this parameter is not specified, all lanes are tested by default.
  • This parameter cannot be specified together with --show or --clear.
  • The value can be 0, 1, 2, or 3.

No

[--checker-pattern]

Specify the stream type of the RX end.

  • Currently, the following types are supported: prbs7, prbs9, prbs10, prbs11, prbs15, prbs20, prbs23, and prbs31.
  • If this parameter is not specified, the default value prbs31 is used.
  • The parameter value is case-insensitive. For example, prbs7 and PRBS7 both are supported.
  • This parameter cannot be specified together with --show or --clear.

No

[--checker-lanes]

Specify the lane of the RX end.

  • You can specify one or more lanes at a time. Use commas (,) to separate multiple lanes. If multiple lanes are specified, the lanes must be consecutive, for example, 0, 1, 2 or 2, 1, 3. Non-consecutive lanes are not supported.
  • If this parameter is not specified, all lanes are tested by default.
  • This parameter cannot be specified together with --show or --clear.
  • The value can be 0, 1, 2, or 3.

No

[-show, --show, --show-diagnostic-info]

Display the stream test result.

  • This parameter cannot be specified together with --clear, --prbs-mode, --generator-pattern, --generator-lanes, --checker-pattern, and --checker-lanes.
  • After the information is displayed, the result of the current stream test is cleared.

No

[-clear, --clear, --clear-diagnostic-info]

Clear the stream test result.

  • This parameter cannot be specified together with --show, --prbs-mode, --generator-pattern, --generator-lanes, --checker-pattern, and --checker-lanes.
  • You can specify other parameters at the same time.

No

The following is an example of one-click traffic testing:

ascend-dmi -pc -d 9--pattern prbs15 -dur 5
Figure 5 One-click traffic testing example
Table 14 Parameters of one-click traffic testing output

Parameter

Description

device

Logical ID of the NPU.

lane

Lane ID of the RoCE link.

error count

Number of bit errors. The maximum value is 67092480, indicating full bit errors.

error rate

Bit error rate. If the bit error rate is less than 10-5, the signal quality is normal.

alos

The value 0 indicates normal, and the value 1 indicates that the input signal amplitude is too low.

times

Traffic testing duration.

The following shows an example of custom traffic testing:

# Enable the stream test on Device8 and Device9.
ascend-dmi -pc --clear --device 8,9-q
# On Device8 and Device9, the TX ends are lane 0 and lane 1, and the code pattern is PRBS20. The RX ends are lane 2 and lane 3, and the code pattern is PRBS23.
ascend-dmi -pc --prbs-mode EN -q --device 8,9--generator-pattern prbs20 --generator-lanes 0,1--checker-pattern prbs23 --checker-lanes 2,3
# Display the stream test results of Device8 and Device9.
ascend-dmi -pc --show-diagnostic-info -d 8,9-q
# Disable traffic testing on Device8 and Device9.
ascend-dmi -pc --prbs-mode DS -d 8,9-q
# Clear the traffic testing results on Device8 and Device9.
ascend-dmi -pc --clear-diagnostic-info -d 8,9-q
Figure 6 Custom traffic testing example
Table 15 Parameters of custom traffic testing output

Parameter

Description

Lane

Lane ID of the RoCE link.

Check Enable

Check status of the RX end. The value 0 indicates disabled and 1 indicates enabled.

Pattern

Code pattern of check in the RX direction.

Error-Bits

Number of bit errors. The upper limit is 67092480 (full bit errors).

Bit-Error Rate (BER)

Bit error rate, which is the number of bit errors divided by the total number of transmitted bits multiplied by 100%.

ALOS

The value must be 0 for normal traffic testing. If the value is 1, the signal amplitude is too low. If no traffic is testing, this parameter is meaningless.

Period

Time when the traffic testing is controlled or the check result is read last time.

Performance Test 6: Software and Hardware Version Compatibility Test

The software and hardware compatibility tool obtains the hardware information, architecture, driver version, firmware version, and version software. For details about the parameters for software and hardware compatibility tests, see Table 16.

ascend-dmi -c -h
Table 16 Software and hardware version compatibility test

Parameter

Description

Mandatory

[-c, --compatible]

This parameter is used to check the software and hardware version compatibility.

  • If driver 22.0.0 or CANN 6.2.RC1 or later has been installed, the -c parameter is used to check the compatibility between the NPU firmware and driver, and between the driver and CANN.
  • If the driver version is earlier than 22.0.0 and the CANN version is earlier than 6.2.RC1, the -c parameter is used to check whether the corresponding driver, firmware, and software package are installed.

Yes

[-p, --path]

You can specify the installation path of the CANN software package to be tested. If the installation path is not specified, the default installation path is used.

Example command for specifying the installation path of the software package:

ascend-dmi -c -p /home/xxx/Ascend

No

The following shows an example of the hardware and software version compatibility test:

ascend-dmi -c
Figure 7 Software and hardware version compatibility test example
Table 17 Parameters for software and hardware version compatibility test output

Parameter

Description

System Information

System information

Architecture

Architecture

Type

PU or chip model

Compatibility Check Result

Compatibility check result

Package

Package name

Version

Version

Status

Status. Possible values are as follows:

  • OK: compatible
  • INCOMPATIBLE PACKAGE: incompatible
  • NA: Unknown status. The software version may fail to be obtained.

    Non-root users cannot query the firmware compatibility, and the status of NPU firmware is displayed as NA.

Innerversion

Internal version number

Dependencies

Dependencies

Fault Diagnosis

View the parameters of the fault diagnosis command.

ascend-dmi --dg -h
Table 18 Fault diagnosis parameters

Parameter

Description

Mandatory

[-dg, --dg, --diagnosis]

This parameter is used to perform a fault diagnosis test on the entire PU.

Yes

[-i, --items]

Specify the diagnosis check items.

  • You can specify one or more of driver, CANN, device, network, bandwidth, AI FLOPs, HBM, and signalQuality. Use commas (,) to separate multiple items.
  • If this parameter is not specified, check items except AI Core and PRBS are diagnosed by default.

No

[-d, --device]

Specify the ID of the device to be diagnosed. The device ID is the logical ID of the Huawei Cloud AI chip.

  • You can specify one or more device IDs. Use commas (,) to separate them.
  • If no device ID is specified, the diagnosis results of all devices are returned by default.

No

[-r, --result]

Specify the path for storing the stress test result and information collection result, for example, /test. The specified path must meet security requirements and cannot contain the wildcard (*).

  • To specify a path for storing results, you need to create the ascend_check folder in the specified path. The path specified by user root will be created in the root directory, and the path specified by non-root users will be created in the $HOME directory.
  • If the path is not specified, the results will be stored in the default path, which is /var/log/ascend_check for user root and $HOME/var/log/ascend_check for non-root users.

No

[-s, --stress]

This parameter is used to perform stress tests. Currently, the following stress tests are supported: on-chip memory stress test, AI Core stress test, P2P stress test, and power consumption stress test.

  • If on-chip memory and power consumption are included, this parameter can be used together with the -st parameter. The time for performing the stress test is specified by --st.
  • If AI Core check item is included, this parameter can be used together with the -sc parameter. The number of performed stress tests is specified by --sc.
  • If bandwidth check is specified, this parameter can be used together with -t to perform a P2P stress test.

No

[-st, --st, --stress-time]

Specify the time for the EDP and TDP stress tests.

  • The value ranges from 60 to 604800, in seconds.
  • This parameter must be used together with [-s, -- stress] when the EDP and TDP stress tests are included.
  • This parameter must be used together with [-s, --stress] when the on-chip memory diagnosis is included.

No

[-fmt, --fmt, --format]

Specify the output format. The value can be normal or json.

  • If not specified, the default value normal is used.
  • If this parameter is set to json, the stress test result will be stored in the ascend_check/environment_check_before.txt file. However, if json format is not specified, the fault diagnosis result will not be stored.

No

[-h, --help]

View the parameters of the fault diagnosis command.

No

Fault Diagnosis 1: Network Diagnosis

Diagnose the health status of the network and output the diagnosis result.

# Example of diagnosing the network health status of Device0
ascend-dmi -dg -i network -d 0
Figure 8 Network diagnosis example

The parameters in the command output are as follows:

  • PASS: The network is healthy.
  • SKIP: The current product form does not support this check item.
  • INFO: The network check result is informational.
  • WARN: The network check result is an alarm.
  • FAIL: The network check fails.

Fault Diagnosis 2: SignalQuality Diagnosis

Diagnose the signal quality and output the diagnosis result.

# Example of SignalQuality diagnosis
ascend-dmi -dg -i signalQuality -q
Figure 9 SignalQuality diagnosis example

The parameters in the command output are as follows:

  • PASS: The HCCS and RoCE communication ports on the NPU pass the test, and the signal quality is normal.
  • SKIP: The current device does not support eye pattern diagnosis.
  • IMPORTANT_WARN: Important warning. The signal quality of one or more of the HCCS and RoCE ports is abnormal. Contact Huawei engineers.
  • FAIL: The eye pattern detection fails.

Fault Diagnosis 3: On-Chip Memory Diagnosis

Diagnose the high bandwidth memory and output the diagnosis result.

# Example of on-chip memory diagnosis
ascend-dmi -dg -i hbm
Figure 10 On-chip memory diagnosis example
Table 19 Parameters for on-chip diagnosis output

Output Status

Description

PASS

The on-chip memory passes the check and is normal.

SKIP

The current hardware form does not support on-chip memory check.

GENERAL_WARN

There are isolated pages with historical multi-bit errors. The NPU chip triggering the alarm has a health management fault code of 0x80E18401. The NPU chip can still be used.

IMPORTANT_WARN

The number of isolated pages now differs from the previous count. Restart to reset the NPU chip.

EMERGENCY_WARN

  • There are too many isolated pages with historical multi-bit errors, and too many device isolation rows. The NPU chip has a health management fault code of 0x80E18402. Replace the faulty part.
  • If there are four or more isolation rows in the same stack but different banks, the device is at high risk. Replace the spare part.
  • If the number of isolation rows in the same stack, same SID, and different PCs is greater than or equal to four, the device is at high risk. Replace the spare part.
  • If the number of isolation rows in the same stack, SID, PC, and bank is greater than 16, the device is at high risk. Replace the spare part.
  • If the number of different addresses is greater than five (excluding adjacent error addresses of four bits or less) in the same stack, SID, PC, and bank, the device is at high risk. Replace the spare part.

FAIL

The on-chip memory detection fails. Contact Huawei engineers.

Fault Diagnosis 4: On-Chip Memory Stress Test

Run a stress test on the high-bandwidth memory and output the diagnosis result.

# Example
ascend-dmi -dg -i hbm -s -st 60 -q
Figure 11 On-chip memory diagnosis example

Output parameters:

  • PASS: The on-chip memory stress test is passed.
  • SKIP: The current device does not support the on-chip memory stress test.
  • FAIL: The on-chip memory stress test fails because new multi-bit isolation pages are added. The software execution fails.

Fault Diagnosis 5: On-Chip Memory High-Risk Address Stress Test

Run a stress test on the high-risk addresses of the high bandwidth memory and output the diagnosis result.

Table 20 Parameters for on-chip memory high-risk address stress tests

Parameter

Description

Mandatory

[-s, --stress]

This parameter is used to perform stress tests. Currently, the following stress tests are supported: on-chip memory stress test, AI Core stress test, P2P stress test, and power consumption stress test.

Yes

[-qs, --qs, --quick stress]

Specify the range of fast stress test for high-bandwidth memory high-risk addresses.

  • The value ranges from 0 to 100. The recommended value is 100.
  • If the value is 0, the fast stress test is performed on all high-bandwidth memory addresses by default.
  • If HBM diagnosis is included, this parameter must be used together with [-s, --stress] and cannot be used together with [-st, --st, --stress-time] or [--sc, --stress-count].

Yes

# Example of stress test on on-chip memory high-risk addresses
ascend-dmi -dg -i hbm -s -qs 60-q
Figure 12 On-chip memory high-risk address stress test example

Output parameters:

  • PASS: The high-bandwidth memory high-risk address fast stress test is passed. No isolation page is added.
  • SKIP: The current device does not support the on-chip memory high-risk address stress test.
  • FAIL: The high-bandwidth memory high-risk address fast stress test fails. New isolation pages are added.

Fault Diagnosis 6: AI Core Diagnosis

Diagnose the AI Core error and output the diagnosis result.

# Example of AI Core diagnosis
ascend-dmi -dg -i aicore  -q
Figure 13 AI Core diagnosis example

Output parameters:

  • PASS: The diagnosis result is normal.
  • SKIP: The diagnosis is performed by a non-root user and AI Core diagnosis is not supported.
  • EMERGENCY_WARN: emergency warning. Replace the hardware.
  • FAIL: AI Core diagnosis fails. Contact Huawei engineers.

Fault Diagnosis 7: AI FLOPs Diagnosis

Diagnose the chip compute and output the test result.

# Example of AI FLOPs diagnosis
ascend-dmi -dg -i aiflops -q
Figure 14 AI FLOPs diagnosis example

Output parameters:

  • PASS: The compute test result is normal (greater than the reference value).
  • WARN: The chip overheats during the compute test.
  • FAIL: The compute test fails. The test result is smaller than the reference value.

Fault Diagnosis 8: Bandwidth Diagnosis

Diagnose the local bandwidth and output the diagnosis result.

# Example of performing bandwidth diagnosis on Device0
ascend-dmi --dg -i bandwidth -d 0
Figure 15 Bandwidth diagnosis example

Output parameters:

  • PASS: The bandwidth test result is normal.
  • FAIL: The bandwidth test fails. The test result is smaller than the reference value. Contact Huawei engineers.

Fault Diagnosis 9: P2P Stress Test

Checks whether the HCCS communication link from the source device to the target device has hardware faults and outputs the test result.

Table 21 Parameters for P2P stress tests

Parameter

Description

Mandatory

[-s, --stress]

This parameter is used to perform stress tests. Currently, the following stress tests are supported: on-chip memory stress test, AI Core stress test, P2P stress test, and power consumption stress test.

  • If bandwidth check is specified, this parameter can be used together with -s to perform a P2P stress test.

Yes

[-t, --type]

Specify the data flows to be tested.

  • This parameter takes effect only when item is set to bandwidth and -s is specified, indicating that the P2P pressure test is performed.
  • Currently, only the P2P bandwidth type is supported.

    p2p: Data is transferred from the specified source device to the target device. The transfer rate and total duration are tested.

Yes

# Example of P2P stress test
ascend-dmi -dg -i bandwidth --type p2p -s
Figure 16 P2P stress test example

Output parameters:

  • PASS: The stress test is passed, and the result is normal.
  • SKIP: The current device does not support the P2P stress test.
  • EMERGENCY_WARN: emergency warning. The stress test fails. Contact Huawei engineers to replace the hardware.
  • FAIL: P2P stress test fails. Contact Huawei engineers.

Fault Diagnosis 10: Power Consumption Stress Test

Perform the EDP/TDP power consumption stress test and output the diagnosis result.

# Example of power consumption stress test
ascend-dmi -dg -i edp -s -st 60-q
ascend-dmi -dg -i tdp -s -st 60-q
Figure 17 TDP power consumption stress test example
Figure 18 EDP power consumption stress test example

Output parameters:

  • PASS: The power consumption stress test result is normal.
  • SKIP: The current device does not support the power consumption stress test.
  • IMPORTANT_WARN: A chip alarm is generated during the stress test. Handle the alarm based on the description. If the fault persists, contact Huawei engineers.
  • FAIL: The power consumption stress test fails. Contact Huawei engineers.