このページは、お客様の言語ではご利用いただけません。Huawei Cloudは、より多くの言語バージョンを追加するために懸命に取り組んでいます。ご協力ありがとうございました。
- What's New
- Function Overview
- Service Overview (2.0)
- Billing (2.0)
- Getting Started (2.0)
-
User Guide (2.0)
- Using IAM to Grant Access to AOM
-
Connecting to AOM
- Connecting to AOM
- Managing Collector Base UniAgent
- Connecting Businesses to AOM
- Connecting Applications to AOM
-
Connecting Middleware and Custom Plug-ins to AOM
- Overview About Middleware and Custom Plug-in Connection to AOM
-
Connecting Middleware to AOM
- Ingesting MySQL Metrics to AOM
- Ingesting Redis Metrics to AOM
- Ingesting Kafka Metrics to AOM
- Ingesting Nginx Metrics to AOM
- Ingesting MongoDB Metrics to AOM
- Ingesting Consul Metrics to AOM
- Ingesting HAProxy Metrics to AOM
- Ingesting PostgreSQL Metrics to AOM
- Ingesting Elasticsearch Metrics to AOM
- Ingesting RabbitMQ Metrics to AOM
- Ingesting Other Middleware Metrics to AOM
- Connecting Custom Plug-ins to AOM
- Managing Middleware and Custom Plug-in Collection Tasks
- Connecting Running Environments to AOM
- Connecting Cloud Services to AOM
- Connecting Open-Source System to AOM
- Managing Log Ingestion
-
(New) Connecting to AOM
- AOM Access Overview
- Managing Collector Base UniAgent
- Connecting Businesses to AOM
- Connecting Components to AOM
-
Connecting Middleware to AOM
- Overview About Middleware Connection to AOM
- Ingesting MySQL Metrics to AOM
- Ingesting Redis Metrics to AOM
- Ingesting Kafka Metrics to AOM
- Ingesting Nginx Metrics to AOM
- Ingesting MongoDB Metrics to AOM
- Ingesting Consul Metrics to AOM
- Ingesting HAProxy Metrics to AOM
- Ingesting PostgreSQL Metrics to AOM
- Ingesting Elasticsearch Metrics to AOM
- Ingesting RabbitMQ Metrics to AOM
- Managing Middleware Collection Tasks
- Connecting Running Environments to AOM
- Connecting Cloud Services to AOM
- Ingesting Data to AOM Using Open-Source APIs and Protocols
- Managing Metric and Log Ingestion
- Observability Metric Browsing
- Dashboard Monitoring
- Alarm Monitoring
- (New) Log Management
- Log Management (Old)
-
Prometheus Monitoring
- Prometheus Monitoring Overview
- Managing Prometheus Instances
- Managing Prometheus Instance Metrics
- Using Prometheus Monitoring to Monitor CCE Cluster Metrics
- Configuring Multi-Account Aggregation for Unified Monitoring
- Configuring Metric Collection Rules for CCE Clusters
- Configuring Recording Rules to Improve Metric Query Efficiency
- Configuring Data Multi-Write to Dump Metrics to Self-Built Prometheus Instances
- Setting Metric Storage Duration
- Monitoring Prometheus Instance Metrics Through Dashboards
- Configuring the Remote Read Address to Enable Self-built Prometheus to Read Data from AOM
- Configuring the Remote Write Address to Report Self-Built Prometheus Data to AOM
- Checking Prometheus Instance Data Through Grafana
- Checking the Number of Metric Samples Reported by Prometheus Instances
- Infrastructure Monitoring
- Application Insights
- O&M Management
- Global Settings
- Querying AOM Traces
- Migrating Data from AOM 1.0 to AOM 2.0
-
Best Practices (2.0)
- AOM Best Practices
- Building a Comprehensive Metric System
- Alarm Noise Reduction
- Unified Metric Monitoring
- Customizing OS Images to Automatically Connect UniAgent
- Connecting Self-Built Middleware in the CCE Container Scenario
- Interconnecting Third-Party/IDC/Huawei Cloud Cross-Region Self-Built Prometheus with AOM Prometheus Instances
-
FAQs (2.0)
- Dashboard
- Alarm Management
- Log Analysis
- Prometheus Monitoring
- Infrastructure Monitoring
- Application Monitoring
-
Collection Management
- Are ICAgent and UniAgent the Same?
- What Can I Do If an ICAgent Is Offline?
- Why Is an Installed ICAgent Displayed as "Abnormal" on the UniAgent Installation and Configuration Page?
- Why Can't I View the ICAgent Status After It Is Installed?
- Why Can't AOM Monitor CPU and Memory Usage After ICAgent Is Installed?
- How Do I Obtain an AK/SK?
- FAQs About UniAgent and ICAgent Installation
- How Do I Enable the Nginx stub_status Module?
- Why Does APM Metric Collection Fail?
- Why Cannot the Installation Script Be Downloaded When I Try to Install UniAgent on an ECS?
- CMDB (Unavailable Soon)
-
O&M Management (Unavailable Soon)
- How Can I Obtain the OBS Permission for Installing Packages?
- Why Can't Scheduled Tasks Be Triggered?
- Can I Specify Script Parameters and Hosts During Job Execution?
- Why Is a Parameter Error Displayed When I Create a Scheduled Task Using a Cron Expression?
- How Can I Set a Review for an Execution Plan?
- Why Is "delete success:{}" Displayed (Files Cannot Be Deleted) During Disk Clearance?
- What Can I Do If the Execution Plan Is Not Updated After I Modify the Job?
- What Can I Do If "agent not found" Is Displayed?
- Why Are the Hosts Listed in Execution Logs Inconsistent with Those I Configured for a Task?
- Why Did a Task Fail to Execute?
- Other FAQs
-
API Reference
- Before You Start
- API Overview
- Calling APIs
-
APIs
-
Alarm
- Querying the Event Alarm Rule List
- Adding an Event Alarm Rule
- Modifying an Event Alarm Rule
- Deleting an Event Alarm Rule
- Querying Events and Alarms
- Counting Events and Alarms
- Reporting Events and Alarms
- Obtaining the Alarm Sending Result
- Deleting a Silence Rule
- Adding a Silence Rule
- Modifying a Silence Rule
- Obtaining the Silence Rule List
- Querying an Alarm Action Rule Based on Rule Name
- Adding an Alarm Action Rule
- Deleting an Alarm Action Rule
- Modifying an Alarm Action Rule
- Querying the Alarm Action Rule List
- Querying Metric or Event Alarm Rules
- Adding or Modifying Metric or Event Alarm Rules
- Deleting Metric or Event Alarm Rules
-
Monitoring
- Querying Time Series Objects
- Querying Time Series Data
- Querying Metrics
- Querying Monitoring Data
- Adding Monitoring Data
- Adding or Modifying One or More Service Discovery Rules
- Deleting a Service Discovery Rule
- Querying Existing Service Discovery Rules
- Adding a Threshold Rule
- Querying the Threshold Rule List
- Modifying a Threshold Rule
- Deleting a Threshold Rule
- Querying a Threshold Rule
- Deleting Threshold Rules in Batches
-
Prometheus Monitoring
- Querying Expression Calculation Results in a Specified Period Using the GET Method
- (Recommended) Querying Expression Calculation Results in a Specified Period Using the POST Method
- Querying the Expression Calculation Result at a Specified Time Point Using the GET Method
- (Recommended) Querying Expression Calculation Results at a Specified Time Point Using the POST Method
- Querying Tag Values
- Obtaining the Tag Name List Using the GET Method
- (Recommended) Obtaining the Tag Name List Using the POST Method
- Querying Metadata
- Log
- Prometheus Instance
- Configuration Management
-
CMDB (AOM 2.0)
- Creating an Application
- Deleting an Application
- Querying the Details of an Application
- Modifying an Application
- Adding a Component
- Deleting a Component
- Querying the Details of a Component
- Modifying a Component
- Creating an Environment
- Deleting an Environment
- Querying the Details of an Environment
- Modifying an Environment
- Querying the Resource List of a Node
- Querying the Details of an Application Based on the Application Name
- Querying the Details of an Environment Based on the Environment Name
- Querying the Details of a Component Based on the Component Name
- Adding a Sub-application
- Deleting a Sub-application
- Modifying a Sub-application
-
Automation (AOM 2.0)
- Creating a Task
- Updating a Task
- Operating a Paused Task
- Obtaining the Execution Details of a Workflow
- Terminating a Task
- Querying a Script
- Querying the Script Version
- Performing Fuzzy Search on the Job Management Page
- Querying Execution Plans (Custom Templates) Based on Job ID
- Querying the Details of an Execution Plan
- Querying Tasks
- Querying the Execution History of a Task
- Executing a Workflow
-
Alarm
- Historical APIs
- Examples
- Permissions Policies and Supported Actions
- Appendix
- SDK Reference
-
Service Overview (1.0)
- What Is AOM?
- Product Architecture
- Functions
- Application Scenarios
- Edition Differences
-
Metric Overview
- Introduction
- Network Metrics and Dimensions
- Disk Metrics and Dimensions
- Disk Partition Metrics
- File System Metrics and Dimensions
- Host Metrics and Dimensions
- Cluster Metrics and Dimensions
- Container Metrics and Dimensions
- VM Metrics and Dimensions
- Instance Metrics and Dimensions
- Service Metrics and Dimensions
- Security
- Restrictions
- Privacy and Sensitive Information Protection Statement
- Relationships Between AOM and Other Services
- Basic Concepts
- Permissions
- Billing
- Change History
- Getting Started (1.0)
-
User Guide (1.0)
- Overview
- Subscribing to AOM
- Permissions Management
- Connecting Resources to AOM
- Monitoring Overview
- Alarm Management
- Resource Monitoring
- Log Management
- Configuration Management
- Resource Groups
- Auditing
- Upgrading to AOM 2.0
- Best Practices (1.0)
-
FAQs (1.0)
- User FAQs
-
Consultation FAQs
- What Is the Billing Policy of AOM?
- What Are the Usage Restrictions of AOM?
- What Are the Differences Between AOM and APM?
- How Do I Distinguish Alarms from Events?
- What Is the Relationship Between the Time Range and Statistical Cycle?
- Does AOM Display Logs in Real Time?
- Will Container Logs Be Deleted After They Are Dumped?
- How Can I Do If I Cannot Receive Any Email Notification After Configuring a Threshold Rule?
- Why Are Connection Channels Required?
-
Usage FAQs
- What Can I Do If I Do Not Have the Permission to Access SMN?
- What Can I Do If Resources Are Not Running Properly?
- How Do I Set the Full-Screen Online Duration?
- What Can I Do If the Log Usage Reaches 90% or Is Full?
- How Do I Obtain an AK/SK?
- How Can I Check Whether a Service Is Available?
- Why Is the Status of an Alarm Rule Displayed as "Insufficient"?
- Why the Status of a Workload that Runs Normally Is Displayed as "Abnormal" on the AOM Page?
- How Do I Create the apm_admin_trust Agency?
- How Do I Obtain the AK/SK by Creating an Agency?
- What Is the Billing Policy of Logs?
- Why Can't I See Any Logs on the Console?
- What Can I Do If an ICAgent Is Offline?
- Why Can't the Host Be Monitored After ICAgent Is Installed?
- Why Is "no crontab for root" Displayed During ICAgent Installation?
- Why Can't I Select an OBS Bucket When Configuring Log Dumping on AOM?
- Why Can't Grafana Display Content?
- Videos
-
More Documents
-
User Guide (1.0) (Kuala Lumpur Region)
-
Service Overview
- What Is AOM?
- Product Architecture
- Functions
- Application Scenarios
-
Metric Overview
- Introduction
- Network Metrics and Dimensions
- Disk Metrics and Dimensions
- Disk Partition Metrics
- File System Metrics and Dimensions
- Host Metrics and Dimensions
- Cluster Metrics and Dimensions
- Container Metrics and Dimensions
- VM Metrics and Dimensions
- Instance Metrics and Dimensions
- Service Metrics and Dimensions
- Restrictions
- Privacy and Sensitive Information Protection Statement
- Relationships Between AOM and Other Services
- Basic Concepts
- Permissions
- Getting Started
- Permissions Management
- Connecting Resources to AOM
- Monitoring Overview
- Alarm Management
- Resource Monitoring
- Log Management
- Configuration Management
- Auditing
- Upgrading to AOM 2.0
-
FAQs
- User FAQs
-
Consultation FAQs
- What Are the Usage Restrictions of AOM?
- What Are the Differences Between AOM and APM?
- How Do I Distinguish Alarms from Events?
- What Is the Relationship Between the Time Range and Statistical Cycle?
- Does AOM Display Logs in Real Time?
- How Can I Do If I Cannot Receive Any Email Notification After Configuring a Threshold Rule?
- Why Are Connection Channels Required?
-
Usage FAQs
- What Can I Do If I Do Not Have the Permission to Access SMN?
- What Can I Do If Resources Are Not Running Properly?
- How Do I Set the Full-Screen Online Duration?
- How Do I Obtain an AK/SK?
- How Can I Check Whether a Service Is Available?
- Why Is the Status of an Alarm Rule Displayed as "Insufficient"?
- Why the Status of a Workload that Runs Normally Is Displayed as "Abnormal" on the AOM Page?
- How Do I Create the apm_admin_trust Agency?
- What Can I Do If an ICAgent Is Offline?
- Why Is "no crontab for root" Displayed During ICAgent Installation?
- Change History
-
Service Overview
-
User Guide (2.0) (Kuala Lumpur Region)
- Service Overview
- Getting Started
- Introduction
- Access Center
- Dashboard
- Alarm Management
-
Metric Analysis
- Metric Browsing
- Prometheus Monitoring
- Resource Usage Statistics
- Log Analysis (Beta)
- Container Insights
- Infrastructure Monitoring
- Process Monitoring
- Collection Management
- Configuration Management
- Remarks
- Permissions Management
- Auditing
- Upgrading to AOM 2.0
-
FAQs
- Overview
- Dashboard
- Alarm Management
- Log Analysis
- Prometheus Monitoring
- Container Insights
- Application Monitoring
-
Collection Management
- Are ICAgent and UniAgent the Same?
- What Can I Do If an ICAgent Is Offline?
- Why Is an Installed ICAgent Displayed as "Abnormal" on the Agent Management Page?
- Why Can't I View the ICAgent Status After It Is Installed?
- Why Can't AOM Monitor CPU and Memory Usage After ICAgent Is Installed?
- How Do I Obtain an AK/SK?
- FAQs About ICAgent Installation
- Other FAQs
- Change History
-
API Reference (Kuala Lumpur Region)
- Before You Start
- API Overview
- Calling APIs
-
APIs
-
Alarm
- Querying the Event Alarm Rule List
- Adding an Event Alarm Rule
- Modifying an Event Alarm Rule
- Deleting an Event Alarm Rule
- Obtaining the Alarm Sending Result
- Deleting a Silence Rule
- Adding a Silence Rule
- Modifying a Silence Rule
- Obtaining the Silence Rule List
- Querying an Alarm Action Rule Based on Rule Name
- Adding an Alarm Action Rule
- Deleting an Alarm Action Rule
- Modifying an Alarm Action Rule
- Querying the Alarm Action Rule List
- Querying Events and Alarms
- Counting Events and Alarms
- Reporting Events and Alarms
-
Monitoring
- Querying Time Series Objects
- Querying Time Series Data
- Querying Metrics
- Querying Monitoring Data
- Adding Monitoring Data
- Adding or Modifying One or More Service Discovery Rules
- Deleting a Service Discovery Rule
- Querying Existing Service Discovery Rules
- Adding a Threshold Rule
- Querying the Threshold Rule List
- Modifying a Threshold Rule
- Deleting a Threshold Rule
- Querying a Threshold Rule
- Deleting Threshold Rules in Batches
- Log
-
Alarm
- Examples
- Permissions Policies and Supported Actions
- Appendix
- Change History
-
User Guide (ME-Abu Dhabi Region)
- Service Overview
- Getting Started
- User Guide
-
FAQs
- What Can I Do If an ICAgent Is Offline?
- Obtaining an AK/SK
- What Is the Relationship Between the Time Range and Statistical Cycle?
- What Can I Do If Resources Are Not Running Properly?
- How Can I Do If I Do Not Have the Permission to Access SMN?
- How Do I Distinguish Alarms and Events?
- Does AOM Display Logs in Real Time?
- How Can I Check Whether a Service Is Available?
- Why Is the Status of an Alarm Rule Displayed as "Insufficient"?
- Why the Status of a Workload that Runs Normally Is Abnormal on the AOM Page?
-
API Reference(ME-Abu Dhabi Region)
- Before You Start
- API Overview
- Calling APIs
-
APIs
-
Monitoring (v1)
- Querying Metrics
- Querying Monitoring Data
- Adding Monitoring Data
- Adding a Threshold Rule
- Modifying a Threshold Rule
- Querying the Threshold Rule List
- Querying a Threshold Rule
- Deleting a Threshold Rule
- Adding or Modifying One or More Application Discovery Rules
- Deleting an Application Discovery Rule
- Querying Application Discovery Rules
- Auto Scaling
- Log
-
Monitoring (v1)
- Permissions Policies and Supported Actions
- Appendix
-
User Guide (Ankara Region)
- Service Overview
- Getting Started
- User Guide
-
FAQs
- What Can I Do If an ICAgent Is Offline?
- How Do I Obtain an AK/SK?
- What Can I Do If Resources Are Not Running Properly?
- How Can I Do If I Do Not Have the Permission to Access SMN?
- How Do I Distinguish Alarms from Events?
- Does AOM Display Logs in Real Time?
- Why Is the Application Status Normal but the Component Status Abnormal?
- Best Practices
- Change History
-
API Reference (Ankara Region)
- Before You Start
- API Overview
- Calling APIs
-
APIs
-
Monitoring (v1)
- Querying Metrics
- Querying Monitoring Data
- Adding Monitoring Data
- Adding a Threshold Rule
- Modifying a Threshold Rule
- Querying the Threshold Rule List
- Querying a Threshold Rule
- Deleting a Threshold Rule
- Adding or Modifying One or More Application Discovery Rules
- Deleting an Application Discovery Rule
- Querying Application Discovery Rules
- Monitoring (v2)
- Auto Scaling
- Log
- Events/Alarms
- Agent
- Application Discovery Rules
-
Prometheus Monitoring
- Querying Expression Calculation Results in a Specified Period
- Querying the Expression Calculation Result at a Specified Time Point
- Querying Tag Values
- Obtaining the Tag Name List
- Querying Metadata
- Querying the Calculation Results of a PromQL Expression in a Specified Period Based on Prometheus Instance
- Querying the Calculation Result of a PromQL Expression at a Specified Time Point Based on Prometheus Instance
- Querying the Values of a Tag Based on Prometheus Instance
- Obtaining the Tag Name List Based on Prometheus Instance
- Querying Metadata Based on Prometheus Instance
-
Monitoring (v1)
- Appendix
- Change History
-
User Guide (1.0) (Kuala Lumpur Region)
- General Reference
Show all
Copied.
Basic Metrics: ModelArts Metrics
This section describes the ModelArts metrics reported to AOM through the Agent.
Category |
Metric |
Metric Name |
Description |
Value Range |
Unit |
---|---|---|---|---|---|
CPU |
ma_container_cpu_util |
CPU Usage |
CPU usage of a measured object |
0–100 |
% |
ma_container_cpu_used_core |
Used CPU Cores |
Number of CPU cores used by a measured object |
≥ 0 |
Cores |
|
ma_container_cpu_limit_core |
Total CPU Cores |
Total number of CPU cores that have been applied for a measured object |
≥ 1 |
Cores |
|
Memory |
ma_container_memory_capacity_megabytes |
Memory |
Total physical memory that has been applied for a measured object |
≥ 0 |
MB |
ma_container_memory_util |
Physical Memory Usage |
Percentage of the used physical memory to the total physical memory applied for a measured object |
0–100 |
% |
|
ma_container_memory_used_megabytes |
Used Physical Memory |
Physical memory that has been used by a measured object (container_memory_working_set_bytes in the current working set). (Memory usage in a working set = Active anonymous and cache, and file-baked page ≤ container_memory_usage_bytes) |
≥ 0 |
MB |
|
Storage I/O |
ma_container_disk_read_kilobytes |
Disk Read Rate |
Volume of data read from a disk per second |
≥ 0 |
KB/s |
ma_container_disk_write_kilobytes |
Disk Write Rate |
Volume of data written into a disk per second |
≥ 0 |
KB/s |
|
GPU memory |
ma_container_gpu_mem_total_megabytes |
GPU Memory Capacity |
Total GPU memory of a training job |
> 0 |
MB |
ma_container_gpu_mem_util |
GPU Memory Usage |
Percentage of the used GPU memory to the total GPU memory |
0–100 |
% |
|
ma_container_gpu_mem_used_megabytes |
Used GPU Memory |
GPU memory used by a measured object |
≥ 0 |
MB |
|
GPU |
ma_container_gpu_util |
GPU Usage |
GPU usage of a measured object |
0–100 |
% |
ma_container_gpu_mem_copy_util |
GPU Memory Bandwidth Usage |
GPU memory bandwidth usage of a measured object. For example, the maximum memory bandwidth of NVIDIA GPU V100 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%. |
0–100 |
% |
|
ma_container_gpu_enc_util |
GPU Encoder Usage |
GPU encoder usage of a measured object |
0–100 |
% |
|
ma_container_gpu_dec_util |
GPU Decoder Usage |
GPU decoder usage of a measured object |
0–100 |
% |
|
DCGM_FI_DEV_GPU_TEMP |
GPU Temperature |
GPU temperature |
> 0 |
°C |
|
DCGM_FI_DEV_POWER_USAGE |
GPU Power |
GPU power |
> 0 |
W |
|
DCGM_FI_DEV_MEMORY_TEMP |
Memory Temperature |
Memory temperature |
> 0 |
°C |
|
DCGM_FI_PROF_GR_ENGINE_ACTIVE |
Graphics Engine Activity |
Percentage of the time when the graphic or compute engine is in the active state within a period. This is an average value of all graphic or compute engines. An active graphic or compute engine indicates that the graphic or compute context is associated with a thread and the graphic or compute context is busy. |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_SM_OCCUPANCY |
SM Occupancy |
Ratio of the number of thread bundles that reside on the SM to the maximum number of thread bundles that can reside on the SM within a period. This is an average value of all SMs within a period. A high value does not mean a high GPU usage. Only when the GPU memory bandwidth is limited, a high value of workloads (DCGM_FI_PROF_DRAM_ACTIVE) indicates more efficient GPU usage. |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE |
Tensor Activity |
Fraction of the period during which the tensor (HMMA/IMMA) pipe is active. This is an average value within a period, not an instantaneous value. A higher value indicates a higher utilization of tensor cores. Value 1 (100%) indicates that a tensor instruction is sent every instruction cycle in the entire period (one instruction is completed in two cycles). If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM tensor cores run at 100% utilization. During the entire period, all SM tensor cores run at 20% utilization. During 1/5 of the entire period, all SM tensor cores run at 100% utilization. Other combinations |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_DRAM_ACTIVE |
Memory BW Utilization |
Percentage of the time for sending data to or receiving data from the device memory within a period. This is an average value within a period, not an instantaneous value. A higher value indicates a higher utilization of device memory. Value 1 (100%) indicates that a DRAM instruction is executed once per cycle throughout a period (the maximum value can be reached at a peak of about 0.8). If the value is 0.2 (20%), indicating that data is read from or written into the device memory during 20% of the cycle within a period. |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_PIPE_FP16_ACTIVE |
FP16 Engine Activity |
Fraction of the period during which the FP16 (half-precision) pipe is active. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP16 cores. Value 1 (100%) indicates that the FP16 instruction is executed every two cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM FP16 cores run at 100% utilization. During the entire period, all SM FP16 cores run at 20% utilization. During 1/5 of the entire period, all SM FP16 cores run at 100% utilization. Other combinations |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_PIPE_FP32_ACTIVE |
FP32 Engine Activity |
Fraction of the period during which the fused multiply-add (FMA) pipe is active. Multiply-add applies to FP32 (single precision) and integers. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP32 cores. Value 1 (100%) indicates that the FP32 instruction is executed every two cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM FP32 cores run at 100% utilization. During the entire period, all SM FP32 cores run at 20% utilization. During 1/5 of the entire period, all SM FP32 cores run at 100% utilization. Other combinations |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_PIPE_FP64_ACTIVE |
FP64 Engine Activity |
Fraction of the period during which the FP64 (double precision) pipe is active. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP64 cores. Value 1 (100%) indicates that the FP64 instruction is executed every four cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM FP64 cores run at 100% utilization. During the entire period, all SM FP64 cores run at 20% utilization. During 1/5 of the entire period, all SM FP64 cores run at 100% utilization. Other combinations |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_SM_ACTIVE |
SM Activity |
Fraction of the time during which at least one thread bundle is active on an SM within a period. This is an average value of all SMs and is insensitive to the number of threads in each block. A thread bundle is active after being scheduled and allocated with resources. The thread bundle may be in the computing state or a non-computing state (for example, waiting for a memory request). If the value is less than 0.5, GPUs are not efficiently used. The value should be greater than 0.8. For example, a GPU has N SMs: A kernel function uses N thread blocks to run on all SMs in a period. In this case, the value is 1 (100%). A kernel function runs N/5 thread blocks in a period. In this case, the value is 0.2. A kernel function uses N thread blocks and runs only 1/5 of cycles in a period. In this case, the value is 0.2. |
0–1.0 |
Percentage (fraction) |
|
DCGM_FI_PROF_PCIE_TX_BYTES DCGM_FI_PROF_PCIE_RX_BYTES |
PCIe Bandwidth |
Rate of data transmitted or received over the PCIe bus, including the protocol header and data payload. This is an average value within a period, not an instantaneous value. The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per channel. |
≥ 0 |
Bytes/s |
|
DCGM_FI_PROF_NVLINK_RX_BYTES DCGM_FI_PROF_NVLINK_TX_BYTES |
NVLink Bandwidth |
Rate at which data is transmitted or received through NVLink, excluding the protocol header. This is an average value within a period, not an instantaneous value. The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction. |
≥ 0 |
Bytes/s |
|
Network I/O |
ma_container_network_receive_bytes |
Downlink Rate (BPS) |
Inbound traffic rate of a measured object |
≥ 0 |
Bytes/s |
ma_container_network_receive_packets |
Downlink Rate (PPS) |
Number of data packets received by a NIC per second |
≥ 0 |
Packets/s |
|
ma_container_network_receive_error_packets |
Downlink Error Rate |
Number of error packets received by a NIC per second |
≥ 0 |
Count/s |
|
ma_container_network_transmit_bytes |
Uplink Rate (BPS) |
Outbound traffic rate of a measured object |
≥ 0 |
Bytes/s |
|
ma_container_network_transmit_error_packets |
Uplink Error Rate |
Number of error packets sent by a NIC per second |
≥ 0 |
Count/s |
|
ma_container_network_transmit_packets |
Uplink Rate (PPS) |
Number of data packets sent by a NIC per second |
≥ 0 |
Packets/s |
|
NPU |
ma_container_npu_util |
NPU Usage |
NPU usage of a measured object |
0–100 |
% |
ma_container_npu_memory_util |
NPU Memory Usage |
Percentage of the used NPU memory to the total NPU memory |
0–100 |
% |
|
ma_container_npu_memory_used_megabytes |
Used NPU Memory |
NPU memory used by a measured object |
≥ 0 |
MB |
|
ma_container_npu_memory_total_megabytes |
Total NPU Memory |
Total NPU memory of a measured object |
≥ 0 |
MB |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot