Updated on 2025-08-15 GMT+08:00

ModelArts Metric Collector

Description

ModelArts Metric Collector, a default built-in plug-in of ModelArts, runs as a node daemon to collect node and job metrics and report them to AOM. For details about the metrics, see Viewing Lite Cluster Metrics on AOM.

Figure 1 ModelArts Metric Collector

Constraints

  • The plug-in is automatically installed during resource pool creation and cannot be uninstalled.
  • This plug-in is automatically installed if ModelArts Node Agent is upgraded to the latest version for an existing resource pool.
  • During the plug-in upgrade, the pod for metric collection restarts. As a result, metrics may not be reported for a short period of time. Exercise caution when performing the operation.

Components

Component

Description

Resource Type

modelarts-metric-collector

Node and container metrics collection

DaemonSet

Parameters

Parameter

Description

Standby Node Metric Reporting

Whether the standby node of a dedicated pool reports metrics. The default value is false.

Enable Exporter

Third-party monitoring systems such as Prometheus can obtain metrics collected by ModelArts. If this function is disabled, third-party monitoring systems such as Prometheus cannot collect metrics. This function is enabled by default.

Dedicated pool: Enable this function if you want to use inference job metrics for scaling.

Report Metrics to a Custom Common Prometheus Instance on AOM

By default, metrics are reported to the Prometheus_AOM_Default instance of AOM.

If this function is enabled, metrics are reported to the custom Prometheus common instance, as shown in Figure 2. If this function is disabled, metrics are reported to the default Prometheus instance, that is, the Prometheus_AOM_Default instance, as shown in Figure 3.

Figure 2 Custom Prometheus common instance
Figure 3 Prometheus_AOM_Default instance