Updated on 2025-08-15 GMT+08:00

Cluster Autoscaler

Description

Cluster Autoscaler is a plug-in for elastic scaling of ModelArts resource pools in a cluster. It can be used to scale in or out node pools based on user-defined rules.

Constraints

  • This plug-in is supported only for nodes in a pay-per-use or yearly/monthly Lite Cluster resource pool.
  • If the resource specifications are sold out or the underlying capacity is insufficient, the scale-out will fail.
  • This plug-in is not supported for Lite Cluster resource pools purchased by rack.
  • This plug-in uses the global agency permissions to perform operations on resource pools. If the global agency includes blacklist policies related to resource pool operations, delete the blacklist first. To do so, perform the following steps:
    1. Log in to the ModelArts console. In the navigation pane on the left, choose Permission Management. Locate the target authorization and obtain the content in the Authorization Content column, which is the name of the agency granted to the current user.
      Figure 1 Authorization content
    2. Go to the IAM console. In the navigation pane on the left, choose Agencies. Locate the obtained agency and click Modify in the Operation column.
      Figure 2 IAM agency
    3. Click the Permissions tab.
    4. Locate ModelArts CommonOperations and click Delete in the Operation column. In the displayed dialog box, click OK.
      Figure 3 Deleting the ModelArts CommonOperations permission

Installing a Plug-in

  1. Log in to the ModelArts console. In the navigation pane on the left, choose Lite Cluster under Resource Management.
  2. Click the resource pool name to access its details page.
  3. In the navigation pane on the left, choose Plug-ins.
  4. Locate the plug-in to be installed and click Install.

    If Cluster Autoscaler is not manually installed in a newly created resource pool, however, it is displayed as installed in the plug-in list. This indicates that Cluster Autoscaler has been installed in the CCE cluster used by the newly created resource pool. In this case, uninstall Cluster Autoscaler and reinstall it.

    Figure 4 Installing a plug-in
  5. In the displayed dialog box, configure the parameters.
    The following table lists the related parameters.
    Table 1 Cluster Autoscaler parameters

    Parameter

    Sub-Parameter

    Description

    Specifications

    Plug-in Version

    Specify the version of Cluster Autoscaler to be deployed.

    Plug-in Specifications

    Specify the specifications of the plug-in to be deployed. You can select preset specifications or customize one.

  6. Read "Usage Notes" and select I have read and understand the preceding information.
  7. Click OK.

Configuring Auto Scaling Policies for a Node Pool

After you install Cluster Autoscaler, you need to configure auto scaling policies for node pools.

Only pay-per-use nodes can be added.

Configured nodes may be deleted and cannot be restored if automatic scale-in is performed. Exercise caution.

  1. On the resource pool details page, choose Node Pool Management from the left.
  2. Locate the target node pool and click AS Configuration in the Operation column.
  3. In the displayed dialog box, configure the node pool scaling policy.
    • Auto Scale-Out

      If this function is enabled, the node pool can be automatically scaled out. Each node pool can have a maximum of six scale-out rules.

      Table 2 Auto scale-out parameters

      Parameter

      Description

      Custom Scale-out Rules

      Click Add Rule. In the dialog box displayed, set the following parameters:

      Set Rule Type to Period or Metric Trigger. Each node pool can have a maximum of six scale-out rules, including five periodic scale-out rules and one metric-triggered scale-out rule. A periodic scale-out rule cannot be added repeatedly. You can add only one metric-triggered scale-out rule.

      For details, see Table 3.

      Max. Nodes

      The node pool will not be scaled out if the number of nodes hits the configured maximum. If the number of nodes in a node pool plus the expected number of nodes to be added exceeds the upper limit, the scale-out will not be triggered. This is to ensure the atomicity of scale-out.

      Table 3 Scale-out rule types

      Rule Type

      Description

      Periodic

      Automatically adds nodes to the node pool in a specified period of time, optimizing resource allocation and reducing costs.

      • Trigger Time: Specify a time as required. This time indicates the local time of where the node is deployed.
      • New Nodes: Set the number of nodes to be added to a node pool during elastic scaling.

      Metric Trigger

      Dynamically adds nodes to the node pool based on the NPU usage, improving task execution efficiency.

      • Trigger: Currently, only NPU usage-triggered scale-out is supported. When the NPU usage of a node is low, the system may migrate tasks to the node or adjust the number of nodes to better match the requirements.

        NPU usage = Resource requested by the pod in the node pool/Allocatable resources of the node pod (Node Allocatable)

        The value must be greater than the scale-in percentage configured in Autoscaler.

      • Action
        • Customization: Customize the number of nodes to be added during auto scaling.
        • Automatic calculation: When the trigger condition is met, nodes are automatically added and the usage is restored to a value lower than the threshold.

          Number of nodes to be added = Resource requested by the pod in the node pool/(Allocatable resources of a single node x Number of target nodes) - Current number of nodes + 1

    • Auto Scale-In

      Once enabled, the system checks the resource status of the entire cluster. If it confirms that workload pods can be scheduled and run properly, it automatically chooses nodes for scale-in.

      Table 4 Auto scale-in parameters

      Parameter

      Description

      Min. Nodes

      The node pool will not be scaled in if the number of nodes hits the configured minimum.

      You must set the minimum number of nodes (minCount) in the node pool for scale-in. Otherwise, the auto scale-in will fail.

      Cooldown Period (Min)

      The cooldown period for starting scale-in evaluation again after auto scale-out is triggered

  4. Click OK.

Configuring Metric-Triggered Auto Scaling

When you configure an auto scaling policy for a node pool, if metric ma_node_pool_allocate_card_util is used as the scaling policy, you need to complete the following configurations.

  1. Install the cloud native plug-in, select local storage, and enable custom metric collection. For details, see Creating an HPA Policy with Custom Metrics.
  2. Create external APIServices and use kubectl apply to apply the configurations to the Kubernetes cluster.

    a. Log in to the CCE console. Go to the shell page of the cluster by clicking the CLI tool in the upper right corner.

    b. Create the external.yaml file and save the YAML content below to the file.

    c. Run the kubectl apply -f external.yaml command.
    apiVersion: apiregistration.k8s.io/v1
    kind: APIService
    metadata:
      labels:
        app: external-metrics-apiserver
        release: cceaddon-prometheus
      name: v1beta1.external.metrics.k8s.io
    spec:
      group: external.metrics.k8s.io
      groupPriorityMinimum: 100
      insecureSkipTLSVerify: true
      service:
        name: custom-metrics-apiserver
        namespace: monitoring
        port: 443
      version: v1beta1
      versionPriority: 100
  3. Add custom external metrics of the Prometheus plug-in. For details, see Step 3: Modify the Configuration File.
    1. Log in to the CCE console and click the cluster name to access its details page. In the navigation pane on the left, choose ConfigMaps and Secrets and switch to the monitoring namespace.
    2. Update user-adapter-config. You can modify the rules field in user-adapter-config to convert the metrics exposed by Prometheus to metrics that can be associated with HPA.

      Add the following example rules:

      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: user-adapter-config
        namespace: monitoring
      data:
        config.yaml: |
          rules: []
          ...
      #The following content is the added content.
          externalRules:
          - seriesQuery: '{__name__="ma_node_allocate_card_util",pool_id!=""}'
            metricsQuery: avg(<<.Series>>{<<.LabelMatchers>>}) by (pool_id,node_pool)
            resources:
              overrides:
                  pool_id:
                      resource: namespace
            name: 
              as: ma_node_pool_allocate_card_util
  4. On the CCE console, choose Clusters from the navigation pane.
  5. Click the cluster name. Then, in the navigation pane on the left, choose Workload. Switch to the monitoring namespace. Locate the custom-metrics-apiserver instance and choose More > Redeploy next to the workload.
  6. After the redeployment is complete, you can use the CLI tool on the CCE console to view the current metric values. The command is shown below. In the command, pool_id indicates the resource pool ID, and node_pool indicates the node pool name. When querying the default node pool, leave this parameter blank.
    kubectl get --raw /apis/external.metrics.k8s.io/v1beta1/namespaces/{{pool_id}}/ma_node_pool_allocate_card_util?labelSelector=node_pool={{node_pool_name}}

Components

Table 5 Nodescaler component of Cluster Autoscaler

Component

Description

Resource Type

nodescaler-controller-manager

Manage auto scaling of resource pools.

Deployment

Change History

Table 6 Release history

Plug-in Version

New Feature

0.1.20

Supported auto scale-out at a scheduled time, scale-out based on the NPU allocation rate, and auto scale-in based on the load of idle nodes.