Updated on 2025-05-21 GMT+08:00

bursting from on-premises Plug-in

Introduction

The bursting from on-premises plug-in can communicate with Slurm and Huawei Cloud Auto Scaling (AS), Cloud Eye, and Elastic Cloud Server (ECS) services. This plug-in dynamically adjusts the number of Slurm nodes as needed.

Preparations

  1. Creating an AS configuration

    Log in to the AS console, and create an AS configuration based on the required ECS configurations.

  2. Creating an AS group

    Create an AS group using the AS configuration you created.

    Record the AS group ID, which will be used in the Scaler configuration file.

  3. Purchasing a yearly/monthly ECS

    Log in to the ECS console, purchase a yearly/monthly ECS as needed, and record the ECS node name in the Slurm cluster. This node will be used as a stable node in the Scaler configuration file.

  4. Creating an access key pair
    1. On the management console, choose My Credentials.

    2. In the navigation pane, choose Access Keys and click Create Access Key. After the access keys are created, you can download the file that contains the created Access Key ID (AK) and Secret Access Key (SK).

  5. Determining the region ID and the project ID

    Go to My Credentials page, choose API Credentials in the navigation pane, and find the target region ID and the project ID.

  6. Determining the endpoints of antecedent services

    The Scaler program depends on AS, Cloud Eye, and ECS. The endpoints of each service are as follows:

    as.{Region ID}.myhuaweicloud.com

    ces.{Region ID}.myhuaweicloud.com

    ecs.{Region ID}.myhuaweicloud.com

    Replace {Region ID} with the actual region ID, for example, as.cn-north-9.myhuaweicloud.com.

Constraints

  • Slurm user quotas can only be configured at the node level.
  • Instances cannot be manually added to an AS group.
  • To update the stable node information, you need to update the Scaler configuration file, restart the Scaler, and then create an instance.
  • A shortened stable node name must be in the format of a prefix followed by a number, like hpc-0001 and slurm-02.
  • The Scaler program can run on only one node.

Software Deployment

  • Installation node

    Master node of the Slurm cluster

  • Installation directory

    /data

  • Downloading software

    cd /data

    Download scaler-0.0.1-SNAPSHOT.jar.

  • Creating a configuration file

    Create the scalerConfig.yaml file in the directory that saves scaler-0.0.1-SNAPSHOT.jar, and edit the file. (Ensure that the file complies with the YAML file specifications.)

    Configure the file as follows:

    user:
      # AK of the account for logging in to the console
      ak:
      # SK of the account for logging in to the console
      sk:
     # Tenant ID of the target region
      project: cc515cbccbc04b78b29a30f5c47fc99a
      # Proxy address, port, username, and password. They are not required if no proxy is needed.
      proxy-address:
      proxy-port:
      proxy-username:
      proxy-password:
    as:
      # AS endpoint corresponding to the target region
      endpoint: as.cn-north-9.myhuaweicloud.com
      # ID of the preconfigured AS group
      group: de2aa26c-12f7-4882-824f-6c2886bb91e1
      # Maximum number of instances displayed on a page. The default value is 100. You do not need to change the value.
      list-instance-limit: 100
     # Maximum number of instances that can be deleted. This maximum number in AS is 50. You do not need to change the value.
      delete-instance-limit: 50
    ecs:
      # ECS endpoint corresponding to the target region
      endpoint: ecs.cn-north-9.myhuaweicloud.com
    metric:
      # Namespace of the custom monitoring metric. You do not need to change the value.
      namespace: HPC.SLURM
      # Name of a custom metric.
      name: workload
     # Name of a custom metric dimension. You do not need to change the value.
      dimension-name: hpcslurm01
      # ID of a custom metric dimension. It can be set to the AS group ID. This value does not affect functions.
      dimension-id: de2aa26c-12f7-4882-824f-6c2886bb91e1
      # Time to live (TTL) reported by the metric. You do not need to change the value.
      report-ttl: 172800
      # Cloud Eye endpoint corresponding to the target region
      endpoint: ces.cn-north-9.myhuaweicloud.com
    task:
      # Period for checking the Slurm node status, in seconds
      health-audit-period: 30
      # Period for reporting custom metrics, in seconds
      metric-report-period: 30
      # Period for checking whether scale-in is required, in seconds
      scale-in-period: 120
      # Period for automatically deleting nodes, in seconds
      delete-instance-period: 10
     # Period for automatically discovering new nodes
      discover-instance-period: 20
      # Period for comparing the number of nodes in the AS group with that of nodes in the Slurm cluster, in seconds
      diff-instance-and-node-period: 60
    slurm:
      # Name of the stable node. Separate multiple names with commas (,).
      stable-nodes: node-ind07urq,node-or7qht4s
      # Slurm partition where stable nodes are located
      stable-partition: dyn1
      # Slurm partition where variable nodes are located
      variable-partition: dyn1
      # Period of idle time after which idle nodes will be deleted, in seconds
      scale-in-time: 300
      # Job waiting time. If the waiting time of a job is longer than this value, the job is considered to be in a queue and will be counted in related metrics. The recommended value is 0.
      job-wait-time: 0
     # Timeout interval for registering a new node with Slurm. If a new node fails to be registered after the timeout interval expires, the node will be deleted by AS. The recommended value is 10 minutes.
      register-timeout-minutes: 10
      # Number of CPU cores of the elastic node
      cpu: 4
      # Memory size of the elastic node. This is a reserved field. It can be set to any value greater than 0.
      memory: 12600

    You should change the values of the parameters in bold as needed. Retain the default values for other parameters.

  • Installing JDK

    If JDK 8 is not installed on the master node, manually install it.

Software Management

  • Starting the software

    Run the nohup java -jar scaler-0.0.1-SNAPSHOT.jar --spring.config.name=scalerConfig > /dev/null 2>&1 & command as the root user.

  • Stopping the software
    1. Run the ps –ef | grep java | grep scaler command to query the process ID of the software.
    2. Run the kill -9 Process ID command.
  • Updating the configuration file

    After updating the scalerConfig.yaml configuration file, restart Scaler to apply the modification.

Troubleshooting

Log printing warning

node: [Node name] status isn't DOWN

This message indicates that the node is not in the AS group and is not in the down state.

Check whether the node is a stable node. If yes, add the node to the scalerConfig.yaml file.

Relevant Files

scaler-0.0.1-SNAPSHOT.jar: main software of Scaler

scalerConfig.yaml: configuration file

scaler.log/scaler.yyyy-mm-dd.log: program run log