Updated on 2024-04-19 GMT+08:00

Solution Overview

Scenarios

This solution helps you quickly set up a scalable HPC environment on Huawei Cloud based on the open-source software Slurm and Huawei's open-source Gearbox. Slurm is configured to run in "configless" mode for cloud servers functioning as compute nodes. The Gearbox program interconnects with Huawei Cloud Auto Scaling and Cloud Eye to monitor the job status of a Slurm cluster and automatically scale in or out cloud servers in the Slurm cluster in real time. In addition, new cloud servers are automatically registered with and added to the cluster, or cloud servers are automatically deregistered from the cluster and then destroyed.

Solution Architecture

The solution architecture is illustrated below.

Figure 1 Architecture

This solution will:

  • Create two Linux Elastic Cloud Servers (ECSs), install the open-source software Slurm, install the Gearbox program on the scheduling node, and configure the Java environment.
  • Create one EIP for internal and external communication.
  • Create security groups and configure rules to control access to ECSs so as to secure the ECS environment.
  • Use Image Management Service (IMS) to prepare the initialization environment for compute node servers during auto scaling.
  • Use Auto Scaling to create and configure an auto scaling group as well as define scaling policies to automatically scale in or out cluster resources.
  • Use Cloud Eye for resource monitoring. The Gearbox program monitors the job status, calculates the workload value of custom metrics, and reports the metrics to Cloud Eye.
  • Use Scalable File Service (SFS) to mount SFS file systems to the ECSs to provide shared file storage for clusters.

Advantages

  • Auto scaling

    In this solution, auto scaling groups are configured and the Gearbox program is built in the server that functions as the scheduling node. The program periodically monitors cluster metrics, summarizes metric data, and reports the data to Cloud Eye. Cloud Eye alarm rules then trigger auto scaling, reducing costs.

  • Personalized customization

    This solution and the built-in Gearbox program are both open-source and free for commercial use. You can also make custom development based on source code.

  • Easy deployment

    In just a few clicks, you can easily deploy a scalable HPC cluster.

Constraints

  • Before deploying this solution, register a HUAWEI ID, enable Huawei Cloud services, and complete real-name authentication. If you select the yearly/monthly billing mode, ensure that your account has sufficient balance. If you do not have sufficient balance, you can go to the Billing Center to manually pay for the order.
  • Before deploying this solution, ensure that your account has sufficient IAM permissions. For details, see (Optional) Creating the rf_admin_trust Agency.
  • Ensure that you have sufficient quotas. Specifically, log in to the Huawei Cloud management console and choose Resources > My Quotas to check your quotas. If the quotas are insufficient, submit a service ticket to increase the quotas.
    • Compute: number of ECSs, number of CPU cores, and RAM capacity
    • Storage: Elastic Volume Service (EVS) and Scalable File Service (SFS)
    • Network: VPC, subnets, EIPs, and security groups