HPC Management and Scheduling Plug-in

Product Overview

The HPC management and scheduling plug-in is an end-to-end one-stop Huawei Cloud cluster resource usage and management platform that is developed based on Slurm. It provides one-click cluster delivery on a visualized interaction interface. It integrates SFS Turbo file systems to provide high-performance shared storage. You can perform operations on cluster users, compute resources, and service jobs on the UI. The plug-in supports quick modeling and computing in scenarios such as structural mechanics, fluid analysis, thermal simulation, and gene sequencing.

Core Functions

Function	Description
Partition Management	Divides logical resource pools and isolates resources of different teams or projects.
Cluster Management	Creates, destroys, and manages compute resources and monitors cluster metrics.
Topology Management	Defines the physical topology structure (such as racks and switches) of a cluster and optimizes job scheduling policies.
Job Management	Submits tasks based on user requirements and queries task logs, job status, completion time, and scheduled nodes.
Job Templates	Sets standard job configurations and submits tasks in one click.
Elastic Resource Supply	Configures at least one scaling policy for each partition to automatically scale in or out compute nodes based on the policy.
Elastic Job Scheduling	Schedules jobs based on policies, such as by priority, first in, first out (FIFO), and backfill scheduling.
Quota Management	Restricts resource usages of users or groups by QoS, accounts, and partitions to ensure fair access and prioritized use.
Data Management	Mounts SFS Turbo file systems to provide high-performance shared storage. Files smaller than 1 GB can be uploaded and downloaded on the cockpit UI.
Tag Management	Adds tags to nodes for fine-grained management of resources in the same partition.
Auditing Management	Logs user operations and resource usages.
Cluster O&M	Allows you to view node processes, system configurations, environment variables, and downloaded logs.
User Management	Has a built-in administrator account that can be used to create and delete common users. These users are assigned different roles to access the cluster UIs.

System Architecture and Deployment Requirements

Architecture Topology

Management and control nodes
- Master node: has 16 vCPUs, 32-GB memory, and a 300-GB disk and is responsible for cluster scheduling, user management, and audit log storage
- SFS Turbo: provides a shared file system. The mount path is /mnt/sfs_turbo_1.
Compute nodes: Pay-per-use or yearly/monthly compute nodes are created on the cockpit UI or using elastic policies.

Deployment Requirements

Component	Configuration Requirements
Master node	16 vCPUs, 32-GB memory, a 300-GB SSD disk, and associated with an EIP
SFS Turbo	On-demand capacity expansion by at least 1 TB and bandwidth of at least 1 Gbit/s
Compute nodes	You can create compute nodes on the cockpit UI and select specifications as required.