Updated on 2024-12-31 GMT+08:00

Managing Lite Cluster Resource Pools

Renewal Management of Lite Cluster Resource Pools

For yearly/monthly Lite Cluster resource pools, you can renew them, enable auto-renewal, and modify auto-renewal.

In the navigation pane of the ModelArts console, choose AI Dedicated Resource Pools > Elastic Clusters. On the displayed page, perform the desired operations.

Viewing Basic Information About a Lite Cluster Resource Pool

In the navigation pane of the ModelArts console, choose AI Dedicated Resource Pools > Elastic Clusters. On the displayed page, click the target resource pool to view more information.

Figure 1 Viewing basic information about a Lite Cluster resource pool

Managing Lite Cluster Resource Pool Tags

You can add tags to a resource pool for quick search.

  1. Log in to the ModelArts console. In the navigation pane, choose AI Dedicated Resource Pools > Elastic Clusters.
  2. In the Lite resource pool list, click the name of the target resource pool to view its details.
  3. On the resource pool details page, click the Tags tab to view the tag information.

    Tags can be added, modified, and deleted. For details about how to use tags, see Using TMS Tags to Manage Resources by Group.

    Figure 2 Tags

    You can add up to 20 tags.

Configuration Management of Lite Cluster Resource Pools

On the resource pool details page, click Configuration Management. From there, you can modify the namespace to be monitored, cluster configuration, and image pre-provisioning information.

  • Click next to monitoring to enable or disable monitoring and set the namespace to be monitored. For details about how to use monitoring, see Viewing Lite Cluster Metrics Using Prometheus.
  • Click next to cluster configuration to set core binding, dropcache, and hugepage memory parameters. If no value is set, the default value from the resource pool image will be used.
    • Core Pinning: If CPU pinning is enabled, workload pods exclusively use CPUs to improve performance (such as training and inference performance) and reduce the scheduling delay. This function is ideal for scenarios that are sensitive to CPU caching and scheduling delay. If CPU binding is disabled, exclusive CPUs will not be allocated to workload pods. Disable this function if you want a large pool of shareable CPUs. You can also disable core binding and use taskset to flexibly bind cores in service containers.
    • Dropcache: After this function is enabled, Linux cache clearing is enabled. This function can improve application performance in most scenarios. Clearing the cache can potentially lead to container startup failure or a degradation in system performance, as the system will need to reload data from the disk into memory. If this function is disabled, cache clearing is disabled.
    • Hugepage Memory: When enabled, Transparent Huge Page (THP) is used. This memory management technique boosts system performance by increasing the memory page size. THP dynamically allocates huge page memory, simplifying its management. Enabling huge page memory can enhance application performance in most cases. However, it may trigger node restarts due to the soft lockup mechanism. If disabled, huge page memory is not used.
  • Click for image pre-provisioning to set the image source, add an image key, and configure image pre-provisioning. For details, see (Optional) Configuring Image Pre-provisioning.

More Operations

For more operations, see the following: