Help Center/ Cloud Container Engine/ User Guide/ Cloud Native AI/ AI Task Management/ Kuberay Add-on

Updated on 2025-11-06 GMT+08:00

View PDF

Kuberay Add-on

Introduction

Kuberay is a native Kubernetes add-on designed to seamlessly manage and operate the Ray distributed computing framework within Kubernetes clusters, including CCE standard and Turbo clusters. Ray is a high-performance distributed computing library commonly used in machine learning, reinforcement learning, and data processing scenarios. The goal of Kuberay is to integrate Ray into Kubernetes for easy deployment, management, and scaling of RayClusters.

As a Kubernetes operator, Kuberay handles the lifecycle of RayClusters through CRDs. Its core functions include:

RayCluster deployment: Kuberay handles the creation and management of both head and worker nodes in RayClusters automatically.
Auto scaling: The number of worker node pods within a RayCluster adjusts dynamically based on workload demands.
Resource management: Kuberay seamlessly integrates with Kubernetes' resource management system, including CPU, memory, and GPU resources.
Fault recovery: Kuberay monitors the health of RayClusters and automatically recovers any faulty nodes.
Logging and monitoring: Kuberay incorporates Kubernetes logging and monitoring tools to simplify debugging and optimization of Ray applications.

Open-source community: https://github.com/ray-project/kuberay

Notes and Constraints

The CCE cluster (standard or Turbo) version must be v1.27 or later.
This add-on is being deployed. For details about the regions where this add-on is available, see the console.

Billing

To access the Ray Dashboard, you need to bind an EIP to any node in the cluster. Using an EIP is billed. For details, see EIP Price Calculator.

Installing the Add-on

Log in to the CCE console and click the cluster name to access the cluster console.
In the navigation pane, choose Add-ons. Locate Kuberay on the right and click Install.

On the Install Add-on page, configure the specifications. You can configure the CPU and memory quotas as required.

CPU quotas are measured in cores, with the value being an integer followed by the unit suffix (m), such as 100m.

**Table 1** CPU quotas
Parameter	Example Value	Description
Request	100m	The minimum number of CPUs required by a container
Limit	500m	The maximum number of CPUs available for a container. If the CPU usage goes beyond the set limit, the instance might restart, impacting the add-on's regular operation.

Memory quotas are measured in bytes, with the value being an integer followed by the unit suffix (Mi), such as 100Mi. To manage 500 Ray pods, you need approximately 500 MiB of memory. You can modify the settings based on memory usage.

**Table 2** Memory quotas
Parameter	Example Value	Description
Request	100Mi	The minimum amount of memory required by a container
Limit	500Mi	The maximum amount of memory available for a container. If the memory usage goes beyond the set limit, the instance might restart, impacting the add-on's regular operation.

Configure the parameters. For details, see Table 3.

**Table 3** Parameters
Parameter	Example Value	Description
Add-on Namespace	default	Namespace where the Kuberay add-on is in. The parameter defaults to default, and the value can be customized.
Batch Scheduler	Do not use	Whether to use an external scheduler to schedule multiple RayJobs simultaneously. The parameter defaults to Do not use. Do not use: indicates that an external scheduler is not used to schedule multiple RayJobs simultaneously. Volcano: Volcano is used to schedule multiple RayJobs simultaneously. If you selected Volcano but do not have it on-premises, you are advised to install the Volcano Scheduler add-on. For details, see Volcano Scheduler.
Service Port	8080	Port used by Kuberay to provide services for external systems. The parameter defaults to 8080.

Click Install. If the add-on is in the Running state, it means that the add-on has been installed.

Components

**Table 4** Add-on components
Component	Description	Resource Type
kuberay-operator	Used to deploy Ray and manage its lifecycle.	Deployment

How to Use the Add-on

The following provides different ways to submit a Ray job using the Kuberay add-on. For details about the workflow and application scenarios, see Table 5. For details about how to use Kuberay, see the official documentation.

**Table 5** Comparison of different ways to submit a job
How to Submit a Job	Description
Using a RayCluster custom resource	To get started, create a RayCluster custom resource, log in to the head node pod, and submit a job. The head node will then automatically distribute the job to a worker node for execution.
Using a RayJob custom resource	You can mount a Python script directly to the head node of a RayCluster using the RayJob. The head node will then distribute the job to a worker node for execution. In this example, the YAML file for creating a RayJob custom resource is provided, which can be customized as needed. The RayJob initiates a RayCluster with a head node and a set of worker nodes that scale based on job statuses. The RayJob mounts the Python script to the head node through ConfigMaps. The RayJob automatically executes the Python script to run the distributed counter job. Once the job is finished, the RayJob will be completed, but the RayCluster remains active until manually deleted.

Step 1: Deploy a RayCluster custom resource.

Install kubectl on an existing ECS and access a cluster using kubectl. For details, see Accessing a Cluster Using kubectl.

Create a YAML file for configuring a RayCluster. In this example, the file name is my-raycluster.yaml. You can change it as needed.

vim my-raycluster.yaml

The file content is as follows:

If you encounter network issues when using an official Ray image, it may fail to be pulled. To avoid this, it is recommended that you push the image to an SWR image repository in advance and update the image address in the file with the SWR image address. For details about how to push an image to SWR, see Migrating Images to SWR Using Docker Commands.

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: my-raycluster    # Name of a RayCluster, which can be customized
spec:
  rayVersion: '2.41.0'   # Specify the Ray version.
  headGroupSpec:         # Configuration of the head node
    serviceType: ClusterIP          # Ensure that the Service is available only in the cluster.
    rayStartParams:
      dashboard-host: '0.0.0.0'     # Allow external access to the Ray Dashboard.
    template:
      metadata:
        labels: {}
      spec:
        containers:
        - name: ray-head    # Name of the head node pod
          image: rayproject/ray:2.41.0     # Image address
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
          resources:
            limits:
              cpu: "1"
              memory: "2G"
            requests:
              cpu: "500m"
              memory: "2G"
        volumes:
          - name: ray-logs
            emptyDir: {}
  workerGroupSpecs:    # Configuration of the worker nodes
  - replicas: 1
    minReplicas: 1
    maxReplicas: 10
    groupName: small-group
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker     # Name of the worker node pod
          image: rayproject/ray:2.41.0     # Image address
          volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
          resources:
            limits:
              cpu: "1"
              memory: "1G"
            requests:
              cpu: "500m"
              memory: "1G"
        volumes:
          - name: ray-logs
            emptyDir: {}

Create the RayCluster.
```
kubectl create -f my-raycluster.yaml
```
If information similar to the following is displayed, the RayCluster has been created:
```
raycluster.ray.io/my-raycluster created
```

Check the deployment of the RayCluster by performing the following operations in sequence:

Check the status of the RayCluster.

kubectl get raycluster

Information similar to the following is displayed:

NAME          DESIRED WORKERS    AVAILABLE WORKERS    CPUS    MEMORY    GPUS    STATUS    AGE
my-raycluster 1                  1                    1       3G        0       ready     2m29s

Check the status of the created pods.

kubectl get pod | grep my-raycluster

If information similar to the following is displayed, the pods in the RayCluster have been created:

NAME                                     READY    STATUS    RESTARTS    AGE
my-raycluster-head-6cvnw                 1/1      Running   0           2m6s
my-raycluster-small-group-worker-x2lmd   1/1      Running   0           2m6s

Check the created Service.

kubectl get svc | grep my-raycluster

If information similar to the following is displayed, the Service in the RayCluster has been created:

NAME                     TYPE        CLUSTER-IP    EXTERNAL-IP    PORT(S)                                 AGE
my-raycluster-head-svc   ClusterIP   None          <none>         10001/TCP,8265/TCP,6379/TCP,8080/TCP    3m13s

Step 2: Submit and execute a Ray job.

Get the name of the head node pod.

kubectl get pod | grep my-raycluster

The following information is displayed, and it indicates that the name of the head node pod in the RayCluster is my-raycluster-head-6cvnw.

NAME                                     READY    STATUS    RESTARTS    AGE
my-raycluster-head-6cvnw                 1/1      Running   0           2m6s
my-raycluster-small-group-worker-x2lmd   1/1      Running   0           2m6s

Access my-raycluster-head-6cvnw.

kubectl exec -it my-raycluster-head-6cvnw -- bash

Create a my-script.py file as the job script.

cat > /my_script.py

The file content is as follows:

import ray    # Import the Ray library.
import os     # Import the OS library.
   
ray.init()    # Access and initialize the RayCluster.

@ray.remote(num_cpus=1)   # Decorator provided by Ray
class Counter:    # Define a simple counter class.
    def __init__(self):
        # Used to verify runtimeEnv
        self.name = "test_counter"
        self.counter = 0

    def inc(self):
        self.counter += 1

    def get_counter(self):
        return "{} got {}".format(self.name, self.counter)

counter = Counter.remote()    # Create a remote counter. All calls to the counter are remotely executed in the RayCluster.

for _ in range(1000):         # Perform the counter increment operation 1000 times in a loop.
    ray.get(counter.inc.remote())
    print(ray.get(counter.get_counter.remote()))

Press Ctrl + D to save the change and exit.

On my-raycluster-head-6cvnw, run the my-script.py script to perform the preceding job.
```
python my-script.py
```
If information similar to the following is displayed, the job has been executed:
```
test_counter got 1
test_counter got 2
test_counter got 3
...
```

Step 3: Access the Ray Dashboard.

Obtain the label of the head node pod in the RayCluster so that the pod can be identified and associated with the Service created in 2. You can change my-raycluster-head-6cvnw as needed.

kubectl describe pod my-raycluster-head-6cvnw

The following information is displayed. You are advised to use ray.io/identifier=my-raycluster-head as the label.

...
Labels:    app.kubernetes.io/created-by=kuberay-operator
           app.kubernetes.io/name=kuberay
           ray.io/cluster=my-raycluster
           ray.io/group=headgroup
           ray.io/identifier=my-raycluster-head
           ray.io/is-ray-node=yes
           ray.io/node-type=head
...

Create a YAML file for configuring a NodePort Service. In this example, the file name is ray-dashboard.yaml. You can change it as needed. This Service is used to expose services to external systems so that you can directly access the Ray Dashboard from a browser.

vim ray-dashboard.yaml

The file content is as follows:

apiVersion: v1
kind: Service
metadata:
  name: ray-dashboard
  labels:
    ray.io/identifier: my-raycluster-head
  namespace: default
spec:
  ports:
  - name: cce-service-0
    port: 8265          # Port for accessing the Service, which is set to 8265
    protocol: TCP       # Protocol used for accessing the Service. The value can be TCP or UDP.
    targetPort: 8265    # Port used by the Service to access the target container. This port is closely related to the application running in a container and must be 8265.
  selector:             # Label selector
    ray.io/identifier: my-raycluster-head
  externalTrafficPolicy: Cluster
  type: NodePort        # Service type. NodePort indicates that services are accessed through a node port.

Create the Service.
```
kubectl create -f ray-dashboard.yaml
```
If information similar to the following is displayed, the Service has been created:
```
service/ray-dashboard created
```

Obtain the node port of the Service:

kubectl get services

The following information is displayed, and 30001 specifies the node port.

NAME              TYPE        CLUSTER-IP      PORT(S)          AGE 
ray-dashboard    NodePort    10.96.123.45    8265:30001/TCP   5m

Enter http://<EIP of a cluster node>:30001 in the address bar of a browser to access the Ray Dashboard. Ensure that an EIP has been bound to a node in the cluster. To bind an EIP to a node, log in to the CCE console, click the cluster name to access the cluster console, choose Nodes in the navigation pane, click the Nodes tab in the right pane, and click the node name to go to the ECS console. For details, see Binding an EIP.
If the page displayed is similar to Figure 1, the access is successful. You can monitor and manage the status of the RayCluster, track resource usage, and oversee job execution in real time using the Ray Dashboard. For example, you can check the status of job execution in the Recent jobs area.

Figure 1 Accessing the Ray Dashboard

Step 4: Delete related resources.

Delete resources related to the RayCluster.
```
kubectl delete -f my-raycluster.yaml
```
Information similar to the following is displayed:
```
raycluster.ray.io "my-raycluster" deleted
```
Delete the Service resources.
```
kubectl delete -f ray-dashboard.yaml
```
Information similar to the following is displayed:
```
service "ray-dashboard" deleted
```

Install kubectl on an existing ECS and access a cluster using kubectl. For details, see Accessing a Cluster Using kubectl.

Create a YAML file for configuring a RayJob custom resource. In this example, the file name is my-rayjob.yaml. You can change it as needed.

vim my-rayjob.yaml

The file content is as follows:

apiVersion: ray.io/v1
kind: RayJob    # The resource type is RayJob.
metadata:
  name: my-rayjob
spec:           # Define the specifications of the RayJob, including the job entry point, runtime environment, and RayCluster configuration.
  entrypoint: python /home/ray/samples/sample_code.py     # Specify the entry point command of the job. In this example, a Python script will be run.
  runtimeEnvYAML: |
    env_vars:   # Define the environment variables.
      counter_name: "test_counter"
  rayClusterSpec:     # Define the configuration of the RayCluster.
    headGroupSpec:    # Define the configuration of the head node in the RayCluster.
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.41.0
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "1"
                requests:
                  cpu: "200m"
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          volumes:
            - name: code-sample
              configMap:
                name: ray-job-code-sample
                items:
                  - key: sample_code.py
                    path: sample_code.py
    workerGroupSpecs:    # Define the configuration of the worker nodes in the RayCluster.
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        groupName: small-group
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.41.0
                resources:
                  limits:
                    cpu: "1"
                  requests:
                    cpu: "200m"
---
# Define a specific job. The Python script sample_code.py is included, which creates a basic distributed counter.
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    import os

    ray.init()

    @ray.remote
    class Counter:
        def __init__(self):
            # Used to verify runtimeEnv
            self.name = os.getenv("counter_name")
            assert self.name == "test_counter"
            self.counter = 0

        def inc(self):
            self.counter += 1

        def get_counter(self):
            return "{} got {}".format(self.name, self.counter)

    counter = Counter.remote()

    for _ in range(1000):
        ray.get(counter.inc.remote())
        print(ray.get(counter.get_counter.remote()))

Create the RayJob and ConfigMap (used to store job code).
```
kubectl create -f my-rayjob.yaml
```
If information similar to the following is displayed, the resources have been created:
```
rayjob.ray.io/my-rayjob created
configmap/ray-job-code-sample created
```

Check the deployment of the RayJob by performing the following operations in sequence:

Check whether the RayCluster has been created.
```
kubectl get pod
```
If information similar to the following is displayed, the resource has been created: (In the following command, my-rayjob-raycluster-4464z-head-hb5qx indicates the head node pod, my-rayjob-raycluster-4464z-small-group-worker-csqb2 indicates the worker node pod, and my-rayjob-x2tv6 indicates the job execution status. If STATUS of my-rayjob-x2tv6 is Completed, it means that the job is complete.)
```
NAME                                                  READY    STATUS    RESTARTS    AGE
my-rayjob-raycluster-4464z-head-hb5qx                 1/1      Running   0           24s
my-rayjob-raycluster-4464z-small-group-worker-csqb2   1/1      Running   0           24s
my-rayjob-x2tv6                                       1/1      Running   0           4s
```

Check the status of the RayJob.

kubectl get rayjob

If information similar to the following is displayed, the job is still in progress: (If JOB STATUS is SUCCEEDED, the job is complete.)

NAME        JOB STATUS    DEPLOYMENT STATUS    RAY CLUSTER NAME              START TIME            END TIME    AGE
my-rayjob   RUNNING       Running              my-rayjob-raycluster-4464z    2025-02-10T07:16:26Z              28

View the job execution status.

kubectl logs my-rayjob-x2tv6

Information similar to the following is displayed:

2025-02-21 03:32:23,631 INFO cli.py:36 -- Job submission server address: http://my-rayjob-raycluster-4464z-head-svc.default.svc.cluster.local:8265
2025-02-21 03:32:24,142 SUCC cli.py:60 -- --------------------------------------------
2025-02-21 03:32:24,143 SUCC cli.py:61 -- Job 'my-rayjob-x2tv6' submitted successfully
2025-02-21 03:32:24,143 SUCC cli.py:62 -- --------------------------------------------
2025-02-21 03:32:24,143 INFO cli.py:286 -- Next steps
2025-02-21 03:32:24,143 INFO cli.py:287 -- Query the logs of the job:
2025-02-21 03:32:24,143 INFO cli.py:289 -- ray job logs my-rayjob-x2tv6
2025-02-21 03:32:24,143 INFO cli.py:291 -- Query the status of the job:
2025-02-21 03:32:24,143 INFO cli.py:293 -- ray job status my-rayjob-x2tv6
2025-02-21 03:32:24,143 INFO cli.py:295 -- Request the job to be stopped:
2025-02-21 03:32:24,143 INFO cli.py:297 -- ray job stop my-rayjob-x2tv6
2025-02-21 03:32:24,147 INFO cli.py:304 -- Tailing logs until the job exits (disable with --no-wait):
2025-02-21 03:32:25,011 INFO worker.py:1429 -- Using address 172.20.0.10:6379 set in the environment variable RAY_ADDRESS
2025-02-21 03:32:25,012 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 172.20.0.10:6379...
2025-02-21 03:32:25,017 INFO worker.py:1740 -- Connected to Ray cluster. View the dashboard at 172.20.0.10:8265 
test_counter got 1
test_counter got 2
test_counter got 3
test_counter got 4
...

Monitor and manage the RayCluster status, resource usage, and job execution in real time on the Ray Dashboard.
1. Obtain the label of the head node pod in the RayCluster so that the pod can be identified and associated with the Service created in 5.b. You can change my-rayjob-raycluster-4464z-head-hb5qx as needed.
```
kubectl describe pod my-rayjob-raycluster-4464z-head-hb5qx
```
  The following information is displayed. You are advised to use my-rayjob-raycluster-4464z-head as the label.
```
...
Labels:           app.kubernetes.io/created-by=kuberay-operator
                  app.kubernetes.io/name=kuberay
                  ray.io/cluster=my-rayjob-raycluster-4464z
                  ray.io/group=headgroup
                  ray.io/identifier=my-rayjob-raycluster-4464z-head
                  ray.io/is-ray-node=yes
                  ray.io/node-type=head
...
```
2. Create a YAML file for configuring a NodePort Service. In this example, the file name is ray-dashboard.yaml. You can change it as needed. This Service is used to expose services to external systems so that you can directly access the Ray Dashboard from a browser.
```
vim ray-dashboard.yaml
```
  The file content is as follows:
```
apiVersion: v1
kind: Service
metadata:
  name: ray-dashboard
  labels:
    ray.io/identifier: my-rayjob-raycluster-4464z-head
  namespace: default
spec:
  ports:
  - name: cce-service-0
    port: 8265          # Port for accessing the Service, which is set to 8265
    protocol: TCP       # Protocol used for accessing the Service. The value can be TCP or UDP.
    targetPort: 8265    # Port used by the Service to access the target container. This port is closely related to the application running in a container and must be 8265.
  selector:             # Label selector
    ray.io/identifier: my-rayjob-raycluster-4464z-head
  externalTrafficPolicy: Cluster
  type: NodePort        # Service type. NodePort indicates that services are accessed through a node port.
```
3. Create the Service.
```
kubectl create -f ray-dashboard.yaml
```
  If information similar to the following is displayed, the Service has been created:
```
service/ray-dashboard created
```
4. Obtain the node port of the Service.
```
kubectl get services
```
  The following information is displayed, and 32638 specifies the node port.
```
NAME                                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                AGE
ray-dashboard                         NodePort    10.247.147.102   <none>        8265:32638/TCP                         8s
```
5. Enter http://<EIP of a cluster node>:32638 in the address bar of a browser to access the Ray Dashboard. Ensure that an EIP has been bound to a node in the cluster. To bind an EIP to a node, log in to the CCE console, click the cluster name to access the cluster console, choose Nodes in the navigation pane, click the Nodes tab in the right pane, and click the node name to go to the ECS console. For details, see Binding an EIP.
  If the page displayed is similar to Figure 2, the access is successful. You can monitor and manage the status of the RayCluster, track resource usage, and oversee job execution in real time using the Ray Dashboard. For example, you can check the status of job execution in the Recent jobs area.
  
  Figure 2 Accessing the Ray Dashboard
Delete related resources by taking the following steps:
1. Delete resources related to the RayJob.
```
kubectl delete -f my-rayjob.yaml
```
  Information similar to the following is displayed:
```
rayjob.ray.io "my-rayjob" deleted
configmap "ray-job-code-sample" deleted
```
2. Delete the Service resources.
```
kubectl delete -f ray-dashboard.yaml
```
  Information similar to the following is displayed:
```
service "ray-dashboard" deleted
```

Common Issues

After removing the Kuberay add-on from a CCE cluster, the CRDs for the RayCluster, RayJob, and Ray services will still be present. You can delete these remaining resources by following these steps:

On the ECS that has been connected to the cluster, search for the CRDs related to Ray.

kubectl get crd | grep ray

The following information is displayed, and it means that there are two related CRDs:

rayclusters.ray.io                          2025-02-01T12:00:00Z 
rayjobs.ray.io                              2025-02-01T12:00:00Z
rayservices.ray.io                          2025-02-01T12:00:00Z

Delete the related resources in sequence. You can replace rayclusters.ray.io, rayjobs.ray.io, and rayservices.ray.io in the command as needed.

kubectl delete crd rayclusters.ray.io
kubectl delete crd rayjobs.ray.io
kubectl delete crd rayservices.ray.io

If information similar to the following is displayed, the resources have been deleted.

customresourcedefinition.apiextensions.k8s.io "rayclusters.ray.io" deleted
customresourcedefinition.apiextensions.k8s.io "rayjobs.ray.io" deleted
customresourcedefinition.apiextensions.k8s.io "rayservices.ray.io" deleted

Release History

**Table 6** Kuberay add-on
Add-on Version	Supported Cluster Version	New Feature	Community Version
1.2.4	v1.27 v1.28 v1.29 v1.30 v1.31 v1.32 v1.33	Clusters of v1.33 are supported.	v1.2.2
1.2.3	v1.27 v1.28 v1.29 v1.30 v1.31 v1.32	Clusters of v1.32 are supported.	v1.2.2
1.2.2	v1.27 v1.28 v1.29 v1.30 v1.31	The Kuberay add-on is now available.	v1.2.2