Updated on 2026-04-28 GMT+08:00

Example: Creating a Ray Cluster

Users performing reinforcement learning (RL) training tasks typically require a Ray cluster to ensure smooth task execution.

The ModelArts Standard environment supports running Ray distributed jobs by starting the Ray cluster before the training begins. To start a Ray cluster, execute the ray start --head command on the master node, while worker nodes join the cluster by executing ray start --address="master_ip:6379". ModelArts containers provide environment variables to distinguish whether the current node is a master or a worker node, enabling the execution of the appropriate commands. The following is a script example for starting a Ray cluster across multiple nodes:

#!/bin/bash
pkill -9 python
ray stop --force

# Total number of nodes used for training
NNODES=${VC_WORKER_NUM:-1}
# Number of NPUs per node
NPUS_PER_NODE=$(lspci | grep d80 | wc -l)
MASTER_ADDR=$(python -c "import os; print(os.getenv('VC_WORKER_HOSTS','127.0.0.1').split(',')[0])")

if [ "$VC_TASK_INDEX" == "0" ]; then
# Start the master node
	ray start --head --resources='{"NPU": '$NPUS_PER_NODE'}'

	while true; do
		ray_status_output=$(ray status)
		npu_count=$(echo "$ray_status_output" | grep -oP '(?<=/)\d+\.\d+(?=\s*NPU)' | head -n 1)
		npu_count_int=$(echo "$npu_count" | awk '{print int($1)}')
		device_count=$((npu_count_int / $NPUS_PER_NODE))

                # Check if device_count matches NNODES
		if [ "$device_count" -eq "$NNODES" ]; then
			echo "Ray cluster is ready with $device_count devices (from $npu_count NPU resources), starting Python script."
			ray status
			break
		else
			echo "Waiting for Ray to allocate $NNODES devices. Current device count: $device_count"
			sleep 5
		fi
	done
else
    # Worker nodes attempt to register ray with the master node until the registration is successful
	while true; do
		# Attempt to connect to the Ray cluster.
		ray start --address="$MASTER_ADDR:6379" --resources='{"NPU": '$NPUS_PER_NODE'}'
		# Check if the connection was successful.
		ray status
		if [ $? -eq 0 ]; then
			echo "Successfully connected to the Ray cluster!"
			break
		else
			echo "Failed to connect to the Ray cluster. Retrying in 5 seconds..."
			sleep 5
		fi
	done
fi

Save the script above as a file (for example, start_ray_cluster.sh) and upload it to OBS.

To create a Ray cluster and start a training task via a custom method in ModelArts, simply execute the command bash start_ray_cluster.sh. After execution, you can verify the cluster status using the ray status command.