Example: Creating a Ray Cluster
Users performing reinforcement learning (RL) training tasks typically require a Ray cluster to ensure smooth task execution.
The ModelArts Standard environment supports running Ray distributed jobs by starting the Ray cluster before the training begins. To start a Ray cluster, execute the ray start --head command on the master node, while worker nodes join the cluster by executing ray start --address="master_ip:6379". ModelArts containers provide environment variables to distinguish whether the current node is a master or a worker node, enabling the execution of the appropriate commands. The following is a script example for starting a Ray cluster across multiple nodes:
#!/bin/bash
pkill -9 python
ray stop --force
# Total number of nodes used for training
NNODES=${VC_WORKER_NUM:-1}
# Number of NPUs per node
NPUS_PER_NODE=$(lspci | grep d80 | wc -l)
MASTER_ADDR=$(python -c "import os; print(os.getenv('VC_WORKER_HOSTS','127.0.0.1').split(',')[0])")
if [ "$VC_TASK_INDEX" == "0" ]; then
# Start the master node
ray start --head --resources='{"NPU": '$NPUS_PER_NODE'}'
while true; do
ray_status_output=$(ray status)
npu_count=$(echo "$ray_status_output" | grep -oP '(?<=/)\d+\.\d+(?=\s*NPU)' | head -n 1)
npu_count_int=$(echo "$npu_count" | awk '{print int($1)}')
device_count=$((npu_count_int / $NPUS_PER_NODE))
# Check if device_count matches NNODES
if [ "$device_count" -eq "$NNODES" ]; then
echo "Ray cluster is ready with $device_count devices (from $npu_count NPU resources), starting Python script."
ray status
break
else
echo "Waiting for Ray to allocate $NNODES devices. Current device count: $device_count"
sleep 5
fi
done
else
# Worker nodes attempt to register ray with the master node until the registration is successful
while true; do
# Attempt to connect to the Ray cluster.
ray start --address="$MASTER_ADDR:6379" --resources='{"NPU": '$NPUS_PER_NODE'}'
# Check if the connection was successful.
ray status
if [ $? -eq 0 ]; then
echo "Successfully connected to the Ray cluster!"
break
else
echo "Failed to connect to the Ray cluster. Retrying in 5 seconds..."
sleep 5
fi
done
fi Save the script above as a file (for example, start_ray_cluster.sh) and upload it to OBS.
To create a Ray cluster and start a training task via a custom method in ModelArts, simply execute the command bash start_ray_cluster.sh. After execution, you can verify the cluster status using the ray status command.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot