示例:创建Ray集群
用户在执行与强化学习相关的训练任务时,通常需要使用Ray集群来确保任务的顺利进行。
ModelArts环境支持运行Ray分布式作业,只需在正式训练作业前启动Ray集群即可。启动Ray集群时,需在主节点执行 ray start --head命令,从节点则通过执行 ray start --address="master_ip:6379"加入集群。ModelArts容器中提供了环境变量,用于判断当前节点是主节点还是从节点,从而执行相应的命令。以下是多节点启动Ray集群的脚本示例。
#!/bin/bash
pkill -9 python
ray stop --force
# 共使用多少节点训练
NNODES=${VC_WORKER_NUM:-1}
# 每个节点有多少张卡
NPUS_PER_NODE=$(lspci | grep d80 | wc -l)
MASTER_ADDR=$(python -c "import os; print(os.getenv('VC_WORKER_HOSTS','127.0.0.1').split(',')[0])")
if [ "$VC_TASK_INDEX" == "0" ]; then
# 主节点启动
ray start --head --resources='{"NPU": '$NPUS_PER_NODE'}'
while true; do
ray_status_output=$(ray status)
npu_count=$(echo "$ray_status_output" | grep -oP '(?<=/)\d+\.\d+(?=\s*NPU)' | head -n 1)
npu_count_int=$(echo "$npu_count" | awk '{print int($1)}')
device_count=$((npu_count_int / $NPUS_PER_NODE))
# 判断 device_count 是否与 NNODES 相等
if [ "$device_count" -eq "$NNODES" ]; then
echo "Ray cluster is ready with $device_count devices (from $npu_count NPU resources), starting Python script."
ray status
break
else
echo "Waiting for Ray to allocate $NNODES devices. Current device count: $device_count"
sleep 5
fi
done
else
# 子节点尝试往主节点注册ray直到成功
while true; do
# 尝试连接 Ray 集群
ray start --address="$MASTER_ADDR:6379" --resources='{"NPU": '$NPUS_PER_NODE'}'
# 检查连接是否成功
ray status
if [ $? -eq 0 ]; then
echo "Successfully connected to the Ray cluster!"
break
else
echo "Failed to connect to the Ray cluster. Retrying in 5 seconds..."
sleep 5
fi
done
fi
将以上脚本写入文件,例如start_ray_cluster.sh,并上传至OBS。
要在ModelArts中通过自定义方式启动训练任务并创建Ray集群,只需执行bash start_ray_cluster.sh命令,执行后可通过ray status命令查看Ray集群状态。