TensorFlow
ModelArts provides multiple AI frameworks for different engines. When you use these engines for model training, the boot commands during training need to be adapted accordingly. This section introduces how to make adaptions to the TensorFlow engine.
TensorFlow Startup Principle
Specifications and number of nodes
In this case, GPU: 8 × GP-Vnt1 | CPU: 72 cores | Memory: 512 GB is used as an example to describe how to allocate ModelArts resources for single-node and distributed jobs.
For a single-node job (running on only one node), ModelArts starts a training container that exclusively uses the resources on the node.
For a distributed job (running on more than one node), ModelArts starts a parameter server (PS) and a worker on the same node. The PS owns the compute resources of CPU: 36 cores | Memory: 256 GB, and the worker owns GPU: 8 xGP-Vnt1 | CPU: 36 cores | Memory: 256 GB.
Only CPU and memory resources are allocated to a PS, while a worker can also own acceleration cards (except for pure CPU specifications). In this example, each worker owns eight GP Vnt1 acceleration cards. If a PS and a worker are started on the same node, the disk resources are shared by both parties.
Network communication
- For a single-node job, no network communication is required.
- For a distributed job, network communications are required in nodes and between nodes.
In nodes
A PS and a worker can communicate in nodes through a container network or host network.
- A container network is used when you run a training job on nodes using shared resources.
- When you run a training job on nodes using a dedicated pool, the host network is used if the node is configured with RoCE NICs, and the container network is used if the node is configured with InfiniBand NICs.
Between nodes
For a distributed job, a PS and a worker can communicate between nodes. ModelArts provides you with InfiniBand and RoCE NICs with a bandwidth of up to 100 Gbit/s.
Boot Commands
By default, the training service uses the python interpreter in the job image to start up the training script. To obtain the python interpreter, run the which python command. The working directory during startup is /home/ma-user/user-job-dir/<The code directory name>, which is the directory returned by running pwd or os.getcwd() in python.
- Boot command for single-card single-node
python <Relative path of the startup file> <Job parameters>
- Relative path of the startup file: path of the startup file relative to /home/ma-user/user-job-dir/<The code directory name>
- Job parameters: running parameters configured for a training job
Figure 1 Creating a training job
Configure the parameters by referring to the above figure. Then, run the following command on the console background:
python /home/ma-user/modelarts/user-job-dir/gpu-train/train.py --epochs 5
- Boot command for distributed jobs
python --task_index ${VC_TASK_INDEX} --PS_hosts ${TF_PS_HOSTS} --worker_hosts ${TF_WORKER_HOSTS} --job_name ${MA_TASK_NAME} <Relative path of the startup file> <Job parameters>
- VC_TASK_INDEX: task serial number, for example, 0/1/2.
- TF_PS_HOSTS: address array of PS nodes, for example, [xx-PS-0.xx:TCP_PORT,xx-PS-1.xx:TCP_PORT]. The value of TCP_PORT is a random port ranging from 5,000 to 10,000.
- TF_WORKER_HOSTS: address array of worker nodes, for example, [xx-worker-0.xx:TCP_PORT,xx-worker-1.xx:TCP_PORT]. The value of TCP_PORT is a random port ranging from 5,000 to 10,000.
- MA_TASK_NAME: task name, which can be PS or worker.
- Relative path of the startup file: path of the startup file relative to /home/ma-user/user-job-dir/<The code directory name>
- Job parameters: running parameters configured for a training job
Figure 2 Creating a training job
Configure the parameters by referring to the above figure. Then, run the following command on the console background:
python --task_index "$VC_TASK_INDEX" --PS_hosts "$TF_PS_HOSTS" --worker_hosts "$TF_WORKER_HOSTS" --job_name "$MA_TASK_NAME" /home/ma-user/modelarts/user-job-dir/gpu-train/train.py --epochs 5
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot