TensorFlow

ModelArts provides multiple AI frameworks for different engines. When you use these engines for model training, the boot commands during training need to be adapted accordingly. This section introduces how to make adaptions to the TensorFlow engine.

TensorFlow Startup Principle

Specifications and number of nodes

In this case, GPU: 8 × GP-Vnt1 | CPU: 72 cores | Memory: 512 GB is used as an example to describe how to allocate ModelArts resources for single-node and distributed jobs.

For a single-node job (running on only one node), ModelArts starts a training container that exclusively uses the resources on the node.

For a distributed job (running on more than one node), ModelArts starts a parameter server (PS) and a worker on the same node. The PS owns the compute resources of CPU: 36 cores | Memory: 256 GB, and the worker owns GPU: 8 xGP-Vnt1 | CPU: 36 cores | Memory: 256 GB.

Only CPU and memory resources are allocated to a PS, while a worker can also own acceleration cards (except for pure CPU specifications). In this example, each worker owns eight GP Vnt1 acceleration cards. If a PS and a worker are started on the same node, the disk resources are shared by both parties.

Network communication

For a single-node job, no network communication is required.
For a distributed job, network communications are required in nodes and between nodes.

In nodes

A PS and a worker can communicate in nodes through a container network or host network.

A container network is used when you run a training job on nodes using shared resources.
When you run a training job on nodes using a dedicated pool, the host network is used if the node is configured with RoCE NICs, and the container network is used if the node is configured with InfiniBand NICs.

Between nodes

For a distributed job, a PS and a worker can communicate between nodes. ModelArts provides you with InfiniBand and RoCE NICs with a bandwidth of up to 100 Gbit/s.

Boot Commands

By default, the training service uses the python interpreter in the job image to start up the training script. To obtain the python interpreter, run the which python command. The working directory during startup is /home/ma-user/user-job-dir/<The code directory name>, which is the directory returned by running pwd or os.getcwd() in python.

Boot command for single-card single-node
```
python <Relative path of the startup file> <Job parameters>
```
- Relative path of the startup file: path of the startup file relative to /home/ma-user/user-job-dir/<The code directory name>
- Job parameters: running parameters configured for a training job
Figure 1 Creating a training job

Configure the parameters by referring to the above figure. Then, run the following command on the console background:
```
python /home/ma-user/modelarts/user-job-dir/gpu-train/train.py --epochs 5
```
Boot command for distributed jobs
```
python --task_index ${VC_TASK_INDEX} --PS_hosts ${TF_PS_HOSTS} --worker_hosts ${TF_WORKER_HOSTS} --job_name ${MA_TASK_NAME} <Relative path of the startup file> <Job parameters> 
```
- VC_TASK_INDEX: task serial number, for example, 0/1/2.
- TF_PS_HOSTS: address array of PS nodes, for example, [xx-PS-0.xx:TCP_PORT,xx-PS-1.xx:TCP_PORT]. The value of TCP_PORT is a random port ranging from 5,000 to 10,000.
- TF_WORKER_HOSTS: address array of worker nodes, for example, [xx-worker-0.xx:TCP_PORT,xx-worker-1.xx:TCP_PORT]. The value of TCP_PORT is a random port ranging from 5,000 to 10,000.
- MA_TASK_NAME: task name, which can be PS or worker.
- Relative path of the startup file: path of the startup file relative to /home/ma-user/user-job-dir/<The code directory name>
- Job parameters: running parameters configured for a training job
Figure 2 Creating a training job

Configure the parameters by referring to the above figure. Then, run the following command on the console background:
```
python --task_index "$VC_TASK_INDEX" --PS_hosts "$TF_PS_HOSTS" --worker_hosts "$TF_WORKER_HOSTS" --job_name "$MA_TASK_NAME"
 /home/ma-user/modelarts/user-job-dir/gpu-train/train.py --epochs 5
```