Help Center> Cloud Container Instance (CCI)> Best Practices> Using TensorFlow to Train Neural Networks

Using TensorFlow to Train Neural Networks

This topic describes how to create a GPU-accelerated workload in CCI and how to use such a workload to train a simple neural network in a container. The TensorFlow image classification is used as an example.

Such container-based AI training and inference have the following advantages:

  • Environment differences can be eliminated. You do not need to install various software, such as Python, TensorFlow, and CUDA toolkit.
  • The GPU driver is installation-free.
  • The resource cost is low, and resources are charged by second.
  • The serverless architecture does not require VM O&M.

Creating an Image

The TensorFlow community provides the base TensorFlow images that are installed with the base TensorFlow library. TensorFlow images are classified into GPU-enabled images and CPU-enabled images, which can be downloaded from the following addresses:

  • GPU-enabled images: tensorflow/tensorflow:1.15.0-gpu
  • CPU-enabled images: tensorflow/tensorflow:1.13.0

In this example, a trained model named Inception-v3 from the TensorFlow official website is used to classify images. Inception-v3 is a model trained in the 2012 ImageNet Challenge. In this challenge, it classified a huge image set into 1000 types. GitHub provides the code for classifying images by using Inception-v3.

The code for training models is contained in the project https://gpu-demo.obs.cn-north-1.myhuaweicloud.com/gpu-demo.zip. You need to download and decompress the package, and add the code project to an image. The content of the Dockerfile for creating an image is as follows:

FROM tensorflow/tensorflow:1.15.0-gpu
ADD gpu-demo /home/project/gpu-demo

The preceding ADD command is used to copy the gpu-demo project to the /home/project directory of the image. You can modify the directory as required.

Run the docker build -t tensorflow/tensorflow:v1 . command to create an image. The dot (.) indicates the current directory, that is, the directory where the Dockerfile is located.

After the image is created, push it to the SoftWare Repository for Container (SWR). For details about how to push an image, see Introduction.

Creating a TensorFlow Workload

  1. Log in to the CCI console.
  2. In the navigation pane, choose Namespaces. On the page displayed on the right, click Create in the GPU-accelerated area. In the displayed dialog box, enter the namespace name, set the VPC and subnet CIDR blocks, and click Create.

    Figure 1 GPU-accelerated namespace

  3. In the navigation pane, choose Workloads > Deployments. On the page displayed on the right, click Create Deployment.
  4. Configure workload information.

    1. Specify the workload name, select the namespace created in 2, set Pods to 1, and select GPU-accelerated for Pod Specifications and 418.126 for the GPU driver version.
      For details about GPU-accelerated pod specifications and GPU drivers, see Pod Specifications.
      Figure 2 Selecting GPU-accelerated pod specifications
    2. Select an image. In this example, select the TensorFlow image pushed to SWR.
    3. In the Advanced Settings area, mount an SFS volume of the NFS type to store the trained data.
      Figure 3 Mounting NFS-type volume
    4. In the Startup Commands area, enter an executable command and parameters.
      • Executable command: /bin/bash
      • Parameter 1: -c
      • Parameter 2: python /home/project/gpu-demo/cifar10/cifar10_multi_gpu_train.py --num_gpus=1 --data_dir=/home/project/gpu-demo/cifar10/data --max_steps=10000 --train_dir=/tmp/sfs0/train_data; while true; do sleep 10; done

        --train_dir indicates the path for storing the training result. The path prefix /tmp/sfs0 must be the same as Container Path specified in 4.c. Otherwise, the training result cannot be written into the NFS-type volume.

        --max_steps indicates the number of training iterations. In this example, this field is set to 10000. The model training takes about 3 minutes. If this parameter is not specified, the default value 1000000 is used, which indicates that the model training takes a longer time. Increasing the value of max_steps will result in a longer training time and a more accurate result.

      The preceding command is used to train the image classification model. After setting the command and parameters, click Next.

      Figure 4 Setting the container startup command
    5. Configure workload access settings.

      In this example, select Do not use. Then, click Next.

    6. Click Submit, and then click Back to Deployment List.

      In the workload list, if the workload is in the Running state, the workload is successfully created.

Using an Existing Model to Classify Images

  1. Click the TensorFlow workload name. In the Pod List area of the workload details page, click the arrow icon at the left of the pod and then click the CLI tab. If the prompt (#) is displayed on the CLI, you have logged in to the pod.

    Figure 5 Accessing the pod using the web-terminal

  2. Switch to the directory where the project is located, and run the python classify_image.py --model_dir=model command to query the classification result.

    # cd /home/project/gpu-demo                                                     
    # ls -l                                                                         
    total 96                                                                        
    -rw-r--r-- 1 root root  6874 Aug 30 08:09 airplane.jpg                          
    drwxr-xr-x 3 root root  4096 Sep  4 07:54 cifar10                               
    drwxr-xr-x 3 root root  4096 Aug 30 08:09 cifar10_estimator                     
    -rw-r--r-- 1 root root 30836 Aug 30 08:09 dog.jpg                               
    -rw-r--r-- 1 root root 43675 Aug 30 08:09 flower.jpg                            
    drwxr-xr-x 4 root root  4096 Sep  4 02:14 inception                             
    # cd inception                                                                  
    # python classify_image.py --model_dir=model --image_file=/home/project/gpu-demo/airplane.jpg                                  
    ...
    2019-01-02 08:05:24.891201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15131 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:0a.0, compute capability: 6.0)           
    airliner (score = 0.84250)                                                      
    wing (score = 0.03228)                                                          
    space shuttle (score = 0.02524)                                                 
    warplane, military plane (score = 0.00691)                                      
    airship, dirigible (score = 0.00664)
    In the preceding command, --image_file specifies the image (as shown in the following figure) to be classified. The last lines of the result are the classification label and the corresponding score. A higher score indicates a more accurate classification. The line airliner (score = 0.84250) indicates that the model recognizes the image as an airliner.
    Figure 6 airliner
    You can also do not specify the image to be classified. If an image is not specified, the following image is used.
    Figure 7 Default image
    In this case, run the python classify_image.py –model_dir=mode command to query the classification result.
    # python classify_image.py --model_dir=model
    ...
    2019-01-02 08:02:33.271527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15131 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:0a.0, compute capability: 6.0)                                   
    giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca (score = 0.89107)                                                                             
    indri, indris, Indri indri, Indri brevicaudatus (score = 0.00779)               
    lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens (score = 0.00296)                                                                           
    custard apple (score = 0.00147)                                                 
    earthstar (score = 0.00117)

    The result shows that the model recognizes the image as a panda.

Using the Trained Image Classification Model

The TensorFlow official website provides the model code and training data of a Deep Convolutional Neural Network (DCNN), that is, CIFAR-10. CIFAR-10 is a simplified image classification model. It classifies images across 10 categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The images trained in the model, also called training data, are of these 10 categories.

  1. On the CCI console, click the workload name. In the Pod List area of the workload details page, click the arrow icon at the left of the pod and then click the CLI tab. Then, use cifar10_eval.py provided in the code to check the accuracy of the model. In the following command, set the checkpoint_dir field to the directory where the model that has just been trained is located.

    # cd /home/project/gpu-demo/cifar10
    # python cifar10_eval.py --data_dir=data --checkpoint_dir=/tmp/sfs0/train_data --run_once
    ...
    2019-01-02 08:25:43.914186: precision @1 = 0.817

  2. Continue to use the preceding airplane image for testing. In the following command, set the checkpoint_dir field to the directory where the model that has just been trained is located and the test_file field to the image to be tested.

    # python label_image.py --checkpoint_dir=/tmp/sfs0/train_data --test_file=/home/project/gpu-demo/airplane.jpg
     ...
    2019-01-02 08:36:42.149700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15131 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:0a.0, compute capability: 6.0)                                   
    airplane (score = 4.28143)                                                      
    ship (score = 1.92319)                                                          
    cat (score = 0.03095)

    The result shows that the model accurately recognizes the image as an airplane. label_image.py is the code that uses the model that has just been trained to classify the image.

    In addition, you can view the usage of various resources on the Monitoring tab page in the Pod List area.