Help Center> Cloud Container Engine> Best Practices> Batch Computing> Using Kubeflow and Volcano to Train an AI Model
Updated on 2022-07-20 GMT+08:00

Using Kubeflow and Volcano to Train an AI Model

Kubernetes has become the de facto standard for cloud native application orchestration and management. An increasing number of applications are migrated to Kubernetes. AI and machine learning inherently involve a large number of computing-intensive tasks. Kubernetes is a preferential tool for developers building AI platforms because of its excellent capabilities in resource management, application orchestration, and O&M monitoring.

Emergence and Constraints of Kubeflow

Building an end-to-end AI computing platform based on Kubernetes is complex and demanding. More than a dozen of phases is required, as shown in the following diagram. Apart from the familiar model training phase, the process also includes data collection, preprocessing, resource management, feature extraction, data verification, model management, model release, and monitoring, as shown in Figure 1. If AI algorithm engineers want to run a model training task, they have to build an entire AI computing platform first. Imagine how time- and labor-consuming that is and how much knowledge and experience it requires.

Figure 1 Model training

This is where Kubeflow comes in. Created in 2017, Kubeflow is a container and Kubernetes-based platform for agile deployment, development, training, release, and management in the machine learning field. It leverages cloud native technologies to make it faster and easier for data scientists, machine learning engineers, and system O&M personnel to deploy, use, and manage popular machine learning software.

Kubeflow 1.0 is now available, providing capabilities in development, building, training, and deployment that cover the entire process of machine learning and deep learning for enterprise users.

Diagram:

With Kubeflow 1.0, you first develop a model using Jupyter, and then set up containers using tools such as Fairing (SDK). Next, you create Kubernetes resources to train the model. After the training is complete, you create and deploy servers for inference using KFServing. This is how you use Kubeflow to establish an end-to-end agile process of a machine learning task. This process can be fully automated using pipelines, which help achieve DevOps in the AI field.

Kubernetes Pain Points

Does that mean we can now sit back and relax? Not yet. Kubeflow uses the default scheduler of Kubernetes, which was initially designed for long services. Its scheduling capability is inadequate for tasks that involve batch computing and elastic scheduling in AI and big data scenarios. The main constraints are as follows:

Resource preemption

A TensorFlow job involves two roles: parameter server (ps) and worker. Only when pods of these two roles run properly at the same time can a TensorFlow job be executed normally. However, the default scheduler is insensitive to the roles of pods in a TensorFlow job. Pods are treated identically and scheduled one by one. This causes problems when there are multiple jobs to schedule and cluster resources are scarce. Each job could end up being allocated with only part of the resources it needs to finish the execution. That is, resources are used up while no job can be successfully executed. To better illustrate this dilemma, assume that you want to run two TensorFlow jobs, namely, TFJob1 and TFJob2. Each of these jobs has four workers, which means each job requires four GPUs to run. However, your cluster only has four available GPUs in total. In this case, with the default scheduler, TFJob1 and TFJob2 could end up being allocated two GPUs each. They are waiting each other to finish and release the resources. However, this will not happen until you manually intervene. The deadlock created in this situation cause resource wastes and low efficiency in job execution.

Lack of affinity-based scheduling

In distributed training, data is frequently exchanged between parameter servers and workers. To ensure higher efficiency, parameter servers and workers of the same job should be scheduled to the same node for faster transmission using local networks. However, the default scheduler is insensitive to the affinity between parameter servers and workers of the same job. Pods are randomly scheduled instead. As shown in the following figure, assume that you want to run two TensorFlow jobs with each having one ps and two workers. With the default scheduler, the scheduling results could be any of the following three situations. However, only result (c) can deliver the highest efficiency. In (c), the ps and the workers can use the local network to communicate more efficiently and shorten the training time.

Volcano, a Perfect Batch Scheduling System for Accelerating AI Computing

Volcano is an enhanced batch scheduling system for high-performance computing workloads running on Kubernetes. It complements Kubernetes in machine learning, deep learning, HPC, and big data computing scenarios, providing capabilities such as gang scheduling, computing task queue management, task-topology, and GPU affinity scheduling. In addition, Volcano enhances batch task creation and lifecycle management, fair-share, binpack, and other Kubernetes-native capabilities. It fully addresses the constraints of Kubeflow in distributed training mentioned above.

For more information about Volcano, visit https://github.com/volcano-sh/volcano.

Using Volcano in Huawei Cloud

The convergence of Kubeflow and Volcano, two open-source projects, greatly simplifies and accelerates AI computing workloads running on Kubernetes. The two projects have been recognized by an increasing number of players in the field and applied in production environments. Volcano is used in Huawei Cloud CCE, Cloud Container Instance (CCI), and Kubernetes-Native Batch Computing Solution. Volcano will continue to iterate with optimized algorithms, enhanced capabilities such as intelligent scheduling, and new inference features such as GPU Share, to further improve the efficiency of Kubeflow batch training and inference.

Implementing Typical Distributed AI Training Jobs

This section describes how to perform distributed training of a digital image classification model using the MNIST dataset based on Kubeflow and Volcano.

  1. Log in to the CCE console and create a CCE cluster. For details, see Buying a CCE Cluster.
  2. Deploy Volcano on the created CCE cluster.

    In the navigation pane on the left, choose Add-ons. On the Add-on Marketplace tab page, click Install Add-on under volcano. In the Basic Information area on the Install Add-on page, select the cluster and Volcano version, and click Next: Configuration.

    Figure 2 Installing the volcano add-on

    The volcano add-on has no configuration parameters. Click Install and wait until the installation is complete.

  3. Deploy the Kubeflow environment.

    1. Install kfctl and set environment variables.
      1. Set environment variables as follows:
        export KF_NAME=<your choice of name for the Kubeflow deployment>
        export BASE_DIR=<path to a base directory>
        export KF_DIR=${BASE_DIR}/${KF_NAME}
        export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.2.yaml"
      2. Install kfctl.
        Download kfctl from https://github.com/kubeflow/kfctl/releases/tag/v1.0.2.
        tar -xvf kfctl_v1.0.2_<platform>.tar.gz
        chmod +x kfctl
        mv kfctl /usr/local/bin/
    2. Deploy Kubeflow.
      mkdir -p ${KF_DIR}
      cd ${KF_DIR}
      kfctl apply -V -f ${CONFIG_URI} 

      Delete the following PVCs and create four PVCs with the same name in CCE.

      # kubectl get pvc -n kubeflow
      NAME             STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
      katib-mysql      Pending                                                     3m56s
      metadata-mysql   Pending                                                     4m2s
      minio-pv-claim   Pending                                                     3m55s
      mysql-pv-claim   Pending                                                     3m54s

  4. Deploy the MNIST dataset.

    1. Download kubeflow/examples to the local host and select an operation guide based on the environment.
      yum install git
      git clone https://github.com/kubeflow/examples.git
    2. Install python3.
      wget https://www.python.org/ftp/python/3.6.8/Python-3.6.8.tgz
      tar -zxvf Python-3.6.8.tgz
      cd Python-3.6.8 ./configure
      make make install

      After the installation, run the following commands to check whether the installation is successful:

      python3 -V 
      pip3 -V
    3. Install and start Jupyter Notebook.
      pip3 install jupyter notebook
      jupyter notebook --allow-root
    4. Configure an SSH tunnel on PuTTY and remotely connect to the notebook.
    5. After the connection is successful, enter localhost:8000 in the address box of a browser to log in to the notebook.

    6. Create a distributed training job as prompted by Jupyter. Set the value of schedulerName to volcano to enable the Volcano scheduler.
      kind: TFJob
      metadata:
        name: {train_name}  
      spec:
        schedulerName: volcano
        tfReplicaSpecs:
          Ps:
            replicas: {num_ps}
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                serviceAccount: default-editor
                containers:
                - name: tensorflow
                  command:
                  ...
                  env:
                  ...
                  image: {image}
                  workingDir: /opt
                restartPolicy: OnFailure
          Worker:
            replicas: 1
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                serviceAccount: default-editor
                containers:
                - name: tensorflow
                  command:
                  ...
                  env:
                  ...
                  image: {image}
                  workingDir: /opt
                restartPolicy: OnFailure

  5. Submit the job and start the training.

    kubectl apply -f mnist.yaml

    After the training job is complete, you can query the training results on the Kubeflow UI. This is how you run a simple distributed training job using Kubeflow and Volcano. Kubeflow simplifies TensorFlow job configuration. Volcano, with simply one more line of configuration, saves you significant time and effort in large-scale distributed training by providing capabilities such as gang scheduling and task topology to eliminate deadlocks and achieve affinity scheduling.