Using Kubeflow and Volcano to Train an AI Model
Kubernetes has become the de facto standard for cloud native application orchestration and management. An increasing number of applications are migrated to Kubernetes. AI and machine learning inherently involve a large number of computing-intensive tasks. Kubernetes is a preferential tool for developers building AI platforms because of its excellent capabilities in resource management, application orchestration, and O&M monitoring.
Problems of the Default Kubernetes Scheduler in Batch Computing
Kubeflow uses the default scheduler of Kubernetes, which was initially designed for long-term running services. Its scheduling capability is inadequate for tasks that involve batch computing and elastic scheduling in AI and big data scenarios. The main constraints are as follows:
Resource preemption
A TensorFlow job consists of parameter servers (PS) and workers that need to collaborate to complete the training task. If only one pod is running, the job cannot execute properly. The default Kubernetes scheduler assigns pods individually and does not recognize the PS-worker dependency in a TFJob of a Kubeflow job. This scheduling approach can lead to resource allocation issues, especially when a cluster is under heavy load. In the scenario illustrated in the figure, there are four GPUs available in the cluster. TFJob1 and TFJob2 are configured with four workers each, but they are only allocated two GPUs each. However, both TFJob1 and TFJob2 actually need four GPUs each. This can result in a deadlock situation where both jobs are waiting for resources to be released, leading to inefficient GPU resource utilization.
Lack of affinity-based scheduling
In distributed training, data exchange between PS and workers is common and crucial for training efficiency. The bandwidth between PS and workers directly impacts training performance. However, the default Kubernetes scheduler does not account for the logical relationship between PS and workers, leading to random scheduling. For instance, when running two TFJobs, each with one PS and two workers, different scheduling outcomes can occur. The best scenario is depicted in case (c), where the PS and workers can leverage the local network for faster data transmission, enhancing training speed.
Volcano, a Perfect Batch Scheduling System for Accelerating AI Computing
Volcano is an enhanced batch scheduling system for high-performance computing workloads running on Kubernetes. It complements Kubernetes in machine learning, deep learning, HPC, and big data computing scenarios, providing capabilities such as gang scheduling, computing task queue management, task-topology, and GPU affinity scheduling. In addition, Volcano enhances batch task creation and lifecycle management, fair-share, binpack, and other Kubernetes-native capabilities. It fully addresses the constraints of Kubeflow in distributed training mentioned above.
For more information about Volcano, visit https://github.com/volcano-sh/volcano.
Using Volcano in Huawei Cloud
The convergence of Kubeflow and Volcano, two open-source projects, greatly simplifies and accelerates AI computing workloads running on Kubernetes. The two projects have been recognized by an increasing number of players in the field and applied in production environments. Volcano is used in Huawei Cloud CCE, CCI, and Kubernetes-Native Batch Computing Solution. Volcano will continue to iterate with optimized algorithms, enhanced capabilities such as intelligent scheduling, and new inference features such as GPU Share, to further improve the efficiency of Kubeflow batch training and inference.
Implementing Typical Distributed AI Training Jobs
This section describes how to perform distributed training of a digital image classification model using the MNIST dataset based on Kubeflow and Volcano.
- Log in to the CCE console and click the cluster name to access the cluster console.
- Deploy volcano on the cluster.
In the navigation pane, choose Add-ons. In the right pane, find Volcano Scheduler and click Install. In the window that slides out from the right, configure the specifications and click Install.
- Deploy the MNIST dataset.
- Download kubeflow/examples to the local host and select a guide based on the environment. The command is as follows:
yum install git git clone https://github.com/kubeflow/examples.git
- Install Python3. For details, see Getting and installing the latest version of Python.
wget https://www.python.org/ftp/python/3.6.8/Python-3.6.8.tgz tar -zxvf Python-3.6.8.tgz cd Python-3.6.8 ./configure make make install
After the installation, run the following commands to check whether the installation is successful:
python3 -V pip3 -V
- Install and start Jupyter Notebook.
pip3 install jupyter notebook jupyter notebook --allow-root
- Configure an SSH tunnel on PuTTY and remotely connect to the notebook.
- After the connection is successful, enter localhost:8000 in the address box of a browser to log in to the notebook.
- Create a distributed training job as prompted by Jupyter. Set the value of schedulerName to volcano to enable volcano.
kind: TFJob metadata: name: {train_name} spec: schedulerName: volcano tfReplicaSpecs: Ps: replicas: {num_ps} template: metadata: annotations: sidecar.istio.io/inject: "false" spec: serviceAccount: default-editor containers: - name: tensorflow command: ... env: ... image: {image} workingDir: /opt restartPolicy: OnFailure Worker: replicas: 1 template: metadata: annotations: sidecar.istio.io/inject: "false" spec: serviceAccount: default-editor containers: - name: tensorflow command: ... env: ... image: {image} workingDir: /opt restartPolicy: OnFailure
- Download kubeflow/examples to the local host and select a guide based on the environment. The command is as follows:
- Submit the job and start the training.
kubectl apply -f mnist.yaml
After the training job is complete, you can query the training results on the Kubeflow UI. This is how you run a simple distributed training job using Kubeflow and Volcano. Kubeflow simplifies TensorFlow job configuration. Volcano, with simply one more line of configuration, saves you significant time and effort in large-scale distributed training by providing capabilities such as gang scheduling and task topology to eliminate deadlocks and achieve affinity scheduling.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot