Training TensorFlow Models
After Kubeflow is successfully deployed, it is easy to use the ps-worker mode to train TensorFlow models. This section provides a TensorFlow training example released at the official Kubeflow website. For more information, see https://www.kubeflow.org/docs/guides/components/tftraining/.
Creating a TfCnn Training Job
Run the following commands to create a TfCnn training job:
CNN_JOB_NAME=mycnnjob VERSION=v0.4.0 ks init ${CNN_JOB_NAME} cd ${CNN_JOB_NAME} ks registry add kubeflow-git github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow ks pkg install kubeflow-git/examples ks generate tf-job-simple-v1beta1 ${CNN_JOB_NAME} --name=${CNN_JOB_NAME} ks apply ${KF_ENV} -c ${CNN_JOB_NAME}
You can run the ks env list command to obtain the value of ${KF_ENV}. In this example, the value of ${KF_ENV} is default. After the execution is complete, run the kubectl get po command to view the result.
Using a Single GPU for Training
The preceding training job can be implemented by GPUs. Perform the following steps to modify the TFJob configuration file:
vi ${KS_APP}/components/${CNN_JOB_NAME}.jsonnet
Replace the file content with the following content in the mycnnjob.jsonnet file:
local env = std.extVar("__ksonnet/environments"); local params = std.extVar("__ksonnet/params").components.mycnnjob; local k = import "k.libsonnet"; local name = params.name; local namespace = env.namespace; local image = "gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3"; local tfjob = { apiVersion: "kubeflow.org/v1beta1", kind: "TFJob", metadata: { name: name, namespace: namespace, }, spec: { tfReplicaSpecs: { Worker: { replicas: 1, template: { metadata:{ annotations: { sidecar.istio.io/inject: "false" } }, spec: { containers: [ { args: [ "python", "tf_cnn_benchmarks.py", "--batch_size=64", "--num_batches=100", "--model=resnet50", "--variable_update=parameter_server", "--flush_stdout=true", "--num_gpus=1", "--local_parameter_device=cpu", "--device=gpu", "--data_format=NHWC", ], image: "swr.ap-southeast-1.myhuaweicloud.com/wubowen585/tf-benchmarks-gpu:v0", name: "tensorflow", ports: [ { containerPort: 2222, name: "tfjob-port", }, ], resources: { limits: { "nvidia.com/gpu": 1, }, }, workingDir: "/opt/tf-benchmarks/scripts/tf_cnn_benchmarks", }, ], restartPolicy: "OnFailure", }, }, }, Ps: { replicas: 1, template: { spec: { containers: [ { args: [ "python", "tf_cnn_benchmarks.py", "--batch_size=64", "--num_batches=100", "--model=resnet50", "--variable_update=parameter_server", "--flush_stdout=true", "--num_gpus=1", "--local_parameter_device=cpu", "--device=cpu", "--data_format=NHWC", ], image: "swr.ap-southeast-1.myhuaweicloud.com/wubowen585/tf-benchmarks-cpu:v0", name: "tensorflow", ports: [ { containerPort: 2222, name: "tfjob-port", }, ], resources: { limits: { cpu: 4, }, }, workingDir: "/opt/tf-benchmarks/scripts/tf_cnn_benchmarks", }, ], restartPolicy: "OnFailure", }, }, tfReplicaType: "PS", }, }, }, }; k.core.v1.list.new([ tfjob, ])
After the replacement is complete, restart the TFJob. After running the ks delete command, wait for about 30 seconds to confirm that the TFJob has been deleted.
ks delete ${KF_ENV} -c ${CNN_JOB_NAME} ks apply ${KF_ENV} -c ${CNN_JOB_NAME}
After the worker runs the job (about 5 minutes if a GPU is used), run the following command to view the running result:
kubectl get po kubectl logs ${CNN_JOB_NAME}-worker-0
In this example, the CNN ResNet50 model is used to train randomly generated images based on the TensorFlow distributed architecture. 64 images (specified by batch_size) are trained each time, and a total of 100 training steps (specified by step) are performed. The CPU performance (image/sec) at each training step is recorded. The training result shows that the training performance of a single P100 GPU is 158.62 images/sec.
Using Multiple GPUs for Training
To demonstrate the advantages of distributed TensorFlow jobs, two GPUs are used to run the same training job. In this example, the number of workers is changed to 2. You can perform the following procedure to change the number of workers:
vi ${KS_APP}/components/${CNN_JOB_NAME}.jsonnet
Change the number of worker replicas to 2.
Save the modification and restart the TFJob.
ks delete ${KF_ENV} -c ${CNN_JOB_NAME} ks apply ${KF_ENV} -c ${CNN_JOB_NAME}
Wait for about 5 minutes and query the training results of the two workers.
kubectl get po kubectl logs ${CNN_JOB_NAME}-worker-0
The training results show that the training performance of two workers is almost twice that of a single worker.
Feedback
Was this page helpful?
Provide feedbackFor any further questions, feel free to contact us through the chatbot.
Chatbot