Updated on 2022-12-01 GMT+08:00

Training TensorFlow Models

After Kubeflow is successfully deployed, it is easy to use the ps-worker mode to train TensorFlow models. This section provides a TensorFlow training example released at the official Kubeflow website. For more information, see https://www.kubeflow.org/docs/guides/components/tftraining/.

Creating a TfCnn Training Job

Run the following commands to create a TfCnn training job:

CNN_JOB_NAME=mycnnjob
VERSION=v0.4.0

ks init ${CNN_JOB_NAME}
cd ${CNN_JOB_NAME}
ks registry add kubeflow-git github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
ks pkg install kubeflow-git/examples

ks generate tf-job-simple-v1beta1 ${CNN_JOB_NAME} --name=${CNN_JOB_NAME}
ks apply ${KF_ENV} -c ${CNN_JOB_NAME}

You can run the ks env list command to obtain the value of ${KF_ENV}. In this example, the value of ${KF_ENV} is default. After the execution is complete, run the kubectl get po command to view the result.

Using a Single GPU for Training

The preceding training job can be implemented by GPUs. Perform the following steps to modify the TFJob configuration file:

vi ${KS_APP}/components/${CNN_JOB_NAME}.jsonnet

Replace the file content with the following content in the mycnnjob.jsonnet file:

local env = std.extVar("__ksonnet/environments");
local params = std.extVar("__ksonnet/params").components.mycnnjob;

local k = import "k.libsonnet";

local name = params.name;
local namespace = env.namespace;
local image = "gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3";

local tfjob = {
  apiVersion: "kubeflow.org/v1beta1",
  kind: "TFJob",
  metadata: {
    name: name,
    namespace: namespace,
  },
  spec: {
    tfReplicaSpecs: {
      Worker: {
        replicas: 1,
        template: {
          metadata:{
            annotations: {
               sidecar.istio.io/inject: "false"
            }
          },
          spec: {
            containers: [
              {
                args: [
                  "python",
                  "tf_cnn_benchmarks.py",
                  "--batch_size=64",
                  "--num_batches=100",
                  "--model=resnet50",
                  "--variable_update=parameter_server",
                  "--flush_stdout=true",
                  "--num_gpus=1",
                  "--local_parameter_device=cpu",
                  "--device=gpu",
                  "--data_format=NHWC",
                ],
                image: "swr.ap-southeast-1.myhuaweicloud.com/wubowen585/tf-benchmarks-gpu:v0",
                name: "tensorflow",
                ports: [
                  {
                    containerPort: 2222,
                    name: "tfjob-port",
                  },
                ],
                resources: {
                  limits: {
                    "nvidia.com/gpu": 1,
                  },   
                },
                workingDir: "/opt/tf-benchmarks/scripts/tf_cnn_benchmarks",
              },
            ],
            restartPolicy: "OnFailure",
          },
        },
      },
      Ps: {
        replicas: 1,
        template: {
          spec: {
            containers: [
              {
                args: [
                  "python",
                  "tf_cnn_benchmarks.py",
                  "--batch_size=64",
                  "--num_batches=100",
                  "--model=resnet50",
                  "--variable_update=parameter_server",
                  "--flush_stdout=true",
                  "--num_gpus=1",
                  "--local_parameter_device=cpu",
                  "--device=cpu",
                  "--data_format=NHWC",
                ],
                image: "swr.ap-southeast-1.myhuaweicloud.com/wubowen585/tf-benchmarks-cpu:v0",
                name: "tensorflow",
                ports: [
                  {
                    containerPort: 2222,
                    name: "tfjob-port",
                  },
                ],
                resources: {
                  limits: {
                    cpu: 4,
                  },   
                },
                workingDir: "/opt/tf-benchmarks/scripts/tf_cnn_benchmarks",
              },
            ],
            restartPolicy: "OnFailure",
          },
        },
        tfReplicaType: "PS",
      },
    },
  },
};

k.core.v1.list.new([
  tfjob,
])

After the replacement is complete, restart the TFJob. After running the ks delete command, wait for about 30 seconds to confirm that the TFJob has been deleted.

ks delete ${KF_ENV} -c ${CNN_JOB_NAME}
ks apply ${KF_ENV} -c ${CNN_JOB_NAME}

After the worker runs the job (about 5 minutes if a GPU is used), run the following command to view the running result:

kubectl get po
kubectl logs ${CNN_JOB_NAME}-worker-0

In this example, the CNN ResNet50 model is used to train randomly generated images based on the TensorFlow distributed architecture. 64 images (specified by batch_size) are trained each time, and a total of 100 training steps (specified by step) are performed. The CPU performance (image/sec) at each training step is recorded. The training result shows that the training performance of a single P100 GPU is 158.62 images/sec.

Using Multiple GPUs for Training

To demonstrate the advantages of distributed TensorFlow jobs, two GPUs are used to run the same training job. In this example, the number of workers is changed to 2. You can perform the following procedure to change the number of workers:

vi ${KS_APP}/components/${CNN_JOB_NAME}.jsonnet

Change the number of worker replicas to 2.

Save the modification and restart the TFJob.

ks delete ${KF_ENV} -c ${CNN_JOB_NAME} 
ks apply ${KF_ENV} -c ${CNN_JOB_NAME}

Wait for about 5 minutes and query the training results of the two workers.

kubectl get po 
kubectl logs ${CNN_JOB_NAME}-worker-0

The training results show that the training performance of two workers is almost twice that of a single worker.