云容器引擎 CCE云容器引擎 CCE

更新时间:2021/03/18 GMT+08:00
分享

Tensorflow训练

Kubeflow部署成功后,使用ps-worker的模式来进行tensorflow训练就变得非常容易。本节介绍一个kubeflow官方的tensorflow训练范例,您可参考https://www.kubeflow.org/docs/guides/components/tftraining/获取更详细的信息。

构建TfCnn训练任务

执行如下命令:

CNN_JOB_NAME=mycnnjob
VERSION=v0.4.0

ks init ${CNN_JOB_NAME}
cd ${CNN_JOB_NAME}
ks registry add kubeflow-git github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
ks pkg install kubeflow-git/examples

ks generate tf-job-simple-v1beta1 ${CNN_JOB_NAME} --name=${CNN_JOB_NAME}
ks apply ${KF_ENV} -c ${CNN_JOB_NAME}

其中${KF_ENV}可通过ks env list命令查看,这里取default。执行完毕后通过kubectl get po命令查看结果。

使用GPU训练

上述训练可在GPU场景下进行,需要修改之前创建tfjob的配置文件,修改过程如下:

vi ${KS_APP}/components/${CNN_JOB_NAME}.jsonnet

用如下“mycnnjob.jsonnet”文件中的内容替换:

local env = std.extVar("__ksonnet/environments");
local params = std.extVar("__ksonnet/params").components.mycnnjob;

local k = import "k.libsonnet";

local name = params.name;
local namespace = env.namespace;
local image = "gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3";

local tfjob = {
  apiVersion: "kubeflow.org/v1beta1",
  kind: "TFJob",
  metadata: {
    name: name,
    namespace: namespace,
  },
  spec: {
    tfReplicaSpecs: {
      Worker: {
        replicas: 1,
        template: {
          spec: {
            containers: [
              {
                args: [
                  "python",
                  "tf_cnn_benchmarks.py",
                  "--batch_size=64",
                  "--num_batches=100",
                  "--model=resnet50",
                  "--variable_update=parameter_server",
                  "--flush_stdout=true",
                  "--num_gpus=1",
                  "--local_parameter_device=cpu",
                  "--device=gpu",
                  "--data_format=NHWC",
                ],
                image: "swr.cn-east-2.myhuaweicloud.com/wubowen585/tf-benchmarks-gpu:v0",
                name: "tensorflow",
                ports: [
                  {
                    containerPort: 2222,
                    name: "tfjob-port",
                  },
                ],
                resources: {
                  limits: {
                    "nvidia.com/gpu": 1,
                  },   
                },
                workingDir: "/opt/tf-benchmarks/scripts/tf_cnn_benchmarks",
              },
            ],
            restartPolicy: "OnFailure",
          },
        },
      },
      Ps: {
        replicas: 1,
        template: {
          spec: {
            containers: [
              {
                args: [
                  "python",
                  "tf_cnn_benchmarks.py",
                  "--batch_size=64",
                  "--num_batches=100",
                  "--model=resnet50",
                  "--variable_update=parameter_server",
                  "--flush_stdout=true",
                  "--num_gpus=1",
                  "--local_parameter_device=cpu",
                  "--device=cpu",
                  "--data_format=NHWC",
                ],
                image: "swr.cn-east-2.myhuaweicloud.com/wubowen585/tf-benchmarks-cpu:v0",
                name: "tensorflow",
                ports: [
                  {
                    containerPort: 2222,
                    name: "tfjob-port",
                  },
                ],
                resources: {
                  limits: {
                    cpu: 4,
                  },   
                },
                workingDir: "/opt/tf-benchmarks/scripts/tf_cnn_benchmarks",
              },
            ],
            restartPolicy: "OnFailure",
          },
        },
        tfReplicaType: "PS",
      },
    },
  },
};

k.core.v1.list.new([
  tfjob,
])

替换完毕后重新启动tfjob,执行ks delete命令后需要等待30s左右确认tfjob已删除完毕。

ks delete ${KF_ENV} -c ${CNN_JOB_NAME}
ks apply ${KF_ENV} -c ${CNN_JOB_NAME}

等待worker运行完毕后(一般GPU训练大约需要5分钟),执行如下命令查看运行结果:

kubectl get po
kubectl logs ${CNN_JOB_NAME}-worker-0

该范例的主要功能是基于tensorflow的分布式架构,利用卷积神经网络(CNN)中的ResNet50模型对随机生成的图像进行训练,每次训练64张图像(batch_size),共训练100次(step),记录每次训练过程中的cpu性能(image/sec),可以看到单个p100 GPU的训练性能为158.62 images/sec。

使用多张GPU训练

为了体现分布式tensorflow的优势,采用两张GPU卡对同一场景进行训练,这里将worker数改为2。使用tfjob修改worker的数量十分方便,修改过程如下:

vi ${KS_APP}/components/${CNN_JOB_NAME}.jsonnet

将文件中的worker的replicas数量改为2即可。

修改完毕后保存,重启tfjob:

ks delete ${KF_ENV} -c ${CNN_JOB_NAME} 
ks apply ${KF_ENV} -c ${CNN_JOB_NAME}

等待5分钟左右后,可以看到两个worker的训练结果。

kubectl get po 
kubectl logs ${CNN_JOB_NAME}-worker-0

运行结果如下,可以看到两个worker的性能提升近乎是单个worker的两倍。

分享:

    相关文档

    相关产品