更新时间:2024-05-10 GMT+08:00
Tensorflow训练
Kubeflow部署成功后,使用ps-worker的模式来进行Tensorflow训练就变得非常容易。本节介绍一个Kubeflow官方的Tensorflow训练范例,您可参考TensorFlow Training (TFJob)获取更详细的信息。
创建MNIST示例
- 部署TFJob资源以开始训练。
创建tf-mnist.yaml文件,示例如下:
apiVersion: "kubeflow.org/v1" kind: TFJob metadata: name: tfjob-simple namespace: kubeflow spec: tfReplicaSpecs: Worker: replicas: 2 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: kubeflow/tf-mnist-with-summaries:latest command: - "python" - "/var/tf_mnist/mnist_with_summaries.py"
- 创建TFJob。
kubectl apply -f tf-mnist.yaml
- 等待worker运行完毕后,查看运行日志。
kubectl -n kubeflow logs tfjob-simple-worker-0
回显如下:
... Accuracy at step 900: 0.964 Accuracy at step 910: 0.9653 Accuracy at step 920: 0.9665 Accuracy at step 930: 0.9681 Accuracy at step 940: 0.9664 Accuracy at step 950: 0.9667 Accuracy at step 960: 0.9694 Accuracy at step 970: 0.9683 Accuracy at step 980: 0.9687 Accuracy at step 990: 0.966 Adding run metadata for 999
- 删除TFJob。
kubectl delete -f tf-mnist.yaml
使用GPU训练
TFJob可在GPU场景下进行,该场景需要集群中包含GPU节点,并安装合适的驱动。
- 在TFJob中指定GPU资源。
创建tf-gpu.yaml文件,示例如下:
该示例的主要功能是基于Tensorflow的分布式架构,利用卷积神经网络(CNN)中的ResNet50模型对随机生成的图像进行训练,每次训练32张图像(batch_size),共训练100次(step),记录每次训练过程中的性能(image/sec)。
apiVersion: "kubeflow.org/v1" kind: "TFJob" metadata: name: "tf-smoke-gpu" spec: tfReplicaSpecs: PS: replicas: 1 template: metadata: creationTimestamp: null spec: containers: - args: - python - tf_cnn_benchmarks.py - --batch_size=32 - --model=resnet50 - --variable_update=parameter_server - --flush_stdout=true - --num_gpus=1 - --local_parameter_device=cpu - --device=cpu - --data_format=NHWC image: docker.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3 name: tensorflow ports: - containerPort: 2222 name: tfjob-port resources: limits: cpu: "1" workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks restartPolicy: OnFailure Worker: replicas: 1 template: metadata: creationTimestamp: null spec: containers: - args: - python - tf_cnn_benchmarks.py - --batch_size=32 - --model=resnet50 - --variable_update=parameter_server - --flush_stdout=true - --num_gpus=1 - --local_parameter_device=cpu - --device=gpu - --data_format=NHWC image: docker.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3 name: tensorflow ports: - containerPort: 2222 name: tfjob-port resources: limits: nvidia.com/gpu: 1 # GPU数量 workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks restartPolicy: OnFailure
- 创建TFJob。
kubectl apply -f tf-gpu.yaml
- 等待worker运行完毕后(一般GPU训练大约需要5分钟),执行如下命令查看运行结果:
kubectl logs tf-smoke-gpu-worker-0
回显如下:
... INFO|2023-09-02T12:04:25|/opt/launcher.py|27| Running warm up INFO|2023-09-02T12:08:55|/opt/launcher.py|27| Done warm up INFO|2023-09-02T12:08:55|/opt/launcher.py|27| Step Img/sec loss INFO|2023-09-02T12:08:56|/opt/launcher.py|27| 1 images/sec: 68.8 +/- 0.0 (jitter = 0.0) 8.777 INFO|2023-09-02T12:09:00|/opt/launcher.py|27| 10 images/sec: 70.4 +/- 0.4 (jitter = 1.8) 8.557 INFO|2023-09-02T12:09:04|/opt/launcher.py|27| 20 images/sec: 70.5 +/- 0.3 (jitter = 1.5) 8.090 INFO|2023-09-02T12:09:09|/opt/launcher.py|27| 30 images/sec: 70.3 +/- 0.3 (jitter = 1.6) 8.041 INFO|2023-09-02T12:09:13|/opt/launcher.py|27| 40 images/sec: 70.1 +/- 0.2 (jitter = 1.7) 9.464 INFO|2023-09-02T12:09:18|/opt/launcher.py|27| 50 images/sec: 70.1 +/- 0.2 (jitter = 1.6) 7.797 INFO|2023-09-02T12:09:23|/opt/launcher.py|27| 60 images/sec: 70.1 +/- 0.2 (jitter = 1.6) 8.595 INFO|2023-09-02T12:09:27|/opt/launcher.py|27| 70 images/sec: 70.0 +/- 0.2 (jitter = 1.7) 7.853 INFO|2023-09-02T12:09:32|/opt/launcher.py|27| 80 images/sec: 69.9 +/- 0.2 (jitter = 1.7) 7.849 INFO|2023-09-02T12:09:36|/opt/launcher.py|27| 90 images/sec: 69.8 +/- 0.2 (jitter = 1.7) 7.911 INFO|2023-09-02T12:09:41|/opt/launcher.py|27| 100 images/sec: 69.7 +/- 0.1 (jitter = 1.7) 7.853 INFO|2023-09-02T12:09:41|/opt/launcher.py|27| ---------------------------------------------------------------- INFO|2023-09-02T12:09:41|/opt/launcher.py|27| total images/sec: 69.68 INFO|2023-09-02T12:09:41|/opt/launcher.py|27| ---------------------------------------------------------------- INFO|2023-09-02T12:09:42|/opt/launcher.py|80| Finished: python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=1 --local_parameter_device=cpu --device=gpu --data_format=NHWC --job_name=worker --ps_hosts=tf-smoke-gpu-ps-0.default.svc:2222 --worker_hosts=tf-smoke-gpu-worker-0.default.svc:2222 --task_index=0 INFO|2023-09-02T12:09:42|/opt/launcher.py|84| Command ran successfully sleep for ever.
可以看到单个GPU的训练性能为69.68 images/sec。