Kubeflow Add-on

With the increasing volume of models and data, efficiently building, deploying, and managing complex machine learning workflows becomes increasingly challenging.

There is no unified platform that can effectively manage complex machine learning tasks.
It is difficult to track and manage dependencies and data flows between various components.
Repeated execution of the same tasks wastes computing resources.
Workflows cannot be easily migrated between different computing environments, which lacks flexibility.

To address these challenges, CCE standard and Turbo clusters offer the Kubeflow add-on based on Kubeflow. Kubeflow is an open-source machine learning platform designed for Kubernetes, simplifying the development, deployment, and management of machine learning workflows. It offers developers and data scientists a consistent, scalable framework for running machine learning tasks and applications on Kubernetes. Backed on Kubeflow, this add-on offers the following features:

Simplified machine learning workflow development and management: You can quickly define and manage complex machine learning workflows through an intuitive UI and Python SDKs.
Improved resource utilization: Redundant computing is reduced through task parallelization and caching to enhance resource efficiency.
Enhanced workflow portability: You can easily migrate workflows between different environments using platform-neutral IR YAML definitions.
Enhanced management and visualization capabilities: This add-on provides robust tools for managing and visualizing workflows to help you track and optimize them more effectively.

Basic Concepts

**Table 1** Basic concepts of the add-on
Concept	Description	Function
Pipeline	A sequence of machine learning steps, typically including data preprocessing, training, evaluation, and inference. The Kubeflow add-on uses pipelines to automate and manage the full machine learning lifecycle.	Workflows can be defined via GUI or YAML files, automating step execution and enabling efficient scheduling and execution of complex tasks on Kubernetes.
Run	An instance of pipeline execution, indicating the running of a specific workflow. Each run is an execution of the pipeline, recording its input parameters, execution status, and output results.	Runs execute machine learning tasks with varying inputs or datasets. They allow you to track execution progress and results, manage versions, and compare experiments.
Experiment	A logical group of multiple runs, usually with varied settings or hyperparameters for the same problem. You can use experiments to organize runs and compare results.	Experiments allow you to compare and track the execution results of different machine learning tasks. By organizing multiple runs, you can analyze the impact of various parameters and settings on model performance, enabling precise model and performance optimization.
Component	The smallest unit in a pipeline, representing a specific machine learning task or step, like data preprocessing, model training, or evaluation. Each component is a container holding the code and dependencies needed to execute the task.	Components enable independent definition and management of each pipeline step, which can be reused across different pipelines. This modular design simplifies pipeline building, debugging, and extension.
DAG	A graph that represents task dependencies in Kubeflow pipelines. It defines the execution sequence and component dependencies.	A DAG ensures tasks run in the correct order without cyclic dependencies. It enables Kubeflow pipelines to manage dependencies efficiently for optimal workflow execution.

For more details, see Kubeflow Pipeline Concepts.

Prerequisites

A CCE standard or Turbo cluster of v1.27 or later is available. For details, see Buying a CCE Standard/Turbo Cluster.
Before installing the add-on, ensure that the cluster's pods can access the Internet to download required images. You can configure an SNAT rule for the cluster to ensure that pods can access the Internet. You will be billed for the SNAT rule. For details, see NAT Gateway Price Calculator.
In the use case described in this section, a NodePort Service is required to access the Kubeflow web UI. Ensure an EIP is bound to any node in the cluster. The EIP will be billed. For details, see Elastic IP Price Calculator.

Notes and Constraints

The Kubeflow add-on provides only pipeline capabilities.
This add-on is being deployed. To view the regions where this add-on is available, see the console.
This add-on is in the OBT phase. You can experience the latest add-on features. However, the stability of this add-on version has not been fully verified, and the CCE SLA is not valid for this version.

Installing the Add-on

Log in to the CCE console and click the cluster name to access the cluster console.
In the navigation pane, choose Add-ons. On the displayed page, locate Kubeflow and click Install.

On the Install Add-on page, configure the add-on.

**Table 2** Add-on settings
Parameter	Description
Version	Select a version as needed.
Add-on Specifications	Only the default specifications are supported.

Click Install in the lower right corner. If the status is Running, the add-on has been installed.

Components

Component	Description	Resource Type
cache-deployer	Set up cache-related configurations, including PVCs and ConfigMaps, and inject them into the cluster to enable caching for pipeline runs.	Deployment
cache-server	Provide caching to prevent repeated execution of completed pipeline steps, thereby improving running efficiency.	Deployment
metadata-envoy	Collect, process, and forward metadata to the specified metadata store during pipeline execution.	Deployment
ml_metadata_store_server	A metadata storage server that stores information related to pipeline execution, including inputs, outputs, and models.	Deployment
metadata-writer	Record pipeline execution metadata to the metadata store.	Deployment
minio	An object storage service used to store artifacts generated during pipeline execution, such as models and data.	Deployment
api-server	Provide REST and gRPC APIs to enable communication between the SDK, UI, and backend.	Deployment
persistenceagent	Store pipeline execution records in the database and restore running statuses.	Deployment
scheduledworkflow	Periodically trigger pipelines to automate workflow execution.	Deployment
ml-pipeline-ui	Provide a frontend web interface for you to view and manage pipelines and their execution records.	Deployment
viewer-crd-controller	Manage viewer Custom Resource Definitions (CRDs) and visualize artifacts such as images and tables on the UI.	Deployment
visualization-server	Used with the viewer to visualize data, including TensorBoard and ROC curves.	Deployment
mysql	The relational database used by Kubeflow Pipelines to store running records and user information.	Deployment
workflow-controller	The core component of Argo, which schedules and executes each task (step) in a pipeline.	Deployment

All preceding components are deployed in the kubeflow namespace.
Components like Minio and MySQL provide dependencies for Kubeflow Pipelines and can be replaced as needed.

MySQL and Minio backend PVs are bound to emptyDir volumes. If pods are migrated or restarted, unexpected results may occur.

Use Case

This case describes how to use the Kubeflow add-on to run a simple pipeline, as shown in Figure 1. The pipeline includes a comp task that receives a string message and outputs it unchanged. This task is implemented by a Python 3.9 pod, which prints the message and writes it to a specified output file.

Figure 1 Execution process of the case

Install kubectl on an existing ECS and access a cluster using kubectl. For details, see Accessing a Cluster Using kubectl.
After the Kubeflow add-on is installed, CCE automatically creates the ml-pipeline-ui component to provide the frontend web interface. This component is automatically bound to a ClusterIP Service named ml-pipeline-ui. To access the UI, change the Service type of ml-pipeline-ui from ClusterIP to NodePort or LoadBalancer. The following uses a NodePort Service as an example.
Run the following command to change the ClusterIP Service to a NodePort Service:
```
kubectl edit svc ml-pipeline-ui -n kubeflow
```
Set type to NodePort in the file and specify a node port that does not conflict with other services in the cluster. For details about NodePort Services, see NodePort.
```
...
spec:
  ports:
    - name: http
      protocol: TCP
      port: 80
      targetPort: 3000
      nodePort: 30083 # Node port number
  selector:
    app: ml-pipeline-ui
    application-crd-id: kubeflow-pipelines
  clusterIP: 10.247.101.116
  clusterIPs:
    - 10.247.101.116
  type: NodePort # Change ClusterIP to NodePort.
  sessionAffinity: None
...
```
Enter http://<EIP of the target node>:<Node port>/ in the address bar of a browser to access the Kubeflow UI. On the UI, you can create and manage machine learning pipelines, define complete workflows including data preprocessing, model training, evaluation, and deployment. You can also configure and launch runs, monitor task execution in real time, and view logs and visualizations. This enables the automatic orchestration and execution of machine learning workflows.

If the following page is displayed, the UI has been accessed.

Figure 2 Kubeflow add-on UI

Create a pipeline.yaml file locally to configure a pipeline. This section provides a basic pipeline example that extracts input parameters during runtime, prints them, and outputs them to a specified file. For more information about pipelines, see Compile a Pipeline.

pipeline.yaml file content:

# PIPELINE DEFINITION
# Name: my-pipeline
# Description: My ML pipeline running on HuaweiCloud CCE.
# Inputs:
#    message: str
# Outputs:
#    Output: str
components:
comp-comp:# Define a component named comp-comp and specify its inputs (messages) and outputs.
    executorLabel: exec-comp   # Executor label, which is used to match the executor
    inputDefinitions:          # Component input definitions
      parameters:
        message:
          parameterType: STRING
    outputDefinitions:         # Component output definitions
      parameters:
        Output:
          parameterType: STRING
deploymentSpec:
  executors:      # Configure executors.
    exec-comp:
      container:  # Container configuration, which specifies the tasks executed by pods in the CCE cluster
        image: python:3.9    # Use a Python 3.9 image.
        command:             # Container startup command
          - python3
          - -c
          - |
              import sys, json      # Specify the Python script as follows: Extract messages from input parameters, print the messages, and write them to the output file (the path is provided by Kubeflow).

              executor_input = json.loads(sys.argv[1])
              inputs = executor_input.get("inputs", {})
              parameters = inputs.get("parameterValues", {})
              message = parameters.get("message", "Success Message From pipeline")

              print("Received message:", message)

              outputs = executor_input.get("outputs", {})
              output_file = outputs.get("parameters", {}).get("Output", {}).get("outputFile")
              if output_file:
                  with open(output_file, "w") as f:
                      f.write(message)
          - '{{$}}'
pipelineInfo:
  description: My ML pipeline running on HuaweiCloud CCE.
  name: my-pipeline
root:    
  dag:    # Define a DAG to determine the execution sequence and dependencies of multiple components, whether they are executed in parallel or sequentially.
    outputs:
      parameters:
        Output:
          valueFromParameter:
            outputParameterKey: Output   # The output of the comp task serves as the final output of the entire DAG.
            producerSubtask: comp  
    tasks:     # Define a task comp to call the comp-comp component, pass the input messages of the Python script to it, and use its output as the final output of the pipeline.
      comp:    
        cachingOptions:
          enableCache: true
        componentRef:
          name: comp-comp
        inputs:
          parameters:
            message:
              componentInputParameter: message
        taskInfo:
          name: comp
  inputDefinitions:     # Declare the pipeline's input (message) and output parameters, both of which are strings.
    parameters:
      message:
        parameterType: STRING
  outputDefinitions:
    parameters:
      Output:
        parameterType: STRING
schemaVersion: 2.1.0
sdkVersion: kfp-2.12.1

The overall execution logic of the preceding YAML file is as follows.

**Table 3** Execution Logic
Phase	Step	Description	Associated YAML Field
Input	Submit parameters.	Message parameters are input through the UI.	inputDefinitions.parameters.message
	The pipeline receives parameters.	The pipeline centrally validates and standardizes the input parameters required for execution, and stores valid parameters persistently.	root.inputDefinitions
	The DAG distributes parameters.	The DAG distributes parameters to the specified task.	tasks.comp.inputs.parameters.message
Task execution	The task calls the component.	The task comp calls component comp-comp.	tasks.comp.componentRef
	Associate the component with an executor.	The component finds containers by executorLabel and uses them for task execution.	components.comp-comp.executorLabel deploymentSpec.exec-comp
	Run containers.	The API server of the CCE cluster receives the execution request and starts the pods to run the Python script task.	deploymentSpec.exec-comp.container.command
Output	Define component outputs.	The component declares output parameter Output.	components.comp-comp.outputDefinitions
	The DAG collects results.	The DAG collects the output from the comp task and passes it as the final output of the pipeline.	root.dag.outputs.parameters.Output
	Provide the final output.	You can view the global output result.	outputDefinitions.parameters.Output

Return to the UI and click Upload pipeline in the upper right corner to upload the pipeline.yaml file. After the configuration is complete, click Create in the lower part to create a pipeline.

Figure 3 Uploading a pipeline file
After the pipeline is created, the comp task is automatically displayed on the UI. Click Create run in the upper right corner to create a run. After the configuration is complete, click Start in the lower part to run the pipeline.

Figure 4 Creating a run
- Run parameters: automatically generated based on the pipeline.yaml file. Enter the message to be output.
After you click Start, the API server of the CCE cluster receives the task request from the executor, launches the pods to execute it, and updates the execution status and logs in real time on the UI. You can monitor the progress and view the final result there. If the pods of the CCE cluster are stopped, the task execution is complete. In the navigation pane, choose Runs. If is displayed before the target run instance, the pipeline has been executed.

Figure 5 Starting pods for task execution
Click the comp task and view its inputs and outputs. As shown in the following figure, the input message is My ML pipeline running on HuaweiCloud CCE., and the output is My ML pipeline running on HuaweiCloud CCE., which complies with the pipeline definitions.

Figure 6 Input and output