Updated on 2025-11-04 GMT+08:00

Operator Package Development Specifications

Python Operator Package Directory Specifications

Assume that the operator package name is video_clip.tar. The directory structure after the operator package is decompressed is as follows:

+--- video_clip # The directory name must be the same as the tar package name.
| +--- program_package # Python operator directory
| | +--- install.sh # (Optional) Installation script
| | +--- process.py # (Mandatory) Operator code

process.py File Development Specifications

The operator package must contain a script named process.py. The development modes are determined based on the value of auto-data-loading in the operator configuration file.

Mode 1 (auto-data-loading: true)

Applicable scenarios: This mode is recommended except for the following three scenarios:

  1. Multimodal dataset: For example, datasets consisting of images and text, or videos and text.
  2. The output dataset modality is not within the following range: text, image, video, and audio.
  3. Scenario where the entire dataset needs to be used as the input, for example, deduplication operators.

The development specifications of process.py are as follows. The process.py file contains three classes.

1. PreProcess: (optional) operator preprocessing logic.

Before model inference, the operator offloads certain computations from the CPU to the GPU/NPU. Separating the operator preprocessing logic from CPU and GPU/NPU computations enhances GPU/NPU utilization.

2. Process: (mandatory) operator inference logic.

It is recommended that only the model loading and inference parts be included, and the preprocessing and postprocessing be written in PreProcess and PostProcess.

3. PostProcess: (optional) operator post-processing logic.

If the operator has heavy postprocessing logic that is still calculated on the CPU after inference, you are advised to split the logic and write the postprocessing logic in PostProcess.

The operator framework is called in the following sequence: preprocess -> process -> postprocess.

import pandas as pd
import ma_utils as utils
logger = utils.FileLogger.get_logger()

class PreProcess():
def __init__(self, args):
"""
:param args: operator parameters, including the service parameters of the operator and the default parameters obs_input_path (input OBS path) and obs_output_path (output OBS path).
"""
pass

def __call__(self, input: pd.DataFrame) -> pd.DataFrame:
"""
:param input: input parameter
Text-JSONL/CSV file:
The framework extracts the sample content of JSONL/CSV files and transfers the content to the operator. The operator is called once for each JSONL/CSV file. The field names in the DataFrame are the same as those defined in the JSONL/CSV files.
Text-other file:
The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name).
 
Image/Video/Audio-original file:
The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name).
 
Image/Video/Audio-Parquet file:
The framework extracts the sample content of Parquet files and transfers the content to the operator. The operator is called once for each Parquet file. The field names in the DataFrame are the same as those defined in the Parquet files.
There are two system predefined fields: file_path (full path of the input data on the operator (the operator framework downloads the dataset to the local directory of the operator)),
and file_name (file name of the data file (relative path of the file)).
:return: output parameter, data sample list DataFrame output by the operator, which is processed by the framework.
"""
pass


class Process():
def __init__(self, args):
"""
:param args: operator parameters, including the service parameters of the operator and the default parameters obs_input_path (input OBS path) and obs_output_path (output OBS path).
"""
pass

def __call__(self, input: pd.DataFrame) -> pd.DataFrame:
"""
:param input: input parameter
Text-JSONL/CSV file:
The framework extracts the sample content of JSONL/CSV files and transfers the content to the operator. The operator is called once for each JSONL/CSV file. The field names in the DataFrame are the same as those defined in the JSONL/CSV files.
Text-other file:
The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name).
 
Image/Video/Audio-original file:
The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name).
 
Image/Video/Audio-Parquet file:
The framework extracts the sample content of Parquet files and transfers the content to the operator. The operator is called once for each Parquet file. The field names in the DataFrame are the same as those defined in the Parquet files.
There are two system predefined fields: file_path (full path of the input data on the operator (the operator framework downloads the dataset to the local directory of the operator)),
and file_name (file name of the data file (relative path of the file)).
:return: output parameter, data sample list DataFrame output by the operator, which is processed by the framework.
"""
pass


class PostProcess():
def __init__(self, args):
"""
:param args: operator parameters, including the service parameters of the operator and the default parameters obs_input_path (input OBS path) and obs_output_path (output OBS path).
"""
pass

def __call__(self, input: pd.DataFrame) -> pd.DataFrame:
"""
:param input: input parameter
Text-JSONL/CSV file:
The framework extracts the sample content of JSONL/CSV files and transfers the content to the operator. The operator is called once for each JSONL/CSV file. The field names in the DataFrame are the same as those defined in the JSONL/CSV files.
Text-other file:
The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name).
 
Image/Video/Audio-original file:
The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name).
 
Image/Video/Audio-Parquet file:
The framework extracts the sample content of Parquet files and transfers the content to the operator. The operator is called once for each Parquet file. The field names in the DataFrame are the same as those defined in the Parquet files.
There are two system predefined fields: file_path (full path of the input data on the operator (the operator framework downloads the dataset to the local directory of the operator)),
and file_name (file name of the data file (relative path of the file)).
:return: output parameter, data sample list DataFrame output by the operator, which is processed by the framework.
"""
pass

Mode 2 (auto-data-loading: false)

This operation is applicable to the following scenarios:

  1. Multimodal dataset: For example, datasets consisting of images and text, or videos and text.
  2. The output dataset modality is not within the following range: text, image, video, and audio.
  1. Scenario where the entire dataset needs to be used as the input, for example, deduplication operators.

The development specifications of process.py are as follows. The process.py file contains three classes.

1. PreProcess: (optional) operator preprocessing logic.

Before model inference, the operator offloads certain computations from the CPU to the GPU/NPU. Separating the operator preprocessing logic from CPU and GPU/NPU computations enhances GPU/NPU utilization.

2. Process: (mandatory) operator inference logic.

It is recommended that only the model loading and inference parts be included, and the preprocessing and postprocessing be written in PreProcess and PostProcess.

3. PostProcess: (optional) operator post-processing logic.

If the operator has heavy postprocessing logic that is still calculated on the CPU after inference, you are advised to split the logic and write the postprocessing logic in PostProcess.

The operator framework is called in the following sequence: preprocess -> process -> postprocess.

import pandas as pd

class Process():
def __init__(self, args):
"""
:param args: operator parameters, including the service parameters of the operator and the default parameters obs_input_path (input OBS path) and obs_output_path (output OBS path).
"""
pass

def __call__(self, input: pd.DataFrame):
"""
:param input: input parameter. In mode 2, input is an empty DataFrame.
:return: No return value.
"""
pass

Preconfigured Dependencies of the Operator Base Image

The following table lists the dependencies preconfigured in the base image of the operator.

Table 1 Preconfigured Dependencies of the Custom Base Image

Dependency Package Name

Version

absl-py

2.1.0

accelerate

1.0.1

aclruntime

0.0.2

addict

2.4.0

aiohappyeyeballs

2.4.3

aiohttp

3.11.1

aiosignal

1.3.1

ais-bench

0.0.2

akg

2.2

albumentations

1.3.1

antlr4-python3-runtime

4.9.3

apex

0.1.dev20241029+ascend

ascend-faultdiag

6.0.0

ascend-training-accuracy-tools

1.0

ascendebug

0.1.0

astroid

3.0.3

asttokens

2.4.1

async-timeout

5.0.1

attrs

23.2.0

audioread

3.0.1

auto-tune

0.1.0

av

12.0.0

blinker

1.9.0

blobfile

3.0.0

blosc2

2.5.1

boto3

1.23.10

botocore

1.26.10

certifi

2024.12.14

cffi

1.16.0

charset-normalizer

3.4.1

click

8.1.7

click-aliases

1.0.5

cloudpickle

3.1.0

coloredlogs

15.0.1

contourpy

1.3.0

coverage

7.3.0

crc32c

2.7.1

cycler

0.12.1

Cython

3.0.2

dask

2024.2.1

dataflow

0.0.1

datasets

3.0.1

debugpy

1.8.8

decorator

4.4.2

decord

0.6.0

dill

0.3.8

easydict

1.12

einops

0.8.0

entrypoints

0.4

esdk-obs-python

3.23.12

et-xmlfile

2.0.0

exceptiongroup

1.2.2

executing

2.1.0

filelock

3.16.1

flask

2.3.3

flatbuffers

24.12.23

fonttools

4.55.0

frozenlist

1.5.0

fsspec

2024.6.1

fuzzywuzzy

0.18.0

gnureadline

8.2.10

gpytorch

1.12

greenlet

3.1.1

grpcio

1.60.0

grpcio-tools

1.60.0

gunicorn

21.2.0

h5py

3.9.0

hccl

0.1.0

hccl-parser

0.1

huaweicloud-sdk-python-modelarts-dataset

0.1.5

huggingface-hub

0.26.2

humanfriendly

10.0

idna

3.10

ijson

3.3.0

imageio

2.36.1

imageio-ffmpeg

0.5.1

importlib-metadata

8.5.0

importlib-resources

6.4.5

iniconfig

2.0.0

iopath

0.1.10

ipykernel

6.7.0

ipython

8.18.1

isort

5.13.2

itsdangerous

2.2.0

jaxtyping

0.2.19

jedi

0.19.2

jieba

0.42.1

jinja2

3.1.4

jmespath

1.0.1

joblib

1.4.2

jsonlines

4.0.0

jupyter-client

7.4.9

jupyter-core

5.7.2

keras

3.2.1

kiwisolver

1.4.5

lazy-import

0.2.2

lazy-loader

0.4

levenshtein

0.26.0

libclang

18.1.1

libcst

1.2.0

librosa

0.10.2.post1

linear-operator

0.5.3

llm-datadist

0.0.1

llvmlite

0.41.1

locket

1.0.0

lxml

5.1.0

Markdown

3.7

markdown-it-py

3.0.0

MarkupSafe

3.0.2

matplotlib

3.7.3

matplotlib-inline

0.1.7

mccabe

0.7.0

mdurl

0.1.2

mindspore-lite

2.4.0

mindstudio-probe

1.0.2

mindx-elastic

0.0.1

ml-dtypes

0.5.0

mmcv

2.0.1

mmengine

0.10.5

modelarts-pytorch-model-server

1.0.6

moviepy

2.1.2

moxing-framework

2.2.10

mpi4py

3.1.6

mpmath

1.3.0

msgpack

1.1.0

msit

7.0.0rc630

msit-benchmark

7.0.0rc2

msit-compare

7.0.0rc2

msit-llm

7.0.0rc2

msit-surgeon

7.0.0rc2

msprof-analyze

1.2.0

multidict

6.1.0

multiprocess

0.70.16

mypy-extensions

1.0.0

msobjdump

0.1.0

namex

0.0.8

ndindex

1.9.2

nest-asyncio

1.6.0

netifaces

0.11.0.post20240306102544

networkx

3.2.1

numba

0.58.1

numexpr

2.8.6

numpy

1.26.4

omegaconf

2.3.0

onnx

1.17.0

onnxconverter-common

1.14.0

onnxruntime

1.18.0

op-compile-tool

0.1.0

op-gen

0.1

op-test-frame

0.1

opc-tool

0.1.0

opencv-python

4.9.0.80

opencv-python-headless

4.8.1.78

openpyxl

3.1.5

optree

0.13.1

orjson

3.10.7

packaging

24.2

pandas

1.3.5

parso

0.8.4

partd

1.4.2

pathlib2

2.3.7.post1

peft

0.7.1

pexpect

4.9.0

pillow

10.4.0

pip

21.0.1

platformdirs

4.3.6

pluggy

1.5.0

ply

3.11

pooch

1.8.2

portalocker

2.10.1

prettytable

3.12.0

proglog

0.1.10

prometheus-client

0.14.1

prompt-toolkit

3.0.48

propcache

0.2.0

protobuf

3.20.2

psutil

6.0.0

ptyprocess

0.7.0

pure-eval

0.2.3

py-cpuinfo

9.0.0

pyarrow

18.0.0

pybind11

2.13.6

pyclipper

1.3.0.post6

pycocotools

2.0.7

pycparser

2.22

pycryptodome

3.21.0

pycryptodomex

3.21.0

pygments

2.18.0

pylint

3.0.2

pynndescent

0.5.13

pyparsing

3.2.0

pypng

0.20220715.0

pytest

7.4.3

python-dateutil

2.9.0.post0

python-dotenv

1.0.1

pytz

2024.2

PyWavelets

1.4.1

PyYAML

6.0.1

pyzmq

26.2.0

pydub

0.25.1

qudida

0.0.4

rapidfuzz

3.10.1

redis

5.1.1

regex

2024.9.11

requests

2.32.2

rich

13.9.4

rouge-metric

1.0.1

s3transfer

0.5.2

safetensors

0.4.5

schedule-search

0.0.1

scikit-image

0.22.0

scikit-learn

1.5.1

scipy

1.10.1

sentencepiece

0.2.0

setuptools

75.8.0

shapely

2.0.3

show-kernel-debug-data

0.1.0

shyaml

0.6.2

six

1.16.0

skl2onnx

1.17.0

soundfile

0.12.1

soxr

0.5.0.post1

SQLAlchemy

2.0.36

stack-data

0.6.3

sympy

1.13.0

setuptools

57.5.0

soundfile

0.13.1

tables

3.9.2

tabulate

0.9.0

tailor

0.3.4

te

0.4.0

tensorboard

2.18.0

tensorboard-data-server

0.7.2

termcolor

2.5.0

terminaltables

3.1.10

tfrecord

1.14.5

threadpoolctl

3.5.0

tifffile

2024.8.30

tiktoken

0.7.0

timm

1.0.9

tokenizers

0.20.3

tomli

2.1.0

tomlkit

0.13.2

toolz

1.0.0

torch

2.1.0

torch-npu

2.1.0

torchvision

0.16.0

tornado

6.4.1

tqdm

4.66.3

traitlets

5.14.3

transformers

4.45.0

transformers-stream-generator

0.0.5

typeguard

4.4.1

typing-extensions

4.8.0

typing-inspect

0.9.0

tzdata

2024.2

tenacity

9.1.2

umap-learn

0.5.6

urllib3

1.26.7

wcwidth

0.2.13

werkzeug

3.0.3

wheel

0.45.1

XlsxWriter

3.2.0

xmltodict

0.13.0

xxhash

3.5.0

yacs

0.1.8

yapf

0.43.0

yarl

1.17.1

ydata-profiling

4.16.1

zhdate

0.1

zipp

3.21.0

zstandard

0.22.0