Operator Package Development Specifications

Python Operator Package Directory Specifications

Assume that the operator package name is video_clip.tar. The directory structure after the operator package is decompressed is as follows:

+--- video_clip # The directory name must be the same as the tar package name.| +--- program_package # Python operator directory| | +--- install.sh # (Optional) Installation script| | +--- process.py # (Mandatory) Operator code

process.py File Development Specifications

The operator package must contain a script named process.py. The development modes are determined based on the value of auto-data-loading in the operator configuration file.

Mode 1 (auto-data-loading: true)

Applicable scenarios: This mode is recommended except for the following three scenarios:

Multimodal dataset: For example, datasets consisting of images and text, or videos and text.
The output dataset modality is not within the following range: text, image, video, and audio.
Scenario where the entire dataset needs to be used as the input, for example, deduplication operators.

The development specifications of process.py are as follows. The process.py file contains three classes.

1. PreProcess: (optional) operator preprocessing logic.

Before model inference, the operator offloads certain computations from the CPU to the GPU/NPU. Separating the operator preprocessing logic from CPU and GPU/NPU computations enhances GPU/NPU utilization.

2. Process: (mandatory) operator inference logic.

It is recommended that only the model loading and inference parts be included, and the preprocessing and postprocessing be written in PreProcess and PostProcess.

3. PostProcess: (optional) operator post-processing logic.

If the operator has heavy postprocessing logic that is still calculated on the CPU after inference, you are advised to split the logic and write the postprocessing logic in PostProcess.

The operator framework is called in the following sequence: preprocess -> process -> postprocess.

import pandas as pdimport ma_utils as utilslogger = utils.FileLogger.get_logger()class PreProcess():def __init__(self, args):""":param args: operator parameters, including the service parameters of the operator and the default parameters obs_input_path (input OBS path) and obs_output_path (output OBS path)."""passdef __call__(self, input: pd.DataFrame) -> pd.DataFrame:""":param input: input parameterText-JSONL/CSV file:The framework extracts the sample content of JSONL/CSV files and transfers the content to the operator. The operator is called once for each JSONL/CSV file. The field names in the DataFrame are the same as those defined in the JSONL/CSV files.Text-other file:The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name). Image/Video/Audio-original file:The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name). Image/Video/Audio-Parquet file:The framework extracts the sample content of Parquet files and transfers the content to the operator. The operator is called once for each Parquet file. The field names in the DataFrame are the same as those defined in the Parquet files.There are two system predefined fields: file_path (full path of the input data on the operator (the operator framework downloads the dataset to the local directory of the operator)),and file_name (file name of the data file (relative path of the file)).:return: output parameter, data sample list DataFrame output by the operator, which is processed by the framework."""passclass Process():def __init__(self, args):""":param args: operator parameters, including the service parameters of the operator and the default parameters obs_input_path (input OBS path) and obs_output_path (output OBS path)."""passdef __call__(self, input: pd.DataFrame) -> pd.DataFrame:""":param input: input parameterText-JSONL/CSV file:The framework extracts the sample content of JSONL/CSV files and transfers the content to the operator. The operator is called once for each JSONL/CSV file. The field names in the DataFrame are the same as those defined in the JSONL/CSV files.Text-other file:The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name). Image/Video/Audio-original file:The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name). Image/Video/Audio-Parquet file:The framework extracts the sample content of Parquet files and transfers the content to the operator. The operator is called once for each Parquet file. The field names in the DataFrame are the same as those defined in the Parquet files.There are two system predefined fields: file_path (full path of the input data on the operator (the operator framework downloads the dataset to the local directory of the operator)),and file_name (file name of the data file (relative path of the file)).:return: output parameter, data sample list DataFrame output by the operator, which is processed by the framework."""passclass PostProcess():def __init__(self, args):""":param args: operator parameters, including the service parameters of the operator and the default parameters obs_input_path (input OBS path) and obs_output_path (output OBS path)."""passdef __call__(self, input: pd.DataFrame) -> pd.DataFrame:""":param input: input parameterText-JSONL/CSV file:The framework extracts the sample content of JSONL/CSV files and transfers the content to the operator. The operator is called once for each JSONL/CSV file. The field names in the DataFrame are the same as those defined in the JSONL/CSV files.Text-other file:The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name). Image/Video/Audio-original file:The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name). Image/Video/Audio-Parquet file:The framework extracts the sample content of Parquet files and transfers the content to the operator. The operator is called once for each Parquet file. The field names in the DataFrame are the same as those defined in the Parquet files.There are two system predefined fields: file_path (full path of the input data on the operator (the operator framework downloads the dataset to the local directory of the operator)),and file_name (file name of the data file (relative path of the file)).:return: output parameter, data sample list DataFrame output by the operator, which is processed by the framework."""pass

Mode 2 (auto-data-loading: false)

This operation is applicable to the following scenarios:

Multimodal dataset: For example, datasets consisting of images and text, or videos and text.
The output dataset modality is not within the following range: text, image, video, and audio.

Scenario where the entire dataset needs to be used as the input, for example, deduplication operators.

The development specifications of process.py are as follows. The process.py file contains three classes.

1. PreProcess: (optional) operator preprocessing logic.

2. Process: (mandatory) operator inference logic.

It is recommended that only the model loading and inference parts be included, and the preprocessing and postprocessing be written in PreProcess and PostProcess.

3. PostProcess: (optional) operator post-processing logic.

If the operator has heavy postprocessing logic that is still calculated on the CPU after inference, you are advised to split the logic and write the postprocessing logic in PostProcess.

The operator framework is called in the following sequence: preprocess -> process -> postprocess.

import pandas as pdclass Process():def __init__(self, args):""":param args: operator parameters, including the service parameters of the operator and the default parameters obs_input_path (input OBS path) and obs_output_path (output OBS path)."""passdef __call__(self, input: pd.DataFrame):""":param input: input parameter. In mode 2, input is an empty DataFrame.:return: No return value."""pass

Preconfigured Dependencies of the Operator Base Image

The following table lists the dependencies preconfigured in the base image of the operator.

**Table 1** Preconfigured Dependencies of the Custom Base Image
Dependency Package Name	Version
absl-py	2.1.0
accelerate	1.0.1
aclruntime	0.0.2
addict	2.4.0
aiohappyeyeballs	2.4.3
aiohttp	3.11.1
aiosignal	1.3.1
ais-bench	0.0.2
akg	2.2
albumentations	1.3.1
antlr4-python3-runtime	4.9.3
apex	0.1.dev20241029+ascend
ascend-faultdiag	6.0.0
ascend-training-accuracy-tools	1.0
ascendebug	0.1.0
astroid	3.0.3
asttokens	2.4.1
async-timeout	5.0.1
attrs	23.2.0
audioread	3.0.1
auto-tune	0.1.0
av	12.0.0
blinker	1.9.0
blobfile	3.0.0
blosc2	2.5.1
boto3	1.23.10
botocore	1.26.10
certifi	2024.12.14
cffi	1.16.0
charset-normalizer	3.4.1
click	8.1.7
click-aliases	1.0.5
cloudpickle	3.1.0
coloredlogs	15.0.1
contourpy	1.3.0
coverage	7.3.0
crc32c	2.7.1
cycler	0.12.1
Cython	3.0.2
dask	2024.2.1
dataflow	0.0.1
datasets	3.0.1
debugpy	1.8.8
decorator	4.4.2
decord	0.6.0
dill	0.3.8
easydict	1.12
einops	0.8.0
entrypoints	0.4
esdk-obs-python	3.23.12
et-xmlfile	2.0.0
exceptiongroup	1.2.2
executing	2.1.0
filelock	3.16.1
flask	2.3.3
flatbuffers	24.12.23
fonttools	4.55.0
frozenlist	1.5.0
fsspec	2024.6.1
fuzzywuzzy	0.18.0
gnureadline	8.2.10
gpytorch	1.12
greenlet	3.1.1
grpcio	1.60.0
grpcio-tools	1.60.0
gunicorn	21.2.0
h5py	3.9.0
hccl	0.1.0
hccl-parser	0.1
huaweicloud-sdk-python-modelarts-dataset	0.1.5
huggingface-hub	0.26.2
humanfriendly	10.0
idna	3.10
ijson	3.3.0
imageio	2.36.1
imageio-ffmpeg	0.5.1
importlib-metadata	8.5.0
importlib-resources	6.4.5
iniconfig	2.0.0
iopath	0.1.10
ipykernel	6.7.0
ipython	8.18.1
isort	5.13.2
itsdangerous	2.2.0
jaxtyping	0.2.19
jedi	0.19.2
jieba	0.42.1
jinja2	3.1.4
jmespath	1.0.1
joblib	1.4.2
jsonlines	4.0.0
jupyter-client	7.4.9
jupyter-core	5.7.2
keras	3.2.1
kiwisolver	1.4.5
lazy-import	0.2.2
lazy-loader	0.4
levenshtein	0.26.0
libclang	18.1.1
libcst	1.2.0
librosa	0.10.2.post1
linear-operator	0.5.3
llm-datadist	0.0.1
llvmlite	0.41.1
locket	1.0.0
lxml	5.1.0
Markdown	3.7
markdown-it-py	3.0.0
MarkupSafe	3.0.2
matplotlib	3.7.3
matplotlib-inline	0.1.7
mccabe	0.7.0
mdurl	0.1.2
mindspore-lite	2.4.0
mindstudio-probe	1.0.2
mindx-elastic	0.0.1
ml-dtypes	0.5.0
mmcv	2.0.1
mmengine	0.10.5
modelarts-pytorch-model-server	1.0.6
moviepy	2.1.2
moxing-framework	2.2.10
mpi4py	3.1.6
mpmath	1.3.0
msgpack	1.1.0
msit	7.0.0rc630
msit-benchmark	7.0.0rc2
msit-compare	7.0.0rc2
msit-llm	7.0.0rc2
msit-surgeon	7.0.0rc2
msprof-analyze	1.2.0
multidict	6.1.0
multiprocess	0.70.16
mypy-extensions	1.0.0
msobjdump	0.1.0
namex	0.0.8
ndindex	1.9.2
nest-asyncio	1.6.0
netifaces	0.11.0.post20240306102544
networkx	3.2.1
numba	0.58.1
numexpr	2.8.6
numpy	1.26.4
omegaconf	2.3.0
onnx	1.17.0
onnxconverter-common	1.14.0
onnxruntime	1.18.0
op-compile-tool	0.1.0
op-gen	0.1
op-test-frame	0.1
opc-tool	0.1.0
opencv-python	4.9.0.80
opencv-python-headless	4.8.1.78
openpyxl	3.1.5
optree	0.13.1
orjson	3.10.7
packaging	24.2
pandas	1.3.5
parso	0.8.4
partd	1.4.2
pathlib2	2.3.7.post1
peft	0.7.1
pexpect	4.9.0
pillow	10.4.0
pip	21.0.1
platformdirs	4.3.6
pluggy	1.5.0
ply	3.11
pooch	1.8.2
portalocker	2.10.1
prettytable	3.12.0
proglog	0.1.10
prometheus-client	0.14.1
prompt-toolkit	3.0.48
propcache	0.2.0
protobuf	3.20.2
psutil	6.0.0
ptyprocess	0.7.0
pure-eval	0.2.3
py-cpuinfo	9.0.0
pyarrow	18.0.0
pybind11	2.13.6
pyclipper	1.3.0.post6
pycocotools	2.0.7
pycparser	2.22
pycryptodome	3.21.0
pycryptodomex	3.21.0
pygments	2.18.0
pylint	3.0.2
pynndescent	0.5.13
pyparsing	3.2.0
pypng	0.20220715.0
pytest	7.4.3
python-dateutil	2.9.0.post0
python-dotenv	1.0.1
pytz	2024.2
PyWavelets	1.4.1
PyYAML	6.0.1
pyzmq	26.2.0
pydub	0.25.1
qudida	0.0.4
rapidfuzz	3.10.1
redis	5.1.1
regex	2024.9.11
requests	2.32.2
rich	13.9.4
rouge-metric	1.0.1
s3transfer	0.5.2
safetensors	0.4.5
schedule-search	0.0.1
scikit-image	0.22.0
scikit-learn	1.5.1
scipy	1.10.1
sentencepiece	0.2.0
setuptools	75.8.0
shapely	2.0.3
show-kernel-debug-data	0.1.0
shyaml	0.6.2
six	1.16.0
skl2onnx	1.17.0
soundfile	0.12.1
soxr	0.5.0.post1
SQLAlchemy	2.0.36
stack-data	0.6.3
sympy	1.13.0
setuptools	57.5.0
soundfile	0.13.1
tables	3.9.2
tabulate	0.9.0
tailor	0.3.4
te	0.4.0
tensorboard	2.18.0
tensorboard-data-server	0.7.2
termcolor	2.5.0
terminaltables	3.1.10
tfrecord	1.14.5
threadpoolctl	3.5.0
tifffile	2024.8.30
tiktoken	0.7.0
timm	1.0.9
tokenizers	0.20.3
tomli	2.1.0
tomlkit	0.13.2
toolz	1.0.0
torch	2.1.0
torch-npu	2.1.0
torchvision	0.16.0
tornado	6.4.1
tqdm	4.66.3
traitlets	5.14.3
transformers	4.45.0
transformers-stream-generator	0.0.5
typeguard	4.4.1
typing-extensions	4.8.0
typing-inspect	0.9.0
tzdata	2024.2
tenacity	9.1.2
umap-learn	0.5.6
urllib3	1.26.7
wcwidth	0.2.13
werkzeug	3.0.3
wheel	0.45.1
XlsxWriter	3.2.0
xmltodict	0.13.0
xxhash	3.5.0
yacs	0.1.8
yapf	0.43.0
yarl	1.17.1
ydata-profiling	4.16.1
zhdate	0.1
zipp	3.21.0
zstandard	0.22.0