Operator Package Development Specifications
Python Operator Package Directory Specifications
Assume that the operator package name is video_clip.tar. The directory structure after the operator package is decompressed is as follows:
+--- video_clip # The directory name must be the same as the tar package name. | +--- program_package # Python operator directory | | +--- install.sh # (Optional) Installation script | | +--- process.py # (Mandatory) Operator code
process.py File Development Specifications
The operator package must contain a script named process.py. The development modes are determined based on the value of auto-data-loading in the operator configuration file.
Mode 1 (auto-data-loading: true)
Applicable scenarios: This mode is recommended except for the following three scenarios:
- Multimodal dataset: For example, datasets consisting of images and text, or videos and text.
- The output dataset modality is not within the following range: text, image, video, and audio.
- Scenario where the entire dataset needs to be used as the input, for example, deduplication operators.
The development specifications of process.py are as follows. The process.py file contains three classes.
1. PreProcess: (optional) operator preprocessing logic.
Before model inference, the operator offloads certain computations from the CPU to the GPU/NPU. Separating the operator preprocessing logic from CPU and GPU/NPU computations enhances GPU/NPU utilization.
2. Process: (mandatory) operator inference logic.
It is recommended that only the model loading and inference parts be included, and the preprocessing and postprocessing be written in PreProcess and PostProcess.
3. PostProcess: (optional) operator post-processing logic.
If the operator has heavy postprocessing logic that is still calculated on the CPU after inference, you are advised to split the logic and write the postprocessing logic in PostProcess.
The operator framework is called in the following sequence: preprocess -> process -> postprocess.
import pandas as pd import ma_utils as utils logger = utils.FileLogger.get_logger() class PreProcess(): def __init__(self, args): """ :param args: operator parameters, including the service parameters of the operator and the default parameters obs_input_path (input OBS path) and obs_output_path (output OBS path). """ pass def __call__(self, input: pd.DataFrame) -> pd.DataFrame: """ :param input: input parameter Text-JSONL/CSV file: The framework extracts the sample content of JSONL/CSV files and transfers the content to the operator. The operator is called once for each JSONL/CSV file. The field names in the DataFrame are the same as those defined in the JSONL/CSV files. Text-other file: The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name). Image/Video/Audio-original file: The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name). Image/Video/Audio-Parquet file: The framework extracts the sample content of Parquet files and transfers the content to the operator. The operator is called once for each Parquet file. The field names in the DataFrame are the same as those defined in the Parquet files. There are two system predefined fields: file_path (full path of the input data on the operator (the operator framework downloads the dataset to the local directory of the operator)), and file_name (file name of the data file (relative path of the file)). :return: output parameter, data sample list DataFrame output by the operator, which is processed by the framework. """ pass class Process(): def __init__(self, args): """ :param args: operator parameters, including the service parameters of the operator and the default parameters obs_input_path (input OBS path) and obs_output_path (output OBS path). """ pass def __call__(self, input: pd.DataFrame) -> pd.DataFrame: """ :param input: input parameter Text-JSONL/CSV file: The framework extracts the sample content of JSONL/CSV files and transfers the content to the operator. The operator is called once for each JSONL/CSV file. The field names in the DataFrame are the same as those defined in the JSONL/CSV files. Text-other file: The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name). Image/Video/Audio-original file: The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name). Image/Video/Audio-Parquet file: The framework extracts the sample content of Parquet files and transfers the content to the operator. The operator is called once for each Parquet file. The field names in the DataFrame are the same as those defined in the Parquet files. There are two system predefined fields: file_path (full path of the input data on the operator (the operator framework downloads the dataset to the local directory of the operator)), and file_name (file name of the data file (relative path of the file)). :return: output parameter, data sample list DataFrame output by the operator, which is processed by the framework. """ pass class PostProcess(): def __init__(self, args): """ :param args: operator parameters, including the service parameters of the operator and the default parameters obs_input_path (input OBS path) and obs_output_path (output OBS path). """ pass def __call__(self, input: pd.DataFrame) -> pd.DataFrame: """ :param input: input parameter Text-JSONL/CSV file: The framework extracts the sample content of JSONL/CSV files and transfers the content to the operator. The operator is called once for each JSONL/CSV file. The field names in the DataFrame are the same as those defined in the JSONL/CSV files. Text-other file: The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name). Image/Video/Audio-original file: The framework obtains the dataset file list and transfers the list to the operator. The operator is called once for each file. The field names in the DataFrame are file_path (full path of the input data on the operator, which is downloaded by the operator framework to the local directory of the operator) and file_name (data file name). Image/Video/Audio-Parquet file: The framework extracts the sample content of Parquet files and transfers the content to the operator. The operator is called once for each Parquet file. The field names in the DataFrame are the same as those defined in the Parquet files. There are two system predefined fields: file_path (full path of the input data on the operator (the operator framework downloads the dataset to the local directory of the operator)), and file_name (file name of the data file (relative path of the file)). :return: output parameter, data sample list DataFrame output by the operator, which is processed by the framework. """ pass
Mode 2 (auto-data-loading: false)
This operation is applicable to the following scenarios:
- Multimodal dataset: For example, datasets consisting of images and text, or videos and text.
- The output dataset modality is not within the following range: text, image, video, and audio.
- Scenario where the entire dataset needs to be used as the input, for example, deduplication operators.
The development specifications of process.py are as follows. The process.py file contains three classes.
1. PreProcess: (optional) operator preprocessing logic.
Before model inference, the operator offloads certain computations from the CPU to the GPU/NPU. Separating the operator preprocessing logic from CPU and GPU/NPU computations enhances GPU/NPU utilization.
2. Process: (mandatory) operator inference logic.
It is recommended that only the model loading and inference parts be included, and the preprocessing and postprocessing be written in PreProcess and PostProcess.
3. PostProcess: (optional) operator post-processing logic.
If the operator has heavy postprocessing logic that is still calculated on the CPU after inference, you are advised to split the logic and write the postprocessing logic in PostProcess.
The operator framework is called in the following sequence: preprocess -> process -> postprocess.
import pandas as pd class Process(): def __init__(self, args): """ :param args: operator parameters, including the service parameters of the operator and the default parameters obs_input_path (input OBS path) and obs_output_path (output OBS path). """ pass def __call__(self, input: pd.DataFrame): """ :param input: input parameter. In mode 2, input is an empty DataFrame. :return: No return value. """ pass
Preconfigured Dependencies of the Operator Base Image
The following table lists the dependencies preconfigured in the base image of the operator.
|
Dependency Package Name |
Version |
|---|---|
|
absl-py |
2.1.0 |
|
accelerate |
1.0.1 |
|
aclruntime |
0.0.2 |
|
addict |
2.4.0 |
|
aiohappyeyeballs |
2.4.3 |
|
aiohttp |
3.11.1 |
|
aiosignal |
1.3.1 |
|
ais-bench |
0.0.2 |
|
akg |
2.2 |
|
albumentations |
1.3.1 |
|
antlr4-python3-runtime |
4.9.3 |
|
apex |
0.1.dev20241029+ascend |
|
ascend-faultdiag |
6.0.0 |
|
ascend-training-accuracy-tools |
1.0 |
|
ascendebug |
0.1.0 |
|
astroid |
3.0.3 |
|
asttokens |
2.4.1 |
|
async-timeout |
5.0.1 |
|
attrs |
23.2.0 |
|
audioread |
3.0.1 |
|
auto-tune |
0.1.0 |
|
av |
12.0.0 |
|
blinker |
1.9.0 |
|
blobfile |
3.0.0 |
|
blosc2 |
2.5.1 |
|
boto3 |
1.23.10 |
|
botocore |
1.26.10 |
|
certifi |
2024.12.14 |
|
cffi |
1.16.0 |
|
charset-normalizer |
3.4.1 |
|
click |
8.1.7 |
|
click-aliases |
1.0.5 |
|
cloudpickle |
3.1.0 |
|
coloredlogs |
15.0.1 |
|
contourpy |
1.3.0 |
|
coverage |
7.3.0 |
|
crc32c |
2.7.1 |
|
cycler |
0.12.1 |
|
Cython |
3.0.2 |
|
dask |
2024.2.1 |
|
dataflow |
0.0.1 |
|
datasets |
3.0.1 |
|
debugpy |
1.8.8 |
|
decorator |
4.4.2 |
|
decord |
0.6.0 |
|
dill |
0.3.8 |
|
easydict |
1.12 |
|
einops |
0.8.0 |
|
entrypoints |
0.4 |
|
esdk-obs-python |
3.23.12 |
|
et-xmlfile |
2.0.0 |
|
exceptiongroup |
1.2.2 |
|
executing |
2.1.0 |
|
filelock |
3.16.1 |
|
flask |
2.3.3 |
|
flatbuffers |
24.12.23 |
|
fonttools |
4.55.0 |
|
frozenlist |
1.5.0 |
|
fsspec |
2024.6.1 |
|
fuzzywuzzy |
0.18.0 |
|
gnureadline |
8.2.10 |
|
gpytorch |
1.12 |
|
greenlet |
3.1.1 |
|
grpcio |
1.60.0 |
|
grpcio-tools |
1.60.0 |
|
gunicorn |
21.2.0 |
|
h5py |
3.9.0 |
|
hccl |
0.1.0 |
|
hccl-parser |
0.1 |
|
huaweicloud-sdk-python-modelarts-dataset |
0.1.5 |
|
huggingface-hub |
0.26.2 |
|
humanfriendly |
10.0 |
|
idna |
3.10 |
|
ijson |
3.3.0 |
|
imageio |
2.36.1 |
|
imageio-ffmpeg |
0.5.1 |
|
importlib-metadata |
8.5.0 |
|
importlib-resources |
6.4.5 |
|
iniconfig |
2.0.0 |
|
iopath |
0.1.10 |
|
ipykernel |
6.7.0 |
|
ipython |
8.18.1 |
|
isort |
5.13.2 |
|
itsdangerous |
2.2.0 |
|
jaxtyping |
0.2.19 |
|
jedi |
0.19.2 |
|
jieba |
0.42.1 |
|
jinja2 |
3.1.4 |
|
jmespath |
1.0.1 |
|
joblib |
1.4.2 |
|
jsonlines |
4.0.0 |
|
jupyter-client |
7.4.9 |
|
jupyter-core |
5.7.2 |
|
keras |
3.2.1 |
|
kiwisolver |
1.4.5 |
|
lazy-import |
0.2.2 |
|
lazy-loader |
0.4 |
|
levenshtein |
0.26.0 |
|
libclang |
18.1.1 |
|
libcst |
1.2.0 |
|
librosa |
0.10.2.post1 |
|
linear-operator |
0.5.3 |
|
llm-datadist |
0.0.1 |
|
llvmlite |
0.41.1 |
|
locket |
1.0.0 |
|
lxml |
5.1.0 |
|
Markdown |
3.7 |
|
markdown-it-py |
3.0.0 |
|
MarkupSafe |
3.0.2 |
|
matplotlib |
3.7.3 |
|
matplotlib-inline |
0.1.7 |
|
mccabe |
0.7.0 |
|
mdurl |
0.1.2 |
|
mindspore-lite |
2.4.0 |
|
mindstudio-probe |
1.0.2 |
|
mindx-elastic |
0.0.1 |
|
ml-dtypes |
0.5.0 |
|
mmcv |
2.0.1 |
|
mmengine |
0.10.5 |
|
modelarts-pytorch-model-server |
1.0.6 |
|
moviepy |
2.1.2 |
|
moxing-framework |
2.2.10 |
|
mpi4py |
3.1.6 |
|
mpmath |
1.3.0 |
|
msgpack |
1.1.0 |
|
msit |
7.0.0rc630 |
|
msit-benchmark |
7.0.0rc2 |
|
msit-compare |
7.0.0rc2 |
|
msit-llm |
7.0.0rc2 |
|
msit-surgeon |
7.0.0rc2 |
|
msprof-analyze |
1.2.0 |
|
multidict |
6.1.0 |
|
multiprocess |
0.70.16 |
|
mypy-extensions |
1.0.0 |
|
msobjdump |
0.1.0 |
|
namex |
0.0.8 |
|
ndindex |
1.9.2 |
|
nest-asyncio |
1.6.0 |
|
netifaces |
0.11.0.post20240306102544 |
|
networkx |
3.2.1 |
|
numba |
0.58.1 |
|
numexpr |
2.8.6 |
|
numpy |
1.26.4 |
|
omegaconf |
2.3.0 |
|
onnx |
1.17.0 |
|
onnxconverter-common |
1.14.0 |
|
onnxruntime |
1.18.0 |
|
op-compile-tool |
0.1.0 |
|
op-gen |
0.1 |
|
op-test-frame |
0.1 |
|
opc-tool |
0.1.0 |
|
opencv-python |
4.9.0.80 |
|
opencv-python-headless |
4.8.1.78 |
|
openpyxl |
3.1.5 |
|
optree |
0.13.1 |
|
orjson |
3.10.7 |
|
packaging |
24.2 |
|
pandas |
1.3.5 |
|
parso |
0.8.4 |
|
partd |
1.4.2 |
|
pathlib2 |
2.3.7.post1 |
|
peft |
0.7.1 |
|
pexpect |
4.9.0 |
|
pillow |
10.4.0 |
|
pip |
21.0.1 |
|
platformdirs |
4.3.6 |
|
pluggy |
1.5.0 |
|
ply |
3.11 |
|
pooch |
1.8.2 |
|
portalocker |
2.10.1 |
|
prettytable |
3.12.0 |
|
proglog |
0.1.10 |
|
prometheus-client |
0.14.1 |
|
prompt-toolkit |
3.0.48 |
|
propcache |
0.2.0 |
|
protobuf |
3.20.2 |
|
psutil |
6.0.0 |
|
ptyprocess |
0.7.0 |
|
pure-eval |
0.2.3 |
|
py-cpuinfo |
9.0.0 |
|
pyarrow |
18.0.0 |
|
pybind11 |
2.13.6 |
|
pyclipper |
1.3.0.post6 |
|
pycocotools |
2.0.7 |
|
pycparser |
2.22 |
|
pycryptodome |
3.21.0 |
|
pycryptodomex |
3.21.0 |
|
pygments |
2.18.0 |
|
pylint |
3.0.2 |
|
pynndescent |
0.5.13 |
|
pyparsing |
3.2.0 |
|
pypng |
0.20220715.0 |
|
pytest |
7.4.3 |
|
python-dateutil |
2.9.0.post0 |
|
python-dotenv |
1.0.1 |
|
pytz |
2024.2 |
|
PyWavelets |
1.4.1 |
|
PyYAML |
6.0.1 |
|
pyzmq |
26.2.0 |
|
pydub |
0.25.1 |
|
qudida |
0.0.4 |
|
rapidfuzz |
3.10.1 |
|
redis |
5.1.1 |
|
regex |
2024.9.11 |
|
requests |
2.32.2 |
|
rich |
13.9.4 |
|
rouge-metric |
1.0.1 |
|
s3transfer |
0.5.2 |
|
safetensors |
0.4.5 |
|
schedule-search |
0.0.1 |
|
scikit-image |
0.22.0 |
|
scikit-learn |
1.5.1 |
|
scipy |
1.10.1 |
|
sentencepiece |
0.2.0 |
|
setuptools |
75.8.0 |
|
shapely |
2.0.3 |
|
show-kernel-debug-data |
0.1.0 |
|
shyaml |
0.6.2 |
|
six |
1.16.0 |
|
skl2onnx |
1.17.0 |
|
soundfile |
0.12.1 |
|
soxr |
0.5.0.post1 |
|
SQLAlchemy |
2.0.36 |
|
stack-data |
0.6.3 |
|
sympy |
1.13.0 |
|
setuptools |
57.5.0 |
|
soundfile |
0.13.1 |
|
tables |
3.9.2 |
|
tabulate |
0.9.0 |
|
tailor |
0.3.4 |
|
te |
0.4.0 |
|
tensorboard |
2.18.0 |
|
tensorboard-data-server |
0.7.2 |
|
termcolor |
2.5.0 |
|
terminaltables |
3.1.10 |
|
tfrecord |
1.14.5 |
|
threadpoolctl |
3.5.0 |
|
tifffile |
2024.8.30 |
|
tiktoken |
0.7.0 |
|
timm |
1.0.9 |
|
tokenizers |
0.20.3 |
|
tomli |
2.1.0 |
|
tomlkit |
0.13.2 |
|
toolz |
1.0.0 |
|
torch |
2.1.0 |
|
torch-npu |
2.1.0 |
|
torchvision |
0.16.0 |
|
tornado |
6.4.1 |
|
tqdm |
4.66.3 |
|
traitlets |
5.14.3 |
|
transformers |
4.45.0 |
|
transformers-stream-generator |
0.0.5 |
|
typeguard |
4.4.1 |
|
typing-extensions |
4.8.0 |
|
typing-inspect |
0.9.0 |
|
tzdata |
2024.2 |
|
tenacity |
9.1.2 |
|
umap-learn |
0.5.6 |
|
urllib3 |
1.26.7 |
|
wcwidth |
0.2.13 |
|
werkzeug |
3.0.3 |
|
wheel |
0.45.1 |
|
XlsxWriter |
3.2.0 |
|
xmltodict |
0.13.0 |
|
xxhash |
3.5.0 |
|
yacs |
0.1.8 |
|
yapf |
0.43.0 |
|
yarl |
1.17.1 |
|
ydata-profiling |
4.16.1 |
|
zhdate |
0.1 |
|
zipp |
3.21.0 |
|
zstandard |
0.22.0 |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot