Updated on 2025-07-28 GMT+08:00

Custom Data Processing Operators

In addition to the preset processing operators, ModelArts Studio allows you to create custom processing operators to meet specific data processing requirements and service scenarios. You can flexibly define the processing logic based on the site requirements to further improve the model training effect and adaptability.

Constraints

This function is available only to yearly/monthly subscribers.

Figure 1 Applying for trial
Figure 2 Yearly/Monthly resources

Creating a Custom Processing Operator

To create a custom processing operator, perform the following steps:

  1. Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
    Figure 3 My Spaces
  2. In the navigation pane, choose Data Engineering > Data Processing > Processing Tasks. Click Manage Processing Operator in the upper right corner.
  3. On the Manage Processing Operators page, click the Custom tab, and click Create Custom Operator in the upper right corner.
  4. On the Create Custom Operator page, click Download samples to view the specifications of the operator configuration file and operator package. Use OBS to upload the operator configuration file and operator package, set the workspace visibility, and click OK in the lower right corner.
    Figure 4 Creating a custom operator
  5. Custom operators that are created can be used in Processing Text Datasets, Processing Image Datasets, Processing Video Datasets, Processing Weather Datasets, and Processing Other Datasets.

Viewing operator details:

Click the operator name to view its details.

Operator Configuration File Specifications

Table 1 Basic information configuration specifications

Parameter Name

Type

Mandatory

Description

Constraints

id

string

Yes

Operator name in English.

The operator name must start with a letter and can contain up to 128 characters, including letters, digits, and underscores (_). If the length exceeds 128 characters, the excess part is automatically truncated.

The ID cannot be changed when the operator is updated.

The ID of a visible operator in all workspaces is unique in all workspaces, and the ID of a visible operator in the current workspace is unique in the current workspace.

name

string

Yes

Operator display name.

The value contains a maximum of 128 characters. The excess part is automatically truncated.

description

string

No

Operator description.

The length cannot exceed 2000 characters. The excess part is automatically truncated.

author

string

No

Developer name.

The value contains a maximum of 128 characters. The excess part is automatically truncated.

tags

tags object

Yes

Operator label, which is used for classification and filtering.

For details, see Table 2.

labels

Array of label objects

No

Label output by the labeling operator.

For details, see Table 7.

runtime

runtime object

Yes

Operator running configuration.

For details, see Table 3.

arguments

Array of argument objects

No

List of operator input parameters.

For details, see Table 5.

Table 2 Tag configuration specifications

Parameter Name

Type

Mandatory

Description

Constraints

language

Array of strings

Yes

List of languages that can be processed by the operator, for example, zh and en.

The length of a single character string cannot exceed 32 characters. The excess part is automatically truncated.

Enter an international language code.

format

Array of strings

Yes

List of dataset file name extensions supported by the operator, for example, JSON, CSV, and MP4.

The length of a single character string cannot exceed 32 characters. The excess part is automatically truncated.

category

string

Yes

Operator type.

Select only one option. The options are as follows:

- Data extraction

- Data sampling

- Data conversion

- Data filtering

- Data deduplication

- Data labeling

- Other

modal

Array of strings

Yes

List of data modalities supported by the operator.

Select one or multiple options. The options are as follows:

- TEXT

- IMAGE

- VIDEO

- AUDIO

- OTHER (including weather and prediction)

custom

Array of strings

No

List of custom operator tags.

The length of a single character string cannot exceed 32 characters. The excess part is automatically truncated.

Table 3 runtime configuration specifications

Parameter Name

Type

Mandatory

Description

Constraints

cpu-arch

Array of strings

Yes

List of CPU architectures supported by the operator.

Select one or multiple options. The options are as follows:

- Arm

- x86

xpu-devices

Array of strings

No

List of device models supported by the operator. This parameter is mandatory when the operator runs on NPUs.

The options are as follows:

- SNT9B

This parameter is mandatory when the number of NPUs of a resource in runtime.resources is greater than 0. The value can be SNT9B.

environment

string

Yes

Operator package type. Only pure Python operator packages are supported.

Select only one option. The options are as follows:

- PYTHON

entrypoint

string

Yes

Operator startup command.

The value contains a maximum of 128 characters. The excess part is automatically truncated.

If environment is set to python, enter the fixed name process.py. The name cannot be changed.

auto-data-loading

boolean

Yes

Whether to automatically load data.

If the value is true, the framework processes the input and output. If the value is false, the user processes the input and output.

resources

Array of resource objects

No

List of resource sizes required for running a single operator instance.

This parameter is mandatory when environment is set to python.

Table 4 resource configuration specifications

Parameter Name

Type

Mandatory

Description

Constraints

cpu

integer

Yes

Default CPU resources of a single instance (unit: vCPU).

You need to configure it based on the actual number of general computing units.

memory

integer

Yes

Default memory size of a single instance, in MB.

You need to configure it based on the actual number of general computing units.

npu

integer

No

Default number of NPUs of a single instance, in card.

You need to set this parameter based on the specifications and quantity of the subscribed intelligent computing units.

Table 5 argument configuration specifications

Parameter Name

Type

Mandatory

Description

Constraints

key

string

Yes

Parameter ID, which must be unique in the parameter list.

The value contains a maximum of 128 characters. The excess part is automatically truncated.

name

string

Yes

Parameter display name.

The value contains a maximum of 128 characters. The excess part is automatically truncated.

type

string

Yes

Parameter type.

Select only one option. The options are as follows:

- STRING

- FLOAT

- INT

- ENUM (radio button)

- LIST (check box)

- OBS (front-end component parameter for selecting an OBS file)

- BOOLEAN

tips

string

No

Parameter description.

The length cannot exceed 2000 characters. The excess part is automatically truncated.

min

float

No

Minimum value of the parameter. This parameter is optional when type is set to INT or FLOAT.

A maximum of four decimal places are reserved for floating-point numbers.

max

float

No

Minimum value of the parameter. This parameter is optional when type is set to INT or FLOAT.

A maximum of four decimal places are reserved for floating-point numbers.

between

boolean

No

Whether the parameter is a range-type numeric parameter. This parameter is optional when type is set to INT or FLOAT. The default value is false.

-

items

Array of item objects

No

List of enumerated values. This parameter is mandatory when type is set to ENUM or LIST.

This parameter is mandatory when type is set to ENUM or LIST. The number of enumerated items must be at least 1.

required

boolean

No

Whether the parameter is mandatory.

Set it to true or false.

visible

boolean

No

Whether the parameter is visible to the frontend.

Set it to true or false.

default

string

No

Default parameter value.

If visible is set to false and requires is set to true, the default value must be set.

If there are multiple default values, separate them with commas (,), for example, SD,HD.

The default value of a numeric parameter is in the format of min;max.

Table 6 item configuration specifications

Parameter Name

Type

Mandatory

Description

Constraints

name

string

Yes

Enumerated item value.

The value contains a maximum of 128 characters. The excess part is automatically truncated.

value

string

Yes

Value of the enumerated item.

The value contains a maximum of 128 characters. The excess part is automatically truncated.

tips

string

No

Description or description of the enumerated item.

The length cannot exceed 2000 characters. The excess part is automatically truncated.

Table 7 label configuration specifications

Parameter Name

Type

Mandatory

Description

Constraints

key

string

Yes

Key of a tag.

The key can contain a maximum of 128 characters. If the key exceeds 128 characters, the excess part is automatically truncated. The key must be unique among operators.

name

string

Yes

Tag name.

The value contains a maximum of 128 characters. The excess part is automatically truncated.

type

string

Yes

Tag type.

Enumerated value. The options are STRING, NUMERIC, ENUM, and OBJECT. Only one option can be selected.

min

float

No

Minimum tag value.

This parameter is mandatory when type is set to NUMERIC.

max

float

No

Maximum tag value.

This parameter is mandatory when type is set to NUMERIC.

items

Array of labelItem objects

No

Tag enumeration list.

This parameter is mandatory when type is set to ENUM.

dimensions

Array of labelDimension objects

No

Level-1 tag dimension.

This parameter is mandatory when type is set to OBJECT.

Table 8 labelItem configuration specifications

Parameter Name

Type

Mandatory

Description

Constraints

name

string

Yes

Enumerated item name.

The value contains a maximum of 128 characters. The excess part is automatically truncated.

value

string

Yes

Enumerated value.

The value contains a maximum of 128 characters. The excess part is automatically truncated.

Table 9 labelDimension configuration specifications

Parameter Name

Type

Mandatory

Description

Constraints

key

string

Yes

Key of the level-2 tag.

The value contains a maximum of 128 characters. The excess part is automatically truncated.

name

string

Yes

Level-2 tag name.

The value contains a maximum of 128 characters. The excess part is automatically truncated.

type

string

Yes

Level-2 tag type.

Enumerated value. The options are STRING, NUMERIC, and ENUM. Only one option can be selected.

min

float

No

Minimum value of the level-2 tag.

This parameter is mandatory when type is set to NUMERIC.

max

float

No

Maximum value of the level-2 tag.

This parameter is mandatory when type is set to NUMERIC.

items

Array of labelItem objects

No

Enumeration list of level-2 tags.

This parameter is mandatory when type is set to ENUM.

An operator configuration file is a YAML file that describes the basic information, function parameters, operating environment, and resource requirements of an operator. Configure the configuration file as follows:

id: video_clip    # (Mandatory) Operator abbreviation. The value must start with a letter and can contain a maximum of 128 characters, including letters, digits, and underscores (_). If the value contains more than 128 characters, the excess characters will be truncated.
name: Video clip # (Mandatory) Operator display name. The value can contain a maximum of 128 characters. If the value contains more than 128 characters, the excess characters will be truncated.
description: Splits a long video into multiple clips. # (Mandatory) Operator description. The value can contain a maximum of 2,000 characters.
author: "xxx Technology Co., Ltd." # (Optional) Developer or team name, for example, Data Team. The value can contain a maximum of 128 characters.
tags:  # (Mandatory) Operator tag, which is used for classification and filtering in specific scenarios.
 language: # (Mandatory) Language that can be processed by the operator. The value is a language code. Multiple options can be selected. Only international standard language codes are supported. The value contains a maximum of 32 characters.
    - zh
    - en
    -...
  format: # (Mandatory) Dataset file format supported by the operator. Multiple options are supported. The value contains a maximum of 32 characters.
      - JSONL
      - TXT
      - CSV
      - HTML
      - MOBI
      - EPUB
      - DOCX
      - PDF
      - MP4
      - AVI
      -...
  category: # (Mandatory) Operator type, which is used to display operators by category on the GUI. Only one option can be selected. The options are as follows:
    - Data extraction
    - Data sampling
    - Data conversion
    - Data filtering
    - Data Deduplication
    - Data labeling
    - Other
  modal:   # (Mandatory) Data modality supported by the operator. Multiple options can be selected. The options are as follows:
    - TEXT
    - IMAGE
    - VIDEO
    - AUDIO
    - OTHER
  custom: # Tag of a custom operator. Multiple options can be selected. The value contains a maximum of 32 characters.
    - Data augmentation
    - Pre-labeling
    -...
runtime:
  cpu-arch: # (Mandatory) Supported CPUs
    - ARM
    - X86
  xpu-devices: # XPU model supported by the operator. Multiple options can be selected. The options are as follows (mandatory if NPU is involved):
 
    - SNT9B
 
  resources: # Default resources of a single instance. This parameter is mandatory when the operator package type is python.
    - cpu: 16 # CPU processor type
      memory: 256
    - cpu: 8 # NPU processor type
    - memory: 1024 # Unit: MB
      npu: 1
 
  environment: python # (Mandatory) Operator package type. The value can be python (pure Python operator package).
  entrypoint: process.py # (Mandatory) Fixed file name process.py when environment is set to python. The value cannot be changed.
#All service parameters are input parameters of the operator. The data types of the parameters include STRING, FLOAT, INT, ENUM (radio button), LIST (check box), and OBS.
arguments:
  # Example of parameters of the STRING type
  - key: filter_keywords # [Mandatory]
    name: Filter keyword.
    type: STRING
    tips: The samples that match the keywords will be filtered. Multiple keywords are separated by commas (,).   #Tips on the GUI
    required: true
    visible: true
    default: gambling  # Default value of the parameter. This parameter is optional. Use commas (,) to separate multiple default values.

  # Example of the INT/FLOAT value range type
  - key: length_of_characters # [Mandatory]
    name: filtering duration range
    type: FLOAT # [Mandatory] When type is set to int or float, the between field is mandatory.
    between: true # Whether to obtain the value range. When type is set to int or float, the between field is mandatory.
   min: 1.0   # Minimum value of the parameter, which is optional. When type is set to float, the value can contain a maximum of four decimal places. If the value exceeds the maximum, it will be truncated.
    max: 500.0    # Range of the maximum parameter value. This field is optional.
    tips: The filtering duration range includes the entered boundary value. The value is of the float type, in characters. Samples whose text length is within the specified range are retained. #GUI tips
    required: true
    visible: true
    default: 100.0;300.0 # (Optional) Default maximum and minimum parameter values. If this parameter is not set, min;max is used as the default value.

  # Example of parameters of the INT/FLOAT numeric type
  - key: max_cropping_area_ratio
    name: Maximum cropping area ratio
    type: FLOAT
    between: false # Whether the parameter is of the range type. The default value is false.
    min: 0.0   # Range of the minimum value of the parameter
    max: 100.0    # Range of the maximum value of the parameter
    tips: The value is a float ranging from 0.0 to 100.0, in percentage (%). Samples whose cropped area ratio (cropped video area/original video area) is greater than the value will be filtered out.
    visible: true
    required: true 
    default: 100 # Default value

  # Example of an ENUM parameter
  - key: font_conversion
    name: Text font conversion
    type: ENUM # Only one option can be selected.
    items:
      - name: Simplified Chinese to Traditional Chinese
        value: traditional
      - name: Convert Traditional Chinese to Simplified Chinese
        value: simplified
    required: true
    visible: true
    default: simplified

  # Example of a LIST parameter
  - key: resolution
    name: Resolution
    type: LIST # Multiple options can be selected.
    items: # Parameter options. This parameter is mandatory when type is set to ENUM or LIST.
      - name: Smoothness   #[Mandatory]
        value: SM   #[Mandatory]
        tips: 480 > resolution ≥ 360
        - name: SD
        value: SD
        tips: 720 > resolution ≥ 480
      - name: HD
        value: HD
        tips: 1080 > resolution ≥ 720
    required: true
    visible: true  
        default: SD,HD # Use commas (,) to separate multiple default values.

  # Example of an OBS parameter
  - key: sensitive_word
    name: OBS path of the sensitive word dictionary file
    type: OBS
    tips: sensitive word dictionary file
    required: true
    visible: true
    default: NLP/system_resource/sensitive_word.csv # OBS path of the default word dictionary
  # Example of a BOOLEAN parameter
  - key: parse_all
    name: Whether to parse all files
    type: BOOLEAN
    items:
      - name: Yes
        value: true
      - name: No
        value: false
    visible: true
    required: true
    default: false

Operator Package Specifications

Python operator package

Assume that the operator package name is video_clip.tar. The directory structure after the operator package is decompressed is as follows:
+--- video_clip # The directory name must be the same as the tar package name.
| +--- program_package # Python operator directory
| | +--- install.sh # (Optional) Installation script
| | +--- process.py # (Mandatory) Operator code