Updated on 2025-05-28 GMT+08:00

Deploying a Model as a Real-Time Service

After a model is ready, deploy it as a real-time service. Then, you can call the service for prediction.

Prerequisites

  • A ModelArts model in the Normal state is available.
  • To deploy a service, the image owner must have the te_admin permission. Otherwise, the error message "failed to set image shared, please check the agency permission" will be displayed during service deployment.

Procedure

  1. Log in to the ModelArts console. In the navigation pane, choose Service Deployment > Real-Time Services.
  2. In the real-time service list, click Deploy in the upper left corner.
  3. Configure parameters.
    1. Configure basic parameters. For details, see Table 1.
      Table 1 Basic parameters

      Parameter

      Description

      Name

      Name of a real-time service.

      Description

      Brief description for a real-time service.

    2. Enter key information including the resource pool and model configurations. For details, see Table 2.
      Table 2 Parameters

      Parameter

      Sub-Parameter

      Description

      Resource Pool

      Public Resource Pool

      CPU/GPU resource pools are available for you to select.

      Dedicated Resource Pool

      Select a dedicated resource pool flavor. The physical pools with logical subpools created are not supported temporarily.

      Model and Configuration

      Model Source

      Choose My Model as needed.

      Model and Version

      Select the model and version that are in the Normal state.

      Traffic Ratio (%)

      Data proportion of the current AI application version. Service calling requests are allocated to the current version based on this proportion.

      If you deploy only one version of a model, set this parameter to 100. If you select multiple versions for gray release, ensure that the sum of the traffic ratios of these versions is 100%.

      Instance Flavor

      Select available flavors based on the list displayed on the console. The flavors in gray cannot be used in the current environment.

      If no public resource pool flavors are available, In this case, use a dedicated resource pool.

      NOTE:

      When deploying the service with the selected flavor, there will be necessary system consumptions. This means that the actual resources required will be greater than the flavor.

      Instances

      Number of instances for the current model version. If you set the number of instances to 1, the standalone computing mode is used. If you set the number of instances to a value greater than 1, the distributed computing mode is used. Select a computing mode based on your actual needs.

      Timeout

      Timeout of a single model, including both the deployment and startup time. The default value is 20 minutes. The value must range from 3 to 120.

      Add Model and Configuration

      If the selected model has multiple versions, you can add multiple versions and configure a traffic ratio. You can use gray release to smoothly upgrade the model version.

      NOTE:

      Free compute specifications do not support gray release of multiple versions.

      Mount Storage

      This parameter is displayed when the resource pool is a dedicated resource pool. This feature will mount a storage volume to compute nodes (instances) as a local directory when the service is running. This is a good option to consider when dealing with large input data or models.

      OBS Bucket

      • Source Path: Select an OBS bucket path. Cross-region OBS buckets cannot be selected. You can add up to 10 paths.
      • Mount Path: Enter the container mount path, for example, /obs-mount/. It is good practice to create a directory. Avoid using inventory directories or system directories with strict permissions.
      OBS parallel file system
      • Source Path: Select a storage path. A cross-region OBS parallel file system cannot be selected.
      • Mount Path: Enter the container mount path, for example, /obs-mount/.
        • Select a new directory. If you select an existing directory, existing files will be overwritten. OBS mounting allows you to add, view, and modify files in the mount directory but does not allow you to delete files in the mount directory. To delete files, manually delete them in the OBS parallel file system.
        • Mount an empty directory to the container. If the directory is not empty, ensure that the directory does not contain any files that affect container startup. Otherwise, the files will be replaced, and the container cannot start normally. As a result, the workload may not be deployed.
        • The mount path must start with a slash (/) and can contain a maximum of 1,024 characters, including letters, digits, and the following special characters: \_-.

      SFS Turbo

      • File System Name: Select the target SFS Turbo file system. A cross-region SFS Turbo file system cannot be selected.
      • Mount Path: Enter the mount path of the container, for example, /sfs-turbo-mount/. Select a new directory. If you select an existing directory, any existing files within it will be replaced.
        NOTE:
        • A file system can be mounted only once and to only one path. Each mount path must be unique. A maximum of 8 disks can be mounted to a training job.
        • Storage mounting is allowed only for services deployed in a dedicated resource pool which has interconnected with a VPC or associated with SFS Turbo.

          - To interconnect a VPC is to interconnect the VPC where SFS Turbo belongs to a dedicated resource pool network. For details, see Interconnect with a VPC.

          - You can associate HPC SFS Turbo file systems with dedicated resource pool networks.

        • If you need to mount multiple file systems, do not use same or similar paths, for example, /obs-mount/ and /obs-mount/tmp/.
        • Once you have chosen SFS Turbo, avoid deleting the interconnected VPC or disassociating SFS Turbo. Otherwise, mounting will not be possible. When you mount the backend OBS storage on the SFS Turbo page, make sure to set the client's umask permission to 777 for normal use.
  4. After confirming the entered information, deploy the service as prompted. Deploying a service generally requires a period of time, which may be several minutes or tens of minutes depending on the amount of your data and resources.

    Once a real-time service is deployed, it will start immediately.

    You can go to the real-time service list to check if the deployment is complete. Once the service status changes from Deploying to Running, the service is deployed.

Testing Real-Time Service Prediction

After a model is deployed as a real-time service, you can debug code or add files for testing in the Prediction tab. You can test the service in two ways, depending on the input request defined by the model – either by using a JSON text or a file.

  • JSON text prediction: If the input of the deployed model is JSON text, you can enter JSON code in the Prediction tab for testing.
  • File Prediction: If your model uses files as input, you can add images, audios, or videos into the Prediction tab to test the service.
  • The size of an input image must be less than 8 MB.
  • The maximum size of a request body for JSON text prediction is 8 MB.
  • Due to the limitation of API Gateway, the duration of a single prediction cannot exceed 40s.
  • The following image types are supported: png, psd, jpg, jpeg, bmp, gif, webp, psd, svg, and tiff.
  • If you use Ascend flavors for service deployment, you cannot predict transparent .png images because Ascend only supports RGB-3 images.
  • This feature is used for commissioning. Use API calling for actual production. You can select Accessing a Real-Time Service Through Token-based Authentication based on the authentication method.

After a service is deployed, obtain the input parameters of the service in the Usage Guides page of the service details page.

The input parameters displayed in the Usage Guides tab depend on the model source that you select.

  • If you use a custom meta model and your own inference code and configuration file (see Specifications for Writing the Model Configuration File), the Usage Guides tab will only display your configuration file. The following figure shows the mapping between the input parameters in the Usage Guides tab and the configuration file.

The prediction methods for different input requests are as follows:

  • JSON Text Prediction
    1. Log in to the ModelArts console and choose Service Deployment > Real-Time Services.
    2. Click the name of the target service to access its details page. Enter the inference code in the Prediction tab, and click Predict to perform prediction.

  • File Prediction
    1. Log in to the ModelArts console and choose Service Deployment > Real-Time Services.
    2. Click the name of the target service to access its details page. In the Prediction tab, click Upload and select a test file. After the file is uploaded, click Predict to perform a prediction test.