Updated on 2025-08-18 GMT+08:00

Deploying a Model as a Real-Time Service

Real-time inference involves taking user inputs or queries via the internet and instantly providing processed outcomes or choices using AI or machine learning models hosted on remote servers or cloud platforms. Real-time inference uses cloud-based models to deliver fast and reliable AI services. It offers powerful analysis, predictions, and understanding without needing local model deployments. This solution works best for tasks requiring quick responses and interactions.

ModelArts allows you to deploy a model as a web service that provides a real-time test UI and monitoring capabilities. The deployed real-time service offers an accessible API for predictions and calls. Real-time inference is used in situations that need fast responses, like online intelligent customer service and autonomous driving decisions.

This section describes how to deploy your model as a real-time service on ModelArts and use it for predictions.

Billing

Deploying a service in ModelArts uses compute and storage resources, which are billed. Compute resources are billed for running the inference service. Storage resources are billed for storing data in OBS. For details, see Table 1.

Table 1 Billing items

Billing Item

Description

Billing Mode

Billing Formula

Compute resources

Public resource pool

Usage of compute resources.

For details, see ModelArts Pricing Details.

Pay-per-use

Specification unit price x Number of compute nodes x Usage duration

Dedicated resource pool

Fees for dedicated resource pools are paid upfront upon purchase. There are no additional charges for service deployment.

For details about dedicated resource pool fees, see Dedicated Resource Pool Billing Items.

N/A

N/A

Event notification (billed only when enabled)

This function uses Simple Message Notification (SMN) to send a message to you when the event you selected occurs.

To use this function, enable event notification when creating a training job.

For pricing details, see SMN Pricing Details.

Pay by actual usage

  • SMS: SMS notifications
  • Email: Email notifications + Downstream Internet traffic
  • HTTP or HTTPS: HTTP or HTTPS notifications + Downstream Internet traffic

Run logs (billed only when enabled)

Log Tank Service (LTS) collects, analyzes, and stores logs.

If Runtime Log Output is enabled during service deployment, you will be billed if the log data exceeds the LTS free quota. For details, see Log Tank Service Pricing Details.

Pay by actual log size

After the free quota is exceeded, you are billed based on the actual log volume and retention duration.

Constraints

A user can create up to 20 real-time services.

Prerequisites

  • A ModelArts model in the Normal state is available. For details about how to create models, see Creating a Model.
  • The account is not in arrears to ensure available resources for service running.
  • To mount SFS Turbo to a real-time service, first create an SFS Turbo file system and associate it with the service. Follow these steps:
    1. Create an SFS Turbo file system. For details, see Create a File System.
    2. On the Standard Cluster page, click the resource pool where you want to deploy the service. Copy the value of the Network field and exit the details page
    3. Exit the details page. Click the Network tab and search for the target network using the copied information. Click Interconnect VPC, select the VPC and subnet where your SFS Turbo is located, and click OK.

      Alternatively, choose More > Add sfsturbo and select the SFS Turbo file system you want to mount. In this step, the ECS specifications of the SFS Turbo file system must support multiple NICs. Otherwise, attaching NICs fails.

Deploying a Real-Time Service (Synchronous Request)

  1. Log in to the ModelArts console. In the navigation pane, choose Model Deployment > Real-Time Services.
  2. In the real-time service list, click Deploy in the upper left corner.
  3. Configure parameters.
    1. Configure basic parameters. For details, see Table 2.
      Table 2 Basic parameters

      Parameter

      Description

      Name

      Name of a real-time service.

      Auto Stop

      Time for your service to automatically stop running. This helps you avoid unnecessary billing. If you disable this feature, your real-time service will continue running and you will be billed accordingly. By default, this feature is enabled and set to stop the service 1 hour after it starts.

      The options are 1 hour, 2 hours, 4 hours, 6 hours, and Custom. If you select Custom, you can enter any integer from 1 to 24.

      Description

      Brief description for a real-time service.

    2. Enter key information including the resource pool and model configurations. For details, see Table 3.
      Table 3 Parameters

      Parameter

      Sub-Parameter

      Description

      Resource Pool

      Public Resource Pool

      Public resource pool for deploying the real-time service. Public resource pools provide large-scale public computing clusters, which are allocated based on job parameter settings. Resources are isolated by job.

      CPU/GPU resource pools are available for you to select. The pricing for resource pools varies depending on their flavors. For details, see Product Pricing Details. Public resource pools only support the pay-per-use billing mode.

      Dedicated Resource Pool

      Dedicated resource pool for deploying the real-time service. The resources provided in a dedicated resource pool are exclusive and more controllable.

      Select a dedicated resource pool flavor. The physical pools with logical subpools created are not supported temporarily.

      Model and Configuration

      Model Source

      Model source for deploying the real-time service. Choose My Model or My Subscriptions as needed.

      • My Model: Models either trained on ModelArts or developed elsewhere and then uploaded to ModelArts.
      • My Subscriptions: Models subscribed from AI Gallery.

      Model and Version

      Model and version that are in the Normal state.

      Traffic Ratio (%)

      Traffic percentage of the current model version. Service call requests are allocated to the current version based on this proportion.

      If you deploy only one version of a model, set this parameter to 100. If you select multiple versions for gray release, ensure that the sum of the traffic ratios of these versions is 100%.

      Instance Flavor

      Instance flavor of the real-time service to ensure smooth operation.

      Select available flavors based on the list displayed on the console. The flavors in gray cannot be used in the current environment.

      If no public resource pool flavors are available, use a dedicated resource pool.

      When deploying the service with the selected flavor, there will be necessary system consumptions. This means that the actual resources required will be greater than the flavor.

      Instances

      Number of instances for the current model version. If you set the number of instances to 1, the standalone computing mode is used. If you set the number of instances to a value greater than 1, the distributed computing mode is used. Select a computing mode based on your actual needs.

      Environment Variable

      Environment variables you need to inject to the pod.

      To ensure data security, do not enter sensitive information, such as plaintext passwords, in environment variables.

      Timeout

      Timeout of a single model, including both the deployment and startup time. The default value is 20 minutes. The value must range from 3 to 120.

      Add Model and Configuration

      If the selected model has multiple versions, you can add multiple versions and configure traffic ratios for each version. You can use gray release to smoothly upgrade the model version.

      Free compute specifications do not support gray release of multiple versions.

      Mount Storage

      This parameter is displayed when the resource pool is a dedicated resource pool. This feature will mount a storage volume to compute nodes (instances) as a local directory when the service is running. This is a good option to consider when dealing with large input data or models.

      SFS Turbo

      Preparations for mounting SFS Turbo:

      Storage mounting is allowed only for services deployed in a dedicated resource pool which has interconnected with a VPC or associated with SFS Turbo.

      • To interconnect a VPC is to interconnect the VPC where SFS Turbo belongs to a dedicated resource pool network. For details, see Interconnect with a VPC.
      • You can associate HPC SFS Turbo file systems with dedicated resource pool networks.

      Parameters:

      • File System Name: Select the target SFS Turbo file system. A cross-region SFS Turbo file system cannot be selected.
      • Mount Path: Enter the mount path of the container, for example, /sfs-turbo-mount/. Select a new directory. If you select an existing directory, any existing files within it will be replaced.

      Notes:

      • A file system can be mounted only once and to only one path. Each mount path must be unique. A maximum of 8 disks can be mounted to a training job.
      • If you need to mount multiple file systems, do not use same or similar paths, for example, /obs-mount/ and /obs-mount/tmp/.
      • Once you have chosen SFS Turbo, avoid deleting the interconnected VPC or disassociating SFS Turbo. Otherwise, mounting will not be possible. When you mount the backend OBS storage on the SFS Turbo page, make sure to set the client's umask permission to 777 for normal use.

      Priority

      N/A

      This function is supported only for dedicated resource pools, including logical resource pools, new physical resource pools, and logical subpools. Existing physical resource pools do not support this function.

      You can set this parameter to preferentially schedule high-priority services.

      Priority values range from 1 (lowest) to 3 (highest). When training and inference jobs share the same pool, and Preemption is enabled on the training job creation page (For details, see Creating a Production Training Job (New Version)), an inference task with a higher priority can preempt a training job with a lower priority.

      Traffic Limit

      N/A

      Maximum number of times a service can be accessed within a second. You can configure this parameter as needed.

      WebSocket

      N/A

      Specifies whether to deploy a real-time service as a WebSocket service. Change the communication protocol of the service from HTTP/HTTPS to WebSocket.

      WebSocket enables bidirectional, instant communication between clients and servers, making it ideal for applications like real-time predictions and chatbot interactions.

      Once the protocol switches to WebSocket, the service's API URL updates to a WebSocket address. Clients can then link to the service and share data using a WebSocket client.

      Constraints:

      This feature is supported only if the model is WebSocket-compliant and comes from a container image.

      After this feature is enabled, Traffic Limit and Data Collection cannot be set.

      This parameter cannot be modified after the service is deployed.

      For details about WebSocket real-time services, see Full-Process Development of WebSocket Real-Time Services.

      Application Authentication

      Application

      Specifies whether to control access to the real-time service through app authentication.

      App authentication verifies a client's identity using their AppCode and AppSecret. It allows only authorized apps to access service APIs.

      App authentication allows better access control and boosts service security.

      This feature is disabled by default. To enable this feature, see Accessing a Real-Time Service Through App Authentication for details and configure parameters as required.

    3. (Optional) Configure advanced settings.
      Table 4 Advanced settings

      Parameter

      Description

      Auto Restart

      Specifies whether to automatically restart a service instance when a fault occurs.

      After this function is enabled, the system automatically redeploys the real-time service when detecting that the real-time service is abnormal. For details, see Configuring Auto Restart upon a Real-Time Service Fault.

      Auto restart boosts service reliability, minimizes downtime, and handles hardware failures efficiently. Use this function for tasks needing reliable and stable performance.

      Tags

      ModelArts can work with Tag Management Service (TMS). When creating resource-consuming tasks in ModelArts, for example, training jobs, configure tags for these tasks so that ModelArts can use tags to manage resources by group.

      You can select a predefined TMS tag from the tag drop-down list or customize a tag. Predefined tags are available to all service resources that support tags. Custom tags are available only to the service resources of the user who has created the tags.

      For details about how to use tags, see Using TMS Tags to Manage Resources by Group

  4. After confirming the entered information, deploy the service as prompted. Deploying a service generally requires a period of time, which may be several minutes or tens of minutes depending on the amount of your data and resources.

    You can go to the real-time service list to check if the deployment is complete. Once the service status changes from Deploying to Running, the service is deployed.

    Once a real-time service is deployed, it will start immediately.

Testing Real-Time Service Prediction

After a model is deployed as a real-time service, you can debug code or add files for testing in the Prediction tab. Due to the limitation of API Gateway, the duration of a single prediction cannot exceed 40s.

This feature is used for commissioning. Use API calling for actual production. You can select Accessing a Real-Time Service Through Token-based Authentication, Accessing a Real-Time Service Through AK/SK-based Authentication, or Accessing a Real-Time Service Through App Authentication based on the authentication method.

You can test the service in two ways, depending on the input request defined by the model – either by using a JSON text or a file.

  • JSON text prediction: If the input of the deployed model is JSON text, you can enter JSON code in the Prediction tab for testing.
  • File Prediction: If your model uses files as input, you can add images, audios, or videos into the Prediction tab to test the service.
    • The size of an input image must be less than 8 MB.
    • The maximum size of a request body for JSON text prediction is 8 MB.
    • Due to the limitation of API Gateway, the duration of a single prediction cannot exceed 40s.
    • The following image types are supported: png, psd, jpg, jpeg, bmp, gif, webp, psd, svg, and tiff.
    • If you use Ascend flavors for service deployment, you cannot predict transparent .png images because Ascend only supports RGB-3 images.

After a service is deployed, obtain the input parameters of the service in the Usage Guides page of the service details page.

Figure 1 Usage Guides

The input parameters displayed in the Usage Guides tab depend on the model source that you select.

  • If your meta model comes from a built-in algorithm, the input and output parameters are defined by ModelArts. For details, see the Usage Guides tab. In the Prediction tab, enter the corresponding JSON text or file for service testing.
  • If you use a custom meta model and your own inference code and configuration file (see Specifications for Writing the Model Configuration File), the Usage Guides tab will only display your configuration file. The following figure shows the mapping between the input parameters in the Usage Guides tab and the configuration file.
    Figure 2 Mapping between the configuration file and Usage Guides

The prediction methods for different input requests are as follows:

  • JSON Text Prediction
    1. Log in to the ModelArts console and choose Model Deployment > Real-Time Services.
    2. Click the name of the target service to access its details page. Enter the inference code in the Prediction tab, and click Predict to perform prediction.

  • File Prediction
    1. Log in to the ModelArts console and choose Model Deployment > Real-Time Services.
    2. Click the name of the target service to access its details page. In the Prediction tab, click Upload and select a test file. After the file is uploaded, click Predict to perform a prediction test. In Figure 3, the label, position coordinates, and confidence score are displayed.
      Figure 3 Image prediction

Using Cloud Shell to Debug a Real-Time Service Instance Container

You can use Cloud Shell provided by the ModelArts console to log in to the instance container of a running real-time service.

Constraints:

  • Cloud Shell can only access a container when the associated real-time service is deployed within a dedicated resource pool
  • Cloud Shell can only access a container when the associated real-time service is running.
  1. Log in to the ModelArts console. In the navigation pane, choose Model Deployment > Real-Time Services.
  2. On the real-time service list page, click the name or ID of the target service.
  3. Click the Cloud Shell tab and select the target model version and compute node. When the connection status changes to , you have logged in to the instance container.

    If the server disconnects due to an error or remains idle for 10 minutes, you can select Reconnect to regain access to the pod.
    Figure 4 Cloud Shell

    If you encounter a path display issue when logging in to Cloud Shell, press Enter to resolve the problem.
    Figure 5 Path display issue

  4. After logging in to the container, execute the necessary debugging commands in its terminal. Example:

    View logs:

    tail -f /var/log/app.log

    Check the service status:

    systemctl status app

    Run a custom script:

    ./debug_script.sh

  5. After the debugging, exit the container:

    exit

    After returning to the Cloud Shell terminal, you can view the debugging result or log file.