Quasi-Real-Time Inference

This section describes what is quasi-real-time inference and how to use on-demand GPU instances to build a cost-effective quasi-real-time inference service.

Characteristics

In quasi-real-time inference scenarios, a workload has one or more of the following characteristics:

Infrequent calls
The number of daily calls varies greatly, ranging from a few to tens of thousands. Additionally, the GPUs are typically used for less than 6 to 10 hours per day, resulting in a significant amount of idle resources.
Time-consuming processing
Quasi-real-time inference usually takes only seconds to minutes. For example, a typical computer vision task can be completed within seconds, while a typical video processing or AIGC task may take a few minutes.
Tolerance of cold start
The service can tolerate the GPU cold start latency, or the service traffic pattern corresponds to a low probability of cold starts.

Advantages

FunctionGraph offers the following benefits for quasi-real-time inference workloads:

Cloud-native serverless
FunctionGraph provisions on-demand GPU resources by default. GPU instances are automatically scaled based on the number of service requests. The minimum number of instances is 0, and the maximum number of instances can be configured.
Optimal specifications
FunctionGraph offers the best GPU choices, allowing you to select a GPU type and configure the GPU memory based on service requirements.
Cost-effectiveness
The pay-per-use billing mode reduces costs by more than 70% for workloads with low GPU resource utilization.

Parent topic: Scenarios

Previous topic: Scenarios

Next topic: Real-Time Inference