Updated on 2024-05-23 GMT+08:00

Real-Time Inference

Characteristics

In real-time inference application scenarios, a workload has one or more of the following characteristics:

  • Low latency

    Each request needs to be handled promptly, with a strict response time (RT) delay. 90% of the long-tail latency is typically within the range of hundreds of milliseconds.

Advantages

FunctionGraph offers the following benefits for real-time inference workloads:

  • Reserved GPU instances

    In addition to the default on-demand GPU instances, FunctionGraph provides reserved GPU instances. You can use them to eliminate the impact of cold start latency to meet the low-latency response requirements of real-time inference services. For details, see Reserved Instance Management.

  • Prioritized quality with optimized cost

    The billing cycle of a reserved GPU instance is different from that of an on-demand GPU instance. A reserved GPU instance is billed based on its lifecycle, regardless of whether the instance is active or idle (not charged by request). Compared to on-demand GPU instances, the overall cost is higher, but it is reduced by over 50% when compared to self-built long-term GPU clusters.

  • Optimal specifications

    FunctionGraph offers the best GPU choices, allowing you to select a GPU type and configure the GPU memory with the minimum specifications as small as 1 GB based on service requirements.

  • Scaling for traffic surge

    FunctionGraph offers ample GPU resources that can be quickly scaled out to prevent service disruptions due to insufficient or delayed GPU computing power during traffic surges.