Real-Time Inference

Updated on 2024-05-23 GMT+08:00

View PDF

In real-time inference application scenarios, a workload has one or more of the following characteristics:

Low latency
Each request needs to be handled promptly, with a strict response time (RT) delay. 90% of the long-tail latency is typically within the range of hundreds of milliseconds.

FunctionGraph offers the following benefits for real-time inference workloads:

Reserved GPU instances
In addition to the default on-demand GPU instances, FunctionGraph provides reserved GPU instances. You can use them to eliminate the impact of cold start latency to meet the low-latency response requirements of real-time inference services. For details, see Reserved Instance Management.
Prioritized quality with optimized cost
The billing cycle of a reserved GPU instance is different from that of an on-demand GPU instance. A reserved GPU instance is billed based on its lifecycle, regardless of whether the instance is active or idle (not charged by request). Compared to on-demand GPU instances, the overall cost is higher, but it is reduced by over 50% when compared to self-built long-term GPU clusters.
Optimal specifications
FunctionGraph offers the best GPU choices, allowing you to select a GPU type and configure the GPU memory with the minimum specifications as small as 1 GB based on service requirements.
Scaling for traffic surge
FunctionGraph offers ample GPU resources that can be quickly scaled out to prevent service disruptions due to insufficient or delayed GPU computing power during traffic surges.