Updated on 2022-03-13 GMT+08:00

Batch and Timeout

For most models, especially small models, chip inference using batch input brings performance improvement. Batch inference greatly improves the data throughput and chip utilization. Although a certain amount of latency is generated, the overall performance of the system is improved. Therefore, to build a high-performance application, the input batch needs to be the largest batch allowed by the latency.

To make batch operations more simple and flexible, the Matrix module supports the timeout mechanism. Users can configure is_repeat_timeout_flag in the config file to set whether to enable timeout waiting, and configure wait_inputdata_max_time to set the timeout duration. If timeout parameters are set, the system transfers null pointers when calling the Process function after timeout. You need to compile the code logic for timeout processing, which stores the received data in a queue and performs inference until the received data is sufficient to form a batch. You can use the hiai::MultiTypeQueue queue provided by the Matrix module. To prevent data starvation, a timeout duration must be specified by using the timeout setting API based on the latency requirement of the application. When the timeout duration expires, the Matrix module calls the main processing flow of the engine. Users can obtain and process the data in the queue to avoid starvation.

If the model input is multiple batches and you want to send data of each batch to the model manager (inference engine) for inference, you need to add the following code logic:

  1. You need to apply for a buffer space on the device side to store the data of each batch.
  2. When the inference engine on the device side receives data of all batches, you need to combine the data and store it in the buffer space allocated in 1.
  3. You can use the data in batches stored in the buffer space for inference only when the number of received batches on the device side is the same as that required for model inference.