How Do I Solve the Memory Overflow Problem During Model Evaluation?
Resources such as the memory and GPU memory are limited. Therefore, if data is read to the memory first when the dataset is large, an error indicating insufficient memory may be reported. The following error message is displayed in the training job log, indicating that the memory is used up and resources are forcibly released.
"/home/ma-user/run_train_v2.sh: line 113: 94 Killed stdbuf -oL -eL ${ma_program_executor} "$boot_file" $prog_args >> "${training_log_file}" 2>&1"
Solution
- Method 1: Modify the code for loading data to the memory.
Read an image or a batch of images (instead of reading all images into the memory at a time) and then perform inference. In this way, only an image or a batch of images occupies the memory. For details, see the following code sample.
Advanced features of image classification require that all images be read into the memory at a time. The memory overflow problem occurs on a large number of datasets. To solve this problem, you can randomly select some datasets for model evaluation.
Original code, which preprocesses all data and then loads the data to the memory:
for img_path in data_file_list: img = _preprocess(img_path) img_list.append(img)New code, which reads an image or a batch of images and then loads the images to the memory.for img_path in data_file_list: img = _preprocess(img_path) pred_output = sess.run([y], {x: img}) pred_output = softmax(pred_output[0]) pred_list.append(pred_output[0].tolist()) - Method 2: Randomly select some datasets for model evaluation.
Some datasets are randomly selected for model evaluation. If the sample size is large, the randomly selected datasets can effectively reflect the overall dataset status and reduce the required evaluation time.
Did this article solve your problem?
Thank you for your score!Your feedback would help us improve the website.