Why the Data Read Efficiency Is Low When a Large Number of Data Files Are Read During Training?
If a dataset contains a large number of data files (massive small files) and data is stored in OBS, files need to be repeatedly read from OBS during training. As a result, the training process is waiting for reading files, resulting in low read efficiency.
Solution
- Compress the massive small files into a package on your local PC, for example, a .zip package.
- Upload the package to OBS.
- During training, directly download this package from OBS to the /cache directory of your local PC. Perform this operation only once.
For example, you can use mox.file.copy_parallel to download the .zip package to the /cache directory, decompress the package, and then read files for training.
1 2 3 4 5 6 7 8 9 10 11 12
... tf.flags.DEFINE_string('<obs_file_path>/data.zip', '', 'dataset directory.') FLAGS = tf.flags.FLAGS import os import moxing as mox TMP_CACHE_PATH = '/cache/data' mox.file.copy_parallel('FLAGS.data_url', TMP_CACHE_PATH) zip_data_path = os.path.join(TMP_CACHE_PATH, '*.zip') unzip_data_path = os.path.join(TEMP_CACHE_PATH, 'unzip') # You can also decompress .zip Python packages. os.system('unzip '+ zip_data_path + ' -d ' + unzip_data_path) mnist = input_data.read_data_sets(unzip_data_path, one_hot=True)
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot