Help Center/ ModelArts/ FAQs/ Training Jobs/ Reading Data During Training/ Why the Data Read Efficiency Is Low When a Large Number of Data Files Are Read During Training?
Updated on 2024-06-15 GMT+08:00

Why the Data Read Efficiency Is Low When a Large Number of Data Files Are Read During Training?

If a dataset contains a large number of data files (massive small files) and data is stored in OBS, files need to be repeatedly read from OBS during training. As a result, the training process is waiting for reading files, resulting in low read efficiency.

Solution

  1. Compress the massive small files into a package on your local PC, for example, a .zip package.
  2. Upload the package to OBS.
  3. During training, directly download this package from OBS to the /cache directory of your local PC. Perform this operation only once.
    For example, you can use mox.file.copy_parallel to download the .zip package to the /cache directory, decompress the package, and then read files for training.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    ...
    tf.flags.DEFINE_string('<obs_file_path>/data.zip', '', 'dataset directory.')
    FLAGS = tf.flags.FLAGS
    import os
    import moxing as mox
    TMP_CACHE_PATH = '/cache/data'
    mox.file.copy_parallel('FLAGS.data_url', TMP_CACHE_PATH)
    zip_data_path = os.path.join(TMP_CACHE_PATH, '*.zip')
    unzip_data_path = os.path.join(TEMP_CACHE_PATH, 'unzip')
    # You can also decompress .zip Python packages.
    os.system('unzip '+ zip_data_path + ' -d ' + unzip_data_path)
    mnist = input_data.read_data_sets(unzip_data_path, one_hot=True)