Deze pagina is nog niet beschikbaar in uw eigen taal. We werken er hard aan om meer taalversies toe te voegen. Bedankt voor uw steun.

On this page

Show all

Help Center/ ModelArts/ FAQs/ Training Jobs/ Reading Data During Training/ Why the Data Read Efficiency Is Low When a Large Number of Data Files Are Read During Training?

Why the Data Read Efficiency Is Low When a Large Number of Data Files Are Read During Training?

Updated on 2024-06-11 GMT+08:00

If a dataset contains a large number of data files (massive small files) and data is stored in OBS, files need to be repeatedly read from OBS during training. As a result, the training process is waiting for reading files, resulting in low read efficiency.

Solution

  1. Compress the massive small files into a package on your local PC, for example, a .zip package.
  2. Upload the package to OBS.
  3. During training, directly download this package from OBS to the /cache directory of your local PC. Perform this operation only once.
    For example, you can use mox.file.copy_parallel to download the .zip package to the /cache directory, decompress the package, and then read files for training.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    ...
    tf.flags.DEFINE_string('<obs_file_path>/data.zip', '', 'dataset directory.')
    FLAGS = tf.flags.FLAGS
    import os
    import moxing as mox
    TMP_CACHE_PATH = '/cache/data'
    mox.file.copy_parallel('FLAGS.data_url', TMP_CACHE_PATH)
    zip_data_path = os.path.join(TMP_CACHE_PATH, '*.zip')
    unzip_data_path = os.path.join(TEMP_CACHE_PATH, 'unzip')
    # You can also decompress .zip Python packages.
    os.system('unzip '+ zip_data_path + ' -d ' + unzip_data_path)
    mnist = input_data.read_data_sets(unzip_data_path, one_hot=True)
    
Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback