Help Center/ ModelArts/ Troubleshooting/ Training Jobs/ OBS Operation Issues/ TensorFlow Stops Writing TensorBoard to OBS When the Size of Written Data Reaches 5 GB
Updated on 2024-06-11 GMT+08:00

TensorFlow Stops Writing TensorBoard to OBS When the Size of Written Data Reaches 5 GB

Symptom

The following error message is displayed for a ModelArts training job:

Encountered Unknown Error EntityTooLarge
Your proposed upload exceeds the maximum allowed object size.:
If the signature check failed. This could be because of a time skew. Attempting to adjust the signer

Possible Cause

The size of files to be uploaded at a time is limited to 5 GB in OBS. TensorFlow may save the summary file in local cache. Therefore, when flush is triggered each time, the summary file overwrites the original file on OBS. If the size of the file exceeds 5 GB, the file stops being written.

Solution

If this problem occurs during the running of a training job, use the following method for troubleshooting.

  1. You are advised to use the following local cache method:
    import moxing.tensorflow as mox
    mox.cache()