MemArts

Overview

The current standard training workflow of ModelArts involves calling MoXing to download data from OBS to the local SSD and then reading the data from the local SSD for training. However, this process has the following significant issues:

Long download time: The download bandwidth for small files in a certain region is typically around 70Mbps. For example, downloading 2 TB of data could take up to 8 hours, significantly impacting the training experience.
Insufficient disk space: The SSD size on the ModelArts training server in a certain region of Huawei Cloud is 4.7 TB. When the data volume exceeds 4.7 TB, the training server cannot store all the data.
Bandwidth bottleneck: When multiple training nodes download data simultaneously, the download time becomes uncertain due to the OBS bandwidth bottleneck.
Limited advanced features: ModelArts has developed advanced features such as mixed deployment, elasticity, and fault tolerance. However, due to the long data download time, the cost of re-downloading data when the training topology changes is extremely high, significantly affecting your experience of these advanced features.

To solve the preceding problems, ModelArts provides MemArts near-computing cache to cache data on training service nodes. The scheme involves forming a distributed data cache pool by networking multiple SSDs, which is used to cache the training dataset. When the cache is initially loaded with training data, it needs to access OBS to download the data into the cache. Once the cache is complete, subsequent training data downloads or reads are directly performed from the MemArts cache pool, significantly improving training efficiency and optimizing the user experience.

For basic function usage, you can refer to the usage instructions after completing the configurations in the constraints.

If additional operations are required, you can refer to the content introductions in Other Environment Variable Configurations and Usage Examples.

Constraints

The advanced feature must be used only after contacting technical support to assist with the installation of MemArts on the dedicated resource pool.
Prior to using this feature, the environment variable USE_MEMARTS=1 needs to be configured, and data download or reading should be performed using the data API provided by MoXing.
After setting the environment variable, you need to use import moxing for the settings to take effect.

Usage Instructions

MemArts can be used to accelerate data reading. The following describes how to use basic functions.

copy

# Copy the file to your computer.
import moxing as mox 
mox.file.copy('obs://bucket_name/obs_file.txt', '/cache/obs_file.txt')

copy_parallel

# Copy the folder to your computer at the same time.
import moxing as mox  
mox.file.copy_parallel('obs://bucket_name/sub_dir_0', '/cache/sub_dir_0')

read

# Read the file as text and obtain a string.
import moxing as mox  
file_str = mox.file.read('obs://bucket_name/obs_file.txt') 
# Read the file in binary mode and obtain bytes.
import moxing as mox  
file_data = mox.file.read('obs://bucket_name/aaa.jpg',binary=True)

read_meta_free

# Used in ultra-large-scale data read scenarios. Metadata does not need to be cached in advance.
# Read the file as text and obtain a string.
import moxing as mox  
file_str = mox.file.read_meta_free('obs://bucket_name/obs_file.txt') 
# Read the file in binary mode and obtain bytes.
import moxing as mox  
file_data = mox.file.read_meta_free('obs://bucket_name/aaa.jpg',binary=True) 
# Read the file written to MemArts by the write_memarts function and obtain bytes.
import moxing as mox  
file_data = mox.file.read_meta_free('obs://bucket_name/obs_file.txt', memarts_only=True)

File

# Use a file object to read the file as text.
import moxing as mox 
with mox.file.File('obs://bucket_name/obs_file.txt', 'r') as f:  
  file_str = f.read() 
# Use a file object to read the file in binary mode.
import moxing as mox  
with mox.file.File('obs://bucket_name/obs_file.bin', 'rb') as f:             
  file_bytes = f.read()

write_memarts

# Write the file to MemArts. The file content needs to be converted into binary. The write result is returned. True indicates success, and False indicates failure.
import moxing as mox 
ret = mox.file.write_memarts('obs://bucket/dir/data.bin', b'xxx', retry=3)

MemArts Proxy

After installing MoXing version 2.3.8 or higher, run the shell commands below before executing MoXing-related scripts:

# stop is used to clean up any residual proxy client in fast recovery scenarios.
moxing stop_memarts_proxy 
# Start eight proxy clients in the background to ensure optimal performance.
moxing start_memarts_proxy 8

Configure the environment variables USE_MEMARTS=1 and USE_MEMARTS_PROXY=1 to enable MemArts proxy. No additional modifications are required to use this feature, which significantly reduces the memory usage of MemArts processes at the cost of a slight performance decrease.

Other Environment Variable Configurations

The following environment variables are related to the MoXing API used by MemArts. Note that after setting these environment variables, you need to use import moxing for the settings to take effect.

USE_MEMARTS
This environment variable is used to enable or disable MemArts. By default, it is disabled. Set it to 1 to enable MemArts.

MOX_COPY_PARALLEL_THREADS
This environment variable sets the number of threads for the copy_parallel function. The default value is 16, which means 16 processes will run concurrently for the listing and copying operations involved in this function.
MOX_FILE_CHUNK_SIZE
This environment variable sets the chunk size for the read, read_meta_free, copy, copy_parallel, and write_memarts functions, measured in bytes. The default value is 1MB. Typically, there is no need to modify this value. MoXing will split the required files into chunks of the size specified by MOX_FILE_CHUNK_SIZE and then retrieve the corresponding data segments from MemArts or OBS.
MOX_OBS_CLIENT_LOG
This environment variable controls whether the OBS Python SDK logs are printed. By default, it is set to 1 (enabled). When enabled, OBS-related logs will be printed to /home/ma-user/modelarts/log/obs_log.
MOX_AUTO_READ_META_FREE
This environment variable determines whether the read_meta_free function is used by default for read operations. By default, it is disabled. Set it to 1 to enable it. When enabled, calling the read function will actually use the read_meta_free function for read operations.
MOX_MEMARTS_LARGE_FILE_ACC
This environment variable enables the large file acceleration mode for the copy and copy_parallel functions when downloading data from MemArts. By default, it is disabled. Set it to 1 to enable it. When enabled, the system will determine whether a file is large based on the value of the MOX_FILE_PARTIAL_MAXIMUM_SIZE environment variable (in bytes), which defaults to 5GB. If the file size is greater than or equal to MOX_FILE_PARTIAL_MAXIMUM_SIZE, the system will use the number of concurrent processes specified by the MOX_FILE_LARGE_FILE_TASK_NUM environment variable (default is 8) to download the file in multiple processes. It is recommended to use this feature in proxy service mode.
MOX_OBS_TO_MOUNT_PATH
This environment variable enables compatibility features for mounted services such as SFS Turbo. By default, it is disabled. The value is a string where each mapping is separated by =, and multiple mappings are separated by ;. The mappings consist of OBS addresses and local addresses. When enabled, you can directly use OBS paths to read data from the local mounted directories when calling the read, read_meta_free, and exists functions, as well as when using file objects. For example, if you use SFS Turbo to mount the obs://test/a directory to the local c directory and the obs://test/b directory to the local /cache/b directory, the environment variable MOX_OBS_TO_MOUNT_PATH can be set to obs://test/a=/cache/a;obs://test/b=/cache/b. Subsequently, when using the read, read_meta_free, and exists functions, data from the obs://test/a or obs://test/b directories will be read directly from the SFS Turbo mounted directories.
USE_MEMARTS_PROXY
This environment variable enables or disables the MemArts client proxy feature. It is enabled if a proxy client already exists in the environment or if the environment variable is set to 1. For more information on using the client proxy feature, refer to the MemArts proxy feature.
USE_METADATA_CACHE
This environment variable enables or disables local metadata caching. By default, it is disabled. Set it to 1 to enable it. When enabled, metadata required for the copy, copy_parallel, and read functions will be read from the local cache if available. If the metadata is not available locally, it will be read from OBS and then saved locally to improve metadata read performance.

Usage Examples

For third-party library functions that you might need to use, the following code replacements can be used:

json.load(file_path) can be replaced with:

json.loads(mox.file.read(file_path, binary=False))

np.load(file_path) can be replaced with:

np.load(io.BytesIO(moxing.file.read_meta_free(file_path,binary=True)))

np.fromfile(file_path, dtype=dtype) can be replaced with:

np.frombuffer(mox.file.read(file_path,binary=True), dtype=dtype)

torch.load(file_path) can be replaced with:

with io.BytesIO(mox.file.read(file_path,binary=True)) as f:
                pth = torch.load(f)

ndarray = scipy.sparse.load_npz(file_path) can be replaced with:

with io.BytesIO(mox.file.read(file_path,binary=True)) as f:
                ndarray = scipy.sparse.load_npz(f)

with open(file_path, 'rb') as f:pickle.load(f) can be replaced with:

with io.BytesIO(mox.file.read(file_path,binary=True)) as f:
                data = pickle.load(f)

Image.open(file_path) can be replaced with:

Image.open(io.BytesIO(moxing.file.read_meta_free(file_path,binary=True)))

cv2.imread(file_path, imread_type) can be replaced with:

cv2.imdecode(np.frombuffer(mox.file.read_meta_free(file_path,binary=True), np.uint8), imread_type)

If API reconstruction is difficult, you can use moxing.file.copy to copy data to a local path and then read and delete the data using local APIs. This method usually avoids performance issues. Here is a code example:

_, ext = os.path.splitext(file_path)
with tempfile.NamedTemporaryFile(suffix=ext, dir='/cache', delete=False) as fp:
  moxing.file.copy(file_path, fp.name)
  data = np.load(fp.name)

Parent topic: MoXing

Previous topic: MoXing Functions

Next topic: Using ModelArts Standard to Deploy Models for Inference and Prediction