Optimizing AI Performance

Practice: Improving model performance through training optimization
Parameter tuning policy: Adjust parameters such as the flash attention, parallel splitting policy, micro batch size, and recomputation policy.

Improve performance by taking full advantage of the video random-access memory (VRAM), computing power, and parameter tuning.

Performance breakdown

If the performance after parameter tuning still cannot meet the commercial use requirements, you can perform profiling. After collecting performance data, break down and analyze performance from the operator, communications, scheduling, and memory dimensions, and add profiling code to the training script. The procedure is as follows: First, generate the profiling data directory structure. Then use the ATT tool to compare and analyze the end-to-end time consumption of the NPU and competing products. Finally, perform tracing analysis.

Operator analysis

Analyze operators through the summary file generated in profiling and use FA and MM operators.

Practice: route planning acceleration
The ranktable route planning is a communication optimization capability used in distributed parallel training. When NPUs are used, network route affinity planning can be performed for communication paths between nodes based on the actual switch topology, improving the communication speed between nodes. This case describes how to complete a PyTorch NPU distributed training task in ModelArts Lite using ranktable route planning. By default, training tasks are delivered to the Lite resource pool cluster in Volcano job mode. For details, see Best Practices.

Practice: training VRAM optimization

PyTorch memory pool management policies
- The PyTorch memory pool is managed by block. Block pools can be categorized as small memory pools and large memory pools. PyTorch requests memory from device drivers by block. The interface used by a user or PyTorch code to request memory from the memory pool is represented as the request or release of a tensor.
- The tensor lifecycle is managed using a reference counting mode similar to a smart pointer. The channel between Python and C++ is streamlined. That means a Python tensor object is associated with a C++ tensor object. If a Python tensor object disappears, the corresponding C++ tensor object is destructed to release memory. A tensor object created in the C++ environment can be returned as a Python tensor object.
- Tensor objects in C++ consist of viewTensor and storageTensor. The former contains various meta information of a tensor, such as shape, stride, and dataType. The latter contains the memory address and offset. viewTensor is displayed for tensor objects. This is the basic rule when multiple tensors correspond to the same memory after PyTorch performs view operations. When storageTensor requests memory, the PyTorch block pool will allocate an idle block, split it as needed, and return the address pointer.
Cross-stream memory reuse policies for PyTorch
- If you want to use a tensor obtained from the memory pool of a stream for another stream, perform the recordStream operation to add the information about the target stream to the block of the tensor. When the lifecycle of the tensor ends and the address is released, an event_record task is delivered to the target stream.
- Each time the stream requests memory, an event_query operation is performed. If the event has been recorded, it indicates that the task has been executed on other streams. In this case, you can reclaim the block.
PyTorch memory statistics
- Generally, PyTorch memory statistics include allocated, active, and reserved memory. Allocated memory refers to the memory allocated to tensors on a host and not yet released. (Note that this is about memory allocation and release on a host, but not indicate the device status.) Active memory refers to the memory not yet released on a host plus the memory still occupied by other streams.
- For example, a tensor requests memory on stream A to perform the allreduce collective communication operation on stream B. After the tensor is added, it is released. In this case, the allocated memory will be subtracted, but the active memory will not. The active memory will be subtracted only after the allreduce is executed and the result is obtained through query_event.
- If PyTorch requests 100 MB memory from the device and then releases it. Then PyTorch requests another 20 MB memory and does not release it. In this case, the allocated memory is 20 MB, and the reserved memory is 100 MB. On the actual network, the reserved memory contains a large number of small blocks, which cannot be reused as a large memory block. As a result, there is a large difference between the reserved and allocated memory, which is called memory fragments.
Factors affecting PyTorch memory fragments
- Theoretically, more memory requests and releases within a single step always lead to more memory fragments. The PyTorch memory pool depends only on the training script logic on the host. The training script logic of each step is the same. Therefore, the memory status becomes stable after the first step.
- Tensors of different lifecycles request and release memory alternately. To reduce memory fragments, tensors of a long lifecycle should request memory first. Workspace memory can be serially reused in an absolute sense. Therefore, a customized memory pool policy can be used to reduce the impact of memory fragments. The common operation of converting non-continuous memory to continuous memory is a common memory request on NPUs compared with GPUs.
VRAM optimization policy
The number of parameters of a foundation model increases exponentially, far exceeding the capacity of physical memory with a single GPU. Therefore, VRAM optimization is required for training foundation models. VRAM optimization optimizes algorithms to reduce VRAM consumption, or enlarges the VRAM to obtain extra spaces through some replacement methods. As the physical size of VRAM is fixed, you can obtain extra spaces by prolonging time or transferring spaces. Prolonging time usually consumes computing power and bandwidth. Transferring spaces mainly consumes I/O bandwidth and has a delay, which may reduce throughput.

Performance metrics

**Table 1** Performance metrics
Metric ID	Metric Name	Description
cpu_usage	CPU Usage	CPU usage of ModelArts
mem_usage	Memory Usage	Memory usage of ModelArts
gpu_util	GPU Usage	GPU usage of ModelArts
gpu_mem_usage	GPU Memory Usage	GPU memory usage of ModelArts
npu_util	NPU Usage	NPU usage of ModelArts
npu_mem_usage	NPU Memory Usage	NPU memory usage of ModelArts
disk_read_rate	Disk Read Rate	Disk read rate of ModelArts
disk_write_rate	Disk Write Rate	Disk write rate of ModelArts