Updated on 2024-06-12 GMT+08:00

Distributed Training

ModelArts provides the following capabilities:

  • Extensive built-in images, meeting your requirements
  • Custom development environments set up using built-in images
  • Extensive tutorials, helping you quickly understand distributed training
  • Distributed training debugging in development tools such as PyCharm, VS Code, and JupyterLab

Constraints

  • The development environment refers to the new-version Notebook provided by ModelArts, excluding the old-version Notebook.
  • If the notebook instance flavors are changed, you can only perform single-node debugging. You cannot perform distributed debugging or submit remote training jobs.
  • Only the PyTorch and MindSpore AI frameworks can be used for multi-node distributed debugging. If you want to use MindSpore, each node must be equipped with eight cards.
  • The OBS paths in the debugging code should be replaced with your OBS paths.
  • PyTorch is used to write debugging code in this document. The process is the same for different AI frameworks. You only need to modify some parameters.

Related Chapters