Help Center/ ModelArts/ Troubleshooting/ Training Jobs/ GP Issues/ Error Message "RuntimeError: Cannot re-initialize CUDA in forked subprocess" Is Displayed in Logs

Updated on 2025-08-22 GMT+08:00

View PDF

Error Message "RuntimeError: Cannot re-initialize CUDA in forked subprocess" Is Displayed in Logs

Symptom

When PyTorch is used to start multiple processes, the following error message is displayed:

RuntimeError: Cannot re-initialize CUDA in forked subprocess

Possible Causes

The possible causes are as follows:

The boot mode of multi-processing is incorrect.

Solution

For details, see Writing Distributed Applications with PyTorch.

"""run.py:"""
#!/usr/bin/env python
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def run(rank, size):
    """ Distributed function to be implemented later. """
    pass

def init_process(rank, size, fn, backend='gloo'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    size = 2
    processes = []
    mp.set_start_method("spawn")
    for rank in range(size):
        p = mp.Process(target=init_process, args=(rank, size, run))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

Summary and Suggestions

Before creating a training job, use the ModelArts development environment to debug your training code and minimize migration errors.

Use the notebook environment for online debugging. For details, see Using JupyterLab to Develop Models.
Use a local IDE (PyCharm or VS Code) to access the cloud environment for debugging. For details, see Using a Local IDE to Develop Models.

Parent topic: GP Issues

Previous topic: Error Message "cuda runtime error (10) : invalid device ordinal at xxx" Is Displayed in Logs

Next topic: No GP Detected in a Training Job