Help Center/
ModelArts/
Troubleshooting/
Training Jobs/
GPU Issues/
Error Message "RuntimeError: Cannot re-initialize CUDA in forked subprocess" Displayed in Logs
Updated on 2022-12-08 GMT+08:00
Error Message "RuntimeError: Cannot re-initialize CUDA in forked subprocess" Displayed in Logs
Symptom
When PyTorch is used to start multiple processes, the following error message is displayed:
RuntimeError: Cannot re-initialize CUDA in forked subprocess
Possible Causes
The multi-processing startup mode is incorrect.
Solution
For details, see Writing Distributed Applications with PyTorch.
"""run.py:""" #!/usr/bin/env python import os import torch import torch.distributed as dist import torch.multiprocessing as mp def run(rank, size): """ Distributed function to be implemented later. """ pass def init_process(rank, size, fn, backend='gloo'): """ Initialize the distributed environment. """ os.environ['MASTER_ADDR'] = '127.0.0.1' os.environ['MASTER_PORT'] = '29500' dist.init_process_group(backend, rank=rank, world_size=size) fn(rank, size) if __name__ == "__main__": size = 2 processes = [] mp.set_start_method("spawn") for rank in range(size): p = mp.Process(target=init_process, args=(rank, size, run)) p.start() processes.append(p) for p in processes: p.join()
Summary and Suggestions
Before creating a training job, use the ModelArts development environment to debug the training code to maximally eliminate errors in code migration.
- Use the online notebook environment for debugging. For details, see Using JupyterLab to Develop a Model.
- Use the local IDE (PyCharm or VS Code) to access the cloud environment for debugging. For details, see Using the Local IDE to Develop a Model.
Parent topic: GPU Issues
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
The system is busy. Please try again later.
For any further questions, feel free to contact us through the chatbot.
Chatbot