Help Center/ ModelArts/ ModelArts User Guide (Standard)/ Using ModelArts Standard to Train Models/ Preparing Model Training Code/ Configuring Password-free SSH Mutual Trust Between Instances for a Training Job Created Using a Custom Image
Updated on 2025-08-18 GMT+08:00

Configuring Password-free SSH Mutual Trust Between Instances for a Training Job Created Using a Custom Image

For distributed training with custom images using MPI or Horovod, set up password-free SSH trust between instances to enable seamless communication. Otherwise, the training will fail.

This involves code adaptation and training job parameter configuration.

  1. Create a custom image with OpenSSH pre-installed. The training framework should be MPI or Horovod.
  2. Create a boot script file start_sshd.sh.
    MY_SSHD_PORT=${MY_SSHD_PORT:-"38888"} 
    mkdir -p /home/ma-user/etc
    ssh-keygen -f /home/ma-user/etc/ssh_host_rsa_key0 -N '' -t rsa > /dev/null
    /usr/sbin/sshd -p $MY_SSHD_PORT -h /home/ma-user/etc/ssh_host_rsa_key0
  3. Upload the sshd startup script file to the training code directory in OBS.
  4. Create a training job using the custom image.
    • Code Directory: Select the OBS path where the sshd boot script file is stored.
    • Boot Command: Adapt the boot command to the sshd boot script.
      bash ${MA_JOB_DIR}/demo-code/start_sshd.sh && your custom command

      In the command, your custom command indicates custom commands you want to execute in the training job.

    • Environment Variable: Add MY_SSHD_PORT = 38888.
    • Password-free SSH Between Nodes: Enable it and set Password-free SSH File Directory. Use the default value in most cases. After a training job is delivered, the SSH key file and configuration file authorized_keys config id_rsa id_rsa.pub are automatically generated in the /home/ma-user/.ssh directory of the training container.
  5. After a training job is created, its instances can establish an SSH connection with each other by using the domain name and port number throughout the training process. The sample code is as follows:
    ssh modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-1 -p $MY_SSHD_PORT