Help Center/ ModelArts/ Best Practices/ Model Training/ Running a Training Job on ModelArts Standard/ Running a Single-Node Single-PU Training Job on ModelArts Standard
Updated on 2025-10-14 GMT+08:00

Running a Single-Node Single-PU Training Job on ModelArts Standard

Building and Debugging an Image Locally

In this section, the Conda environment is packaged to set up the runtime. You can also install Conda dependencies manually using pip install or conda install.

  • The container image should be smaller than 15 GB. For details, see Constraints on Custom Images of the Training Framework.
  • Build an image through the official open-source website, for example, PyTorch.
  • Containers should be built by layer. Each layer must have no more than 1 GB of capacity or 100,000 files. You need to start with the layers that change less frequently. For example, build the OS, CUDA driver, Python, PyTorch, and other dependency packages in sequence.
  • If the training data and code change frequently, do not store them in the container image in case you need to build container images frequently.
  • The containers can meet the isolation requirements. Do not create conda environments in a container.
  1. Export the conda environment.
    1. Start the offline container image:
      # run on terminal
      docker run -ti ${your_image:tag}
    2. Obtain pytorch.tar.gz:
      # run on container
      
      # Create a conda environment named pytorch based on the target base environment.
      conda create --name pytorch --clone base
      
      pip install conda-pack
      
      # Pack pytorch env to generate pytorch.tar.gz.
      conda pack -n pytorch -o pytorch.tar.gz
    3. Upload the package to a local path.
      # run on terminal
      docker cp ${your_container_id}:/xxx/xxx/pytorch.tar.gz .
    4. Upload pytorch.tar.gz to OBS and set it to public read. Obtain, decompress, and clear pytorch.tar.gz using the wget commands during creation.
  2. Create an image.

    Choose either the official Ubuntu 18.04 image or the image with the CUDA driver from NVIDIA as the base image. Obtain the images on the Docker Hub official website.

    To create an image, do as follows: Install the required apt package and driver, configure the ma-user user, import the conda environment, and configure the notebook dependency.

    • Creating images with a Dockerfile is recommended. This ensures Dockerfile traceability and archiving, as well as image content without redundancy or residue.
    • To reduce the final image size, delete intermediate files such as TAR packages when building each layer. For details about how to clear the cache, see conda clean.
  3. Refer to the following example.
    Dockerfile example:
    FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu18.04
    
    USER root
    
    # section1: add user ma-user whose uid is 1000 and user group ma-group whose gid is 100. If there already exists 1000:100 but not ma-user:ma-group, below code will remove it
    RUN default_user=$(getent passwd 1000 | awk -F ':' '{print $1}') || echo "uid: 1000 does not exist" && \
        default_group=$(getent group 100 | awk -F ':' '{print $1}') || echo "gid: 100 does not exist" && \
        if [ ! -z ${default_group} ] && [ ${default_group} != "ma-group" ]; then \
            groupdel -f ${default_group}; \
            groupadd -g 100 ma-group; \
        fi && \
        if [ -z ${default_group} ]; then \
            groupadd -g 100 ma-group; \
        fi && \
        if [ ! -z ${default_user} ] && [ ${default_user} != "ma-user" ]; then \
            userdel -r ${default_user}; \
            useradd -d /home/ma-user -m -u 1000 -g 100 -s /bin/bash ma-user; \
            chmod -R 750 /home/ma-user; \
        fi && \
        if [ -z ${default_user} ]; then \
            useradd -d /home/ma-user -m -u 1000 -g 100 -s /bin/bash ma-user; \
            chmod -R 750 /home/ma-user; \
        fi && \
        # set bash as default
        rm /bin/sh && ln -s /bin/bash /bin/sh
    
    # section2: config apt source and install tools needed.
    RUN sed -i "s@http://.*archive.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list && \
        sed -i "s@http://.*security.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list && \
        apt-get update && \
        apt-get install -y ca-certificates curl ffmpeg git libgl1-mesa-glx libglib2.0-0 libibverbs-dev libjpeg-dev libpng-dev libsm6 libxext6 libxrender-dev ninja-build screen sudo vim wget zip && \
        apt-get clean  && \
        rm -rf /var/lib/apt/lists/*
    
    USER ma-user
    
    # section3: install miniconda and rebuild conda env
    RUN mkdir -p /home/ma-user/work/ && cd /home/ma-user/work/ && \
        wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.12.0-Linux-x86_64.sh && \
        chmod 777 Miniconda3-py37_4.12.0-Linux-x86_64.sh && \
        bash Miniconda3-py37_4.12.0-Linux-x86_64.sh -bfp /home/ma-user/anaconda3 && \
        wget https://${bucketname}.obs.cn-north-4.myhuaweicloud.com/${folder_name}/pytorch.tar.gz && \
        mkdir -p /home/ma-user/anaconda3/envs/pytorch && \
        tar -xzf pytorch.tar.gz -C /home/ma-user/anaconda3/envs/pytorch && \
        source /home/ma-user/anaconda3/envs/pytorch/bin/activate && conda-unpack && \
        /home/ma-user/anaconda3/bin/conda init bash && \
        rm -rf /home/ma-user/work/*
    
    ENV PATH=/home/ma-user/anaconda3/envs/pytorch/bin:$PATH
    
    # section4: settings of Jupyter Notebook for pytorch env
    RUN source /home/ma-user/anaconda3/envs/pytorch/bin/activate && \
        pip install ipykernel==6.7.0 --trusted-host https://repo.huaweicloud.com -i https://repo.huaweicloud.com/repository/pypi/simple && \
        ipython kernel install --user --env PATH /home/ma-user/anaconda3/envs/pytorch/bin:$PATH --name=pytorch && \
        rm -rf /home/ma-user/.local/share/jupyter/kernels/pytorch/logo-* && \
        rm -rf ~/.cache/pip/* && \
        echo 'export PATH=$PATH:/home/ma-user/.local/bin' >> /home/ma-user/.bashrc && \
        echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nvidia/lib64' >> /home/ma-user/.bashrc && \
        echo 'conda activate pytorch' >> /home/ma-user/.bashrc
    
    ENV DEFAULT_CONDA_ENV_NAME=pytorch

    Replace https://${bucket_name}.obs.cn-north-4.myhuaweicloud.com/${folder_name}/pytorch.tar.gz in the Dockerfile with the OBS path of pytorch.tar.gz in 1 (the file must be set to public read).

    Go to the Dockerfile directory and run the following commands to create an image:

    # Use the cd command to navigate to the directory that contains the Dockerfile and run the build command.
    # docker build -t ${image_name}:${image_version} .
    # Example:
    docker build -t pytorch-1.13-cuda11.3-cudnn8-ubuntu18.04:v1 .
  4. Debug an image.

    It is recommended that you use a Dockerfile to include the changes made during debugging in the official container building process and test it again.

    1. Ensure that the corresponding script, code, and process are running properly on the Linux server.

      If the running fails, commission the container and then create a container image.

    2. Ensure that the image file is located correctly and that you have the required file permission.

      Before training, verify that the custom dependency package is normal, that the required packages are in the pip list, and that the Python used by the container is the correct one. (If you have multiple Pythons installed in the container image, you need to set the Python path environment variable.)

    3. Test the training boot script.
      1. Data copy and verification

        Generally, the image does not contain training data and code. You need to copy the required files to the image after starting the image. To avoid running out of disk space, store data, code, and intermediate data in the /cache directory. It is recommended that the Linux server have sufficient memory (more than 8 GB) and hard disk (more than 100 GB).

        The following command enables file interaction between Docker and Linux:

        docker cp data/ 39c9ceedb1f6:/cache/

        Once you have prepared the data, run the training script and verify that the training starts correctly. Generally, the boot script is as follows:

        cd /cache/code/ 
        python start_train.py

        To troubleshoot the training process, you can access the logs and errors in the container instance and adjust the code and environment variables accordingly.

      2. Preset script testing

        The run.sh script is typically used to copy data and code from OBS to containers, and to copy output results from containers to OBS.

        You can edit and iterate the script in the container instance if the preset script does not produce the desired result.

      3. Dedicated pool scenario

        Mounting SFS in dedicated pools allows you to import code and data without worrying about OBS operations.

        You can either mount the SFS directory to the /mnt/sfs_turbo directory of the debugging node, or sync the directory content with the SFS disk.

        To start a container instance during debugging, use the -v parameter to mount a directory from the host machine to the container environment.

        docker run -ti -d -v /mnt/sfs_turbo:/sfs my_deeplearning_image:v1

        The command above mounts the /mnt/sfs_turbo directory of the host machine to the /sfs directory of the container. Any changes in the corresponding directories of the host machine and container are synchronized in real time.

    4. To locate faults, check the logs for the training image, and check the API response for the inference image.

      Run the following command to view all stdout logs output by the container:

      docker logs -f 39c9ceedb1f6

      Some logs are stored in the container when creating an inference image. You need to access the container to view these logs. Note: You need to check whether the logs contain errors (including when the container is started and when the API is executed).

    5. To modify the user groups of some files that are inconsistent, run the following command as the root user on the host machine.
      docker exec -u root:root 39c9ceedb1f6 bash -c "chown -R ma-user:ma-user /cache"
    6. To fix an error during debugging, edit it in the container instance. You can run the commit command to save the changes.

Uploading an Image

Uploading an image through the client is to run Docker commands on the machine where container engine client is installed to push the image to an image repository of SWR.

If your container engine client is an ECS or CCE node, you can push an image over two types of networks.

  • If your client and the image repository are in the same region, you can push an image over private networks.
  • If your client and the image repository are in different regions, you can push an image over public networks and the client needs to be bound to an EIP.
  • Each image layer uploaded through the client cannot be larger than 10 GB.
  • Your container engine client version must be 1.11.2 or later.
  1. Access SWR.
    1. Log in to the SWR console.
    2. Click Create Organization in the upper right corner and enter an organization name to create an organization. Customize the organization name. Replace the organization name deep-learning in subsequent commands with the actual organization name.
    3. In the navigation pane on the left, choose Dashboard and click Generate Login Command in the upper right corner. On the displayed page, click to copy the login command.
      • The validity period of the generated login command is 24 hours. To obtain a long-term valid login command, see Obtaining a Long-Term Login or Image Push/Pull Command. After you obtain a long-term valid login command, your temporary login commands will still be valid as long as they are in their validity periods.
      • The domain name at the end of the login command is the image repository address. Record the address for later use.
    4. Run the login command on the machine where the container engine is installed.

      The message Login Succeeded will be displayed upon a successful login.

  2. Run the following command on the device where the container engine is installed to label the image:

    docker tag [image_name_1:tag_1] [image_repository_address]/[organization_name]/[image_name_2:tag_2]

    • [image_name_1:tag_1]: Replace it with the actual name and tag of the image to be uploaded.
    • [image_repository_address]: You can query the address on the SWR console, that is, the domain name at the end of the login command in 1.c.
    • [organization_name]: Replace it with the name of the organization created.
    • [image_name_2:tag_2]: Replace it with the desired image name and tag.

    Example:

    docker tag ${image_name}:${image_version} swr.cn-north-4.myhuaweicloud.com/${organization_name}/${image_name}:${image_version}
  3. Upload the image to the image repository.

    docker push [image_repository_address]/[organization_name]/[image_name_2:tag 2]

    Example:

    docker push swr.cn-north-4.myhuaweicloud.com/${organization_name}/${image_name}:${image_version}

    To view the pushed image, go to the SWR console and refresh the My Images page.

Uploading Data and Algorithms to OBS

  1. Prepare data.

    1. Download the animal dataset to the local PC and decompress it.
    2. Use obsutil to upload the dataset to an OBS bucket.
      ./obsutil cp ./dog_cat_1w obs://${your_obs_buck}/demo/ -f -r

      OBS supports multiple file upload modes. If the number of files is less than 100, you can upload them on the OBS console. If the number of files is greater than 100, you are advised to use OBS Browser+ (Windows) and obsutil (Linux). The preceding example shows how to use obsutil.

  2. Prepare an algorithm.

    Upload the main.py file to the demo folder in the OBS bucket. The main.py file contains the following content:

    import argparse
    import os
    import random
    import shutil
    import time
    import warnings
    from enum import Enum
    import torch
    import torch.nn as nn
    import torch.nn.parallel
    import torch.backends.cudnn as cudnn
    import torch.distributed as dist
    import torch.optim
    from torch.optim.lr_scheduler import StepLR
    import torch.multiprocessing as mp
    import torch.utils.data
    import torch.utils.data.distributed
    import torchvision.transforms as transforms
    import torchvision.datasets as datasets
    import torchvision.models as models
    model_names = sorted(name for name in models.__dict__
                         if name.islower() and not name.startswith("__")
                         and callable(models.__dict__[name]))
    parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')
    parser.add_argument('data', metavar='DIR', default='imagenet',
                        help='path to dataset (default: imagenet)')
    parser.add_argument('-a', '--arch', metavar='ARCH', default='resnet18',
                        choices=model_names,
                        help='model architecture: ' +
                             ' | '.join(model_names) +
                             ' (default: resnet18)')
    parser.add_argument('-j', '--workers', default=4, type=int, metavar='N',
                        help='number of data loading workers (default: 4)')
    parser.add_argument('--epochs', default=90, type=int, metavar='N',
                        help='number of total epochs to run')
    parser.add_argument('--start-epoch', default=0, type=int, metavar='N',
                        help='manual epoch number (useful on restarts)')
    parser.add_argument('-b', '--batch-size', default=256, type=int,
                        metavar='N',
                        help='mini-batch size (default: 256), this is the total '
                             'batch size of all GPUs on the current node when '
                             'using Data Parallel or Distributed Data Parallel')
    parser.add_argument('--lr', '--learning-rate', default=0.1, type=float,
                        metavar='LR', help='initial learning rate', dest='lr')
    parser.add_argument('--momentum', default=0.9, type=float, metavar='M',
                        help='momentum')
    parser.add_argument('--wd', '--weight-decay', default=1e-4, type=float,
                        metavar='W', help='weight decay (default: 1e-4)',
                        dest='weight_decay')
    parser.add_argument('-p', '--print-freq', default=10, type=int,
                        metavar='N', help='print frequency (default: 10)')
    parser.add_argument('--resume', default='', type=str, metavar='PATH',
                        help='path to latest checkpoint (default: none)')
    parser.add_argument('-e', '--evaluate', dest='evaluate', action='store_true',
                        help='evaluate model on validation set')
    parser.add_argument('--pretrained', dest='pretrained', action='store_true',
                        help='use pre-trained model')
    parser.add_argument('--world-size', default=-1, type=int,
                        help='number of nodes for distributed training')
    parser.add_argument('--rank', default=-1, type=int,
                        help='node rank for distributed training')
    parser.add_argument('--dist-url', default='tcp://224.66.41.62:23456', type=str,
                        help='url used to set up distributed training')
    parser.add_argument('--dist-backend', default='nccl', type=str,
                        help='distributed backend')
    parser.add_argument('--seed', default=None, type=int,
                        help='seed for initializing training. ')
    parser.add_argument('--gpu', default=None, type=int,
                        help='GPU id to use.')
    parser.add_argument('--multiprocessing-distributed', action='store_true',
                        help='Use multi-processing distributed training to launch '
                             'N processes per node, which has N GPUs. This is the '
                             'fastest way to use PyTorch for either single node or '
                             'multi node data parallel training')
    best_acc1 = 0
    
    
    def main():
        args = parser.parse_args()
        if args.seed is not None:
            random.seed(args.seed)
            torch.manual_seed(args.seed)
            cudnn.deterministic = True
            warnings.warn('You have chosen to seed training. '
                          'This will turn on the CUDNN deterministic setting, '
                          'which can slow down your training considerably! '
                          'You may see unexpected behavior when restarting '
                          'from checkpoints.')
        if args.gpu is not None:
            warnings.warn('You have chosen a specific GPU. This will completely '
                          'disable data parallelism.')
        if args.dist_url == "env://" and args.world_size == -1:
            args.world_size = int(os.environ["WORLD_SIZE"])
        args.distributed = args.world_size > 1 or args.multiprocessing_distributed
        ngpus_per_node = torch.cuda.device_count()
        if args.multiprocessing_distributed:
            # Since we have ngpus_per_node processes per node, the total world_size
            # needs to be adjusted accordingly
            args.world_size = ngpus_per_node * args.world_size
            # Use torch.multiprocessing.spawn to launch distributed processes: the
            # main_worker process function
            mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
        else:
            # Simply call main_worker function
            main_worker(args.gpu, ngpus_per_node, args)
    def main_worker(gpu, ngpus_per_node, args):
        global best_acc1
        args.gpu = gpu
        if args.gpu is not None:
            print("Use GPU: {} for training".format(args.gpu))
        if args.distributed:
            if args.dist_url == "env://" and args.rank == -1:
                args.rank = int(os.environ["RANK"])
            if args.multiprocessing_distributed:
                # For multiprocessing distributed training, rank needs to be the
                # global rank among all the processes
                args.rank = args.rank * ngpus_per_node + gpu
            dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
                                    world_size=args.world_size, rank=args.rank)
        # create model
        if args.pretrained:
            print("=> using pre-trained model '{}'".format(args.arch))
            model = models.__dict__[args.arch](pretrained=True)
        else:
            print("=> creating model '{}'".format(args.arch))
            model = models.__dict__[args.arch]()
        if not torch.cuda.is_available():
            print('using CPU, this will be slow')
        elif args.distributed:
            # For multiprocessing distributed, DistributedDataParallel constructor
            # should always set the single device scope, otherwise,
            # DistributedDataParallel will use all available devices.
            if args.gpu is not None:
                torch.cuda.set_device(args.gpu)
                model.cuda(args.gpu)
                # When using a single GPU per process and per
                # DistributedDataParallel, we need to divide the batch size
                # ourselves based on the total number of GPUs of the current node.
                args.batch_size = int(args.batch_size / ngpus_per_node)
                args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)
                model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
            else:
                model.cuda()
                # DistributedDataParallel will divide and allocate batch_size to all
                # available GPUs if device_ids are not set
                model = torch.nn.parallel.DistributedDataParallel(model)
        elif args.gpu is not None:
            torch.cuda.set_device(args.gpu)
            model = model.cuda(args.gpu)
        else:
            # DataParallel will divide and allocate batch_size to all available GPUs
            if args.arch.startswith('alexnet') or args.arch.startswith('vgg'):
                model.features = torch.nn.DataParallel(model.features)
                model.cuda()
            else:
                model = torch.nn.DataParallel(model).cuda()
        # define loss function (criterion), optimizer, and learning rate scheduler
        criterion = nn.CrossEntropyLoss().cuda(args.gpu)
        optimizer = torch.optim.SGD(model.parameters(), args.lr,
                                    momentum=args.momentum,
                                    weight_decay=args.weight_decay)
        """Sets the learning rate to the initial LR decayed by 10 every 30 epochs"""
        scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
        # optionally resume from a checkpoint
        if args.resume:
            if os.path.isfile(args.resume):
                print("=> loading checkpoint '{}'".format(args.resume))
                if args.gpu is None:
                    checkpoint = torch.load(args.resume)
                else:
                    # Map model to be loaded to specified single gpu.
                    loc = 'cuda:{}'.format(args.gpu)
                    checkpoint = torch.load(args.resume, map_location=loc)
                args.start_epoch = checkpoint['epoch']
                best_acc1 = checkpoint['best_acc1']
                if args.gpu is not None:
                    # best_acc1 may be from a checkpoint from a different GPU
                    best_acc1 = best_acc1.to(args.gpu)
                model.load_state_dict(checkpoint['state_dict'])
                optimizer.load_state_dict(checkpoint['optimizer'])
                scheduler.load_state_dict(checkpoint['scheduler'])
                print("=> loaded checkpoint '{}' (epoch {})"
                      .format(args.resume, checkpoint['epoch']))
            else:
                print("=> no checkpoint found at '{}'".format(args.resume))
        cudnn.benchmark = True
    
        # Data loading code
        traindir = os.path.join(args.data, 'train')
        valdir = os.path.join(args.data, 'val')
        normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                         std=[0.229, 0.224, 0.225])
        train_dataset = datasets.ImageFolder(
            traindir,
            transforms.Compose([
                transforms.RandomResizedCrop(224),
                transforms.RandomHorizontalFlip(),
                transforms.ToTensor(),
                normalize,
            ]))
        if args.distributed:
            train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
        else:
            train_sampler = None
    
        train_loader = torch.utils.data.DataLoader(
            train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
            num_workers=args.workers, pin_memory=True, sampler=train_sampler)
        val_loader = torch.utils.data.DataLoader(
            datasets.ImageFolder(valdir, transforms.Compose([
                transforms.Resize(256),
                transforms.CenterCrop(224),
                transforms.ToTensor(),
                normalize,
            ])),
            batch_size=args.batch_size, shuffle=False,
            num_workers=args.workers, pin_memory=True)
        if args.evaluate:
            validate(val_loader, model, criterion, args)
            return
    
        for epoch in range(args.start_epoch, args.epochs):
            if args.distributed:
                train_sampler.set_epoch(epoch)
            # train for one epoch
            train(train_loader, model, criterion, optimizer, epoch, args)
            # evaluate on validation set
            acc1 = validate(val_loader, model, criterion, args)
            scheduler.step()
            # remember best acc@1 and save checkpoint
            is_best = acc1 > best_acc1
            best_acc1 = max(acc1, best_acc1)
            if not args.multiprocessing_distributed or (args.multiprocessing_distributed
                                                        and args.rank % ngpus_per_node == 0):
                save_checkpoint({
                    'epoch': epoch + 1,
                    'arch': args.arch,
                    'state_dict': model.state_dict(),
                    'best_acc1': best_acc1,
                    'optimizer': optimizer.state_dict(),
                    'scheduler': scheduler.state_dict()
                }, is_best)
    def train(train_loader, model, criterion, optimizer, epoch, args):
        batch_time = AverageMeter('Time', ':6.3f')
        data_time = AverageMeter('Data', ':6.3f')
        losses = AverageMeter('Loss', ':.4e')
        top1 = AverageMeter('Acc@1', ':6.2f')
        top5 = AverageMeter('Acc@5', ':6.2f')
        progress = ProgressMeter(
            len(train_loader),
            [batch_time, data_time, losses, top1, top5],
            prefix="Epoch: [{}]".format(epoch))
        # switch to train mode
        model.train()
        end = time.time()
        for i, (images, target) in enumerate(train_loader):
            # measure data loading time
            data_time.update(time.time() - end)
            if args.gpu is not None:
                images = images.cuda(args.gpu, non_blocking=True)
            if torch.cuda.is_available():
                target = target.cuda(args.gpu, non_blocking=True)
            # compute output
            output = model(images)
            loss = criterion(output, target)
            # measure accuracy and record loss
            acc1, acc5 = accuracy(output, target, topk=(1, 5))
            losses.update(loss.item(), images.size(0))
            top1.update(acc1[0], images.size(0))
            top5.update(acc5[0], images.size(0))
            # compute gradient and do SGD step
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            # measure elapsed time
            batch_time.update(time.time() - end)
            end = time.time()
            if i % args.print_freq == 0:
                progress.display(i)
    def validate(val_loader, model, criterion, args):
        batch_time = AverageMeter('Time', ':6.3f', Summary.NONE)
        losses = AverageMeter('Loss', ':.4e', Summary.NONE)
        top1 = AverageMeter('Acc@1', ':6.2f', Summary.AVERAGE)
        top5 = AverageMeter('Acc@5', ':6.2f', Summary.AVERAGE)
        progress = ProgressMeter(
            len(val_loader),
            [batch_time, losses, top1, top5],
            prefix='Test: ')
        # switch to evaluate mode
        model.eval()
        with torch.no_grad():
            end = time.time()
            for i, (images, target) in enumerate(val_loader):
                if args.gpu is not None:
                    images = images.cuda(args.gpu, non_blocking=True)
                if torch.cuda.is_available():
                    target = target.cuda(args.gpu, non_blocking=True)
                # compute output
                output = model(images)
                loss = criterion(output, target)
                # measure accuracy and record loss
                acc1, acc5 = accuracy(output, target, topk=(1, 5))
                losses.update(loss.item(), images.size(0))
                top1.update(acc1[0], images.size(0))
                top5.update(acc5[0], images.size(0))
                # measure elapsed time
                batch_time.update(time.time() - end)
                end = time.time()
                if i % args.print_freq == 0:
                    progress.display(i)
            progress.display_summary()
        return top1.avg
    def save_checkpoint(state, is_best, filename='checkpoint.pth.tar'):
        torch.save(state, filename)
        if is_best:
            shutil.copyfile(filename, 'model_best.pth.tar')
    class Summary(Enum):
        NONE = 0
        AVERAGE = 1
        SUM = 2
        COUNT = 3
    
    
    class AverageMeter(object):
        """Computes and stores the average and current value"""
    
        def __init__(self, name, fmt=':f', summary_type=Summary.AVERAGE):
            self.name = name
            self.fmt = fmt
            self.summary_type = summary_type
            self.reset()
    
        def reset(self):
            self.val = 0
            self.avg = 0
            self.sum = 0
            self.count = 0
    
        def update(self, val, n=1):
            self.val = val
            self.sum += val * n
            self.count += n
            self.avg = self.sum / self.count
        def __str__(self):
            fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
            return fmtstr.format(**self.__dict__)
        def summary(self):
            fmtstr = ''
            if self.summary_type is Summary.NONE:
                fmtstr = ''
            elif self.summary_type is Summary.AVERAGE:
                fmtstr = '{name} {avg:.3f}'
            elif self.summary_type is Summary.SUM:
                fmtstr = '{name} {sum:.3f}'
            elif self.summary_type is Summary.COUNT:
                fmtstr = '{name} {count:.3f}'
            else:
                raise ValueError('invalid summary type %r' % self.summary_type)
            return fmtstr.format(**self.__dict__)
    class ProgressMeter(object):
        def __init__(self, num_batches, meters, prefix=""):
            self.batch_fmtstr = self._get_batch_fmtstr(num_batches)
            self.meters = meters
            self.prefix = prefix
        def display(self, batch):
            entries = [self.prefix + self.batch_fmtstr.format(batch)]
            entries += [str(meter) for meter in self.meters]
            print('\t'.join(entries))
        def display_summary(self):
            entries = [" *"]
            entries += [meter.summary() for meter in self.meters]
            print(' '.join(entries))
        def _get_batch_fmtstr(self, num_batches):
            num_digits = len(str(num_batches // 1))
            fmt = '{:' + str(num_digits) + 'd}'
            return '[' + fmt + '/' + fmt.format(num_batches) + ']'
    
    
    def accuracy(output, target, topk=(1,)):
        """Computes the accuracy over the k top predictions for the specified values of k"""
        with torch.no_grad():
            maxk = max(topk)
            batch_size = target.size(0)
            _, pred = output.topk(maxk, 1, True, True)
            pred = pred.t()
            correct = pred.eq(target.view(1, -1).expand_as(pred))
            res = []
            for k in topk:
                correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
                res.append(correct_k.mul_(100.0 / batch_size))
            return res
    if __name__ == '__main__':
        main()

Debugging Code with Notebook

  • Notebook billing is as follows:
    • A running notebook instance will be billed based on used resources. The fees vary depending on your selected resources. For details, see Product Pricing Details. When a notebook instance is not used, stop it.
    • If you select EVS for storage when creating a notebook instance, the EVS disk will be continuously billed. Stop and delete the notebook instance if it is not required.
  • When a notebook instance is created, auto stop is enabled by default. The notebook instance will automatically stop at the specified time.
  • Only running notebook instances can be accessed or stopped.
  • A maximum of 10 notebook instances can be created under one account.

Follow these steps:

  1. Register the image. Log in to the ModelArts console. In the navigation pane on the left, choose Image Management. Click Register. Set SWR Source to the image pushed to SWR. Paste the complete SWR address, or click to select a private image from SWR for registration, and add GPU to Type.
  2. Log in to the ModelArts console. In the navigation pane on the left, choose Development Workspace > Notebook.
  3. Click Create Notebook. On the displayed page, configure the parameters.
    1. Configure basic information of the notebook instance, including its name, description, and auto stop status. For details, see Table 1.
      Table 1 Basic parameters

      Parameter

      Description

      Name

      Name of a notebook instance, which can contain 1 to 64 characters, including letters, digits, hyphens (-), and underscores (_).

      Description

      Brief description of a notebook instance.

      Auto Stop

      Automatically stops the notebook instance at a specified time. This function is enabled by default. The default value is 1 hour, indicating that the notebook instance automatically stops after running for 1 hour and its resource billing will stop then.

      The options are 1 hour, 2 hours, 4 hours, 6 hours, and Custom. You can select Custom to specify any integer from 1 to 24 hours.

    2. Select an image and configure resource specifications for the instance.
      • Image: In the Custom Images tab, select the uploaded custom image.
      • Resource Type: Select a created dedicated resource pool based on site requirements.
      • Instance Specifications: Select 1-GPU specifications.
      • Storage: Select EVS.

      To use VS Code to connect to a notebook instance for code debugging, enable Remote SSH and select a key pair. For details, see Connecting to a Notebook Instance Through VS Code.

  4. Click Next.
  5. Confirm the information and click Submit.

    Switch to the notebook instance list. The notebook instance is being created. It will take several minutes before its status changes to Running.

    If the created notebook instance fails to be started, refer to the key points for debugging described in Building and Debugging an Image Locally.

  6. In the notebook instance list, click the instance name. On the instance details page that is displayed, view the instance configuration.
  7. Mounting an OBS parallel file system: On the notebook instance details page, click the Storage tab, then click Mount Storage, and set mounting parameters.
    1. Set a local mounting directory. Enter a folder name in /data/, for example, demo. The system will automatically create the folder in /data/ of the notebook container to mount the OBS file system.
    2. Select the folder for storing the OBS parallel file system and click OK.
  8. View the mounting result on the notebook instance details page.
  9. Debug the code.

    Open your notebook instance and access a terminal, and go to the mounting directory set in Step 7.

    cd /data/demo

    Command for executing training:

    /home/ma-user/anaconda3/envs/pytorch/bin/python main.py -a resnet50 -b 128 --epochs 5 dog_cat_1w/

    The alarm "RequestsDependencyWarning: urllib3 (1.26.8) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!" does not affect training and can be ignored.

    After debugging in a notebook instance, if the image is modified, you can save the image for subsequent training. For details, see Saving a Notebook Environment Image.

Creating a Single-Node Single-PU Training Job

For training using a dedicated pool, the mounted directory must be the same as that during debugging.

  1. Log in to the ModelArts console and check whether access authorization has been configured for your account. For details, see Configuring Agency Authorization for ModelArts with One Click. If you have been authorized using access keys, clear the authorization and configure agency authorization.
  2. In the navigation pane on the left, choose Model Training > Training Jobs. The training job list is displayed by default. Click Create Training Job.
  3. On the Create Training Job page, configure parameters and click Submit.
    • Algorithm Type: Custom algorithm
    • Boot Mode: Custom image
    • Image: custom image you have uploaded
    • Boot Command:
      cd ${MA_JOB_DIR}/demo && python main.py -a resnet50 -b 128 --epochs 5 dog_cat_1w/

      demo (customizable) is the last-level directory of the OBS path.

    • Resource Pool: In the Dedicated Resource Pool tab, select a GPU dedicated resource pool.
    • Specifications: Select 1-GPU specifications.
  4. Click Submit. On the information confirmation page, check the parameters, and click OK.
  5. Wait until the training job is created.

    Once you submit the job creation request, the system handles tasks like downloading the container image and code directory, and executing the boot command in the backend. Training jobs take varying amounts of time, from tens of minutes to several hours, depending on the service logic and chosen resources.

Monitoring Resources

You can view the resource usage of compute nodes in the Resource Usages window. The data of at most the last three days can be displayed. When the resource usage window is opened, the data is loading and refreshed periodically.

Operation 1: If a training job uses multiple compute nodes, choose a node from the drop-down list box to view its metrics.

Operation 2: Click cpuUsage, memUsage, npuMemUsage, or npuUtil to show or hide the usage chart of that metric.

Operation 3: Hover over the graph to view the usage at the specific time.

Table 2 Parameters

Parameter

Description

cpuUsage

CPU usage

gpuMemUsage

GPU memory usage

gpuUtil

GPU usage

memUsage

Memory usage

npuMemUsage

NPU memory usage

npuUtil

NPU usage