Running a Single-Node Single-PU Training Job on ModelArts Standard
Process
- Preparations
- Purchasing Service Resources (OBS and SWR)
- Assigning Permissions
- Creating a Dedicated Resource Pool (The VPC does not need to be connected.)
- Installing and Configuring the OBS CLI
- (Optional) Configuring Workspaces
- Model training
Building and Debugging an Image Locally
In this section, the Conda environment is packaged to set up the runtime. You can also install Conda dependencies manually using pip install or conda install.

- The container image should be smaller than 15 GB. For details, see Constraints on Custom Images of the Training Framework.
- Build an image through the official open-source website, for example, PyTorch.
- Containers should be built by layer. Each layer must have no more than 1 GB of capacity or 100,000 files. You need to start with the layers that change less frequently. For example, build the OS, CUDA driver, Python, PyTorch, and other dependency packages in sequence.
- If the training data and code change frequently, do not store them in the container image in case you need to build container images frequently.
- The containers can meet the isolation requirements. Do not create conda environments in a container.
- Export the conda environment.
- Start the offline container image:
# run on terminal docker run -ti ${your_image:tag}
- Obtain pytorch.tar.gz:
# run on container # Create a conda environment named pytorch based on the target base environment. conda create --name pytorch --clone base pip install conda-pack # Pack pytorch env to generate pytorch.tar.gz. conda pack -n pytorch -o pytorch.tar.gz
- Upload the package to a local path.
# run on terminal docker cp ${your_container_id}:/xxx/xxx/pytorch.tar.gz .
- Upload pytorch.tar.gz to OBS and set it to public read. Obtain, decompress, and clear pytorch.tar.gz using the wget commands during creation.
- Start the offline container image:
- Create an image.
Choose either the official Ubuntu 18.04 image or the image with the CUDA driver from NVIDIA as the base image. Obtain the images on the Docker Hub official website.
To create an image, do as follows: Install the required apt package and driver, configure the ma-user user, import the conda environment, and configure the notebook dependency.
- Creating images with a Dockerfile is recommended. This ensures Dockerfile traceability and archiving, as well as image content without redundancy or residue.
- To reduce the final image size, delete intermediate files such as TAR packages when building each layer. For details about how to clear the cache, see conda clean.
- Refer to the following example.
Dockerfile example:
FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu18.04 USER root # section1: add user ma-user whose uid is 1000 and user group ma-group whose gid is 100. If there already exists 1000:100 but not ma-user:ma-group, below code will remove it RUN default_user=$(getent passwd 1000 | awk -F ':' '{print $1}') || echo "uid: 1000 does not exist" && \ default_group=$(getent group 100 | awk -F ':' '{print $1}') || echo "gid: 100 does not exist" && \ if [ ! -z ${default_group} ] && [ ${default_group} != "ma-group" ]; then \ groupdel -f ${default_group}; \ groupadd -g 100 ma-group; \ fi && \ if [ -z ${default_group} ]; then \ groupadd -g 100 ma-group; \ fi && \ if [ ! -z ${default_user} ] && [ ${default_user} != "ma-user" ]; then \ userdel -r ${default_user}; \ useradd -d /home/ma-user -m -u 1000 -g 100 -s /bin/bash ma-user; \ chmod -R 750 /home/ma-user; \ fi && \ if [ -z ${default_user} ]; then \ useradd -d /home/ma-user -m -u 1000 -g 100 -s /bin/bash ma-user; \ chmod -R 750 /home/ma-user; \ fi && \ # set bash as default rm /bin/sh && ln -s /bin/bash /bin/sh # section2: config apt source and install tools needed. RUN sed -i "s@http://.*archive.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list && \ sed -i "s@http://.*security.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list && \ apt-get update && \ apt-get install -y ca-certificates curl ffmpeg git libgl1-mesa-glx libglib2.0-0 libibverbs-dev libjpeg-dev libpng-dev libsm6 libxext6 libxrender-dev ninja-build screen sudo vim wget zip && \ apt-get clean && \ rm -rf /var/lib/apt/lists/* USER ma-user # section3: install miniconda and rebuild conda env RUN mkdir -p /home/ma-user/work/ && cd /home/ma-user/work/ && \ wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.12.0-Linux-x86_64.sh && \ chmod 777 Miniconda3-py37_4.12.0-Linux-x86_64.sh && \ bash Miniconda3-py37_4.12.0-Linux-x86_64.sh -bfp /home/ma-user/anaconda3 && \ wget https://${bucketname}.obs.cn-north-4.myhuaweicloud.com/${folder_name}/pytorch.tar.gz && \ mkdir -p /home/ma-user/anaconda3/envs/pytorch && \ tar -xzf pytorch.tar.gz -C /home/ma-user/anaconda3/envs/pytorch && \ source /home/ma-user/anaconda3/envs/pytorch/bin/activate && conda-unpack && \ /home/ma-user/anaconda3/bin/conda init bash && \ rm -rf /home/ma-user/work/* ENV PATH=/home/ma-user/anaconda3/envs/pytorch/bin:$PATH # section4: settings of Jupyter Notebook for pytorch env RUN source /home/ma-user/anaconda3/envs/pytorch/bin/activate && \ pip install ipykernel==6.7.0 --trusted-host https://repo.huaweicloud.com -i https://repo.huaweicloud.com/repository/pypi/simple && \ ipython kernel install --user --env PATH /home/ma-user/anaconda3/envs/pytorch/bin:$PATH --name=pytorch && \ rm -rf /home/ma-user/.local/share/jupyter/kernels/pytorch/logo-* && \ rm -rf ~/.cache/pip/* && \ echo 'export PATH=$PATH:/home/ma-user/.local/bin' >> /home/ma-user/.bashrc && \ echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nvidia/lib64' >> /home/ma-user/.bashrc && \ echo 'conda activate pytorch' >> /home/ma-user/.bashrc ENV DEFAULT_CONDA_ENV_NAME=pytorch
Replace https://${bucket_name}.obs.cn-north-4.myhuaweicloud.com/${folder_name}/pytorch.tar.gz in the Dockerfile with the OBS path of pytorch.tar.gz in 1 (the file must be set to public read).
Go to the Dockerfile directory and run the following commands to create an image:
# Use the cd command to navigate to the directory that contains the Dockerfile and run the build command. # docker build -t ${image_name}:${image_version} . # Example: docker build -t pytorch-1.13-cuda11.3-cudnn8-ubuntu18.04:v1 .
- Debug an image.
It is recommended that you use a Dockerfile to include the changes made during debugging in the official container building process and test it again.
- Ensure that the corresponding script, code, and process are running properly on the Linux server.
If the running fails, commission the container and then create a container image.
- Ensure that the image file is located correctly and that you have the required file permission.
Before training, verify that the custom dependency package is normal, that the required packages are in the pip list, and that the Python used by the container is the correct one. (If you have multiple Pythons installed in the container image, you need to set the Python path environment variable.)
- Test the training boot script.
- Data copy and verification
Generally, the image does not contain training data and code. You need to copy the required files to the image after starting the image. To avoid running out of disk space, store data, code, and intermediate data in the /cache directory. It is recommended that the Linux server have sufficient memory (more than 8 GB) and hard disk (more than 100 GB).
The following command enables file interaction between Docker and Linux:
docker cp data/ 39c9ceedb1f6:/cache/
Once you have prepared the data, run the training script and verify that the training starts correctly. Generally, the boot script is as follows:
cd /cache/code/ python start_train.py
To troubleshoot the training process, you can access the logs and errors in the container instance and adjust the code and environment variables accordingly.
- Preset script testing
The run.sh script is typically used to copy data and code from OBS to containers, and to copy output results from containers to OBS.
You can edit and iterate the script in the container instance if the preset script does not produce the desired result.
- Dedicated pool scenario
Mounting SFS in dedicated pools allows you to import code and data without worrying about OBS operations.
You can either mount the SFS directory to the /mnt/sfs_turbo directory of the debugging node, or sync the directory content with the SFS disk.
To start a container instance during debugging, use the -v parameter to mount a directory from the host machine to the container environment.
docker run -ti -d -v /mnt/sfs_turbo:/sfs my_deeplearning_image:v1
The command above mounts the /mnt/sfs_turbo directory of the host machine to the /sfs directory of the container. Any changes in the corresponding directories of the host machine and container are synchronized in real time.
- Data copy and verification
- To locate faults, check the logs for the training image, and check the API response for the inference image.
Run the following command to view all stdout logs output by the container:
docker logs -f 39c9ceedb1f6
Some logs are stored in the container when creating an inference image. You need to access the container to view these logs. Note: You need to check whether the logs contain errors (including when the container is started and when the API is executed).
- To modify the user groups of some files that are inconsistent, run the following command as the root user on the host machine.
docker exec -u root:root 39c9ceedb1f6 bash -c "chown -R ma-user:ma-user /cache"
- To fix an error during debugging, edit it in the container instance. You can run the commit command to save the changes.
- Ensure that the corresponding script, code, and process are running properly on the Linux server.
Uploading an Image
Uploading an image through the client is to run Docker commands on the machine where container engine client is installed to push the image to an image repository of SWR.
If your container engine client is an ECS or CCE node, you can push an image over two types of networks.
- If your client and the image repository are in the same region, you can push an image over private networks.
- If your client and the image repository are in different regions, you can push an image over public networks and the client needs to be bound to an EIP.

- Each image layer uploaded through the client cannot be larger than 10 GB.
- Your container engine client version must be 1.11.2 or later.
- Access SWR.
- Log in to the SWR console.
- Click Create Organization in the upper right corner and enter an organization name to create an organization. Customize the organization name. Replace the organization name deep-learning in subsequent commands with the actual organization name.
- In the navigation pane on the left, choose Dashboard and click Generate Login Command in the upper right corner. On the displayed page, click
to copy the login command.
- The validity period of the generated login command is 24 hours. To obtain a long-term valid login command, see Obtaining a Long-Term Login or Image Push/Pull Command. After you obtain a long-term valid login command, your temporary login commands will still be valid as long as they are in their validity periods.
- The domain name at the end of the login command is the image repository address. Record the address for later use.
- Run the login command on the machine where the container engine is installed.
The message Login Succeeded will be displayed upon a successful login.
- Run the following command on the device where the container engine is installed to label the image:
docker tag [image_name_1:tag_1] [image_repository_address]/[organization_name]/[image_name_2:tag_2]
- [image_name_1:tag_1]: Replace it with the actual name and tag of the image to be uploaded.
- [image_repository_address]: You can query the address on the SWR console, that is, the domain name at the end of the login command in 1.c.
- [organization_name]: Replace it with the name of the organization created.
- [image_name_2:tag_2]: Replace it with the desired image name and tag.
Example:
docker tag ${image_name}:${image_version} swr.cn-north-4.myhuaweicloud.com/${organization_name}/${image_name}:${image_version}
- Upload the image to the image repository.
docker push [image_repository_address]/[organization_name]/[image_name_2:tag 2]
Example:
docker push swr.cn-north-4.myhuaweicloud.com/${organization_name}/${image_name}:${image_version}
To view the pushed image, go to the SWR console and refresh the My Images page.
Uploading Data and Algorithms to OBS
- A parallel file system has been created on OBS. For details, see Creating a Parallel File System.
- obsutil has been installed and configured. For details, see Installing and Configuring the OBS CLI.
- Prepare data.
- Download the animal dataset to the local PC and decompress it.
- Use obsutil to upload the dataset to an OBS bucket.
./obsutil cp ./dog_cat_1w obs://${your_obs_buck}/demo/ -f -r
OBS supports multiple file upload modes. If the number of files is less than 100, you can upload them on the OBS console. If the number of files is greater than 100, you are advised to use OBS Browser+ (Windows) and obsutil (Linux). The preceding example shows how to use obsutil.
- Prepare an algorithm.
Upload the main.py file to the demo folder in the OBS bucket. The main.py file contains the following content:
import argparse import os import random import shutil import time import warnings from enum import Enum import torch import torch.nn as nn import torch.nn.parallel import torch.backends.cudnn as cudnn import torch.distributed as dist import torch.optim from torch.optim.lr_scheduler import StepLR import torch.multiprocessing as mp import torch.utils.data import torch.utils.data.distributed import torchvision.transforms as transforms import torchvision.datasets as datasets import torchvision.models as models model_names = sorted(name for name in models.__dict__ if name.islower() and not name.startswith("__") and callable(models.__dict__[name])) parser = argparse.ArgumentParser(description='PyTorch ImageNet Training') parser.add_argument('data', metavar='DIR', default='imagenet', help='path to dataset (default: imagenet)') parser.add_argument('-a', '--arch', metavar='ARCH', default='resnet18', choices=model_names, help='model architecture: ' + ' | '.join(model_names) + ' (default: resnet18)') parser.add_argument('-j', '--workers', default=4, type=int, metavar='N', help='number of data loading workers (default: 4)') parser.add_argument('--epochs', default=90, type=int, metavar='N', help='number of total epochs to run') parser.add_argument('--start-epoch', default=0, type=int, metavar='N', help='manual epoch number (useful on restarts)') parser.add_argument('-b', '--batch-size', default=256, type=int, metavar='N', help='mini-batch size (default: 256), this is the total ' 'batch size of all GPUs on the current node when ' 'using Data Parallel or Distributed Data Parallel') parser.add_argument('--lr', '--learning-rate', default=0.1, type=float, metavar='LR', help='initial learning rate', dest='lr') parser.add_argument('--momentum', default=0.9, type=float, metavar='M', help='momentum') parser.add_argument('--wd', '--weight-decay', default=1e-4, type=float, metavar='W', help='weight decay (default: 1e-4)', dest='weight_decay') parser.add_argument('-p', '--print-freq', default=10, type=int, metavar='N', help='print frequency (default: 10)') parser.add_argument('--resume', default='', type=str, metavar='PATH', help='path to latest checkpoint (default: none)') parser.add_argument('-e', '--evaluate', dest='evaluate', action='store_true', help='evaluate model on validation set') parser.add_argument('--pretrained', dest='pretrained', action='store_true', help='use pre-trained model') parser.add_argument('--world-size', default=-1, type=int, help='number of nodes for distributed training') parser.add_argument('--rank', default=-1, type=int, help='node rank for distributed training') parser.add_argument('--dist-url', default='tcp://224.66.41.62:23456', type=str, help='url used to set up distributed training') parser.add_argument('--dist-backend', default='nccl', type=str, help='distributed backend') parser.add_argument('--seed', default=None, type=int, help='seed for initializing training. ') parser.add_argument('--gpu', default=None, type=int, help='GPU id to use.') parser.add_argument('--multiprocessing-distributed', action='store_true', help='Use multi-processing distributed training to launch ' 'N processes per node, which has N GPUs. This is the ' 'fastest way to use PyTorch for either single node or ' 'multi node data parallel training') best_acc1 = 0 def main(): args = parser.parse_args() if args.seed is not None: random.seed(args.seed) torch.manual_seed(args.seed) cudnn.deterministic = True warnings.warn('You have chosen to seed training. ' 'This will turn on the CUDNN deterministic setting, ' 'which can slow down your training considerably! ' 'You may see unexpected behavior when restarting ' 'from checkpoints.') if args.gpu is not None: warnings.warn('You have chosen a specific GPU. This will completely ' 'disable data parallelism.') if args.dist_url == "env://" and args.world_size == -1: args.world_size = int(os.environ["WORLD_SIZE"]) args.distributed = args.world_size > 1 or args.multiprocessing_distributed ngpus_per_node = torch.cuda.device_count() if args.multiprocessing_distributed: # Since we have ngpus_per_node processes per node, the total world_size # needs to be adjusted accordingly args.world_size = ngpus_per_node * args.world_size # Use torch.multiprocessing.spawn to launch distributed processes: the # main_worker process function mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) else: # Simply call main_worker function main_worker(args.gpu, ngpus_per_node, args) def main_worker(gpu, ngpus_per_node, args): global best_acc1 args.gpu = gpu if args.gpu is not None: print("Use GPU: {} for training".format(args.gpu)) if args.distributed: if args.dist_url == "env://" and args.rank == -1: args.rank = int(os.environ["RANK"]) if args.multiprocessing_distributed: # For multiprocessing distributed training, rank needs to be the # global rank among all the processes args.rank = args.rank * ngpus_per_node + gpu dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, world_size=args.world_size, rank=args.rank) # create model if args.pretrained: print("=> using pre-trained model '{}'".format(args.arch)) model = models.__dict__[args.arch](pretrained=True) else: print("=> creating model '{}'".format(args.arch)) model = models.__dict__[args.arch]() if not torch.cuda.is_available(): print('using CPU, this will be slow') elif args.distributed: # For multiprocessing distributed, DistributedDataParallel constructor # should always set the single device scope, otherwise, # DistributedDataParallel will use all available devices. if args.gpu is not None: torch.cuda.set_device(args.gpu) model.cuda(args.gpu) # When using a single GPU per process and per # DistributedDataParallel, we need to divide the batch size # ourselves based on the total number of GPUs of the current node. args.batch_size = int(args.batch_size / ngpus_per_node) args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node) model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) else: model.cuda() # DistributedDataParallel will divide and allocate batch_size to all # available GPUs if device_ids are not set model = torch.nn.parallel.DistributedDataParallel(model) elif args.gpu is not None: torch.cuda.set_device(args.gpu) model = model.cuda(args.gpu) else: # DataParallel will divide and allocate batch_size to all available GPUs if args.arch.startswith('alexnet') or args.arch.startswith('vgg'): model.features = torch.nn.DataParallel(model.features) model.cuda() else: model = torch.nn.DataParallel(model).cuda() # define loss function (criterion), optimizer, and learning rate scheduler criterion = nn.CrossEntropyLoss().cuda(args.gpu) optimizer = torch.optim.SGD(model.parameters(), args.lr, momentum=args.momentum, weight_decay=args.weight_decay) """Sets the learning rate to the initial LR decayed by 10 every 30 epochs""" scheduler = StepLR(optimizer, step_size=30, gamma=0.1) # optionally resume from a checkpoint if args.resume: if os.path.isfile(args.resume): print("=> loading checkpoint '{}'".format(args.resume)) if args.gpu is None: checkpoint = torch.load(args.resume) else: # Map model to be loaded to specified single gpu. loc = 'cuda:{}'.format(args.gpu) checkpoint = torch.load(args.resume, map_location=loc) args.start_epoch = checkpoint['epoch'] best_acc1 = checkpoint['best_acc1'] if args.gpu is not None: # best_acc1 may be from a checkpoint from a different GPU best_acc1 = best_acc1.to(args.gpu) model.load_state_dict(checkpoint['state_dict']) optimizer.load_state_dict(checkpoint['optimizer']) scheduler.load_state_dict(checkpoint['scheduler']) print("=> loaded checkpoint '{}' (epoch {})" .format(args.resume, checkpoint['epoch'])) else: print("=> no checkpoint found at '{}'".format(args.resume)) cudnn.benchmark = True # Data loading code traindir = os.path.join(args.data, 'train') valdir = os.path.join(args.data, 'val') normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) train_dataset = datasets.ImageFolder( traindir, transforms.Compose([ transforms.RandomResizedCrop(224), transforms.RandomHorizontalFlip(), transforms.ToTensor(), normalize, ])) if args.distributed: train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) else: train_sampler = None train_loader = torch.utils.data.DataLoader( train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None), num_workers=args.workers, pin_memory=True, sampler=train_sampler) val_loader = torch.utils.data.DataLoader( datasets.ImageFolder(valdir, transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), normalize, ])), batch_size=args.batch_size, shuffle=False, num_workers=args.workers, pin_memory=True) if args.evaluate: validate(val_loader, model, criterion, args) return for epoch in range(args.start_epoch, args.epochs): if args.distributed: train_sampler.set_epoch(epoch) # train for one epoch train(train_loader, model, criterion, optimizer, epoch, args) # evaluate on validation set acc1 = validate(val_loader, model, criterion, args) scheduler.step() # remember best acc@1 and save checkpoint is_best = acc1 > best_acc1 best_acc1 = max(acc1, best_acc1) if not args.multiprocessing_distributed or (args.multiprocessing_distributed and args.rank % ngpus_per_node == 0): save_checkpoint({ 'epoch': epoch + 1, 'arch': args.arch, 'state_dict': model.state_dict(), 'best_acc1': best_acc1, 'optimizer': optimizer.state_dict(), 'scheduler': scheduler.state_dict() }, is_best) def train(train_loader, model, criterion, optimizer, epoch, args): batch_time = AverageMeter('Time', ':6.3f') data_time = AverageMeter('Data', ':6.3f') losses = AverageMeter('Loss', ':.4e') top1 = AverageMeter('Acc@1', ':6.2f') top5 = AverageMeter('Acc@5', ':6.2f') progress = ProgressMeter( len(train_loader), [batch_time, data_time, losses, top1, top5], prefix="Epoch: [{}]".format(epoch)) # switch to train mode model.train() end = time.time() for i, (images, target) in enumerate(train_loader): # measure data loading time data_time.update(time.time() - end) if args.gpu is not None: images = images.cuda(args.gpu, non_blocking=True) if torch.cuda.is_available(): target = target.cuda(args.gpu, non_blocking=True) # compute output output = model(images) loss = criterion(output, target) # measure accuracy and record loss acc1, acc5 = accuracy(output, target, topk=(1, 5)) losses.update(loss.item(), images.size(0)) top1.update(acc1[0], images.size(0)) top5.update(acc5[0], images.size(0)) # compute gradient and do SGD step optimizer.zero_grad() loss.backward() optimizer.step() # measure elapsed time batch_time.update(time.time() - end) end = time.time() if i % args.print_freq == 0: progress.display(i) def validate(val_loader, model, criterion, args): batch_time = AverageMeter('Time', ':6.3f', Summary.NONE) losses = AverageMeter('Loss', ':.4e', Summary.NONE) top1 = AverageMeter('Acc@1', ':6.2f', Summary.AVERAGE) top5 = AverageMeter('Acc@5', ':6.2f', Summary.AVERAGE) progress = ProgressMeter( len(val_loader), [batch_time, losses, top1, top5], prefix='Test: ') # switch to evaluate mode model.eval() with torch.no_grad(): end = time.time() for i, (images, target) in enumerate(val_loader): if args.gpu is not None: images = images.cuda(args.gpu, non_blocking=True) if torch.cuda.is_available(): target = target.cuda(args.gpu, non_blocking=True) # compute output output = model(images) loss = criterion(output, target) # measure accuracy and record loss acc1, acc5 = accuracy(output, target, topk=(1, 5)) losses.update(loss.item(), images.size(0)) top1.update(acc1[0], images.size(0)) top5.update(acc5[0], images.size(0)) # measure elapsed time batch_time.update(time.time() - end) end = time.time() if i % args.print_freq == 0: progress.display(i) progress.display_summary() return top1.avg def save_checkpoint(state, is_best, filename='checkpoint.pth.tar'): torch.save(state, filename) if is_best: shutil.copyfile(filename, 'model_best.pth.tar') class Summary(Enum): NONE = 0 AVERAGE = 1 SUM = 2 COUNT = 3 class AverageMeter(object): """Computes and stores the average and current value""" def __init__(self, name, fmt=':f', summary_type=Summary.AVERAGE): self.name = name self.fmt = fmt self.summary_type = summary_type self.reset() def reset(self): self.val = 0 self.avg = 0 self.sum = 0 self.count = 0 def update(self, val, n=1): self.val = val self.sum += val * n self.count += n self.avg = self.sum / self.count def __str__(self): fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})' return fmtstr.format(**self.__dict__) def summary(self): fmtstr = '' if self.summary_type is Summary.NONE: fmtstr = '' elif self.summary_type is Summary.AVERAGE: fmtstr = '{name} {avg:.3f}' elif self.summary_type is Summary.SUM: fmtstr = '{name} {sum:.3f}' elif self.summary_type is Summary.COUNT: fmtstr = '{name} {count:.3f}' else: raise ValueError('invalid summary type %r' % self.summary_type) return fmtstr.format(**self.__dict__) class ProgressMeter(object): def __init__(self, num_batches, meters, prefix=""): self.batch_fmtstr = self._get_batch_fmtstr(num_batches) self.meters = meters self.prefix = prefix def display(self, batch): entries = [self.prefix + self.batch_fmtstr.format(batch)] entries += [str(meter) for meter in self.meters] print('\t'.join(entries)) def display_summary(self): entries = [" *"] entries += [meter.summary() for meter in self.meters] print(' '.join(entries)) def _get_batch_fmtstr(self, num_batches): num_digits = len(str(num_batches // 1)) fmt = '{:' + str(num_digits) + 'd}' return '[' + fmt + '/' + fmt.format(num_batches) + ']' def accuracy(output, target, topk=(1,)): """Computes the accuracy over the k top predictions for the specified values of k""" with torch.no_grad(): maxk = max(topk) batch_size = target.size(0) _, pred = output.topk(maxk, 1, True, True) pred = pred.t() correct = pred.eq(target.view(1, -1).expand_as(pred)) res = [] for k in topk: correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True) res.append(correct_k.mul_(100.0 / batch_size)) return res if __name__ == '__main__': main()
Debugging Code with Notebook
- Notebook billing is as follows:
- A running notebook instance will be billed based on used resources. The fees vary depending on your selected resources. For details, see Product Pricing Details. When a notebook instance is not used, stop it.
- If you select EVS for storage when creating a notebook instance, the EVS disk will be continuously billed. Stop and delete the notebook instance if it is not required.
- When a notebook instance is created, auto stop is enabled by default. The notebook instance will automatically stop at the specified time.
- Only running notebook instances can be accessed or stopped.
- A maximum of 10 notebook instances can be created under one account.
Follow these steps:
- Register the image. Log in to the ModelArts console. In the navigation pane on the left, choose Image Management. Click Register. Set SWR Source to the image pushed to SWR. Paste the complete SWR address, or click
to select a private image from SWR for registration, and add GPU to Type.
- Log in to the ModelArts console. In the navigation pane on the left, choose Development Workspace > Notebook.
- Click Create Notebook. On the displayed page, configure the parameters.
- Configure basic information of the notebook instance, including its name, description, and auto stop status. For details, see Table 1.
Table 1 Basic parameters Parameter
Description
Name
Name of a notebook instance, which can contain 1 to 64 characters, including letters, digits, hyphens (-), and underscores (_).
Description
Brief description of a notebook instance.
Auto Stop
Automatically stops the notebook instance at a specified time. This function is enabled by default. The default value is 1 hour, indicating that the notebook instance automatically stops after running for 1 hour and its resource billing will stop then.
The options are 1 hour, 2 hours, 4 hours, 6 hours, and Custom. You can select Custom to specify any integer from 1 to 24 hours.
- Select an image and configure resource specifications for the instance.
- Image: In the Custom Images tab, select the uploaded custom image.
- Resource Type: Select a created dedicated resource pool based on site requirements.
- Instance Specifications: Select 1-GPU specifications.
- Storage: Select EVS.
To use VS Code to connect to a notebook instance for code debugging, enable Remote SSH and select a key pair. For details, see Connecting to a Notebook Instance Through VS Code.
- Configure basic information of the notebook instance, including its name, description, and auto stop status. For details, see Table 1.
- Click Next.
- Confirm the information and click Submit.
Switch to the notebook instance list. The notebook instance is being created. It will take several minutes before its status changes to Running.
If the created notebook instance fails to be started, refer to the key points for debugging described in Building and Debugging an Image Locally.
- In the notebook instance list, click the instance name. On the instance details page that is displayed, view the instance configuration.
- Mounting an OBS parallel file system: On the notebook instance details page, click the Storage tab, then click Mount Storage, and set mounting parameters.
- Set a local mounting directory. Enter a folder name in /data/, for example, demo. The system will automatically create the folder in /data/ of the notebook container to mount the OBS file system.
- Select the folder for storing the OBS parallel file system and click OK.
- View the mounting result on the notebook instance details page.
- Debug the code.
Open your notebook instance and access a terminal, and go to the mounting directory set in Step 7.
cd /data/demo
Command for executing training:
/home/ma-user/anaconda3/envs/pytorch/bin/python main.py -a resnet50 -b 128 --epochs 5 dog_cat_1w/
The alarm "RequestsDependencyWarning: urllib3 (1.26.8) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!" does not affect training and can be ignored.
After debugging in a notebook instance, if the image is modified, you can save the image for subsequent training. For details, see Saving a Notebook Environment Image.
Creating a Single-Node Single-PU Training Job

For training using a dedicated pool, the mounted directory must be the same as that during debugging.
- Log in to the ModelArts console and check whether access authorization has been configured for your account. For details, see Configuring Agency Authorization for ModelArts with One Click. If you have been authorized using access keys, clear the authorization and configure agency authorization.
- In the navigation pane on the left, choose Model Training > Training Jobs. The training job list is displayed by default. Click Create Training Job.
- On the Create Training Job page, configure parameters and click Submit.
- Algorithm Type: Custom algorithm
- Boot Mode: Custom image
- Image: custom image you have uploaded
- Boot Command:
cd ${MA_JOB_DIR}/demo && python main.py -a resnet50 -b 128 --epochs 5 dog_cat_1w/
demo (customizable) is the last-level directory of the OBS path.
- Resource Pool: In the Dedicated Resource Pool tab, select a GPU dedicated resource pool.
- Specifications: Select 1-GPU specifications.
- Click Submit. On the information confirmation page, check the parameters, and click OK.
- Wait until the training job is created.
Once you submit the job creation request, the system handles tasks like downloading the container image and code directory, and executing the boot command in the backend. Training jobs take varying amounts of time, from tens of minutes to several hours, depending on the service logic and chosen resources.
Monitoring Resources
You can view the resource usage of compute nodes in the Resource Usages window. The data of at most the last three days can be displayed. When the resource usage window is opened, the data is loading and refreshed periodically.
Operation 1: If a training job uses multiple compute nodes, choose a node from the drop-down list box to view its metrics.
Operation 2: Click cpuUsage, memUsage, npuMemUsage, or npuUtil to show or hide the usage chart of that metric.
Operation 3: Hover over the graph to view the usage at the specific time.
Parameter |
Description |
---|---|
cpuUsage |
CPU usage |
gpuMemUsage |
GPU memory usage |
gpuUtil |
GPU usage |
memUsage |
Memory usage |
npuMemUsage |
NPU memory usage |
npuUtil |
NPU usage |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot