ModelArts
ModelArts
All results for "
" in this service
All results for "
" in this service
What's New
Function Overview
Service Overview
What Is ModelArts?
Functions
Basic Knowledge
Introduction to the AI Development Lifecycle
Basic Concepts of AI Development
Common Concepts of ModelArts
DevEnviron
Model Training
Model Deployment
Resource Pools
Related Services
How Do I Access ModelArts?
Preparations
Registering a HUAWEI CLOUD Account
Configuring Access Authorization (Global Configuration)
Creating an OBS Bucket
DevEnviron
Introduction to DevEnviron
Application Scenarios
Managing Notebook Instances
Creating a Notebook Instance
Accessing a Notebook Instance
Searching for, Starting, Stopping, or Deleting a Notebook Instance
Selecting Storage in DevEnviron
Changing a Notebook Instance Image
Dynamically Expanding EVS Disk Capacity
Changing the Flavor of a Notebook Instance
Modifying the SSH Configuration for a Notebook Instance
Viewing the Notebook Instances of All IAM Users Under One Tenant Account
JupyterLab
Operation Process in JupyterLab
JupyterLab Overview and Common Operations
Code Parametrization Plug-in
Using ModelArts SDK
Using the Git Plug-in
Uploading and Downloading Data in Notebook
Uploading Files to JupyterLab
Scenarios
Uploading Files from a Local Path to JupyterLab
Upload Scenarios and Entries
Uploading a Local File Less Than 100 MB to JupyterLab
Uploading a Local File with a Size Ranging from 100 MB to 5 GB to JupyterLab
Uploading a Local File Larger Than 5 GB to JupyterLab
Cloning an Open-Source Repository in GitHub
Uploading OBS Files to JupyterLab
Uploading Remote Files to JupyterLab
Downloading a File from JupyterLab to a Local Path
Local IDE
Operation Process in a Local IDE
Local IDE (PyCharm)
Connecting to a Notebook Instance Through PyCharm Toolkit
PyCharm Toolkit
Downloading and Installing PyCharm Toolkit
Connecting to a Notebook Instance Through PyCharm Toolkit
Manually Connecting to a Notebook Instance Through PyCharm
Submitting a Training Job Using PyCharm Toolkit
Submitting a Training Job (New Version)
Stopping a Training Job
Viewing Training Logs
Uploading Data to a Notebook Instance Using PyCharm
Local IDE (VS Code)
Connecting to a Notebook Instance Through VS Code
Installing VS Code
Connecting to a Notebook Instance Through VS Code with One Click
Connecting to a Notebook Instance Through VS Code Toolkit
Manually Connecting to a Notebook Instance Through VS Code
Remotely Debugging in VS Code
Uploading and Downloading Files in VS Code
Local IDE (Accessed Using SSH)
ModelArts CLI Command Reference
ModelArts CLI Overview
(Optional) Installing ma-cli Locally
Autocompletion for ma-cli Commands
ma-cli Authentication
ma-cli Image Building Command
ma-cli Image Building Command
Obtaining an Image Creation Template
Loading an Image Creation Template
Obtaining Registered ModelArts Images
Creating an Image in ModelArts Notebook
Obtaining Image Creation Caches in ModelArts Notebook
Clearing Image Creation Caches in ModelArts Notebook
Registering SWR Images with ModelArts Image Management
Deregistering a Registered Image from ModelArts Image Management
Debugging an SWR Image on an ECS
Using the ma-cli ma-job Command to Submit a ModelArts Training Job
ma-cli ma-job Command Overview
Obtaining ModelArts Training Jobs
Submitting a ModelArts Training Job
Obtaining ModelArts Training Job Logs
Obtaining ModelArts Training Job Events
Obtaining ModelArts AI Engines for Training
Obtaining ModelArts Resource Specifications for Training
Stopping a ModelArts Training Job
Using ma-cli to Copy OBS Data
Model Development
Introduction to Model Development
Preparing Data
Preparing Algorithms
Introduction to Algorithm Preparation
Using a Preset Image (Custom Script)
Overview
Developing a Custom Script
Creating an Algorithm
Using a Custom Image
Searching for an Algorithm
Deleting an Algorithm
Performing a Training
Creating a Training Job
Reviewing Training Job Details
Training Job Logs
Introduction to Training Job Logs
Common Logs
Viewing Training Job Logs
Locating Faults by Analyzing Training Logs
Viewing Training Job Events
Viewing the Resource Usage of a Training Job
Evaluation Results
Viewing Environment Variables of a Training Container
Stopping, Rebuilding, or Searching for a Training Job
CloudShell
Logging In to a Training Container Using Cloud Shell
Releasing Training Job Resources
Training Experiment
Introduction to Experiment
Adding a Training Job to an Experiment
Viewing an Experiment
Deleting an Experiment
Advanced Training Operations
Selecting a Training Mode
Automatic Recovery from a Training Fault
Training Fault Tolerance Check
Resumable Training and Incremental Training
Detecting Training Job Suspension
Permission to Set the Highest Job Priority
Visualized Model Training
Introduction to Training Job Visualization
MindInsight Visualization Jobs
TensorBoard Visualization Jobs
Distributed Training
Distributed Training
Single-Node Multi-Card Training Using DataParallel
Multi-Node Multi-Card Training Using DistributedDataParallel
Distributed Debugging Adaptation and Code Example
Sample Code of Distributed Training
Model Inference
Introduction to Inference
Managing AI Applications
Introduction to AI Application Management
Creating an AI Application
Importing a Meta Model from a Training Job
Importing a Meta Model from OBS
Importing a Meta Model from a Container Image
Viewing Details About an AI Application
Managing AI Applications
Viewing Events of an AI Application
Deploying an AI Application as a Service
Deploying AI Applications as Real-Time Services
Deploying as a Real-Time Service
Viewing Service Details
Testing the Deployed Service
Accessing Real-Time Services
Accessing a Real-Time Service
Authentication Mode
Access Authenticated Using a Token
Access Mode
Accessing a Real-Time Service (Public Network Channel)
Accessing a Real-Time Service (VPC Channel)
Accessing a Real-Time Service (VPC High-Speed Channel)
Maintaining Real-Time Services
Scaling
Overview
Manual Scaling
Auto Scaling
Deploying AI Applications as Batch Services
Deploying as a Batch Service
Viewing the Batch Service Prediction Result
Upgrading a Service
Starting, Stopping, Deleting, or Restarting a Service
Viewing Service Events
Inference Specifications
Model Package Specifications
Introduction to Model Package Specifications
Specifications for Editing a Model Configuration File
Specifications for Writing Model Inference Code
Examples of Custom Scripts
TensorFlow
PyTorch
XGBoost
PySpark
Scikit-learn
ModelArts Monitoring on Cloud Eye
ModelArts Metrics
Setting Alarm Rules
Viewing Monitoring Metrics
Docker Containers with ModelArts
Image Management
Using Custom Images in Notebook Instances
Registering an Image in ModelArts
Saving a Notebook Instance as a Custom Image
Saving a Notebook Environment Image
Using a Custom Image to Create a Notebook Instance
Creating and Using a Custom Image on a Notebook Instance
Application Scenarios and Process
Step 1 Creating a Custom Image
Step 2 Registering a New Image
Step 3 Using a New Image to Create a Development Environment
Using a Custom Image to Train Models (New-Version Training)
Overview
Preparing a Training Image
Specifications for Custom Images for Training Jobs
Migrating an Image to ModelArts Training
Using a Base Image to Create a Training Image
Creating an Algorithm Using a Custom Image
Using a Custom Image to Create a CPU- or GPU-based Training Job
Using a Custom Image to Create AI applications for Inference Deployment
Custom Image Specifications for Creating AI Applications
Creating a Custom Image and Using It to Create an AI Application
FAQs
How Can I Log In to SWR and Upload Images to It?
How Do I Configure Environment Variables for an Image?
How Do I Use Docker to Start an Image Saved Using a Notebook Instance?
How Do I Configure a Conda Source in a Notebook Development Environment?
Resource Management
Resource Pool
Elastic Cluster
Comprehensive Upgrades to ModelArts Resource Pool Management Functions
Creating a Resource Pool
Viewing Details About a Resource Pool
Resizing a Resource Pool
Migrating the Workspace
Changing Job Types Supported by a Resource Pool
Upgrading a Resource Pool Driver
Deleting a Resource Pool
Abnormal Status of a Dedicated Resource Pool
ModelArts Network
Monitoring Resources
Viewing All ModelArts Monitoring Metrics on the AOM Console
SDK Reference
Before You Start
SDK Overview
Getting Started
(Optional) Installing the ModelArts SDK Locally
Session Authentication
(Optional) Session Authentication
Authentication Using the Username and Password
AK/SK-based Authentication
OBS Management
Overview of OBS Management
Transferring Files (Recommended)
Uploading a File to OBS
Uploading a Folder to OBS
Downloading a File from OBS
Downloading a Folder from OBS
Data Management
Managing Datasets
Querying a Dataset List
Creating a Dataset
Querying Details About a Dataset
Modifying a Dataset
Deleting a Dataset
Managing Dataset Versions
Obtaining a Dataset Version List
Creating a Dataset Version
Querying Details About a Dataset Version
Deleting a Dataset Version
Managing Samples
Querying a Sample List
Querying Details About a Sample
Deleting Samples in a Batch
Managing Dataset Import Tasks
Querying a Dataset Import Task List
Creating a Dataset Import Task
Querying the Status of a Dataset Import Task
Managing Export Tasks
Querying a Dataset Export Task List
Creating a Dataset Export Task
Querying the Status of a Dataset Export Task
Managing Manifest Files
Overview of Manifest Management
Parsing a Manifest File
Creating and Saving a Manifest File
Parsing a Pascal VOC File
Creating and Saving a Pascal VOC File
Managing Labeling Jobs
Creating a Labeling Job
Obtaining the Labeling Job List of a Dataset
Obtaining Details About a Labeling Job
Training Management
Training Jobs
Creating a Training Job
Debugging a Training Job
Using the SDK to Debug a Multi-Node Distributed Training Job
Using the SDK to Debug a Single-Node Training Job
Obtaining Training Jobs
Obtaining the Details About a Training Job
Modifying the Description of a Training Job
Deleting a Training Job
Terminating a Training Job
Obtaining Training Logs
Obtaining the Runtime Metrics of a Training Job
APIs for Resources and Engine Specifications
Obtaining Resource Flavors
Obtaining Engine Types
Model Management
Importing a Model
Obtaining Models
Obtaining Model Objects
Obtaining Details About a Model
Deleting a Model
Service Management
Service Management Overview
Deploying a Real-Time Service
Obtaining Details About a Service
Obtaining Services
Obtaining Service Objects
Updating Service Configurations
Obtaining Service Monitoring Information
Obtaining Service Logs
Delete a Service
API Reference
Before You Start
Overview
API Calling
Endpoint
Constraints
Basic Concepts
API Overview
Calling APIs
Making an API Request
Authentication
Response
DevEnviron Management
Querying Notebook Instances
Creating a Notebook Instance
Querying Details of a Notebook Instance
Updating a Notebook Instance
Deleting a Notebook Instance
Saving a Running Instance as a Container Image
Obtaining the Available Flavors
Querying Flavors Available for a Notebook Instance
Querying the Available Duration of a Running Notebook Instance
Prolonging a Notebook Instance
Starting a Notebook Instance
Stopping a Notebook Instance
Obtaining the Notebook Instances with OBS Storage Mounted
OBS Storage Mounting
Obtaining Details About a Notebook Instance with OBS Storage Mounted
Unmounting OBS Storage from a Notebook Instance
Querying Supported Images
Registering a Custom Image
Obtaining User Image Groups
Obtaining Details of an Image
Deleting an Image
Training Management
Creating an Algorithm
Querying the Algorithm List
Querying Algorithm Details
Modifying an Algorithm
Deleting an Algorithm
Creating a Training Job
Querying the Details About a Training Job
Modifying the Description of a Training Job
Deleting a Training Job
Terminating a Training Job
Querying the Logs of a Specified Task in a Given Training Job (Preview)
Querying the Logs of a Specified Task in a Training Job (OBS Link)
Querying the Running Metrics of a Specified Task in a Training Job
Querying a Training Job List
Obtaining the General Specifications Supported by a Training Job
Obtaining the Preset AI Frameworks Supported by a Training Job
AI Application Management
Querying the AI Application List
Creating an AI Application
Obtaining Details About an AI Application
Deleting an AI application
Service Management
Obtaining Service Monitoring
Obtaining Services
Deploying Services
Obtaining Supported Service Deployment Specifications
Obtaining Service Details
Updating Service Configurations
Deleting a Service
Obtaining Dedicated Resource Pools
Obtaining Service Event Logs
Obtaining Service Update Logs
Resource Management
Configuration Management
Querying OS Configuration Parameters
Quota Management
Obtaining OS Quotas
Event Management
Obtaining the Event List
Resource Pool Job Management
Obtaining Jobs in a Dedicated Resource Pool
Obtaining Statistics About Dedicated Resource Pool Jobs
Resource Metrics
Obtaining the Real-Time Resource Usage
Plug-in Template Management
Querying a Plug-in Template
Tag Management
Creating a Resource Pool Tag
Deleting Tags of a Resource Pool
Querying All Tags of a Resource Pool
Network Management
Creating Network Resources
Obtaining Network Resources
Obtaining a Network Resource
Deleting a Network Resource
Updating a Network Resource
Node Management
Obtaining Nodes
Deleting Nodes in Batches
Resource Pool Management
Creating Resource Pools
Obtaining Resource Pools
Obtaining a Resource Pool
Deleting a Resource Pool
Updating a Resource Pool
Monitoring a Resource Pool
Obtaining Resource Pool Statistics
Resource Specifications Management
Obtaining Resource Specifications
Authorization Management
Viewing an Authorization List
Configuring Authorization
Deleting Authorization
Creating a ModelArts Agency
Use Cases
Creating a Development Environment Instance
Using PyTorch to Create a Training Job (New-Version Training)
Managing ModelArts Authorization
Common Parameters
Status Code
Error Codes
Obtaining a Project ID and Name
Obtaining an Account Name and ID
Obtaining a Username and ID
FAQs
General Issues
What Is ModelArts?
What Are the Relationships Between ModelArts and Other Services?
What Are the Differences Between ModelArts and DLS?
How Do I Obtain an Access Key?
How Do I Upload Data to OBS?
What Do I Do If the System Displays a Message Indicating that the AK/SK Pair Is Unavailable?
How Do I Use ModelArts to Train Models Based on Structured Data?
What Are Regions and AZs?
How Do I Check Whether ModelArts and an OBS Bucket Are in the Same Region?
How Do I View All Files Stored in OBS on ModelArts?
Where Are Datasets of ModelArts Stored in a Container?
What Are the Functions of ModelArts Training and Inference?
Can AI-assisted Identification of ModelArts Identify a Specific Label?
Why Is the Job Still Queued When Resources Are Sufficient?
Notebook (New Version)
Constraints
Is sudo Privilege Escalation Supported?
Does ModelArts Support apt-get?
Is the Keras Engine Supported?
Does ModelArts Support the Caffe Engine?
Can I Install MoXing in a Local Environment?
Can Notebook Instances Be Remotely Logged In?
Data Upload or Download
How Do I Upload a File from a Notebook Instance to OBS or Download a File from OBS to a Notebook Instance?
How Do I Upload Local Files to a Notebook Instance?
How Do I Import Large Files to a Notebook Instance?
Where Will the Data Be Uploaded to?
How Do I Download Files from a Notebook Instance to a Local Computer?
How Do I Copy Data from Development Environment Notebook A to Notebook B?
Data Storage
How Do I Rename an OBS File?
Do Files in /cache Still Exist After a Notebook Instance is Stopped or Restarted? How Do I Avoid a Restart?
How Do I Use the pandas Library to Process Data in OBS Buckets?
Environment Configurations
How Do I Check the CUDA Version Used by a Notebook Instance?
How Do I Enable the Terminal Function in DevEnviron of ModelArts?
How Do I Install External Libraries in a Notebook Instance?
How Do I Obtain the External IP Address of My Local PC?
How Can I Resolve Abnormal Font Display on a ModelArts Notebook Accessed from iOS?
Is There a Proxy for Notebook? How Do I Disable It?
Notebook Instances
What Do I Do If I Cannot Access My Notebook Instance?
What Should I Do When the System Displays an Error Message Indicating that No Space Left After I Run the pip install Command?
What Do I Do If "Read timed out" Is Displayed After I Run pip install?
What Do I Do If the Code Can Be Run But Cannot Be Saved, and the Error Message "save error" Is Displayed?
Code Execution
What Do I Do If a Notebook Instance Won't Run My Code?
Why Does the Instance Break Down When dead kernel Is Displayed During Training Code Running?
What Do I Do If cudaCheckError Occurs During Training?
What Should I Do If DevEnviron Prompts Insufficient Space?
Why Does the Notebook Instance Break Down When opencv.imshow Is Used?
Why Cannot the Path of a Text File Generated in Windows OS Be Found In a Notebook Instance?
What Do I Do If Files Fail to Be Saved in JupyterLab?
Failures to Access the Development Environment Through VS Code
What Do I Do If the VS Code Window Is Not Displayed?
What Do I Do If a Remote Connection Failed After VS Code Is Opened?
What Do I Do If Error Message "Could not establish connection to xxx" Is Displayed During a Remote Connection?
What Do I Do If the Connection to a Remote Development Environment Remains in "Setting up SSH Host xxx: Downloading VS Code Server locally" State for More Than 10 Minutes?
What Do I Do If the Connection to a Remote Development Environment Remains in the State of "Setting up SSH Host xxx: Downloading VS Code Server locally" for More Than 10 Minutes?
What Do I Do If the Connection to a Remote Development Environment Remains in the State of "ModelArts Remote Connect: Connecting to instance xxx..." for More Than 10 Minutes?
What Do I Do If a Remote Connection Is in the Retry State?
What Do I Do If Error Message "The VS Code Server failed to start" Is Displayed?
What Do I Do If Error Message "Permissions for 'x:/xxx.pem' are too open" Is Displayed?
What Do I Do If Error Message "Bad owner or permissions on C:\Users\Administrator/.ssh/config" or "Connection permission denied (publickey)" Is Displayed?
What Do I Do If Error Message "ssh: connect to host xxx.pem port xxxxx: Connection refused" Is Displayed?
What Do I Do If Error Message "ssh: connect to host ModelArts-xxx port xxx: Connection timed out" Is Displayed?
What Do I Do If Error Message "Load key "C:/Users/xx/test1/xxx.pem": invalid format" Is Displayed?
What Do I Do If Error Message "An SSH installation couldn't be found" or "Could not establish connection to instance xxx: 'ssh' ..." Is Displayed?
What Do I Do If Error Message "no such identity: C:/Users/xx /test.pem: No such file or directory" Is Displayed?
What Do I Do If Error Message "Host key verification failed" or "Port forwarding is disabled" Is Displayed?
What Do I Do If Error Message "Failed to install the VS Code Server" or "tar: Error is not recoverable: exiting now" Is Displayed?
What Do I Do If Error Message "XHR failed" Is Displayed When a Remote Notebook Instance Is Accessed Through VS Code?
What Do I Do for an Automatically Disconnected VS Code Connection If No Operation Is Performed for a Long Time?
What Do I Do If It Takes a Long Time to Set Up a Remote Connection After VS Code Is Automatically Upgraded?
What Do I Do If Error Message "Connection reset" Is Displayed During an SSH Connection?
What Can I Do If a Notebook Instance Is Frequently Disconnected or Stuck After I Use MobaXterm to Connect to the Notebook Instance in SSH Mode?
Others
How Do I Use Multiple Ascend Cards for Debugging in a Notebook Instance?
Why Is the Training Speed Similar When Different Notebook Flavors Are Used?
How Do I Perform Incremental Training When Using MoXing?
How Do I View GPU Usage on the Notebook?
How Can I Obtain GPU Usage Through Code?
Which Real-Time Performance Indicators of an Ascend Chip Can I View?
What Are the Relationships Between Files Stored in JupyterLab, Terminal, and OBS?
How Do I Migrate Data from an Old-Version Notebook Instance to a New-Version One?
How Do I Use the Datasets Created on ModelArts in a Notebook Instance?
pip and Common Commands
What Are Sizes of the /cache Directories for Different Notebook Specifications in DevEnviron?
Training Jobs
Functional Consulting
What Are the Solutions to Underfitting?
What Are the Precautions for Switching Training Jobs from the Old Version to the New Version?
How Do I Obtain a Trained ModelArts Model?
What Is TensorBoard Used for in Model Visualization Jobs?
How Do I Obtain RANK_TABLE_FILE on ModelArts for Distributed Training?
How Do I Obtain the CUDA and cuDNN Versions of a Custom Image?
How Do I Obtain a MoXing Installation File?
In a Multi-Node Training, the TensorFlow PS Node Functioning as a Server Will Be Continuously Suspended. How Does ModelArts Determine Whether the Training Is Complete? Which Node Is a Worker?
How Do I Install MoXing for a Custom Image of a Training Job?
Reading Data During Training
How Do I Configure the Input and Output Data for Training Models on ModelArts?
How Do I Improve Training Efficiency While Reducing Interaction with OBS?
Why the Data Read Efficiency Is Low When a Large Number of Data Files Are Read During Training?
Compiling the Training Code
How Do I Create a Training Job When a Dependency Package Is Referenced by the Model to Be Trained?
What Is the Common File Path for Training Jobs?
How Do I Install a Library That C++ Depends on?
How Do I Check Whether a Folder Copy Is Complete During Job Training?
How Do I Load Some Well Trained Parameters During Job Training?
How Do I Obtain Training Job Parameters from the Boot File of the Training Job?
Why Can't I Use os.system ('cd xxx') to Access the Corresponding Folder During Job Training?
How Do I Invoke a Shell Script in a Training Job to Execute the .sh File?
How Do I Obtain the Dependency File Path to be Used in Training Code?
What Is the File Path If a File in the model Directory Is Referenced in a Custom Python Package?
Creating a Training Job
What Can I Do If the Message "Object directory size/quantity exceeds the limit" Is Displayed When I Create a Training Job?
What Are Precautions for Setting Training Parameters?
What Are Sizes of the /cache Directories for Different Resource Specifications in the Training Environment?
Is the /cache Directory of a Training Job Secure?
Why Is a Training Job Always Queuing?
Managing Training Job Versions
Does a Training Job Support Scheduled or Periodic Calling?
Viewing Job Details
How Do I Check Resource Usage of a Training Job?
How Do I Access the Background of a Training Job?
Is There Any Conflict When Models of Two Training Jobs Are Saved in the Same Directory of a Container?
Only Three Valid Digits Are Retained in a Training Output Log. Can the Value of loss Be Changed?
Can a Trained Model Be Downloaded or Migrated to Another Account? How Do I Obtain the Download Path?
Service Deployment
Model Management
Importing Models
How Do I Import the .h5 Model of Keras to ModelArts?
How Do I Edit the Installation Package Dependency Parameters in a Model Configuration File When Importing a Model?
How Do I Change the Default Port to Create a Real-Time Service Using a Custom Image?
Does ModelArts Support Multi-Model Import?
Restrictions on the Size of an Image for Importing an AI Application
Service Deployment
Functional Consulting
What Types of Services Can Models Be Deployed as on ModelArts?
What Are the Differences Between Real-Time Services and Batch Services?
What Is the Maximum Size of a Prediction Request Body?
How Do I Select Compute Node Specifications for Deploying a Service?
What Is the CUDA Version for Deploying a Service on GPUs?
Real-Time Services
What Do I Do If a Conflict Occurs in the Python Dependency Package of a Custom Prediction Script When I Deploy a Real-Time Service?
What Is the Format of a Real-Time Service API?
Why Did My Service Deployment Fail with Proper Deployment Timeout Configured?
API/SDK
Can ModelArts APIs or SDKs Be Used to Download Models to a Local PC?
What Installation Environments Do ModelArts SDKs Support?
Does ModelArts Use the OBS API to Access OBS Files over an Intranet or the Internet?
How Do I Obtain a Job Resource Usage Curve After I Submit a Training Job by Calling an API?
How Do I View the Old-Version Dedicated Resource Pool List Using the SDK?
Using PyCharm Toolkit
What Should I Do If an Error Occurs During Toolkit Installation?
What Should I Do If an Error Occurs When I Edit a Credential in PyCharm Toolkit?
Why Cannot I Start Training?
What Should I Do If Error "xxx isn't existed in train_version" Occurs When a Training Job Is Submitted?
What Should I Do If Error "Invalid OBS path" Occurs When a Training Job Is Submitted?
What Should I Do If an Error Occurs During Service Deployment?
How Do I View Error Logs of PyCharm Toolkit?
How Do I Use PyCharm ToolKit to Create Multiple Jobs for Simultaneous Training?
What Should I Do If "Error occurs when accessing to OBS" Is Displayed When PyCharm ToolKit Is Used?
Troubleshooting
General Issues
Incorrect OBS Path on ModelArts
DevEnviron
Environment Configuration Faults
Disk Space Used Up
An Error Is Reported When Conda Is Used to Install Keras 2.3.1 in Notebook
Instance Faults
What Do I Do If I Cannot Access My Notebook Instance?
What Should I Do When the System Displays an Error Message Indicating that No Space Left After I Run the pip install Command?
What Do I Do If the Code Can Be Run But Cannot Be Saved, and the Error Message "save error" Is Displayed?
ModelArts.6333 Error Occurs
What Can I Do If a Message Is Displayed Indicating that the Token Does Not Exist or Is Lost When I Open a Notebook Instance?
Code Running Failures
Error Occurs When Using a Notebook Instance to Run Code, Indicating That No File Is Found in /tmp
What Do I Do If a Notebook Instance Won't Run My Code?
Why Does the Instance Break Down When dead kernel Is Displayed During Training Code Running?
What Do I Do If cudaCheckError Occurs During Training?
What Do I Do If Insufficient Space Is Displayed in DevEnviron?
Why Does the Notebook Instance Break Down When opencv.imshow Is Used?
Why Cannot the Path of a Text File Generated in Windows OS Be Found In a Notebook Instance?
What Do I Do If No Kernel Is Displayed After a Notebook File Is Created?
JupyterLab Plug-in Faults
What Do I Do If the Git Plug-in Password Is Invalid?
Save an Image Failures
What If the Error Message "there are processes in 'D' status, please check process status using'ps -aux' and kill all the 'D' status processes" or "Buildimge,False,Error response from daemon,Cannot pause container xxx" Is Displayed When I Save an Image?
What Do I Do If Error "container size %dG is greater than threshold %dG" Is Displayed When I Save an Image?
What Do I Do If Error "too many layers in your image" Is Displayed When I Save an Image?
What Do I Do If Error "The container size (xG) is greater than the threshold (25G)" Is Reported When I Save an Image?
Other Faults
Failed to Open the checkpoints Folder in Notebook
Failed to Use a Purchased Dedicated Resource Pool to Create New-Version Notebook Instances
Training Jobs
OBS Operation Issues
Failed to Correctly Read Files
Error Message Is Displayed Repeatedly When a TensorFlow-1.8 Job Is Connected to OBS
TensorFlow Stops Writing TensorBoard to OBS When the Size of Written Data Reaches 5 GB
Error "Unable to connect to endpoint" Error Occurs When a Model Is Saved
What Do I Do If Error Message "No such file or directory" Is Displayed in Training Job Logs?
Error Message "BrokenPipeError: Broken pipe" Displayed When OBS Data Is Copied
Error Message "ValueError: Invalid endpoint: obs.xxxx.com" Displayed in Logs
Error Message "errorMessage:The specified key does not exist" Displayed in Logs
In-Cloud Migration Adaptation Issues
Failed to Import a Module
Error Message "No module named .*" Displayed in Training Job Logs
Failed to Install a Third-Party Package
Failed to Download the Code Directory
Error Message "No such file or directory" Displayed in Training Job Logs
Failed to Find the .so File During Training
Failed to Parse Parameters and Log Error Occurs
Training Output Path Is Used by Another Job
Failed to Find the Boot File When a Training Job Is Created Using a Custom Image
Error Message "RuntimeError: std::exception" Displayed for a PyTorch 1.0 Engine
Error Message "retCode=0x91, [the model stream execute failed]" Displayed in MindSpore Logs
Error Occurred When Pandas Reads Data from an OBS File If MoXing Is Used to Adapt to an OBS Path
Error Message "Please upgrade numpy to >= xxx to use this pandas version" Displayed in Logs
Reinstalled CUDA Version Does Not Match the One in the Target Image
Error ModelArts.2763 Occurred During Training Job Creation
Error Message "AttributeError: module '***' has no attribute '***'" Displayed Training Job Logs
System Container Exits Unexpectedly
Memory Limit Issues
Downloading Files Timed Out or No Space Left for Reading Data
Insufficient Container Space for Copying Data
Error Message "No space left" Displayed When a TensorFlow Multi-node Job Downloads Data to /cache
Size of the Log File Has Reached the Limit
Error Message "write line error" Displayed in Logs
Error Message "No space left on device" Displayed in Logs
Training Job Failed Due to OOM
Common Issues Related to Insufficient Disk Space and Solutions
Internet Access Issues
Error Message "Network is unreachable" Displayed in Logs
URL Connection Timed Out in a Running Training Job
Permission Issues
What Should I Do If Error "stat:403 reason:Forbidden" Is Displayed in Logs When a Training Job Accesses OBS
Error Message "Permission denied" Displayed in Logs
GPU Issues
Error Message "No CUDA-capable device is detected" Displayed in Logs
Error Message "RuntimeError: connect() timed out" Displayed in Logs
Error Message "cuda runtime error (10) : invalid device ordinal at xxx" Displayed in Logs
Error Message "RuntimeError: Cannot re-initialize CUDA in forked subprocess" Displayed in Logs
No GPU Is Found for a Training Job
Service Code Issues
Error Message "pandas.errors.ParserError: Error tokenizing data. C error: Expected .* fields" Displayed in Logs
Error Message "max_pool2d_with_indices_out_cuda_frame failed with error code 0" Displayed in Logs
Training Job Failed with Error Code 139
Debugging Training Code in the Cloud Environment If a Training Job Failed
Error Message "'(slice(0, 13184, None), slice(None, None, None))' is an invalid key" Displayed in Logs
Error Message "DataFrame.dtypes for data must be int, float or bool" Displayed in Logs
Error Message "CUDNN_STATUS_NOT_SUPPORTED" Displayed in Logs
Error Message "Out of bounds nanosecond timestamp" Displayed in Logs
Error Message "Unexpected keyword argument passed to optimizer" Displayed in Logs
Error Message "no socket interface found" Displayed in Logs
Error Message "Runtimeerror: Dataloader worker (pid 46212) is killed by signal: Killed BP" Displayed in Logs
Error Message "AttributeError: 'NoneType' object has no attribute 'dtype'" Displayed in Logs
Error Message "No module name 'unidecode'" Displayed in Logs
Distributed Tensorflow Cannot Use tf.variable
When MXNet Creates kvstore, the Program Is Blocked and No Error Is Reported
ECC Error Occurs in the Log, Causing Training Job Failure
Training Job Failed Because the Maximum Recursion Depth Is Exceeded
Training Using a Built-in Algorithm Failed Due to a bndbox Error
Training Job Status Is Reviewing Job Initialization
Training Job Process Exits Unexpectedly
Stopped Training Job Process
Training Job Suspended
Data Replication Suspension
Suspension Before Training
Suspension During Training
Suspension in the Last Training Epoch
Training Jobs Created in a Dedicated Resource Pool
No Cloud Storage Name or Mount Path Displayed on the Page for Creating a Training Job
Training Performance Issues
Training Performance Deteriorated
Inference Deployment
AI Application Management
Creating an AI Application Failed
Failed to Build an Image or Import a File When an IAM user Creates an AI Application
Obtaining the Directory Structure in the Target Image When Importing an AI Application Through OBS
Failed to Obtain Certain Logs on the ModelArts Log Query Page
Failed to Download a pip Package When an AI Application Is Created Using OBS
Failed to Use a Custom Image to Create an AI application
Insufficient Disk Space Is Displayed When a Service Is Deployed After an AI Application Is Imported
Error Occurred When a Created AI Application Is Deployed as a Service
Invalid Runtime Dependency Configured in an Imported Custom Image
Garbled Characters Displayed in an AI Application Name Returned When AI Application Details Are Obtained Through an API
The Model or Image Exceeded the Size Limit for AI Application Import
A Single Model File Exceeded the Size Limit (5 GB) for AI Application Import
Creating an AI Application Failed Due to Image Building Timeout
Service Deployment
Error Occurred When a Custom Image Model Is Deployed as a Real-Time Service
Alarm Status of a Deployed Real-Time Service
Failed to Start a Service
What Do I Do If an Image Fails to Be Pulled When a Service Is Deployed, Started, Upgraded, or Modified?
What Do I Do If an Image Restarts Repeatedly When a Service Is Deployed, Started, Upgraded, or Modified?
What Do I Do If a Container Health Check Fails When a Service Is Deployed, Started, Upgraded, or Modified?
What Do I Do If Resources Are Insufficient When a Service Is Deployed, Started, Upgraded, or Modified?
Error Occurred When a CV2 Model Package Is Used to Deploy a Real-Time Service
Service Is Consistently Being Deployed
A Started Service Is Intermittently in the Alarm State
Failed to Deploy a Service and Error "No Module named XXX" Occurred
Insufficient Permission to or Unavailable Input/Output OBS Path of a Batch Service
Service Prediction
Service Prediction Failed
Error "APIG.XXXX" Occurred in a Prediction Failure
Error ModelArts.4206 Occurred in Real-Time Service Prediction
Error ModelArts.4302 Occurred in Real-Time Service Prediction
Error ModelArts.4503 Occurred in Real-Time Service Prediction
Error MR.0105 Occurred in Real-Time Service Prediction
Method Not Allowed
Request Timed Out
Error Occurred When an API Is Called for Deploying a Model Created Using a Custom Image
MoXing
Error Occurs When MoXing Is Used to Copy Data
How Do I Disable the Warmup Function of the Mox?
Pytorch Mox Logs Are Repeatedly Generated
Does moxing.tensorflow Contain the Entire TensorFlow? How Do I Perform Local Fine Tune on the Generated Checkpoint?
Copying Data Using MoXing Is Slow and the Log Is Repeatedly Printed in a Training Job
Failed to Access a Folder Using MoXing and Read the Folder Size Using get_size
APIs or SDKs
"ERROR: Could not install packages due to an OSError" Occurred During ModelArts SDK Installation
Error Occurred During Service Deployment After the Target Path to a File Downloaded Through a ModelArts SDK Is Set to a File Name
Best Practices
Permissions Management
Basic Concepts
Permission Management Mechanisms
IAM
Agencies and Dependencies
Workspace
Configuration Practices in Typical Scenarios
Assigning Permissions to Individual Users for Using ModelArts
Separately Assigning Permissions to Administrators and Developers
Viewing the Notebook Instances of All IAM Users Under One Tenant Account
Logging In to a Training Container Using Cloud Shell
Prohibiting a User from Using a Public Resource Pool
Model Development (Custom Algorithms in Training Jobs of the New Version)
Using a Custom Algorithm to Build a Handwritten Digit Recognition Model
Model Inference
Creating a Custom Image and Using It to Create an AI Application
End-to-End O&M of Inference Services
Creating an AI Application Using a Custom Engine
Using a Large Model to Create an AI Application and Deploying a Real-Time Service
High-Speed Access to Inference Services Through VPC Peering
Videos
Data Labeling
Introduction to Data Labeling
Manual Labeling
Creating a Labeling Job
Image Labeling
Image Classification
Object Detection
Text Labeling
Text Classification
Named Entity Recognition
Text Triplet
Audio Labeling
Sound Classification
Speech Labeling
Speech Paragraph Labeling
Viewing Labeling Jobs
Viewing My Created Labeling Jobs
Viewing My Participated Labeling Jobs
Data Preparation and Analytics
Introduction to Data Preparation
Getting Started
Creating a Dataset
Dataset Overview
Creating a Dataset
Modifying a Dataset
Importing Data
Introduction to Data Importing
Importing Data from OBS
Introduction to Importing Data from OBS
Importing Data from an OBS Path
Specifications for Importing Data from an OBS Directory
Importing a Manifest File
Specifications for Importing a Manifest File
Importing Data from Local Files
Data Analysis and Preview
Data Filtering
Data Feature Analysis
Labeling Data
Publishing Data
Introduction to Data Publishing
Publishing a Data Version
Managing Data Versions
Exporting Data
Introduction to Exporting Data
Exporting Data to a New Dataset
Exporting Data to OBS
Data Processing
Data Processing Overview
Description of Built-in Operators for Data Processing
Data Validation
Data Cleansing
Data Selection
Data Deduplication
Data Deredundancy
Data Augmentation
Data Augmentation
Data Generation
Data Transfer Between Domains
Tool Guide (Cloud Alliance scenario)
PyCharm Toolkit
Preparations
Downloading and Installing PyCharm Toolkit
Creating Access Keys (AK and SK)
Using Access Keys for Login
Connecting to a Notebook Instance Through PyCharm Toolkit
PyCharm Toolkit (Latest Version)
Training a Model
Submitting a Training Job (New Version)
Stopping a Training Job
Viewing Training Logs
OBS-based Upload and Download
FAQs
What Should I Do If an Error Occurs During ToolKit Installation?
An Error Occurs When You Edit a Credential in PyCharm Toolkit
Why Cannot I Start Training?
What Should I Do If Error "xxx isn't existed in train_version" Occurs When a Training Job Is Submitted
What Should I Do If an Error Occurs When I Submit a Training Job
What Should I Do If an Error Occurs During Service Deployment
How Do I View Error Logs of PyCharm ToolKit?