ModelArts
ModelArts
All results for "
" in this service
All results for "
" in this service
What's New
Function Overview
Service Overview
What Is ModelArts?
Functions
Basic Knowledge
Introduction to the AI Development Lifecycle
Basic Concepts of AI Development
Common Concepts of ModelArts
DevEnviron
Model Training
Model Deployment
Resource Pools
Related Services
How Do I Access ModelArts?
ModelArts User Guide (Standard)
ModelArts Standard Preparations
Configuring Access Authorization for ModelArts Standard
Configuring Agency Authorization for ModelArts with One Click
Creating an IAM User and Granting ModelArts Permissions
Creating an OBS Bucket for ModelArts to Store Data
ModelArts Standard Resource Management
About ModelArts Standard Resource Pools
Creating a Standard Dedicated Resource Pool
Managing Standard Dedicated Resource Pools
Viewing Details About a Standard Dedicated Resource Pool
Resizing a Standard Dedicated Resource Pool
Upgrading the Standard Dedicated Resource Pool Driver
Modifying the Job Types Supported by a Standard Dedicated Resource Pool
Configuring the Standard Dedicated Resource Pool to Access the Internet
Releasing Standard Dedicated Resource Pools and Deleting the Network
Development Environments
Application Scenarios
Creating a Notebook Instance
Using a Notebook Instance for AI Development Through JupyterLab
Using JupyterLab to Develop and Debug Code Online
Common Functions of JupyterLab
Using Git to Clone the Code Repository in JupyterLab
Creating a Scheduled Job in JupyterLab
Uploading Files to JupyterLab
Uploading Files from a Local Path to JupyterLab
Cloning GitHub Open-Source Repository Files to JupyterLab
Uploading OBS Files to JupyterLab
Uploading Remote Files to JupyterLab
Downloading a File from JupyterLab to a Local PC
Using MindInsight Visualization Jobs in JupyterLab
Using TensorBoard Visualization Jobs in JupyterLab
Using Notebook Instances Remotely Through PyCharm
Connecting to a Notebook Instance Through PyCharm Toolkit
Manually Connecting to a Notebook Instance Through PyCharm
Uploading Data to a Notebook Instance Through PyCharm
Using Notebook Instances Remotely Through VS Code
Connecting to a Notebook Instance Through VS Code
Installing VS Code
Connecting to a Notebook Instance Through VS Code Toolkit
Manually Connecting to a Notebook Instance Through VS Code
Uploading and Downloading Files in VS Code
Using a Notebook Instance Remotely with SSH
Managing Notebook Instances
Searching for a Notebook Instance
Updating a Notebook Instance
Starting, Stopping, or Deleting a Notebook Instance
Saving a Notebook Instance
Dynamically Expanding EVS Disk Capacity
Dynamically Mounting an OBS Parallel File System
ModelArts CLI Command Reference
(Optional) Installing ma-cli Locally
ma-cli Authentication
Using Moxing Commands in a Notebook Instance
MoXing Framework Functions
Using MoXing in Notebook
Mapping Between mox.file and Local APIs and Switchover
Sample Code for Common Operations
Sample Code for Advanced MoXing Usage
Model Training
Model Training Process
Preparing Model Training Code
Starting a Preset Image's Boot File
Developing Code for Training Using a Preset Image
Developing Code for Training Using a Custom Image
Configuring Password-free SSH Mutual Trust Between Nodes for a Training Job Created Using a Custom Image
Preparing a Model Training Image
Creating a Debug Training Job
Using PyCharm Toolkit to Create and Debug a Training Job
Creating an Algorithm
Creating a Training Job
Distributed Model Training
Overview
Creating a Single-Node Multi-Card Distributed Training Job (DataParallel)
Creating a Multiple-Node Multi-Card Distributed Training Job (DistributedDataParallel)
Incremental Model Training
High Model Training Reliability
Training Job Fault Tolerance Check
Training Log Failure Analysis
Training Job Rescheduling
Resumable Training
Enabling Unconditional Auto Restart
Managing Model Training Jobs
Viewing Training Job Details
Viewing the Resource Usage of a Training Job
Viewing the Model Evaluation Result
Viewing Training Job Events
Viewing Training Job Logs
Priority of a Training Job
Using Cloud Shell to Debug a Production Training Job
Rebuilding, Stopping, or Deleting a Training Job
Managing Environment Variables of a Training Container
Inference Deployment
Overview
Creating a Model
Creation Methods
Importing a Meta Model from a Training Job
Importing a Meta Model from OBS
Importing a Meta Model from a Container Image
Model Creation Specifications
Model Package Structure
Specifications for Editing a Model Configuration File
Specifications for Writing a Model Inference Code File
Specifications for Using a Custom Engine to Create a Model
Examples of Custom Scripts
Deploying a Model as Real-Time Inference Jobs
Deploying and Using Real-Time Inference
Deploying a Model as a Real-Time Service
Authentication Methods for Accessing Real-time Services
Accessing a Real-Time Service Through Token-based Authentication
Accessing a Real-Time Service Through App Authentication
Accessing a Real-Time Service Through Different Channels
Accessing a Real-Time Service Through a Public Network
Accessing a Real-Time Service Through a VPC High-Speed Channel
Accessing a Real-Time Service Using Different Protocols
Accessing a Real-Time Service Using Server-Sent Events
Deploying a Model as a Batch Inference Service
Managing ModelArts Models
Viewing ModelArts Model Details
Viewing ModelArts Model Events
Managing ModelArts Model Versions
Managing a Synchronous Real-Time Service
Viewing Details About a Real-Time Service
Viewing Events of a Real-Time Service
Managing the Lifecycle of a Real-Time Service
Modifying a Real-Time Service
Viewing Performance Metrics of a Real-Time Service on Cloud Eye
Configuring Auto Restart upon a Real-Time Service Fault
Managing Batch Inference Jobs
Viewing Details About a Batch Service
Viewing Events of a Batch Service
Managing the Lifecycle of a Batch Service
Modifying a Batch Service
Image Management
Application Scenarios of Custom Images
Creating a Custom Image for a Notebook Instance
Creating a Custom Image
Creating a Custom Image Using the Image Saving Function
Creating a Custom Image for Model Training
Creating a Custom Training Image
Creating a Custom Training Image Using a Preset Image
Migrating Existing Images to ModelArts
Creating a Custom Training Image (PyTorch + Ascend)
Creating a Custom Training Image (MindSpore + Ascend)
Creating a Custom Image for Inference
Creating a Custom Image for a Model
Creating a Custom Image on ECS
Resource Monitoring
Overview
Viewing Monitoring Metrics on the ModelArts Console
Viewing All ModelArts Monitoring Metrics on the AOM Console
Using Grafana to View AOM Monitoring Metrics
Installing and Configuring Grafana
Installing and Configuring Grafana on Windows
Installing and Configuring Grafana on Linux
Installing and Configuring Grafana on a Notebook Instance
Configuring a Grafana Data Source
Configuring a Dashboard to View Metric Data
Best Practices
Permissions Management
Basic Concepts
Permission Management Mechanisms
IAM
Agencies and Dependencies
Workspace
Configuration Practices in Typical Scenarios
Assigning Permissions to Individual Users for Using ModelArts
Separately Assigning Permissions to Administrators and Developers
Viewing the Notebook Instances of All IAM Users Under One Tenant Account
Logging In to a Training Container Using Cloud Shell
Prohibiting a User from Using a Public Resource Pool
Model Development (Custom Algorithms in Training Jobs of the New Version)
Using a Custom Algorithm to Build a Handwritten Digit Recognition Model
Model Inference
Creating a Custom Image and Using It to Create an AI Application
End-to-End O&M of Inference Services
Creating an AI Application Using a Custom Engine
Using a Large Model to Create an AI Application and Deploying a Real-Time Service
High-Speed Access to Inference Services Through VPC Peering
API Reference
Before You Start
API Overview
Calling APIs
Making an API Request
Authentication
Response
Development Environment Management
Creating a Notebook Instance
Querying Notebook Instances
Querying Details of a Notebook Instance
Updating a Notebook Instance
Deleting a Notebook Instance
Saving a Running Instance as a Container Image
Obtaining the Available Flavors
Querying Flavors Available for a Notebook Instance
Querying the Available Duration of a Running Notebook Instance
Prolonging a Notebook Instance
Starting a Notebook Instance
Stopping a Notebook Instance
Obtaining the Notebook Instances with OBS Storage Mounted
OBS Storage Mounting
Obtaining Details About a Notebook Instance with OBS Storage Mounted
Unmounting OBS Storage from a Notebook Instance
Querying Supported Images
Registering a Custom Image
Obtaining User Image Groups
Obtaining Details of an Image
Deleting an Image
Training Management
Creating an Algorithm
Querying the Algorithm List
Querying Algorithm Details
Modifying an Algorithm
Deleting an Algorithm
Creating a Training Job
Querying the Details About a Training Job
Modifying the Description of a Training Job
Deleting a Training Job
Terminating a Training Job
Querying the Logs of a Specified Task in a Given Training Job (Preview)
Querying the Logs of a Specified Task in a Training Job (OBS Link)
Querying the Running Metrics of a Specified Task in a Training Job
Querying a Training Job List
Obtaining the General Specifications Supported by a Training Job
Obtaining the Preset AI Frameworks Supported by a Training Job
Authorization Management
Viewing an Authorization List
Configuring Authorization
Deleting Authorization
Creating a ModelArts Agency
Managing DevEnviron Instances
Querying All Notebook Instances
App Authentication Management
Creates an API.
API query
Querying APIs and Apps
Check whether the app exists.
Application Authentication Management
Querying the API Authentication Information of an Application
Resource Pool Job Management
Obtaining Jobs in a Dedicated Resource Pool
Obtaining Statistics About Dedicated Resource Pool Jobs
Configuration Management
Querying OS Configuration Parameters
Obtaining OS Quotas
Plug-in Template Management
Querying a Plug-in Template
Query Plug-in Templates
Virtual Resource Flavors
Obtaining Virtual Resource Flavor Templates
Obtaining Virtual Resource Flavors
Node Management
Obtaining Nodes in a Resource Pool
Deleting Nodes in Batches
Updating Nodes in Batches
Locking Node Functions in Batches
Unlocking Node Functions in Batches
Binding Nodes to a Logical Subpool in Batches
Network Management
Creating Network Resources
Obtaining Network Resources
Obtaining a Network Resource
Deleting a Network Resource
Updating a Network Resource
Tag Management
Obtaining All Resource Tags of a Resource Pool
Obtaining Resource Tags of a Resource Pool
Modifying Resource Tags of a Resource Pool
Obtaining All Resource Tags of Nodes
Obtaining Resource Tags of Nodes
Modifying Resource Tags of Nodes
Event Management
Obtaining the Event List
Plug-in Management
Create a Plug-in
Query Plug-ins
Update a Plug-in
Delete a Plug-in
Query Details of a Plug-in
Resource Specifications Management
Obtaining Resource Specifications
Order Management
Querying Orders
Resource Pool Management
Creating Resource Pools
Obtaining Resource Pools
Obtaining a Resource Pool
Deleting a Resource Pool
Updating a Resource Pool
Monitoring a Resource Pool
Obtaining Resource Pool Statistics
Retrying the Job Type That Fails to Be Enabled
Resource Metrics
Obtaining the Real-Time Resource Usage
Use Cases
Creating a Development Environment Instance
Using PyTorch to Create a Training Job (New-Version Training)
Managing ModelArts Authorization
Appendix
Status Code
Error Codes
Obtaining a Project ID and Name
Obtaining an Account Name and ID
Obtaining a Username and ID
Historical APIs
Data Management (Old Version)
Querying the Dataset List
Creating a Dataset
Querying Details About a Dataset
Modifying a Dataset
Deleting a Dataset
Obtaining Dataset Statistics
Querying the Monitoring Data of a Dataset
Querying the Dataset Version List
Creating a Dataset Labeling Version
Querying Details About a Dataset Version
Deleting a Dataset Labeling Version
Obtaining a Sample List
Adding Samples in Batches
Deleting Samples in Batches
Obtaining Details About a Sample
Obtaining Sample Search Condition
Obtaining a Sample List of a Team Labeling Task by Page
Obtaining Details About a Team Labeling Sample
Querying the Dataset Label List
Creating a Dataset Label
Modifying Labels in Batches
Deleting Labels in Batches
Updating a Label by Label Names
Deleting a Label and the Files that Only Contain the Label
Updating Sample Labels in Batches
Querying the Team Labeling Task List of a Dataset
Creating a Team Labeling Task
Querying Details About a Team Labeling Task
Starting a Team Labeling Task
Updating a Team Labeling Task
Deleting a Team Labeling Task
Creating a Team Labeling Acceptance Task
Querying the Report of a Team Labeling Acceptance Task
Updating Status of a Team Labeling Acceptance Task
Querying Details About Team Labeling Task Statistics
Querying Details About the Progress of a Team Labeling Task Member
Querying the Team Labeling Task List by a Team Member
Submitting Sample Review Comments of an Acceptance Task
Reviewing Team Labeling Results
Updating Labels of Team Labeling Samples in Batches
Querying the Labeling Team List
Creating a Labeling Team
Querying Details About a Labeling Team
Updating a Labeling Team
Deleting a Labeling Team
Sending an Email to a Labeling Team Member
Querying the List of All Labeling Team Members
Querying the List of Labeling Team Members
Creating a Labeling Team Member
Deleting Labeling Team Members in Batches
Querying Details About Labeling Team Members
Updating a Labeling Team Member
Deleting a Labeling Team Member
Querying the Dataset Import Task List
Creating an Import Task
Querying Details About a Dataset Import Task
Querying the Dataset Export Task List
Creating a Dataset Export Task
Querying the Status of a Dataset Export Task
Synchronizing a Dataset
Querying the Status of a Dataset Synchronization Task
DevEnviron (Old Version)
Creating a Development Environment Instance
Obtaining Development Environment Instances
Obtaining Details About a Development Environment Instance
Modifying the Description of a Development Environment Instance
Deleting a Development Environment Instance
Managing a Development Environment Instance
Training Management (Old Version)
Training Jobs
Creating a Training Job
Querying a Training Job List
Querying the Details About a Training Job Version
Deleting a Version of a Training Job
Obtaining Training Job Versions
Creating a Version of a Training Job
Stopping a Training Job
Modifying the Description of a Training Job
Deleting a Training Job
Obtaining the Name of a Training Job Log File
Querying a Built-in Algorithm
Querying Training Job Logs
Training Job Parameter Configuration
Creating a Training Job Configuration
Querying a List of Training Job Configurations
Modifying a Training Job Configuration
Deleting a Training Job Configuration
Querying the Details About a Training Job Configuration
Visualization Jobs
Creating a Visualization Job
Querying a Visualization Job List
Querying the Details About a Visualization Job
Modifying the Description of a Visualization Job
Deleting a Visualization Job
Stopping a Visualization Job
Restarting a Visualization Job
Resource and Engine Specifications
Querying Job Resource Specifications
Querying Job Engine Specifications
Job Statuses
SDK Reference
Before You Start
SDK Overview
Getting Started
(Optional) Installing ModelArts SDK Locally
Session Authentication
(Optional) Session Authentication
Authentication Using the Username and Password
AK/SK-based Authentication
OBS Management
Overview of OBS Management
Transferring Files (Recommended)
Uploading a File to OBS
Uploading a Folder to OBS
Downloading a File from OBS
Downloading a Folder from OBS
Data Management
Managing Datasets
Querying a Dataset List
Creating a Dataset
Querying Details About a Dataset
Modifying a Dataset
Deleting a Dataset
Managing Dataset Versions
Obtaining a Dataset Version List
Creating a Dataset Version
Querying Details About a Dataset Version
Deleting a Dataset Version
Managing Samples
Querying a Sample List
Querying Details About a Sample
Deleting Samples in a Batch
Managing Dataset Import Tasks
Querying a Dataset Import Task List
Creating a Dataset Import Task
Querying the Status of a Dataset Import Task
Managing Export Tasks
Querying a Dataset Export Task List
Creating a Dataset Export Task
Querying the Status of a Dataset Export Task
Managing Manifest Files
Overview of Manifest Management
Parsing a Manifest File
Creating and Saving a Manifest File
Parsing a Pascal VOC File
Creating and Saving a Pascal VOC File
Managing Labeling Jobs
Creating a Labeling Job
Obtaining the Labeling Job List of a Dataset
Obtaining Details About a Labeling Job
Training Management
Training Jobs
Creating a Training Job
Debugging a Training Job
Using the SDK to Debug a Single-Node Training Job
Using the SDK to Debug a Multi-Node Distributed Training Job
Obtaining Training Jobs
Obtaining the Details About a Training Job
Modifying the Description of a Training Job
Deleting a Training Job
Terminating a Training Job
Obtaining Training Logs
Obtaining the Runtime Metrics of a Training Job
APIs for Resources and Engine Specifications
Obtaining Resource Flavors
Obtaining Engine Types
Model Management
Importing a Model
Obtaining Models
Obtaining Model Objects
Obtaining Details About a Model
Deleting a Model
Service Management
Service Management Overview
Deploying a Real-Time Service
Obtaining Details About a Service
Obtaining Services
Obtaining Service Objects
Updating Service Configurations
Obtaining Service Monitoring Information
Obtaining Service Logs
Delete a Service
FAQs
General Issues
What Is ModelArts?
What Are the Relationships Between ModelArts and Other Services?
What Are the Differences Between ModelArts and DLS?
How Do I Obtain an Access Key?
How Do I Upload Data to OBS?
What Do I Do If the System Displays a Message Indicating that the AK/SK Pair Is Unavailable?
How Do I Use ModelArts to Train Models Based on Structured Data?
What Are Regions and AZs?
How Do I Check Whether ModelArts and an OBS Bucket Are in the Same Region?
How Do I View All Files Stored in OBS on ModelArts?
Where Are Datasets of ModelArts Stored in a Container?
What Are the Functions of ModelArts Training and Inference?
Can AI-assisted Identification of ModelArts Identify a Specific Label?
Why Is the Job Still Queued When Resources Are Sufficient?
Notebook (New Version)
Constraints
Is sudo Privilege Escalation Supported?
Does ModelArts Support apt-get?
Is the Keras Engine Supported?
Does ModelArts Support the Caffe Engine?
Can I Install MoXing in a Local Environment?
Can Notebook Instances Be Remotely Logged In?
Data Upload or Download
How Do I Upload a File from a Notebook Instance to OBS or Download a File from OBS to a Notebook Instance?
How Do I Upload Local Files to a Notebook Instance?
How Do I Import Large Files to a Notebook Instance?
Where Will the Data Be Uploaded to?
How Do I Download Files from a Notebook Instance to a Local Computer?
How Do I Copy Data from Development Environment Notebook A to Notebook B?
Data Storage
How Do I Rename an OBS File?
Do Files in /cache Still Exist After a Notebook Instance is Stopped or Restarted? How Do I Avoid a Restart?
How Do I Use the pandas Library to Process Data in OBS Buckets?
Environment Configurations
How Do I Check the CUDA Version Used by a Notebook Instance?
How Do I Enable the Terminal Function in DevEnviron of ModelArts?
How Do I Install External Libraries in a Notebook Instance?
How Do I Obtain the External IP Address of My Local PC?
How Can I Resolve Abnormal Font Display on a ModelArts Notebook Accessed from iOS?
Is There a Proxy for Notebook? How Do I Disable It?
Notebook Instances
What Do I Do If I Cannot Access My Notebook Instance?
What Should I Do When the System Displays an Error Message Indicating that No Space Left After I Run the pip install Command?
What Do I Do If "Read timed out" Is Displayed After I Run pip install?
What Do I Do If the Code Can Be Run But Cannot Be Saved, and the Error Message "save error" Is Displayed?
Code Execution
What Do I Do If a Notebook Instance Won't Run My Code?
Why Does the Instance Break Down When dead kernel Is Displayed During Training Code Running?
What Do I Do If cudaCheckError Occurs During Training?
What Should I Do If DevEnviron Prompts Insufficient Space?
Why Does the Notebook Instance Break Down When opencv.imshow Is Used?
Why Cannot the Path of a Text File Generated in Windows OS Be Found In a Notebook Instance?
What Do I Do If Files Fail to Be Saved in JupyterLab?
Failures to Access the Development Environment Through VS Code
What Do I Do If the VS Code Window Is Not Displayed?
What Do I Do If a Remote Connection Failed After VS Code Is Opened?
What Do I Do If Error Message "Could not establish connection to xxx" Is Displayed During a Remote Connection?
What Do I Do If the Connection to a Remote Development Environment Remains in "Setting up SSH Host xxx: Downloading VS Code Server locally" State for More Than 10 Minutes?
What Do I Do If the Connection to a Remote Development Environment Remains in the State of "Setting up SSH Host xxx: Downloading VS Code Server locally" for More Than 10 Minutes?
What Do I Do If the Connection to a Remote Development Environment Remains in the State of "ModelArts Remote Connect: Connecting to instance xxx..." for More Than 10 Minutes?
What Do I Do If a Remote Connection Is in the Retry State?
What Do I Do If Error Message "The VS Code Server failed to start" Is Displayed?
What Do I Do If Error Message "Permissions for 'x:/xxx.pem' are too open" Is Displayed?
What Do I Do If Error Message "Bad owner or permissions on C:\Users\Administrator/.ssh/config" or "Connection permission denied (publickey)" Is Displayed?
What Do I Do If Error Message "ssh: connect to host xxx.pem port xxxxx: Connection refused" Is Displayed?
What Do I Do If Error Message "ssh: connect to host ModelArts-xxx port xxx: Connection timed out" Is Displayed?
What Do I Do If Error Message "Load key "C:/Users/xx/test1/xxx.pem": invalid format" Is Displayed?
What Do I Do If Error Message "An SSH installation couldn't be found" or "Could not establish connection to instance xxx: 'ssh' ..." Is Displayed?
What Do I Do If Error Message "no such identity: C:/Users/xx /test.pem: No such file or directory" Is Displayed?
What Do I Do If Error Message "Host key verification failed" or "Port forwarding is disabled" Is Displayed?
What Do I Do If Error Message "Failed to install the VS Code Server" or "tar: Error is not recoverable: exiting now" Is Displayed?
What Do I Do If Error Message "XHR failed" Is Displayed When a Remote Notebook Instance Is Accessed Through VS Code?
What Do I Do for an Automatically Disconnected VS Code Connection If No Operation Is Performed for a Long Time?
What Do I Do If It Takes a Long Time to Set Up a Remote Connection After VS Code Is Automatically Upgraded?
What Do I Do If Error Message "Connection reset" Is Displayed During an SSH Connection?
What Can I Do If a Notebook Instance Is Frequently Disconnected or Stuck After I Use MobaXterm to Connect to the Notebook Instance in SSH Mode?
Others
How Do I Use Multiple Ascend Cards for Debugging in a Notebook Instance?
Why Is the Training Speed Similar When Different Notebook Flavors Are Used?
How Do I Perform Incremental Training When Using MoXing?
How Do I View GPU Usage on the Notebook?
How Can I Obtain GPU Usage Through Code?
Which Real-Time Performance Indicators of an Ascend Chip Can I View?
What Are the Relationships Between Files Stored in JupyterLab, Terminal, and OBS?
How Do I Migrate Data from an Old-Version Notebook Instance to a New-Version One?
How Do I Use the Datasets Created on ModelArts in a Notebook Instance?
pip and Common Commands
What Are Sizes of the /cache Directories for Different Notebook Specifications in DevEnviron?
Training Jobs
Functional Consulting
What Are the Solutions to Underfitting?
What Are the Precautions for Switching Training Jobs from the Old Version to the New Version?
How Do I Obtain a Trained ModelArts Model?
What Is TensorBoard Used for in Model Visualization Jobs?
How Do I Obtain RANK_TABLE_FILE on ModelArts for Distributed Training?
How Do I Obtain the CUDA and cuDNN Versions of a Custom Image?
How Do I Obtain a MoXing Installation File?
In a Multi-Node Training, the TensorFlow PS Node Functioning as a Server Will Be Continuously Suspended. How Does ModelArts Determine Whether the Training Is Complete? Which Node Is a Worker?
How Do I Install MoXing for a Custom Image of a Training Job?
Reading Data During Training
How Do I Configure the Input and Output Data for Training Models on ModelArts?
How Do I Improve Training Efficiency While Reducing Interaction with OBS?
Why the Data Read Efficiency Is Low When a Large Number of Data Files Are Read During Training?
Compiling the Training Code
How Do I Create a Training Job When a Dependency Package Is Referenced by the Model to Be Trained?
What Is the Common File Path for Training Jobs?
How Do I Install a Library That C++ Depends on?
How Do I Check Whether a Folder Copy Is Complete During Job Training?
How Do I Load Some Well Trained Parameters During Job Training?
How Do I Obtain Training Job Parameters from the Boot File of the Training Job?
Why Can't I Use os.system ('cd xxx') to Access the Corresponding Folder During Job Training?
How Do I Invoke a Shell Script in a Training Job to Execute the .sh File?
How Do I Obtain the Dependency File Path to be Used in Training Code?
What Is the File Path If a File in the model Directory Is Referenced in a Custom Python Package?
Creating a Training Job
What Can I Do If the Message "Object directory size/quantity exceeds the limit" Is Displayed When I Create a Training Job?
What Are Precautions for Setting Training Parameters?
What Are Sizes of the /cache Directories for Different Resource Specifications in the Training Environment?
Is the /cache Directory of a Training Job Secure?
Why Is a Training Job Always Queuing?
Managing Training Job Versions
Does a Training Job Support Scheduled or Periodic Calling?
Viewing Job Details
How Do I Check Resource Usage of a Training Job?
How Do I Access the Background of a Training Job?
Is There Any Conflict When Models of Two Training Jobs Are Saved in the Same Directory of a Container?
Only Three Valid Digits Are Retained in a Training Output Log. Can the Value of loss Be Changed?
Can a Trained Model Be Downloaded or Migrated to Another Account? How Do I Obtain the Download Path?
Service Deployment
Model Management
Importing Models
How Do I Import the .h5 Model of Keras to ModelArts?
How Do I Edit the Installation Package Dependency Parameters in a Model Configuration File When Importing a Model?
How Do I Change the Default Port to Create a Real-Time Service Using a Custom Image?
Does ModelArts Support Multi-Model Import?
Restrictions on the Size of an Image for Importing an AI Application
Service Deployment
Functional Consulting
What Types of Services Can Models Be Deployed as on ModelArts?
What Are the Differences Between Real-Time Services and Batch Services?
What Is the Maximum Size of a Prediction Request Body?
How Do I Select Compute Node Specifications for Deploying a Service?
What Is the CUDA Version for Deploying a Service on GPUs?
Real-Time Services
What Do I Do If a Conflict Occurs in the Python Dependency Package of a Custom Prediction Script When I Deploy a Real-Time Service?
What Is the Format of a Real-Time Service API?
Why Did My Service Deployment Fail with Proper Deployment Timeout Configured?
API/SDK
Can ModelArts APIs or SDKs Be Used to Download Models to a Local PC?
What Installation Environments Do ModelArts SDKs Support?
Does ModelArts Use the OBS API to Access OBS Files over an Intranet or the Internet?
How Do I Obtain a Job Resource Usage Curve After I Submit a Training Job by Calling an API?
How Do I View the Old-Version Dedicated Resource Pool List Using the SDK?
Using PyCharm Toolkit
What Should I Do If an Error Occurs During Toolkit Installation?
What Should I Do If an Error Occurs When I Edit a Credential in PyCharm Toolkit?
Why Cannot I Start Training?
What Should I Do If Error "xxx isn't existed in train_version" Occurs When a Training Job Is Submitted?
What Should I Do If Error "Invalid OBS path" Occurs When a Training Job Is Submitted?
What Should I Do If an Error Occurs During Service Deployment?
How Do I View Error Logs of PyCharm Toolkit?
How Do I Use PyCharm ToolKit to Create Multiple Jobs for Simultaneous Training?
What Should I Do If "Error occurs when accessing to OBS" Is Displayed When PyCharm ToolKit Is Used?
Troubleshooting
General Issues
Incorrect OBS Path on ModelArts
DevEnviron
Environment Configuration Faults
Disk Space Used Up
An Error Is Reported When Conda Is Used to Install Keras 2.3.1 in Notebook
Instance Faults
What Do I Do If I Cannot Access My Notebook Instance?
What Should I Do When the System Displays an Error Message Indicating that No Space Left After I Run the pip install Command?
What Do I Do If the Code Can Be Run But Cannot Be Saved, and the Error Message "save error" Is Displayed?
ModelArts.6333 Error Occurs
What Can I Do If a Message Is Displayed Indicating that the Token Does Not Exist or Is Lost When I Open a Notebook Instance?
Code Running Failures
Error Occurs When Using a Notebook Instance to Run Code, Indicating That No File Is Found in /tmp
What Do I Do If a Notebook Instance Won't Run My Code?
Why Does the Instance Break Down When dead kernel Is Displayed During Training Code Running?
What Do I Do If cudaCheckError Occurs During Training?
What Do I Do If Insufficient Space Is Displayed in DevEnviron?
Why Does the Notebook Instance Break Down When opencv.imshow Is Used?
Why Cannot the Path of a Text File Generated in Windows OS Be Found In a Notebook Instance?
What Do I Do If No Kernel Is Displayed After a Notebook File Is Created?
JupyterLab Plug-in Faults
What Do I Do If the Git Plug-in Password Is Invalid?
Save an Image Failures
What If the Error Message "there are processes in 'D' status, please check process status using'ps -aux' and kill all the 'D' status processes" or "Buildimge,False,Error response from daemon,Cannot pause container xxx" Is Displayed When I Save an Image?
What Do I Do If Error "container size %dG is greater than threshold %dG" Is Displayed When I Save an Image?
What Do I Do If Error "too many layers in your image" Is Displayed When I Save an Image?
What Do I Do If Error "The container size (xG) is greater than the threshold (25G)" Is Reported When I Save an Image?
Other Faults
Failed to Open the checkpoints Folder in Notebook
Failed to Use a Purchased Dedicated Resource Pool to Create New-Version Notebook Instances
Training Jobs
OBS Operation Issues
Failed to Correctly Read Files
Error Message Is Displayed Repeatedly When a TensorFlow-1.8 Job Is Connected to OBS
TensorFlow Stops Writing TensorBoard to OBS When the Size of Written Data Reaches 5 GB
Error "Unable to connect to endpoint" Error Occurs When a Model Is Saved
What Do I Do If Error Message "No such file or directory" Is Displayed in Training Job Logs?
Error Message "BrokenPipeError: Broken pipe" Displayed When OBS Data Is Copied
Error Message "ValueError: Invalid endpoint: obs.xxxx.com" Displayed in Logs
Error Message "errorMessage:The specified key does not exist" Displayed in Logs
In-Cloud Migration Adaptation Issues
Failed to Import a Module
Error Message "No module named .*" Displayed in Training Job Logs
Failed to Install a Third-Party Package
Failed to Download the Code Directory
Error Message "No such file or directory" Displayed in Training Job Logs
Failed to Find the .so File During Training
Failed to Parse Parameters and Log Error Occurs
Training Output Path Is Used by Another Job
Failed to Find the Boot File When a Training Job Is Created Using a Custom Image
Error Message "RuntimeError: std:exception" Displayed for a PyTorch 1.0 Engine
Error Message "retCode=0x91, [the model stream execute failed]" Displayed in MindSpore Logs
Error Occurred When Pandas Reads Data from an OBS File If MoXing Is Used to Adapt to an OBS Path
Error Message "Please upgrade numpy to >= xxx to use this pandas version" Displayed in Logs
Reinstalled CUDA Version Does Not Match the One in the Target Image
Error ModelArts.2763 Occurred During Training Job Creation
Error Message "AttributeError: module '***' has no attribute '***'" Displayed Training Job Logs
System Container Exits Unexpectedly
Memory Limit Issues
Downloading Files Timed Out or No Space Left for Reading Data
Insufficient Container Space for Copying Data
Error Message "No space left" Displayed When a TensorFlow Multi-node Job Downloads Data to /cache
Size of the Log File Has Reached the Limit
Error Message "write line error" Displayed in Logs
Error Message "No space left on device" Displayed in Logs
Training Job Failed Due to OOM
Common Issues Related to Insufficient Disk Space and Solutions
Internet Access Issues
Error Message "Network is unreachable" Displayed in Logs
URL Connection Timed Out in a Running Training Job
Permission Issues
What Should I Do If Error "stat:403 reason:Forbidden" Is Displayed in Logs When a Training Job Accesses OBS
Error Message "Permission denied" Displayed in Logs
GPU Issues
Error Message "No CUDA-capable device is detected" Displayed in Logs
Error Message "RuntimeError: connect() timed out" Displayed in Logs
Error Message "cuda runtime error (10) : invalid device ordinal at xxx" Displayed in Logs
Error Message "RuntimeError: Cannot re-initialize CUDA in forked subprocess" Displayed in Logs
No GPU Is Found for a Training Job
Service Code Issues
Error Message "pandas.errors.ParserError: Error tokenizing data. C error: Expected .* fields" Displayed in Logs
Error Message "max_pool2d_with_indices_out_cuda_frame failed with error code 0" Displayed in Logs
Training Job Failed with Error Code 139
Error Message "'(slice(0, 13184, None), slice(None, None, None))' is an invalid key" Displayed in Logs
Error Message "DataFrame.dtypes for data must be int, float or bool" Displayed in Logs
Error Message "CUDNN_STATUS_NOT_SUPPORTED" Displayed in Logs
Error Message "Out of bounds nanosecond timestamp" Displayed in Logs
Error Message "Unexpected keyword argument passed to optimizer" Displayed in Logs
Error Message "no socket interface found" Displayed in Logs
Error Message "Runtimeerror: Dataloader worker (pid 46212) is killed by signal: Killed BP" Displayed in Logs
Error Message "AttributeError: 'NoneType' object has no attribute 'dtype'" Displayed in Logs
Error Message "No module name 'unidecode'" Displayed in Logs
Distributed Tensorflow Cannot Use tf.variable
When MXNet Creates kvstore, the Program Is Blocked and No Error Is Reported
ECC Error Occurs in the Log, Causing Training Job Failure
Training Job Failed Because the Maximum Recursion Depth Is Exceeded
Training Using a Built-in Algorithm Failed Due to a bndbox Error
Training Job Status Is Reviewing Job Initialization
Training Job Process Exits Unexpectedly
Stopped Training Job Process
Training Job Suspended
Locating Training Job Suspension
Data Replication Suspension
Suspension Before Training
Suspension During Training
Suspension in the Last Training Epoch
Training Jobs Created in a Dedicated Resource Pool
No Cloud Storage Name or Mount Path Displayed on the Page for Creating a Training Job
Training Performance Issues
Training Performance Deteriorated
Inference Deployment
AI Application Management
Creating an AI Application Failed
Failed to Build an Image or Import a File When an IAM user Creates an AI Application
Obtaining the Directory Structure in the Target Image When Importing an AI Application Through OBS
Failed to Obtain Certain Logs on the ModelArts Log Query Page
Failed to Download a pip Package When an AI Application Is Created Using OBS
Failed to Use a Custom Image to Create an AI application
Insufficient Disk Space Is Displayed When a Service Is Deployed After an AI Application Is Imported
Error Occurred When a Created AI Application Is Deployed as a Service
Invalid Runtime Dependency Configured in an Imported Custom Image
Garbled Characters Displayed in an AI Application Name Returned When AI Application Details Are Obtained Through an API
The Model or Image Exceeded the Size Limit for AI Application Import
A Single Model File Exceeded the Size Limit (5 GB) for AI Application Import
Creating an AI Application Failed Due to Image Building Timeout
Service Deployment
Error Occurred When a Custom Image Model Is Deployed as a Real-Time Service
Alarm Status of a Deployed Real-Time Service
Failed to Start a Service
What Do I Do If an Image Fails to Be Pulled When a Service Is Deployed, Started, Upgraded, or Modified?
What Do I Do If an Image Restarts Repeatedly When a Service Is Deployed, Started, Upgraded, or Modified?
What Do I Do If a Container Health Check Fails When a Service Is Deployed, Started, Upgraded, or Modified?
What Do I Do If Resources Are Insufficient When a Service Is Deployed, Started, Upgraded, or Modified?
Error Occurred When a CV2 Model Package Is Used to Deploy a Real-Time Service
Service Is Consistently Being Deployed
A Started Service Is Intermittently in the Alarm State
Failed to Deploy a Service and Error "No Module named XXX" Occurred
Insufficient Permission to or Unavailable Input/Output OBS Path of a Batch Service
Service Prediction
Service Prediction Failed
Error "APIG.XXXX" Occurred in a Prediction Failure
Error ModelArts.4206 Occurred in Real-Time Service Prediction
Error ModelArts.4302 Occurred in Real-Time Service Prediction
Error ModelArts.4503 Occurred in Real-Time Service Prediction
Error MR.0105 Occurred in Real-Time Service Prediction
Method Not Allowed
Request Timed Out
Error Occurred When an API Is Called for Deploying a Model Created Using a Custom Image
MoXing
Error Occurs When MoXing Is Used to Copy Data
How Do I Disable the Warmup Function of the Mox?
Pytorch Mox Logs Are Repeatedly Generated
Does moxing.tensorflow Contain the Entire TensorFlow? How Do I Perform Local Fine Tune on the Generated Checkpoint?
Copying Data Using MoXing Is Slow and the Log Is Repeatedly Printed in a Training Job
Failed to Access a Folder Using MoXing and Read the Folder Size Using get_size
APIs or SDKs
"ERROR: Could not install packages due to an OSError" Occurred During ModelArts SDK Installation
Error Occurred During Service Deployment After the Target Path to a File Downloaded Through a ModelArts SDK Is Set to a File Name
Videos