このページは、お客様の言語ではご利用いただけません。Huawei Cloudは、より多くの言語バージョンを追加するために懸命に取り組んでいます。ご協力ありがとうございました。

Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive
Help Center/ Data Lake Insight/ User Guide/ Submitting a DLI Job Using a Notebook Instance

Submitting a DLI Job Using a Notebook Instance

Updated on 2025-02-14 GMT+08:00

Notebook is an interactive data analysis and mining module that has been deeply optimized based on the open-source JupyterLab. It provides online development and debugging capabilities for writing and debugging model training code. After connecting DLI to a notebook instance, you can write code and develop jobs using Notebook's web-based interactive development environment, as well as flexibly perform data analysis and exploration. This section describes how to submit a DLI job using a notebook instance.

For how to perform operations on Jupyter Notebook, see Jupyter Notebook Documentation.

Use notebook instances to submit DLI jobs in scenarios involving online development and debugging. You can perform data analysis and exploration seamlessly, without the need to set up a development environment.

Notes

  • To use this function, which is currently in the whitelist, submit a request by choosing Service Tickets > Create Service Ticket in the upper right corner of the management console.
  • Deleting an elastic resource pool on the DLI management console will not delete the associated notebook instances. If you no longer need the notebook instances, log in to the ModelArts management console to delete them.

Procedure

  1. Create an elastic resource pool and create general-purpose queues within it.

    To create a notebook instance on DLI, first create an elastic resource pool and create a general-purpose queue within the pool. So the queue can offer compute resources required to run DLI jobs. See Step 1: Create an Elastic Resource Pool and Create General-Purpose Queues Within It.

  2. Create a VPC and security group.

    After configuring the elastic resource pool, the pool will prepare the components required for the notebook instance. See Step 2: Create a VPC and Security Group.

  3. Create an enhanced datasource connection, which will be used to connect the DLI elastic resource pool to a notebook instance.

    See Step 3: Create an Enhanced Datasource Connection.

  4. Prepare a custom image.

    See Step 4: Register a ModelArts Custom Image.

  5. Create a custom agency, which will be used to access a notebook instance.

    See Step 5: Create a DLI Custom Agency.

  6. Create a notebook instance in the DLI elastic resource pool.

    See Step 6: Create a Notebook Instance in the DLI Elastic Resource Pool.

  7. Configure the notebook instance to access DLI or LakeFormation metadata.
  8. Write and debug code in JupyterLab.

    On the JupyterLab home page, you can edit and debug code in the Notebook area. See Step 8: Use the Notebook Instance to Write and Debug Code.

Notes and Constraints

  • To submit a DLI job using a notebook instance, you must have a general-purpose queue within an elastic resource pool.
  • Each elastic resource pool is associated with a unique notebook instance.
  • Temporary data generated during the running of notebook jobs is stored in DLI job buckets in a parallel file system.
  • Manage notebook instances on the ModelArts management console. For details, see Managing Notebook Instances.
  • Notebook instances are used for code editing and development, and associated queues are used for job execution.

    To change the queue associated with a notebook instance, perform related operations on the ModelArts management console.

Step 1: Create an Elastic Resource Pool and Create General-Purpose Queues Within It

  1. Create an elastic resource pool.
    1. Log in to the DLI management console. In the navigation pane on the left, choose Resources > Resource Pool.
    2. On the displayed page, click Buy Resource Pool in the upper right corner.
    3. On the displayed page, set the parameters based on Creating an Elastic Resource Pool and Creating Queues Within It.
      • CU range: Reserve over 16 CUs.
      • CIDR Block: Make sure the CIDR block differs from the following ones:

        172.18.0.0/16, 172.16.0.0/16, 10.247.0.0/16

    4. Click Buy.
    5. Click Submit. Wait until the elastic resource pool changes to the Available state.
  2. Create a general-purpose queue within the elastic resource pool.
    1. Locate the elastic resource pool in which you want to create queues and click Add Queue in the Operation column.
    2. On the Add Queue page, configure basic information about the queue. For details about the parameters, see Creating an Elastic Resource Pool and Creating Queues Within It.

      Set Type to For general purpose.

    3. Click Next. On the displayed page, configure a scaling policy for the queue.
    4. Click OK.

Step 2: Create a VPC and Security Group

  • Create a VPC.
    1. Log in to the VPC management console and click Create VPC in the upper right corner of the page.
    2. On the Create VPC page, set the parameters as prompted.

      For details about the parameters, see Creating a VPC.

      Make sure not to set IPv4 CIDR Block to any of the following ones:

      172.18.0.0/16, 172.16.0.0/16, 10.247.0.0/16

  • Create a security group.
    1. On the network console, access the Security Groups page.
    2. Click Create Security Group in the upper right corner.

      On the displayed page, set security group parameters as prompted.

      For details about the parameters, see Creating a Security Group.

    Ensure that the security group allows TCP ports 8998 and 3000032767 to pass through the CIDR block of the DLI elastic resource pool.

Step 3: Create an Enhanced Datasource Connection

  1. Log in to the DLI management console.
  2. In the navigation pane on the left, choose Datasource Connections.
  3. On the displayed Enhanced tab, click Create.

    Set parameters based on Table 2.

    When creating an enhanced datasource connection:

Step 4: Register a ModelArts Custom Image

Based on the preset MindSpore image provided by ModelArts and the ModelArts CLI, you can load the image creation template and modify a Dockerfile to create an image. Then, register the image.

For details about the ModelArts CLI, see ma-cli image Commands for Building Images.

  • Base image address: swr.{endpoint}/atelier/pyspark_3_1_1:develop-remote-pyspark_3.1.1-py_3.7-cpu-ubuntu_18.04-x86_64-uid1000-20230308194728-68791b4

    Replace endpoint (region name) with the actual one.

    For example, the endpoint of AP-Singapore is ap-southeast-3.myhuaweicloud.com.

    The combined base image address is swr.ap-southeast-3.myhuaweicloud.com/atelier/pyspark_3_1_1:develop-remote-pyspark_3.1.1-py_3.7-cpu-ubuntu_18.04-x86_64-uid1000-20230308194728-68791b4.

  • For how to create and register a custom image on ModelArts, see Creating a Custom Image Using Dockerfile.

Step 5: Create a DLI Custom Agency

Create a DLI custom agency, which will be used to access a notebook instance. For details, see Creating a Custom DLI Agency.

Make sure the agency includes the following permissions: ModelArts FullAccess, DLI FullAccess, OBS Administrator, and IAM permission to pass agencies to cloud services.

If using role/policy-based authorization, grant the IAMiam:agencies:* permission.
{
    "Version": "1.1",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "iam:agencies:*"
            ]
        },
        {
            "Effect": "Deny",
            "Action": [
                "iam:agencies:update*",
                "iam:agencies:delete*",
                "iam:agencies:create*"
            ]
        }
    ]
}

Step 6: Create a Notebook Instance in the DLI Elastic Resource Pool

NOTE:

Log in to the ModelArts management console. In the navigation pane on the left, choose System Management > Permission Management. On the displayed page, check if the access authorization for ModelArts is configured. The new agency must include the IAM permission to pass agencies to cloud services. For details about permission policies, see Step 5: Create a DLI Custom Agency.

  1. On the DLI elastic resource pool page, preset DLI resource information required for creating a notebook instance.

    1. Log in to the DLI management console. In the navigation pane on the left, choose Resources > Resource Pool.
    2. On the displayed page, locate the elastic resource pool created in Step 1: Create an Elastic Resource Pool and Create General-Purpose Queues Within It.
    3. Click More in the Operation column and select Notebook (New).
    4. In the slide-out panel, click Create Notebook. In the dialog box that appears, set the following parameters:
    5. Click OK. The instance creation page is displayed.

  2. On the displayed page, set notebook instance parameters.

    1. Create a notebook instance.

      For details about the parameters, see Creating a Notebook Instance.

      Set the parameters as follows:

      • Image: Select the image registered in Step 4: Register a ModelArts Custom Image.
      • VPC Access: Enable VPC access.
        NOTE:

        Contact customer support to enable the VPC access function for the notebook instance.

        Select the security group created in Step 2: Create a VPC and Security Group. The security group must allow TCP ports 8998 and 3000032767 to pass through the CIDR block of the DLI elastic resource pool.

        Click Create.

  3. Connect the notebook instance to DLI.

    1. In the notebook instance list, locate the notebook instance and click Open in the Operation column to access the notebook instance page.
    2. On the notebook instance page, click connect in the upper right corner to connect to DLI.
      Figure 2 Connecting to DLI
    3. In the Connect Cluster dialog box, configure job running information.
      Figure 3 Connect Cluster
      Table 1 Connect Cluster

      Parameter

      Description

      Example Value

      Service Type

      Name of the service to connect

      DLI

      Pool Name

      Elastic resource pool of the queue where the notebook job is running

      In this example, set this parameter to the elastic resource pool created in Step 1: Create an Elastic Resource Pool and Create General-Purpose Queues Within It.

      Queue Name

      Queue where the notebook job is running

      In this example, set this parameter to the queue created in Step 1: Create an Elastic Resource Pool and Create General-Purpose Queues Within It.

      Spark Version

      Spark version

      Only Spark 3.3.1 currently supports submitting DLI jobs using notebook instances.

      Spark Arguments(--conf)

      Allows you to configure custom parameters for the DLI job.

      See Table 2.

      Table 2 Common Spark parameters

      Parameter

      Description

      spark.dli.job.agency.name

      Name of the agency for the DLI job

      When Flink 1.15, Spark 3.3, or a later version is used to execute jobs, you need to add information about the new agency to the job configuration.

      Example configuration:

      In this example, set this parameter to dli_notebook.

      spark.dli.job.agency.name=dli_notebook

      spark.sql.session.state.builder

      Configuration item for accessing metadata

      Example configuration: Set this parameter to access DLI metadata.

      spark.sql.session.state.builder=org.apache.spark.sql.hive.DliLakeHouseBuilder

      spark.sql.catalog.class

      Different data sources and metadata management systems

      Example configuration: Set this parameter to access DLI metadata.

      spark.sql.catalog.class=org.apache.spark.sql.hive.DliLakeHouseCatalog

      spark.dli.metaAccess.enable

      Enables or disables access to DLI metadata.

      spark.dli.metaAccess.enable=true

    4. Click connect. When the connect button in the upper right corner changes to the queue name and the dot before the name turns green, the connection is successful. Then, you can execute the notebook job.
      Figure 4 Notebook instance connected
    5. Click connect to test the connection.

Once the notebook instance is initialized, you can perform online data analysis on it. Instance initialization typically takes about 2 minutes.

When you run SQL statements in the notebook instance, a Spark job is started in DLI, and the results are displayed in the instance.

Step 7: Configure the Notebook Instance to Access DLI Metadata

Before running a job, you need to configure the notebook instance to access DLI or LakeFormation metadata.

Step 8: Use the Notebook Instance to Write and Debug Code

After the notebook instance is connected to the DLI queue, you can edit and debug code in the Notebook area.

You can choose to submit a job using the notebook instance or through the Spark Jobs page of the DLI management console.

(Optional) Configuring the Notebook Instance to Access DLI Metadata

After connecting the notebook instance to DLI, you need to configure access to metadata if you plan to submit DLI jobs using the notebook instance. This section describes how to configure access to DLI metadata.

For how to configure the notebook instance to access LakeFormation metadata, see (Optional) Configuring the Notebook Instance to Access LakeFormation Metadata.

  1. Specify a notebook image.
  2. Create a custom agency to authorize DLI to use DLI metadata and OBS.

    For how to create a custom agency, see Creating a Custom DLI Agency.

    Make sure the custom agency contains the following permissions:

    Table 3 DLI custom agency scenarios

    Scenario

    Agency Name

    Use Case

    Permission Policy

    Allowing DLI to read and write data from and to OBS to transfer logs

    Custom

    For DLI Flink jobs, the permissions include downloading OBS objects, obtaining OBS/GaussDB(DWS) data sources (foreign tables), transferring logs, using savepoints, and enabling checkpointing. For DLI Spark jobs, the permissions allow downloading OBS objects and reading/writing OBS foreign tables.

    Permission Policies for Accessing and Using OBS

    Allowing DLI to access DLI catalogs to retrieve metadata

    Custom

    DLI accesses catalogs to retrieve metadata.

    Permission to Access DLI Catalog Metadata

  3. Confirm access to DLI metadata.
    1. Log in to the ModelArts console and choose Development Workspace > Notebook.
    2. Create a notebook instance. When the instance is Running, click Open in the Operation column.
    3. On the displayed JupyterLab page, choose File > New > Terminal. The Terminal page appears.
      Figure 5 Accessing the Terminal page
    4. Run the following commands to go to the Livy configuration directory and view the Spark configuration file:

      cd /home/ma-user/livy/conf/

      vi spark-defaults.conf

      Ensure that the spark.dli.user.catalogName=dli configuration item exists. This item is used to access DLI metadata.

      It is the default configuration item.

      Figure 6 Disabling default access to DLI metadata
    5. Use Notebook to edit a job.

(Optional) Configuring the Notebook Instance to Access LakeFormation Metadata

After connecting the notebook instance to DLI, you need to configure access to metadata if you plan to submit DLI jobs using the notebook instance. This section describes how to configure access to LakeFormation metadata.

For how to configure the notebook instance to access DLI metadata, see (Optional) Configuring the Notebook Instance to Access DLI Metadata.

  1. Connect DLI to LakeFormation.
    1. For details, see Connecting DLI to LakeFormation.
  2. Specify a notebook image.
  3. Create a custom agency to authorize DLI to use LakeFormation metadata and OBS.

    For how to create a custom agency, see Creating a Custom DLI Agency.

    Make sure the custom agency contains the following permissions:

    Table 4 DLI custom agency scenarios

    Scenario

    Agency Name

    Use Case

    Permission Policy

    Allowing DLI to read and write data from and to OBS to transfer logs

    Custom

    For DLI Flink jobs, the permissions include downloading OBS objects, obtaining OBS/GaussDB(DWS) data sources (foreign tables), transferring logs, using savepoints, and enabling checkpointing. For DLI Spark jobs, the permissions allow downloading OBS objects and reading/writing OBS foreign tables.

    Permission Policies for Accessing and Using OBS

    Allowing DLI to access LakeFormation catalogs to retrieve metadata

    Custom

    DLI accesses LakeFormation catalogs to retrieve metadata.

    Permission to Access LakeFormation Catalog Metadata

  4. On the notebook instance page, set Spark parameters.
    1. Select the queue of the DLI notebook image, click connect, and set Spark parameters.
      spark.sql.catalogImplementation=hive
      spark.hadoop.hive-ext.dlcatalog.metastore.client.enable=true
      spark.hadoop.hive-ext.dlcatalog.metastore.session.client.class=com.huawei.cloud.dalf.lakecat.client.hiveclient.LakeCatMetaStoreClient
      spark.hadoop.lakecat.catalogname.default=lfcatalog // Specify the catalog to access.
      spark.dli.job.agency.name=agencyForLakeformation // The agency must have the necessary permissions on LakeFormation and OBS and must be delegated to DLI.
      spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/lakeformation/*
      spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/lakeformation/*
      spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
      spark.hadoop.hoodie.support.write.lock=org.apache.hudi.lakeformation.LakeCatMetastoreBasedLockProvider

      Table 5 Parameter description

      Parameter

      Mandatory

      Example Value

      Configuration Scenario

      spark.sql.catalogImplementation

      Yes

      hive

      Type of catalog used to store and manage metadata

      spark.hadoop.hive-ext.dlcatalog.metastore.client.enable

      Yes

      true

      Mandatory when LakeFormation metadata access is enabled

      spark.hadoop.hive-ext.dlcatalog.metastore.session.client.class

      Yes

      com.huawei.cloud.dalf.lakecat.client.hiveclient.LakeCatMetaStoreClient

      Mandatory when LakeFormation metadata access is enabled

      spark.hadoop.lakecat.catalogname.default

      No

      lfcatalog

      Name of the LakeFormation data directory to access

      The default value is hive.

      spark.dli.job.agency.name

      Yes

      User-defined agency name

      User-defined agency name

      spark.driver.extraClassPath

      Yes

      /usr/share/extension/dli/spark-jar/lakeformation/*

      Loading of the LakeFormation dependency package

      spark.executor.extraClassPath

      Yes

      /usr/share/extension/dli/spark-jar/lakeformation/*

      Loading of the LakeFormation dependency package

      spark.sql.extensions

      No

      org.apache.spark.sql.hudi.HoodieSparkSessionExtension

      Mandatory in Hudi scenarios

      spark.hadoop.hoodie.support.write.lock

      No

      org.apache.hudi.lakeformation.LakeCatMetastoreBasedLockProvider

      Mandatory in Hudi scenarios

  5. Disable the default access to DLI metadata and use LakeFormation metadata.
    1. Log in to the ModelArts management console and choose DevEnviron > Notebook.
    2. Create a notebook instance. When the instance is Running, click Open in the Operation column.
    3. On the displayed JupyterLab page, choose File > New > Terminal. The Terminal page appears.
      Figure 7 Accessing the Terminal page
    4. Run the following commands to go to the Livy configuration directory and modify the Spark configuration file to disable the default access to DLI metadata:

      cd /home/ma-user/livy/conf/

      vi spark-defaults.conf

      Use # to comment out spark.dli.user.catalogName=dli to disable the default access to DLI metadata.

      Figure 8 Disabling default access to DLI metadata
    5. Use Notebook to edit a job.

      Run the spark.sql statement to access LakeFormation metadata and Hudi tables.

      Figure 9 Accessing LakeFormation metadata

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback