Help Center/ Data Lake Insight/ User Guide/ Submitting a Spark Job Using a Notebook Instance

Updated on 2025-09-15 GMT+08:00

View PDF

Submitting a Spark Job Using a Notebook Instance

Notebook is an interactive data analysis and mining module that has been deeply optimized based on the open-source JupyterLab. It provides online development and debugging capabilities for writing and debugging model training code. After connecting DLI to a notebook instance, you can write code and develop jobs using Notebook's web-based interactive development environment, as well as flexibly perform data analysis and exploration. This section describes how to submit a DLI job using a notebook instance.

For how to perform operations on Jupyter Notebook, see Jupyter Notebook Documentation.

Use notebook instances to submit DLI jobs in scenarios involving online development and debugging. You can perform data analysis and exploration seamlessly, without the need to set up a development environment.

Notes

To use this function, which is currently in the OBT phase, submit a request by choosing Service Tickets > Create Service Ticket in the upper right corner of the management console.
Deleting an elastic resource pool on the DLI management console will not delete the associated notebook instances. If you no longer need the notebook instances, log in to the ModelArts management console to delete them.

Procedure

Create an elastic resource pool and create general-purpose queues within it.
To create a notebook instance on DLI, first create an elastic resource pool and create a general-purpose queue within the pool. So the queue can offer compute resources required to run DLI jobs. See Step 1: Create an Elastic Resource Pool and Create General-Purpose Queues Within It.
Create a VPC and security group.
After configuring the elastic resource pool, the pool will prepare the components required for the notebook instance. See Step 2: Create a VPC and Security Group.
Create an enhanced datasource connection, which will be used to connect the DLI elastic resource pool to a notebook instance.
See Step 3: Create an Enhanced Datasource Connection.
Prepare a custom image.
See Step 4: Register a ModelArts Custom Image.
Create a custom agency, which will be used to access a notebook instance.
See Step 5: Create a DLI Custom Agency.
Create a notebook instance in the DLI elastic resource pool.
See Step 6: Create a Notebook Instance in the DLI Elastic Resource Pool.
Configure the notebook instance to access DLI or LakeFormation metadata.
- (Optional) Configuring the Notebook Instance to Access DLI Metadata
- (Optional) Configuring the Notebook Instance to Access LakeFormation Metadata
Write and debug code in JupyterLab.
On the JupyterLab home page, you can edit and debug code in the Notebook area. See Step 8: Use the Notebook Instance to Write and Debug Code.

Notes and Constraints

To submit a DLI job using a notebook instance, you must have a general-purpose queue within an elastic resource pool.
Each elastic resource pool is associated with a unique notebook instance.
Temporary data generated during the running of notebook jobs is stored in DLI job buckets in a parallel file system.
Manage notebook instances on the ModelArts management console. For details, see Managing Notebook Instances.
Notebook instances are used for code editing and development, and associated queues are used for job execution.
To change the queue associated with a notebook instance, perform related operations on the ModelArts management console.

Step 1: Create an Elastic Resource Pool and Create General-Purpose Queues Within It

Create an elastic resource pool.
1. Log in to the DLI management console. In the navigation pane on the left, choose Resources > Resource Pool.
2. On the displayed page, click Buy Resource Pool in the upper right corner.
3. On the displayed page, set the parameters based on Creating an Elastic Resource Pool and Creating Queues Within It.
  - CU range: Reserve over 16 CUs.
  - CIDR Block: Make sure the CIDR block differs from the following ones:
    172.18.0.0/16, 172.16.0.0/16, 10.247.0.0/16
4. Click Buy.
5. Click Submit. Wait until the elastic resource pool changes to the Available state.
Create a general-purpose queue within the elastic resource pool.
1. Locate the elastic resource pool in which you want to create queues and click Add Queue in the Operation column.
2. On the Add Queue page, configure basic information about the queue. For details about the parameters, see Creating an Elastic Resource Pool and Creating Queues Within It.
  Set Type to For general purpose.
3. Click Next. On the displayed page, configure a scaling policy for the queue.
4. Click OK.

Step 2: Create a VPC and Security Group

Create a VPC.
1. Log in to the VPC management console and click Create VPC in the upper right corner of the page.
2. On the Create VPC page, set the parameters as prompted.
  For details about the parameters, see Creating a VPC.
  
  Make sure not to set IPv4 CIDR Block to any of the following ones:
  
  172.18.0.0/16, 172.16.0.0/16, 10.247.0.0/16
Create a security group.
1. On the network console, access the Security Groups page.
2. Click Create Security Group in the upper right corner.
  On the displayed page, set security group parameters as prompted.
  
  For details about the parameters, see Creating a Security Group.
Ensure that the security group allows TCP ports 8998 and 30000–32767 to pass through the CIDR block of the DLI elastic resource pool.

Step 3: Create an Enhanced Datasource Connection

Log in to the DLI management console.
In the navigation pane on the left, choose Datasource Connections.
On the displayed Enhanced tab, click Create.
Set parameters according to Table 3.

When creating an enhanced datasource connection:
- Resource Pool: Select the elastic resource pool created in Step 1: Create an Elastic Resource Pool and Create General-Purpose Queues Within It.
- VPC: Select the VPC created in Step 2: Create a VPC and Security Group.

Step 4: Register a ModelArts Custom Image

Based on the preset MindSpore image provided by ModelArts and the ModelArts CLI, you can load the image creation template and modify a Dockerfile to create an image. Then, register the image.

For details about the ModelArts CLI, see ma-cli image Commands for Building Images.

Base image address: swr.{endpoint}/atelier/pyspark_3_1_1:develop-remote-pyspark_3.1.1-py_3.7-cpu-ubuntu_18.04-x86_64-uid1000-20230308194728-68791b4
Replace endpoint (region name) with the actual one.

For example, the endpoint of AP-Singapore is ap-southeast-3.myhuaweicloud.com.

The combined base image address is swr.ap-southeast-3.myhuaweicloud.com/atelier/pyspark_3_1_1:develop-remote-pyspark_3.1.1-py_3.7-cpu-ubuntu_18.04-x86_64-uid1000-20230308194728-68791b4.
For how to create and register a custom image on ModelArts, see Creating a Custom Image Using Dockerfile.

Step 5: Create a DLI Custom Agency

Create a DLI custom agency, which will be used to access a notebook instance. For details, see Creating a Custom DLI Agency.

Make sure the agency includes the following permissions: ModelArts FullAccess, DLI FullAccess, OBS Administrator, and IAM permission to pass agencies to cloud services.

If using role/policy-based authorization, grant the IAMiam:agencies:* permission.

{
    "Version": "1.1",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "iam:agencies:*"
            ]
        },
        {
            "Effect": "Deny",
            "Action": [
                "iam:agencies:update*",
                "iam:agencies:delete*",
                "iam:agencies:create*"
            ]
        }
    ]
}

Step 6: Create a Notebook Instance in the DLI Elastic Resource Pool

Log in to the ModelArts management console. In the navigation pane on the left, choose System Management > Permission Management. On the displayed page, check if the access authorization for ModelArts is configured. The new agency must include the IAM permission to pass agencies to cloud services. For details about permission policies, see Step 5: Create a DLI Custom Agency.

On the DLI elastic resource pool page, preset DLI resource information required for creating a notebook instance.
1. Log in to the DLI management console. In the navigation pane on the left, choose Resources > Resource Pool.
2. On the displayed page, locate the elastic resource pool created in Step 1: Create an Elastic Resource Pool and Create General-Purpose Queues Within It.
3. Click More in the Operation column and select Notebook (New).
4. In the slide-out panel, click Create Notebook. In the dialog box that appears, set the following parameters:
  - Image: Select the image registered in Step 4: Register a ModelArts Custom Image.
  - Queue: Select the queue created in Step 1: Create an Elastic Resource Pool and Create General-Purpose Queues Within It.
  - Spark Version: 3.3.1 is recommended.
  - Enhanced: Select the enhanced datasource connection created in Step 3: Create an Enhanced Datasource Connection.
    Figure 1 Presetting DLI resource information required for creating a notebook instance
5. Click OK. The instance creation page is displayed.
On the displayed page, set notebook instance parameters.
1. Create a notebook instance.
  For details about the parameters, see Creating a Notebook Instance.
  
  Set the parameters as follows:
  - Image: Select the image registered in Step 4: Register a ModelArts Custom Image.
  - VPC Access: Enable VPC access.
    
    The VPC access function of notebook instances is currently in the OBT phase. To use this function, submit a service ticket.
    
    Select the security group created in Step 2: Create a VPC and Security Group. The security group must allow TCP ports 8998 and 30000–32767 to pass through the CIDR block of the DLI elastic resource pool.
    
    Click Create.

Connect the notebook instance to DLI.

In the notebook instance list, locate the notebook instance and click Open in the Operation column to access the notebook instance page.
On the notebook instance page, click connect in the upper right corner to connect to DLI.
Figure 2 Connecting to DLI

In the Connect Cluster dialog box, configure job running information.

Figure 3 Connect Cluster
Click to enlarge

**Table 1** Connect Cluster
Parameter	Description	Example Value
Service Type	Name of the service to connect	DLI
Pool Name	Elastic resource pool of the queue where the notebook job is running	In this example, set this parameter to the elastic resource pool created in Step 1: Create an Elastic Resource Pool and Create General-Purpose Queues Within It.
Queue Name	Queue where the notebook job is running	In this example, set this parameter to the queue created in Step 1: Create an Elastic Resource Pool and Create General-Purpose Queues Within It.
Spark Version	Spark version	Only Spark 3.3.1 currently supports submitting DLI jobs using notebook instances.
Spark Arguments(--conf)	Allows you to configure custom parameters for the DLI job.	See Table 2.

**Table 2** Common Spark parameters
Parameter	Description
spark.dli.job.agency.name	Name of the agency for the DLI job When Flink 1.15, Spark 3.3, or a later version is used to execute jobs, you need to add information about the new agency to the job configuration. Example configuration: In this example, set this parameter to dli_notebook. spark.dli.job.agency.name=dli_notebook
spark.sql.session.state.builder	Configuration item for accessing metadata Example configuration: Set this parameter to access DLI metadata. spark.sql.session.state.builder=org.apache.spark.sql.hive.DliLakeHouseBuilder
spark.sql.catalog.class	Different data sources and metadata management systems Example configuration: Set this parameter to access DLI metadata. spark.sql.catalog.class=org.apache.spark.sql.hive.DliLakeHouseCatalog
spark.dli.metaAccess.enable	Enables or disables access to DLI metadata. spark.dli.metaAccess.enable=true

Click connect. When the connect button in the upper right corner changes to the queue name and the dot before the name turns green, the connection is successful. Then, you can execute the notebook job.
Figure 4 Notebook instance connected
Click connect to test the connection.

Once the notebook instance is initialized, you can perform online data analysis on it. Instance initialization typically takes about 2 minutes.

When you run SQL statements in the notebook instance, a Spark job is started in DLI, and the results are displayed in the instance.

Step 7: Configure the Notebook Instance to Access DLI Metadata

Before running a job, you need to configure the notebook instance to access DLI or LakeFormation metadata.

(Optional) Configuring the Notebook Instance to Access DLI Metadata
(Optional) Configuring the Notebook Instance to Access LakeFormation Metadata

Step 8: Use the Notebook Instance to Write and Debug Code

After the notebook instance is connected to the DLI queue, you can edit and debug code in the Notebook area.

You can choose to submit a job using the notebook instance or through the Spark Jobs page of the DLI management console.

For notebook-related operations, see JupyterLab Overview and Common Operations.
For how to upload data to a notebook instance, see Uploading Files to JupyterLab.
For how to download data from a notebook instance, see Downloading a File from JupyterLab to a Local PC.

(Optional) Configuring the Notebook Instance to Access DLI Metadata

After connecting the notebook instance to DLI, you need to configure access to metadata if you plan to submit DLI jobs using the notebook instance. This section describes how to configure access to DLI metadata.

For details about how to configure the notebook instance to access LakeFormation metadata, see (Optional) Configuring the Notebook Instance to Access LakeFormation Metadata.

Specify a notebook image.

Create a custom agency to authorize DLI to use DLI metadata and OBS.

For how to create a custom agency, see Creating a Custom DLI Agency.

Make sure the custom agency contains the following permissions:

**Table 3** DLI custom agency scenarios
Scenario	Agency Name	Use Case	Permission Policy
Allowing DLI to read and write data from and to OBS to transfer logs	Custom	For DLI Flink jobs, the permissions include downloading OBS objects, obtaining OBS/GaussDB(DWS) data sources (foreign tables), transferring logs, using savepoints, and enabling checkpointing. For DLI Spark jobs, the permissions allow downloading OBS objects and reading/writing OBS foreign tables.	Permission Policies for Accessing and Using OBS
Allowing DLI to access DLI catalogs to retrieve metadata	Custom	DLI accesses catalogs to retrieve metadata.	Permission to Access DLI Catalog Metadata

Confirm access to DLI metadata.
1. Log in to the ModelArts console and choose Development Workspace > Notebook.
2. Create a notebook instance. When the instance is Running, click Open in the Operation column.
3. On the displayed JupyterLab page, choose File > New > Terminal. The Terminal page appears.
  Figure 5 Accessing the Terminal page
4. Run the following commands to go to the Livy configuration directory and view the Spark configuration file:
  cd /home/ma-user/livy/conf/
  
  vi spark-defaults.conf
  
  Ensure that the spark.dli.user.catalogName=dli configuration item exists. This item is used to access DLI metadata.
  
  It is the default configuration item.
  
  Figure 6 Disabling default access to DLI metadata
5. Use Notebook to edit a job.
  - For notebook-related operations, see JupyterLab Overview and Common Operations.
  - For how to upload data to a notebook instance, see Uploading Files to JupyterLab.
  - For how to download data from a notebook instance, see Downloading a File from JupyterLab to a Local PC.

(Optional) Configuring the Notebook Instance to Access LakeFormation Metadata

For details about how to configure the notebook instance to access DLI metadata, see (Optional) Configuring the Notebook Instance to Access DLI Metadata.

Connect DLI to LakeFormation.
1. For details, see Connecting DLI to LakeFormation.
Specify a notebook image.

Create a custom agency to authorize DLI to use LakeFormation metadata and OBS.

For how to create a custom agency, see Creating a Custom DLI Agency.

Make sure the custom agency contains the following permissions:

**Table 4** DLI custom agency scenarios
Scenario	Agency Name	Use Case	Permission Policy
Allowing DLI to read and write data from and to OBS to transfer logs	Custom	For DLI Flink jobs, the permissions include downloading OBS objects, obtaining OBS/GaussDB(DWS) data sources (foreign tables), transferring logs, using savepoints, and enabling checkpointing. For DLI Spark jobs, the permissions allow downloading OBS objects and reading/writing OBS foreign tables.	Permission Policies for Accessing and Using OBS
Allowing DLI to access LakeFormation catalogs to retrieve metadata	Custom	DLI accesses LakeFormation catalogs to retrieve metadata.	Permission to Access LakeFormation Catalog Metadata

On the notebook instance page, set Spark parameters.

Select the queue of the DLI notebook image, click connect, and set Spark parameters.

spark.sql.catalogImplementation=hive
spark.hadoop.hive-ext.dlcatalog.metastore.client.enable=true
spark.hadoop.hive-ext.dlcatalog.metastore.session.client.class=com.huawei.cloud.dalf.lakecat.client.hiveclient.LakeCatMetaStoreClient
spark.hadoop.lakecat.catalogname.default=lfcatalog // Specify the catalog to access.
spark.dli.job.agency.name=agencyForLakeformation // The agency must have the necessary permissions on LakeFormation and OBS and must be delegated to DLI.
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/lakeformation/*
spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/lakeformation/*
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
spark.hadoop.hoodie.support.write.lock=org.apache.hudi.lakeformation.LakeCatMetastoreBasedLockProvider

**Table 5** Parameter descriptions
Parameter	Mandatory	Example Value	Configuration Scenario
spark.sql.catalogImplementation	Yes	hive	Type of catalog used to store and manage metadata
spark.hadoop.hive-ext.dlcatalog.metastore.client.enable	Yes	true	Mandatory when LakeFormation metadata access is enabled
spark.hadoop.hive-ext.dlcatalog.metastore.session.client.class	Yes	com.huawei.cloud.dalf.lakecat.client.hiveclient.LakeCatMetaStoreClient	Mandatory when LakeFormation metadata access is enabled
spark.hadoop.lakecat.catalogname.default	No	lfcatalog	Name of the LakeFormation data directory to access The default value is hive.
spark.dli.job.agency.name	Yes	User-defined agency name	User-defined agency name For how to create a custom agency, see Creating a Custom DLI Agency. For DLI metadata agency permissions, see Permission to Access LakeFormation Catalog Metadata.
spark.driver.extraClassPath	Yes	/usr/share/extension/dli/spark-jar/lakeformation/*	Loading of the LakeFormation dependency package
spark.executor.extraClassPath	Yes	/usr/share/extension/dli/spark-jar/lakeformation/*	Loading of the LakeFormation dependency package
spark.sql.extensions	No	org.apache.spark.sql.hudi.HoodieSparkSessionExtension	Mandatory in Hudi scenarios
spark.hadoop.hoodie.support.write.lock	No	org.apache.hudi.lakeformation.LakeCatMetastoreBasedLockProvider	Mandatory in Hudi scenarios

Disable the default access to DLI metadata and use LakeFormation metadata.
1. Log in to the ModelArts management console and choose DevEnviron > Notebook.
2. Create a notebook instance. When the instance is Running, click Open in the Operation column.
3. On the displayed JupyterLab page, choose File > New > Terminal. The Terminal page appears.
  Figure 7 Accessing the Terminal page
4. Run the following commands to go to the Livy configuration directory and modify the Spark configuration file to disable the default access to DLI metadata:
  cd /home/ma-user/livy/conf/
  
  vi spark-defaults.conf
  
  Use # to comment out spark.dli.user.catalogName=dli to disable the default access to DLI metadata.
  
  Figure 8 Disabling default access to DLI metadata
5. Use Notebook to edit a job.
  Run the spark.sql statement to access LakeFormation metadata and Hudi tables.
  
  Figure 9 Accessing LakeFormation metadata