Using DLI to Submit a Spark Jar Job
Scenario
DLI allows you to submit Spark jobs compiled as JAR files, which contain the necessary code and dependency information for executing the job. These files are used for specific data processing tasks such as data query, analysis, and machine learning. Before submitting a Spark Jar job, upload the package to OBS and submit it along with the data and job parameters to run the job.
This example introduces the basic process of submitting a Spark Jar job package through the DLI console. Due to different service requirements, the specific writing of the Jar package may vary. It is recommended that you refer to the sample code provided by DLI and edit and customize it according to your actual business scenario. Get DLI Sample Code.
Procedure
Table 1 describes the procedure for submitting a Spark Jar job using DLI.
Complete the preparations in Preparations before performing the following operations.
Procedure |
Description |
---|---|
Prepare a Spark Jar job package and upload it to OBS. |
|
Step 2: Create an Elastic Resource Pool and Add Queues to the Pool |
Create compute resources required for submitting the Spark Jar job. |
In cross-source analysis scenarios, use DEW to manage access credentials of data sources and create an agency that allows DLI to access DEW. |
|
Step 4: Create a Custom Agency to Allow DLI to Access DEW and Read Credentials |
Create an agency to allow DLI to access DEW. |
Create a Spark Jar job to analyze data. |
Preparations
- Register a Huawei ID and enable Huawei Cloud services. Make sure your account is not in arrears or frozen.
- Configure an agency for DLI.
To use DLI, you need to access services such as Object Storage Service (OBS), Virtual Private Cloud (VPC), and Simple Message Notification (SMN). If it is your first time using DLI, you will need to configure an agency to allow access to these dependent services.
- Log in to the DLI management console using your account. In the navigation pane on the left, choose Global Configuration > Service Authorization.
- On the agency settings page, select the agency permissions under Basic Usage, Datasource, and O&M and click Update.
- Check and understand the notes for updating the agency, and click OK. The DLI agency permissions are updated.
Figure 1 Configuring an agency for DLI
- Once configured, you can check the agency dli_management_agency in the agency list on the IAM console.
Step 1: Upload Data to OBS
Develop a Spark Jar job program by referring to Spark Job Sample Code, compile it, and pack it into spark-examples.jar. Perform the following steps to upload the program:
Before submitting Spark Jar jobs, upload data files to OBS.
- Log in to the DLI console.
- In the service list, click Object Storage Service under Storage.
- Create a bucket. In this example, name it dli-test-obs01.
- On the displayed Buckets page, click Create Bucket in the upper right corner.
- On the displayed Create Bucket page, specify Region and enter the Bucket Name. Retain the default values for other parameters or set them as required.
Select a region that matches the location of the DLI console.
- Click Create Now.
- In the bucket list, click the name of the dli-test-obs01 bucket you just created to access its Objects tab.
- Click Upload Object. In the dialog box displayed, drag or add files or folders, for example, spark-examples.jar, to the upload area. Then, click Upload.
In this example, the path after upload is obs://dli-test-obs01/spark-examples.jar.
For more operations on the OBS console, see the Object Storage Service User Guide.
Step 2: Create an Elastic Resource Pool and Add Queues to the Pool
- Log in to the DLI management console.
- In the navigation pane on the left, choose Resources > Resource Pool.
- On the displayed page, click Buy Resource Pool in the upper right corner.
- On the displayed page, set the parameters.
- In this example, we will buy the resource pool in the CN East-Shanghai2 region. Table 2 describes the parameters.
Table 2 Parameters Parameter
Description
Example Value
Region
Select a region where you want to buy the elastic resource pool.
CN East-Shanghai2
Project
Project uniquely preset by the system for each region
Default
Name
Name of the elastic resource pool
dli_resource_pool
Specifications
Specifications of the elastic resource pool
Standard
CU Range
The maximum and minimum CUs allowed for the elastic resource pool
64-64
CIDR Block
CIDR block the elastic resource pool belongs to. If you use an enhanced datasource connection, this CIDR block cannot overlap that of the data source. Once set, this CIDR block cannot be changed.
172.16.0.0/19
Enterprise Project
Select an enterprise project for the elastic resource pool.
default
- Click Buy.
- Click Submit.
- In the elastic resource pool list, locate the pool you just created and click Add Queue in the Operation column.
- Set the basic parameters listed below.
Table 3 Basic parameters for adding a queue Parameter
Description
Example Value
Name
Name of the queue to add
dli_queue_01
Type
Type of the queue
- To execute SQL jobs, select For SQL.
- To execute Flink or Spark jobs, select For general purpose.
_
Engine
SQL queue engine. The options include Spark and Trino.
_
Enterprise Project
Select an enterprise project.
default
- Click Next and configure scaling policies for the queue.
Click Create to add a scaling policy with varying priority, period, minimum CUs, and maximum CUs.
Figure 2 shows the scaling policy configured in this example.Table 4 Scaling policy parameters Parameter
Description
Example Value
Priority
Priority of the scaling policy in the current elastic resource pool. A larger value indicates a higher priority. In this example, only one scaling policy is configured, so its priority is set to 1 by default.
1
Period
The first scaling policy is the default policy, and its Period parameter configuration cannot be deleted or modified.
The period for the scaling policy is from 00 to 24.
00–24
Min CU
Minimum number of CUs allowed by the scaling policy
16
Max CU
Maximum number of CUs allowed by the scaling policy
64
- Click OK.
Step 3: Use DEW to Manage Access Credentials
To write the output data of a Spark Jar job to OBS, AK/SK is required for accessing OBS. To ensure the security of AK/SK data, you can use Data Encryption Workshop (DEW) and Cloud Secret Management Service (CSMS) for unified management of AK/SK, effectively avoiding sensitive information leakage and business risks caused by hard-coded or plaintext configuration of programs.
- Log in to the DEW management console.
- In the navigation pane on the left, choose Cloud Secret Management Service > Secrets.
- On the displayed page, click Create Secret. Set basic secret information.
Set the AK and SK credential key-value pairs.
- In this example, the key in the first line is the user's access key ID (AK).
- In this example, the key in the second line is the user's secret access key (SK).
Figure 3 Configuring access credentials in DEW
- Set access credential parameters on the DLI Spark Jar job editing page.
spark.hadoop.fs.obs.bucket.USER_BUCKET_NAME.dew.access.key= USER_AK_CSMS_KEY_obstest1 spark.hadoop.fs.obs.bucket.USER_BUCKET_NAME.dew.secret.key= USER_SK_CSMS_KEY_obstest1 spark.hadoop.fs.obs.security.provider=com.dli.provider.UserObsBasicCredentialProvider spark.hadoop.fs.dew.csms.secretName=obsAkSkspark.hadoop.fs.dew.endpoint=kmsendpoint spark.hadoop.fs.dew.csms.version=v3spark.dli.job.agency.name=agencyname
Step 4: Create a Custom Agency to Allow DLI to Access DEW and Read Credentials
- Log in to the management console.
- In the upper right corner of the page, hover over the username and select Identity and Access Management.
- In the navigation pane of the IAM console, choose Agencies.
- On the displayed page, click Create Agency.
- On the Create Agency page, set the following parameters:
- Agency Name: Enter an agency name, for example, dli_dew_agency_access.
- Agency Type: Select Cloud service.
- Cloud Service: This parameter is available only when you select Cloud service for Agency Type. Select Data Lake Insight (DLI) from the drop-down list.
- Validity Period: Select Unlimited.
- Description: You can enter Agency with OBS OperateAccess permissions. This parameter is optional.
- Click Next.
- Click the agency name. On the displayed page, click the Permissions tab. Click Authorize. On the displayed page, click Create Policy.
- Configure policy information.
- Enter a policy name, for example, dli-dew-agency.
- Select JSON.
- In the Policy Content area, paste a custom policy.
{ "Version": "1.1", "Statement": [ { "Effect": "Allow", "Action": [ "csms:secretVersion:get", "csms:secretVersion:list", "kms:dek:decrypt" ] } ] }
- Enter a policy description as required.
- Click Next.
- On the Select Policy/Role page, select Custom policy from the first drop-down list and select the custom policy created in 8.
- Click Next. On the Select Scope page, set the authorization scope. In this example, select All resources.
For details about authorization operations, see Creating a User Group and Assigning Permissions.
- Click OK.
It takes 15 to 30 minutes for the authorization to be in effect.
Step 5: Submit a Spark Job
- On the DLI management console, choose Job Management > Spark Jobs in the navigation pane on the left. On the displayed page, click Create Job in the upper right corner.
- Set the following Spark job parameters:
- Queue: Select the queue created in Step 2: Create an Elastic Resource Pool and Add Queues to the Pool.
- Spark Version: Select a Spark engine version. In this example, version 3.3.1 is selected.
- Application: Select the package created in Step 1: Upload Data to OBS.
- Agency: Select the agency created in Step 4: Create a Custom Agency to Allow DLI to Access DEW and Read Credentials, which is used to allow DLI to access the credentials stored in DEW.
For other parameters, refer to the description about the Spark job editing page in "Creating a Spark Job" in the Data Lake Insight User Guide.
- Click Execute in the upper right corner of the Spark job editing window, read and agree to the privacy agreement, and click OK. Submit the job. A message is displayed, indicating that the job is submitted.
- (Optional) Switch to the Job Management > Spark Jobs page to view the status and logs of the submitted Spark job.
When you click Execute on the DLI management console for the first time, you need to read the privacy agreement. Once agreed to the agreement, you will not receive any privacy agreement messages for subsequent operations.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.