Help Center> Data Lake Insight> Getting Started> Creating and Submitting a Spark SQL Job
Updated on 2024-05-29 GMT+08:00

Creating and Submitting a Spark SQL Job

Scenario

DLI can query data stored in OBS. This section describes how to us a Spark SQL job on DLI to query OBS data.

Procedure

You can use DLI to submit a Spark SQL job to query data. The general procedure is as follows:

Step 1: Upload Data to OBS

Step 2: Create a Queue

Step 3: Create a Database

Step 4: Create a Table

Step 5: Query Data

Step 1: Upload Data to OBS

Before you use DLI to query and analyze data, upload data files to OBS.

  1. Go to the DLI console.
  2. In the service list, click Object Storage Service under Storage. The OBS console page is displayed.
  3. Create a bucket. In this example, the bucket name is obs1.
    1. Click Create Bucket in the upper right corner.
    2. On the displayed Create Bucket page, specify Region and enter the Bucket Name. Retain the default values for other parameters or adjust them as needed.

      You must select the same region as the DLI management console.

    3. Click Create Now.
  4. Click obs1 to access its Objects tab page.
  5. Click Upload Object. In the displayed dialog box, drag a desired file or folder, for example, sampledata.csv to the Upload Object area. Then, click Upload.
    You can create a sampledata.txt file, copy the following content separated by commas (,), and save the file as sampledata.csv.
    12,test

    After the file is uploaded successfully, the file path is obs://obs1/sampledata.csv.

    • For more information about OBS operations, see the Object Storage Service Console Operation Guide.
    • For more information about the tool, see the OBS Tool Guide.
    • You are advised to use an OBS tool, such as OBS Browser+ or obsutil, to upload large files because OBS Console has restrictions on the file size and quantity.
      • OBS Browser+ is a graphical tool that provides complete functions for managing your buckets and objects in OBS.
      • obsutil is a command line tool for accessing and managing OBS resources. If you are familiar with command line interface (CLI), obsutil is recommended as an ideal tool for batch processing and automated tasks.
    You can upload files to a bucket in the following ways. Then OBS stores these files as objects in the bucket.
    Table 1 Access modes of objects uploaded to OBS

    Access Mode

    Upload Method

    Console

    Uploading an object using OBS Console

    OBS Browser+

    Uploading an object using OBS Browser+

    obsutil

    Uploading an object using obsutil

    SDK

    Uploading an object using SDKs. For details, see the section about object uploading in the developer guide of each language.

    API

    Uploading objects - PUT and Uploading objects - POST

Step 2: Create a Queue

A queue is the basis for using DLI. Before executing a SQL job, you need to create a queue.

  • DLI provides a preconfigured queue named default. If the default queue is used, you will be billed based on the amount of data scanned.
  • You can also create queues as needed. If a self-built queue is used, you will be billed based on the used CUH.
    1. Log in to the DLI management console.

      If this is your first time logging in to the DLI management console, you need to be authorized to access OBS.

      For this example, you need at least the Tenant Administrator (Global service) permission.

    2. In the left navigation pane of the DLI management console, choose SQL Editor.
    3. On the left pane, select the Queues tab, and click next to Queues.
      Figure 1 Creating a queue

      For details, see Creating a Queue.

      For details, see Data Lake Insight Billing.

Step 3: Create a Database

Before querying data, create a database, for example, db1.

The default database is a built-in database. You cannot create the default. database.

  1. In the left navigation pane of the DLI management console, choose SQL Editor.
  2. In the editing window on the right of the SQL Editor page, enter the following SQL statement and click Execute. Read and agree to the privacy agreement, and click OK.
    create database db1;

    After the database is successfully created, click in the middle pane to refresh the database list. The new database db1 is displayed in the list.

    When you execute a query on the DLI management console for the first time, you need to read the privacy agreement. You can perform operations only after you agree to the agreement. For later queries, you will not need to read the privacy agreement again.

Step 4: Create a Table

After database db1 is created, create a table (for example, table1) containing data in the sample file obs://obs1/sampledata.csv stored on OBS in db1.

  1. In the SQL editing window of the SQL Editor page, select the default queue and database db1.
  2. Enter the following SQL statement in the job editor window and click Execute:
    create table table1 (id int, name string) using csv options (path 'obs://obs1/sampledata.csv');

    After the table is successfully created, click the Databases tab then db1. The created table table1 is displayed in the table list.

Step 5: Query Data

After performing the preceding steps, you can start querying data.

  1. In the Table tab on the SQL Editor page, double-click the created table table1. The SQL statement is automatically displayed in the SQL job editing window in the right pane. Run following statement to query 1,000 records in the table1 table:
    select * from db1.table1 limit 1000;
  2. Click Execute. The system starts the query.

    After the SQL statement is successfully executed or fails to be executed, you can view the query result on the View Result tab under the SQL job editing window.