Help Center/ Data Lake Insight/ Developer Guide/ SQL Jobs/ Using Spark SQL Jobs to Analyze OBS Data
Updated on 2024-09-20 GMT+08:00

Using Spark SQL Jobs to Analyze OBS Data

DLI allows you to use data stored on OBS. You can create OBS tables on DLI to access and process data in your OBS bucket.

This section describes how to create an OBS table on DLI, import data to the table, and insert and query table data.

Prerequisites

  • You have created an OBS bucket. For details, see Object Storage Service User Guide. In this example, the OBS bucket name is dli-test-021.
  • You have created a DLI SQL queue. For details, see Creating a Queue.

    Note: When you create the DLI queue, set Type to For SQL.

Preparations

Creating a Database on DLI

  1. Log in to the DLI management console and click SQL Editor. On the displayed page, set Engine to spark and Queue to the created SQL queue.
  2. Enter the following statement in the SQL editing window to create the testdb database. For details about the syntax for creating a DLI database, see Creating a Database.
    create database testdb;

The following operations in this section must be performed for the testdb database.

DataSource and Hive Syntax for Creating an OBS Table on DLI

The main difference between DataSource syntax and Hive syntax lies in the range of table data storage formats supported and the number of partitions supported. For the key differences in creating OBS tables using these two syntax, refer to Table 1.

Table 1 Syntax differences

Syntax

Data Types

Partitioning

Number of Partitions

DataSource

ORC, PARQUET, JSON, CSV, and AVRO

You need to specify the partitioning column in both CREATE TABLE and PARTITIONED BY statements. For details, see Creating a Single-Partition OBS Table Using DataSource Syntax.

A maximum of 7,000 partitions can be created in a single table.

Hive

TEXTFILE, AVRO, ORC, SEQUENCEFILE, RCFILE, and PARQUET

Do not specify the partitioning column in the CREATE TABLE statement. Specify the column name and data type in the PARTITIONED BY statement. For details, see Creating an OBS Table Using Hive Syntax.

A maximum of 100,000 partitions can be created in a single table.

For details about the DataSource syntax, see Creating an OBS Table Using the DataSource Syntax.

For details about the Hive syntax, see Creating an OBS Table Using the Hive Syntax.

Creating an OBS Table Using the DataSource Syntax

The following describes how to create an OBS table for CSV files. The methods of creating OBS tables for other file formats are similar.

  • Create a non-partitioned OBS table.
    • Specify an OBS file and create an OBS table for the CSV data.
      1. Create the test.csv file containing the following content and upload the test.csv file to the root directory of OBS bucket dli-test-021:
        Jordon,88,23
        Kim,87,25
        Henry,76,26
      2. Log in to the DLI management console and choose SQL Editor from the navigation pane on the left. In the SQL editing window, set Engine to spark, Queue to the SQL queue you have created, and Database to testdb. Run the following statement to create an OBS table:
        CREATE TABLE testcsvdatasource (name STRING, score DOUBLE, classNo INT
        ) USING csv OPTIONS (path "obs://dli-test-021/test.csv");

        If you create an OBS table using a specified file, you cannot insert data to the table with DLI. The OBS file content is synchronized with the table data.

      3. Run the following statement to query data in the testcsvdatasource table.
        select * from testcsvdatasource;
        Figure 1 Query results
      4. Open the test.csv file on the local PC, add Aarn,98,20 to the file, and replace the original test.csv file in the OBS bucket.
        Jordon,88,23
        Kim,87,25
        Henry,76,26
        Aarn,98,20
      5. In the DLI SQL Editor, query the testcsvdatasource table for Aarn,98,20. The result is displayed.
        select * from testcsvdatasource;
        Figure 2 Query results
    • Specify an OBS directory and create an OBS table for CSV data.
      • The specified OBS data directory does not contain files you want to import to the table.
        1. Create the file directory data in the root directory of the OBS bucket dli-test-021.
        2. Log in to the DLI management console and click SQL Editor. On the displayed page, set Engine to spark, Queue to the created SQL queue, and Database to testdb. Run the following statement to create OBS table testcsvdata2source in the testdb database on DLI:
          CREATE TABLE testcsvdata2source (name STRING, score DOUBLE, classNo INT) USING csv OPTIONS (path "obs://dli-test-021/data");
        3. Run the following statement to insert table data:
          insert into testcsvdata2source VALUES('Aarn','98','20');
        4. Run the following statement to query data in the testcsvdata2source table:
          select * from testcsvdata2source;
          Figure 3 Query results
        5. Refresh the obs://dli-test-021/data directory of the OBS bucket and query the data. A CSV data file is generated, and the data is added to the file.
          Figure 4 Query results
      • The specified OBS data directory contains files you want to import to the table.
        1. Create file directory data2 in the root directory of the OBS bucket dli-test-021. Create the test.csv file with the following content and upload the file to the obs://dli-test-021/data2 directory:
          Jordon,88,23
          Kim,87,25
          Henry,76,26
        2. Log in to the DLI management console and click SQL Editor. On the displayed page, set Engine to spark, Queue to the created SQL queue, and Database to testdb. Run the following statement to create OBS table testcsvdata3source in the testdb database on DLI:
          CREATE TABLE testcsvdata3source (name STRING, score DOUBLE, classNo INT) USING csv OPTIONS (path "obs://dli-test-021/data2");
        3. Run the following statement to insert table data:
          insert into testcsvdata3source VALUES('Aarn','98','20');
        4. Run the following statement to query data in the testcsvdata3source table:
          select * from testcsvdata3source;
          Figure 5 Query results
        5. Refresh the obs://dli-test-021/data2 directory of the OBS bucket and query the data. A CSV data file is generated, and the data is added to the file.
          Figure 6 Query results
  • Create an OBS partitioned table
    • Create a single-partition OBS table
      1. Create file directory data3 in the root directory of the OBS bucket dli-test-021.
      2. Log in to the DLI management console and click SQL Editor. On the displayed page, set Engine to spark, Queue to the created SQL queue, and Database to testdb. Run the following statement to create OBS table testcsvdata4source using data in the specified OBS directory obs://dli-test-021/data3 and partition the table on the classNo column.
        CREATE TABLE testcsvdata4source (name STRING, score DOUBLE, classNo INT) USING csv OPTIONS (path "obs://dli-test-021/data3") PARTITIONED BY (classNo);
      1. Create the classNo=25 directory in the obs://dli-test-021/data3 directory of the OBS bucket. Create the test.csv file based on the following file content and upload the file to the obs://dli-test-021/data3/classNo=25 directory of the OBS bucket.
        Jordon,88,25
        Kim,87,25
        Henry,76,25
      2. Run the following statement in the SQL editor to add the partition data to OBS table testcsvdata4source:
        ALTER TABLE
          testcsvdata4source
        ADD
          PARTITION (classNo = 25) LOCATION 'obs://dli-test-021/data3/classNo=25';
      3. Run the following statement to query data in the classNo=25 partition of the testcsvdata4source table:
        select * from testcsvdata4source where classNo = 25;
        Figure 7 Query results
      4. Run the following statement to insert the following data to the testcsvdata4source table:
        insert into testcsvdata4source VALUES('Aarn','98','25');
        insert into testcsvdata4source VALUES('Adam','68','24');
      5. Run the following statement to query data in the classNo=25 and classNo=24 partitions of the testcsvdata4source table:

        When a partitioned table is queried using the where condition, the partition must be specified. Otherwise, the query fails and "DLI.0005: There should be at least one partition pruning predicate on partitioned table" is reported.

        select * from testcsvdata4source where classNo = 25;
        Figure 8 Query results
        select * from testcsvdata4source where classNo = 24;
        Figure 9 Query results
      6. In the obs://dli-test-021/data3 directory of the OBS bucket, click the refresh button. Partition files are generated in the directory for storing the newly inserted table data.
        Figure 10 classNo=25 file on OBS
        Figure 11 classNo=24 file on OBS
    • Create an OBS table partitioned on multiple columns.
      1. Create file directory data4 in the root directory of the OBS bucket dli-test-021.
      2. Log in to the DLI management console and click SQL Editor. On the displayed page, set Engine to spark, Queue to the created SQL queue, and Database to testdb. Run the following statement to create OBS table testcsvdata5source using data in the specified OBS directory obs://dli-test-021/data4 and partition the table on classNo and dt columns.
        CREATE TABLE testcsvdata5source (name STRING, score DOUBLE, classNo INT, dt varchar(16)) USING csv OPTIONS (path "obs://dli-test-021/data4") PARTITIONED BY (classNo,dt);
      3. Run the following statements to insert the following data into the testcsvdata5source table:
        insert into testcsvdata5source VALUES('Aarn','98','25','2021-07-27');
        insert into testcsvdata5source VALUES('Adam','68','25','2021-07-28');
      4. Run the following statement to query data in the classNo partition of the testcsvdata5source table:
        select * from testcsvdata5source where classNo = 25;
        Figure 12 Query results
      5. Run the following statement to query data in the dt partition of the testcsvdata5source table:
        select * from testcsvdata5source where dt like '2021-07%';
        Figure 13 Query results
      6. Refresh the obs://dli-test-021/data4 directory of the OBS bucket. The following data files are generated:
        • File directory 1: obs://dli-test-021/data4/xxxxxx/classNo=25/dt=2021-07-27
          Figure 14 Query results
        • File directory 2: obs://dli-test-021/data4/xxxxxx/classNo=25/dt=2021-07-28
          Figure 15 Query results
      7. Create the partition directory classNo=24 in obs://dli-test-021/data4, and then create the subdirectory dt=2021-07-29 in classNo=24. Create the test.csv file using the following file content and upload the file to the obs://dli-test-021/data4/classNo=24/dt=2021-07-29 directory.
        Jordon,88,24,2021-07-29
        Kim,87,24,2021-07-29
        Henry,76,24,2021-07-29
      8. Run the following statement in the SQL editor to add the partition data to OBS table testcsvdata5source:
        ALTER TABLE
          testcsvdata5source
        ADD
          PARTITION (classNo = 24,dt='2021-07-29') LOCATION 'obs://dli-test-021/data4/classNo=24/dt=2021-07-29';
      9. Run the following statement to query data in the classNo partition of the testcsvdata5source table:
        select * from testcsvdata5source where classNo = 24;
        Figure 16 Query results
      10. Run the following statement to query all data in July 2021 in the dt partition:
        select * from testcsvdata5source where dt like '2021-07%';
        Figure 17 Query results

Creating an OBS Table Using Hive Syntax

The following describes how to create an OBS table for TEXTFILE files. The methods of creating OBS tables for other file formats are similar.

  • Create a non-partitioned OBS table.
    1. Create file directory data5 in the root directory of the OBS bucket dli-test-021. Create the test.txt file based on the following file content and upload the file to the obs://dli-test-021/data5 directory:
      Jordon,88,23
      Kim,87,25
      Henry,76,26
    2. Log in to the DLI management console and click SQL Editor. On the displayed page, set Engine to spark, Queue to the created SQL queue, and Database to testdb. Run the following Hive statement to create an OBS table using data in obs://dli-test-021/data5/test.txt and set the row data delimiter to commas (,):
      CREATE TABLE hiveobstable (name STRING, score DOUBLE, classNo INT) STORED AS TEXTFILE LOCATION 'obs://dli-test-021/data5' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

      ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' indicates that records are separated by commas (,).

    3. Run the following statement to query data in the hiveobstable table:
      select * from hiveobstable;
      Figure 18 Query results
    4. Run the following statements to insert data into the table:
      insert into hiveobstable VALUES('Aarn','98','25');
      insert into hiveobstable VALUES('Adam','68','25');
    5. Run the following statement to query data in the table to verify that the data has been inserted:
      select * from hiveobstable;
      Figure 19 Query results
    6. In the obs://dli-test-021/data5 directory, refresh the page and query the data. Two files are generated containing the newly inserted data.
      Figure 20 Query results

    Create an OBS Table Containing Data of Multiple Formats

    1. Create file directory data6 in the root directory of the OBS bucket dli-test-021. Create the test.txt file based on the following file content and upload the file to the obs://dli-test-021/data6 directory:
      Jordon,88-22,23:21
      Kim,87-22,25:22
      Henry,76-22,26:23
    2. Log in to the DLI management console and click SQL Editor. On the displayed page, set Engine to spark, Queue to the created SQL queue, and Database to testdb. Run the following Hive statement to create an OBS table using data stored in obs://dli-test-021/data6.
      CREATE TABLE hiveobstable2 (name STRING, hobbies ARRAY<string>, address map<string,string>) STORED AS TEXTFILE LOCATION 'obs://dli-test-021/data6' 
      ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
      COLLECTION ITEMS TERMINATED BY '-'
      MAP KEYS TERMINATED BY ':';
      • ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' indicates that records are separated by commas (,).
      • COLLECTION ITEMS TERMINATED BY '-' indicates that the second column hobbies is in array format. Elements are separated by hyphens (-).
      • MAP KEYS TERMINATED BY ':' indicates that the address column is in the key-value format. Key-value pairs are separated by colons (:).
    3. Run the following statement to query data in the hiveobstable2 table:
      select * from hiveobstable2;
      Figure 21 Query results
  • Create a partitioned OBS table.
    1. Create file directory data7 in the root directory of the OBS bucket dli-test-021.
    2. Log in to the DLI management console and click SQL Editor. On the displayed page, set Engine to spark, Queue to the created SQL queue, and Database to testdb. Run the following statement to create an OBS table using data stored in obs://dli-test-021/data7 and partition the table on the classNo column:
      CREATE TABLE IF NOT EXISTS hiveobstable3(name STRING, score DOUBLE) PARTITIONED BY (classNo INT) STORED AS TEXTFILE LOCATION 'obs://dli-test-021/data7' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

      You can specify the partition key in the PARTITIONED BY statement. Do not specify the partition key in the CREATE TABLE IF NOT EXISTS statement. The following is an incorrect example:

      CREATE TABLE IF NOT EXISTS hiveobstable3(name STRING, score DOUBLE, classNo INT) PARTITIONED BY (classNo) STORED AS TEXTFILE LOCATION 'obs://dli-test-021/data7';

    3. Run the following statements to insert data into the table:
      insert into hiveobstable3 VALUES('Aarn','98','25');
      insert into hiveobstable3 VALUES('Adam','68','25');
    4. Run the following statement to query data in the table:
      select * from hiveobstable3 where classNo = 25;
      Figure 22 Query results
    5. Refresh the obs://dli-test-021/data7 directory. A new partition directory classno=25 is generated containing the newly inserted table data.
      Figure 23 Query results
    6. Create partition directory classno=24 in the obs://dli-test-021/data7 directory. Create the test.txt file using the following file content and upload the file to the obs://dli-test-021/data7/classno=24 directory:
      Jordon,88,24
      Kim,87,24
      Henry,76,24
    7. Run the following statement in the SQL editor to add the partition data to OBS table hiveobstable3:
      ALTER TABLE
        hiveobstable3
      ADD
        PARTITION (classNo = 24) LOCATION 'obs://dli-test-021/data7/classNo=24';
    8. Run the following statement to query data in the hiveobstable3 table:
      select * from hiveobstable3 where classNo = 24;
      Figure 24 Query results

FAQs

  • Q1: What should I do if the following error is reported when the OBS partition table is queried?
    DLI.0005: There should be at least one partition pruning predicate on partitioned table `xxxx`.`xxxx`.;

    Cause: The partition key is not specified in the query statement of a partitioned table.

    Solution: Ensure that the where condition contains at least one partition key.

  • Q2: What should I do if "DLI.0007: The output path is a file, don't support INSERT...SELECT error" is reported when I use a DataSource statement to insert data in a specified OBS directory into an OBS table and the execution fails?
    The statement is similar to the following:
    CREATE TABLE testcsvdatasource (name string, id int) USING csv OPTIONS (path "obs://dli-test-021/data/test.csv");

    Cause: Data cannot be inserted if a specific file is used in the table creation statement. For example, the OBS file obs://dli-test-021/data/test.csv is used in the preceding example.

    Solution: Replace the OBS file to the file directory. You can insert data using the INSERT statement. The preceding example statement can be modified as follows:
    CREATE TABLE testcsvdatasource (name string, id int) USING csv OPTIONS (path "obs://dli-test-021/data");
  • Q3: What should I do if the syntax of a Hive statement used to create a partitioned OBS table is incorrect? For example, the following statement creates an OBS table partitioned on classNo:
    CREATE TABLE IF NOT EXISTS testtable(name STRING, score DOUBLE, classNo INT) PARTITIONED BY (classNo) STORED AS TEXTFILE LOCATION 'obs://dli-test-021/data7';

    Cause: Do not specify the partition key in the list following the table name. Specify the partition key in the PARTITIONED BY statement.

    Solution: Specify the partition key in PARTITIONED BY. For example:
    CREATE TABLE IF NOT EXISTS testtable(name STRING, score DOUBLE) PARTITIONED BY (classNo INT) STORED AS TEXTFILE LOCATION 'obs://dli-test-021/data7';