Updated on 2024-04-29 GMT+08:00

Developing an SQL Script

DataArts Factory allows you to develop, debug, and run SQL scripts online. You can run developed scripts in jobs. For details, see Developing a Pipeline Job.

DataArts Factory supports the following types of SQL scripts. The SQL syntax varies depending on the data source. Before developing an SQL statement, learn about the syntax of the corresponding data source.

Prerequisites

  • A corresponding cloud service has been enabled and a database has been created in the cloud service.
  • A data connection that matches the data connection type of the created script. For details, see Managing Data Connections. The Flink SQL script does not involve this operation.
  • You have locked the script. Otherwise, you must click Lock so that you can develop the script. A script you create or import is locked by you by default. For details, see the lock function.

Procedure

  1. Log in to the DataArts Studio console by following the instructions in Accessing the DataArts Studio Instance Console.
  2. On the DataArts Studio console, locate a workspace and click DataArts Factory.
  3. In the left navigation pane of DataArts Factory, choose Data Development > Develop Script.
  4. In the script directory, double-click a script to access the script development page.
  5. In the upper part of the editor, select script properties. Table 1 describes the script properties. Skip this step when creating a Flink SQL script.
    Table 1 SQL script properties

    Property

    Description

    Data Connection

    Select a data connection.

    Database

    Name of the database.

    Resource Queue

    Selects a resource queue for executing a DLI job. Set this parameter when a DLI or SQL script is created.

    You can create a resource queue using either of the following methods:
    • Click . The Buy Queue page of DLI is displayed.
    • Go to the DLI console.
    NOTE:

    The default resource queue default provided by DLI is only used for trial. It may be occupied by multiple users at a time. Therefore, it is possible that you fail to obtain the resource for related operations. If the execution takes a long time or fails, you are advised to try again during off-peak hours or use a self-built queue to run the job.

    In addition, the default queue does not support the insert, load, or cat commands.

    To set properties for submitting SQL jobs in the form of key/value, click . A maximum of 10 properties can be set. The properties are described as follows:

    NOTE:
    • The environment variable must start with dli.sql. or spark.sql.
    • If the key of the environment variable is dli.sql.shuffle.partitions or dli.sql.autoBroadcastJoinThreshold, the environment variable cannot contain the greater than (>) or less than (<) sign.
    • If a parameter with the same name is configured in both a job and a script, the parameter value configured in the job will overwrite that configured in the script.
    • dli.sql.autoBroadcastJoinThreshold: specifies the data volume threshold to use BroadcastJoin. If the data volume exceeds the threshold, BroadcastJoin will be automatically enabled.
    • dli.sql.shuffle.partitions: specifies the number of partitions during shuffling.
    • dli.sql.cbo.enabled: specifies whether to enable the CBO optimization policy.
    • dli.sql.cbo.joinReorder.enabled: specifies whether join reordering is allowed when CBO optimization is enabled.
    • dli.sql.multiLevelDir.enabled: specifies whether to query the content in subdirectories if there are subdirectories in the specified directory of an OBS table or in the partition directory of an OBS partition table. By default, the content in subdirectories is not queried.
    • dli.sql.dynamicPartitionOverwrite.enabled: specifies that only partitions used during data query are overwritten and other partitions are not deleted.
    NOTE:

    When you run a DLI SQL script or test a DLI SQL single-task job in non-scheduling scenarios, the following parameters are enabled by default:

    • spark.sql.adaptive.enabled: Adaptive Query Execution (AQE) is enabled so that Spark can dynamically optimize the query execution plan based on the characteristics of the data being processed and improve the performance by reducing the amount of data to be processed.
    • spark.sql.adaptive.join.enabled: AQE is enabled for join operations. The optimal join algorithm is selected based on the data being processed to improve performance.
    • spark.sql.adaptive.skewedJoin.enabled: AQE is enabled for skewed join operations. Skewed data can be automatically detected and the join algorithm is optimized accordingly to improve performance.
    • spark.sql.mergeSmallFiles.enabled: Merging of small files is enabled. Small files can be merged into large ones, improving performance and shortening the processing time. In addition, less files need to be read from remote storage, and more local files can be used.

    If you do not want to use these functions, you can set the values of the preceding parameters to false.

  6. Enter an SQL statement in the editor. You can enter multiple SQL statements.
    The SQL syntax varies depending on the data source. Before developing an SQL statement, learn about the syntax of the corresponding data source.
    • SQL statements are separated by semicolons (;). If semicolons are used in other places but not used to separate SQL statements, escape them with backslashes (\). For example:
      select 1;
      select * from a where b="dsfa\;";  --example 1\;example 2.
    • RDS SQL does not support the begin ... commit transaction syntax. If necessary, use the start transaction ... commit transaction syntax.
    • The script cannot be larger than 16 MB.
    • The system date obtained by using an SQL statement is different from that obtained by using the database tool. The query result is stored in the database in the YYYY-MM-DD format, but the query result displayed on the page is in the converted format.
    • When a user submits a Spark SQL script to MRS, the script is submitted to the tenant queue bound to the user by default. The bound queue is the queue corresponding to tenant role of the user. If there are multiple queues, the system preferentially selects a queue based on the queue priorities. To set a fixed queue for the user to submit scripts, log in to FusionInsight Manager, choose Tenant Resources > Dynamic Resource Plan, and click the Global User Policy tab. For details, see Managing Global User Policies.
    To facilitate script development, DataArts Factory provides the following capabilities:
    • The script editor supports the following shortcut keys, which improve the script development efficiency:
      • F8: Run a script.
      • F9: Stop running a script.
      • Ctrl + /: Comment out or uncomment the line or code block where the cursor resides.
      • Ctrl + S: Save a script.
      • Ctrl + Z: Undo an action.
      • Ctrl + F: Search for information.
      • Ctrl + Shift + R: Replace
      • Ctrl + X: Cut (Cut a line when the cursor selects nothing.)
      • Alt + mouse dragging: Select columns to edit a block.
      • Ctrl + mouse click: Select multiple lines to edit or indent them together.
      • Shift + Ctrl + K: Delete the current line.
      • Ctrl + (or ): Move the cursor rightwards (or leftwards) by word.
      • Ctrl + Home or Ctrl + End: Navigate to the beginning or end of the current file.
      • Home or End: Navigate to the beginning or end of the current line.
      • Ctrl + Shift + L: Double-click all the same character strings and add cursors to them to implement batch modification.
      • Ctrl + D: Delete a line.
      • Shift + Ctrl + U: Unlock a script.
      • Ctrl + Alt + K: Select the word where the cursor resides.
      • Ctrl + B: Format
      • Ctrl + Shift + Z: Redo
      • Ctrl + Enter: Execute the selected line or content.
      • Ctrl + Alt + F: Flag
      • Ctrl + Shift + K: Search for the previous one.
      • Ctrl + K: Search for the next one.
      • Ctrl + Backspace: Delete the word to the left of the cursor.
      • Ctrl + Delete: Delete the word to the right of the cursor.
      • Alt + Backspace: Delete all content from the beginning of the line to the cursor.
      • Alt + Delete: Delete all content from the cursor to the end of the line.
      • Alt + Shift-Left: Select all content from the beginning of the line to the cursor.
      • Alt + Shift-Right: Select all content from the cursor to the end of the line.
    • System functions (Flink SQL, Spark SQL, ClickHouse SQL, and Presto SQL do not support system functions.)

      To view the functions supported by this type of data connection, click System Functions on the right of the editor. You can double-click a function to the editor to use it.

    • Data tables can be read to generate SQL statements. (Flink SQL, Spark SQL, ClickHouse SQL, and Presto SQL do not support this function.)

      Click Data Tables on the right of the editor to display all the tables in the current database or schema. You can select tables and columns and click Generate SQL Statement in the lower right corner to generate an SQL statement, which you need to manually format.

    • Script parameters (Currently, only Flink SQL does not support script parameters.)

      You can directly write script parameters in SQL statements. When debugging scripts, you can enter parameter values in the script editor. If the script is referenced by a job, you can set parameter values on the job development page. The parameter values can use EL expressions (see Expression Overview).

      If a parameter in an SQL script involves a variable, the format of the variable must be the same as that set in Configuring Script Variables. If they are different, the variable cannot be identified.

      In the following script example, str1 indicates the parameter name. It can contain only letters, digits, hyphens (-), underscores (_), greater-than signs (>), and less-than signs (<), and can contain a maximum of 16 characters. The parameter name must be unique.

      select ${str1} from data;
      For MRS Spark SQL and MRS Hive SQL scripts, you set a program parameter by referring to set hive.exec.parallel=true; in the SQL statements or configure this parameter by setting Program Parameter on Node Properties of the job.
      Figure 1 Program Parameter
    • Owner

      Click Basic Info to set the script owner and description.

    • Allows you to go to the release page from the script development page in enterprise mode. Place the cursor over and click Release.
    • For MRS API connections, parameters and values can be configured for Spark SQL and Hive SQL scripts. For proxy connections, this function is not supported.

      Click in the upper right corner to set environment variables for scripts. The following are some examples:

      Set environment variables for a Hive SQL script:

      --hiveconf hive.merge.mapfiles=true;

      --hiveconf mapred.job.queue.name=queue1

      Set environment variables for a Spark SQL script:

      --num-executors 1

      --executor-cores 4

      --queue queue2

      The former indicates the parameter name, and the latter indicates the parameter value.

      After the script is executed, view the execution details on the MRS management plane.

  7. (Optional) In the upper part of the editor, click Format to format SQL statements. When developing a Flink SQL script, skip this step.
  8. In the upper part of the editor, click Execute. If you need to execute some SQL statements separately, select the SQL statements first. After executing the SQL statements, view the execution history and result of the script in the lower part of the editor. When developing a Flink SQL script, skip this step.
    • A maximum of 1,000 SQL statement execution results can be displayed. A maximum of 10,000 DLI SQL statement execution results can be displayed. To view more execution results, download or dump them by following the instructions in Downloading or Dumping a Script Execution Result.
    • You can perform the following operations on execution results:
      • Double-click or right-click the name of an execution result tab to rename it. The name can contain a maximum of 16 characters.
      • Right-click the name of an execution result tab to close the current tab, all the tabs to the left or right of the current tab, all the other tabs, or all the tabs.
    • If the MRS cluster is a non-security cluster and the command whitelist is not restricted, you can easily find the corresponding task on the Yarn management page of MRS based on the script name and execution time after adding the application name information during Hive SQL execution. Note that if the default engine is tez, you need to set the engine to mr to disable the tez engine.
  9. Above the editor, click to save the script.
    If the script is created but not saved, set the parameters listed in Table 2.
    Table 2 Script parameters

    Parameter

    Mandatory

    Description

    Script Name

    Yes

    Name of the script. The name contains a maximum of 128 characters, including only letters, numbers, hyphens (-), underscores (_), and periods (.).

    Owners

    No

    Owner of the script. By default, the creator of the script is the owner.

    Description

    No

    Descriptive information about the script.

    Select Directory

    Yes

    Directory to which the script belongs. The root directory is selected by default.

    If you open an unsaved script, you can restore its content from the local cache.

    After the script is saved, a version is automatically generated and displayed in Versions. The version can be rolled back. If you save a script multiple times within a minute, only one version is recorded. If the intermediate data is important, you can click Save new version to save and add a version.

Downloading or Dumping a Script Execution Result

After a script is executed successfully, you can download or dump the execution result. By default, all users can download and dump the execution results of SQL scripts. If you do not want all users to have this permission, configure the permission by referring to Configuring a Data Export Policy.
  • After executing a script, you can click Download on the Result tab page to download a CSV result file to a local path. You can view the download record on the Download Center page.
  • After executing a script, you can click Dump on the Result tab page to dump a CSV and a JSON result file to OBS. For details, see Table 3.
    • The dump function is supported only if the OBS service is available.
    • Only the execution results of the query statements in SQL scripts can be dumped.
    Table 3 Dump parameters

    Parameter

    Mandatory

    Description

    Data Format

    Yes

    Format of the data to be exported. CSV and JSON formats are supported.

    Resource Queue

    No

    DLI queue where the export operation is to be performed. Set this parameter when a DLI or SQL script is created.

    Compression Format

    No

    Format of compression. Set this parameter when a DLI or SQL script is created.

    • none
    • bzip2
    • deflate
    • gzip

    Storage Path

    Yes

    OBS path where the result file is stored. After selecting an OBS path, customize a folder. Then, the system will create it automatically for storing the result file.

    You can also go to the Download Center page to set the default OBS path, which will be automatically set for Storage Path in the Dump Result dialog box.

    Cover Type

    No

    If a folder that has the same name as your custom folder exists in the storage path, select a cover type. Set this parameter when a DLI or SQL script is created.

    • Overwrite: The existing folder will be overwritten by the customized folder.
    • Report: The system reports an error and suspends the export operation.

    Export Column Name

    No

    Yes: Column names will be exported.

    No: Column names will not be exported.

    Character Set

    No

    • UTF-8: default character set
    • GB2312: recommended when the data to be exported contains Chinese character sets
    • GBK: expanded based on and compatible with GB2312
Download or dump allows you to view more SQL script execution results. Table 4 lists the maximum number of results that can be viewed, dumped, and downloaded for different types of SQL scripts.
Table 4 Maximum number of results that can be viewed, dumped, and downloaded

SQL Type

Maximum Number of Results That Can Be Viewed Online

Maximum Number of Results That Can Be Downloaded

Maximum Number of Results That Can Be Dumped

DLI

10000

1000

Unlimited

Hive

1000

1000

10000

DWS

1000

1000

10000

Spark

1000

1000

10000

RDS

1000

1000

Not supported

Presto

1000

The downloaded results are directly dumped to OBS. The number of results is unlimited.

Unlimited

ClickHouse

1000

1000

10000

HetuEngine

1000

1000

10000

Impala

1000

1000

10000