Help Center> DataArts Studio> User Guide> DataArts Factory> Job Development> Developing a Batch Processing Single-Task SQL Job
Updated on 2024-04-29 GMT+08:00

Developing a Batch Processing Single-Task SQL Job

This section describes how to develop and configure a job.

For details about how to develop a batch processing job in single-task mode, see sections Developing an SQL Script, Configuring job parameters, Monitoring Quality, Data Table, Testing and Saving the Job, and Downloading or Dumping a Script Execution Result.

Prerequisites

  • A job has been created. For details, see Creating a Job.
  • You have locked the job. Otherwise, you must click Lock so that you can develop the job. A job you create or import is locked by you by default. For details, see the lock function.

Developing an SQL Script

  1. Log in to the DataArts Studio console by following the instructions in Accessing the DataArts Studio Instance Console.
  2. On the DataArts Studio console, locate a workspace and click DataArts Factory.
  3. In the left navigation pane of DataArts Factory, choose Development > Develop Job.
  4. In the job directory, double-click the name of a single-task job to access the job development page.
  5. On the right of the SQL editor, click Basic Info to configure basic information, properties, and advanced settings of the job. Table 1 lists the basic information, Table 2 lists the properties, and Table 3 lists the advanced settings.
    Table 1 Basic job information

    Parameter

    Description

    Owner

    An owner configured during job creation is automatically matched. This parameter value can be modified.

    Executor

    This parameter is available when Scheduling Identities is set to Yes.

    User that executes the job. When you enter an executor, the job is executed by the executor. If the executor is left unspecified, the job is executed by the user who submitted the job for startup.

    Job Agency

    This parameter is available when Scheduling Identities is set to Yes.

    After an agency is configured, the job interacts with other services as an agency during job execution.

    Priority

    Priority configured during job creation is automatically matched. This parameter value can be modified.

    Execution Timeout

    Timeout of the job instance. If this parameter is set to 0 or is not set, this parameter does not take effect. If the notification function is enabled for the job and the execution time of the job instance exceeds the preset value, the system sends a specified notification, and the job keeps running.

    Exclude Waiting Time from Instance Timeout Duration

    Whether to exclude the wait time from the instance execution timeout duration

    If you select this option, the time to wait before an instance starts running is excluded from the timeout duration. You can modify this setting on the Default Configuration page.

    If you do not select this option, the time to wait before an instance starts running is included in the timeout duration.

    Custom Parameter

    Set the name and value of the parameter.

    Job Tag

    Configure job tags to manage jobs by category.

    Click Add to add a tag to the job. You can also select a tag configured in Managing Job Tags.

    Table 2 Properties of a single-task job

    Property

    Description

    DLI SQL properties

    Database Name

    Select a database.

    Queue Name

    The queue set in the SQL script is selected by default. You can change another one.

    You can create a resource queue using either of the following methods:
    • Click . On the displayed Queue Management page of DLI, create a resource queue.
    • Go to the DLI console to create a resource queue.

    Record Dirty Data

    Click to specify whether to record dirty data.

    • If you select , dirty data will be recorded.
    • If you do not select , dirty data will not be recorded.

    DLI Environmental Variable

    • The environment variable must start with dli.sql. or spark.sql.
    • If the key of the environment variable is dli.sql.shuffle.partitions or dli.sql.autoBroadcastJoinThreshold, the environment variable cannot contain the greater than (>) or less than (<) sign.
    • If a parameter with the same name is configured in both a job and a script, the parameter value configured in the job will overwrite that configured in the script.
      NOTE:

      When you run a DLI SQL script or test a DLI SQL single-task job in non-scheduling scenarios, the following parameters are enabled by default:

      • spark.sql.adaptive.enabled: Adaptive Query Execution (AQE) is enabled so that Spark can dynamically optimize the query execution plan based on the characteristics of the data being processed and improve the performance by reducing the amount of data to be processed.
      • spark.sql.adaptive.join.enabled: AQE is enabled for join operations. The optimal join algorithm is selected based on the data being processed to improve performance.
      • spark.sql.adaptive.skewedJoin.enabled: AQE is enabled for skewed join operations. Skewed data can be automatically detected and the join algorithm is optimized accordingly to improve performance.
      • spark.sql.mergeSmallFiles.enabled: Merging of small files is enabled. Small files can be merged into large ones, improving performance and shortening the processing time. In addition, less files need to be read from remote storage, and more local files can be used.

      If you do not want to use these functions, you can set the values of the preceding parameters to false.

    DWS SQL properties

    Data Connection

    Select a data connection.

    Database

    Select a database.

    Dirty Data Table

    Name of the dirty data table defined in the SQL script.

    The dirty data attributes cannot be edited. They are automatically recommended by the SQL script content.

    Matching Rule

    Enter a Java regular expression used to match the DWS SQL result. For example, if the expression is (?<=\()(-*\d+?)(?=,) and the SQL result is (1,"error message"), then the matched result is "1".

    Failure Matching Value

    If the matched content equals the set value, the node fails to be executed.

    RDS SQL properties

    Data Connection

    Select a data connection.

    Database

    Select a database.

    Spark SQL properties

    MRS Job Name

    MRS job name.

    If the MRS job name is not set and the direct connection mode is selected, the node name can contain only letters, digits, hyphens (-), and underscores (_).

    Data Connection

    Select a data connection.

    Database

    Select a database.

    Program Parameter

    Set program parameters.

    The following is an example:

    Set Parameter to --queue and Value to default_cr, indicating that a specified queue of the MRS cluster is configured. You can also go to the MRS console, click the name of the MRS cluster and then the Jobs tab, locate the job, click More in the Operation column, and select View Details to view the job details.

    NOTE:

    Configure optimization parameters such as threads, memory, and vCPUs for the job to optimize resource usage and improve job execution performance. This configuration is unavailable if a Spark proxy connection is used.

    Spark SQL jobs with a single operator and using a connection of the MRS API type support program parameters.

    Hive SQL properties

    MRS Job Name

    MRS job name.

    If the MRS job name is not set and the direct connection mode is selected, the node name can contain only letters, digits, hyphens (-), and underscores (_).

    Data Connection

    Select a data connection.

    Database

    Select a database.

    Program Parameter

    Set program parameters.

    The following is an example:

    Set Parameter to --hiveconf and Value to mapreduce.job.queuename=default_cr, indicating that a specified queue of the MRS cluster is configured. You can also go to the MRS console, click the name of the MRS cluster and then the Jobs tab, locate the job, click More in the Operation column, and select View Details to view the job details.

    NOTE:

    Configure optimization parameters such as threads, memory, and vCPUs for the job to optimize resource usage and improve job execution performance. This configuration is unavailable if a Hive proxy connection is used.

    Hive SQL jobs with a single operator and using a connection of the MRS API type support program parameters.

    Table 3 Advanced parameters

    Parameter

    Mandatory

    Description

    Node Status Polling Interval (s)

    Yes

    How often the system checks whether the node execution is complete. The value ranges from 1 to 60 seconds.

    During the node execution, the system checks whether the node execution is complete at the configured interval.

    Max. Node Execution Duration

    Yes

    Execution timeout interval for the node. If retry is configured and the execution is not complete within the timeout interval, the node will be executed again.

    Retry upon Failure

    Yes

    Whether to re-execute a node if it fails to be executed.

    • Yes: The node will be re-executed, and the following parameters must be configured:
      • Retry upon Timeout
      • Maximum Retries
      • Retry Interval (seconds)
    • No: The node will not be re-executed. This is the default value.
      NOTE:

      If retry is configured for a job node and the timeout duration is configured, the system allows you to retry a node when the node execution times out.

      If a node is not re-executed when it fails upon timeout, you can go to the Default Configuration page to modify this policy.

      Retry upon Timeout is displayed only when Retry upon Failure is set to Yes.

    Policy for Handling Subsequent Nodes If the Current node Fails

    Yes

    Policy for handling subsequent nodes if the current node fails

    • End the current job execution plan: Execution of the current job will stop, and the job instance status will become Failed.
    • Ignore the failure and set the job execution result to success: The failure of the current node will be ignored, and the next node will be executed. The job instance status will become Successful.
  6. Enter one or more SQL statements in the SQL editor.
    • SQL statements are separated by semicolons (;). If semicolons are used in other places but not used to separate SQL statements, escape them with backslashes (\). The following is an example:
      select 1;
      select * from a where b="dsfa\;";  --example 1\;example 2.
    • RDS SQL does not support the begin ... commit transaction syntax. If necessary, use the start transaction ... commit transaction syntax.
    • The script cannot be larger than 16 MB.
    • The system date obtained by using an SQL statement is different from that obtained by using the database tool. The query result is stored in the database in the YYYY-MM-DD format, but the query result displayed on the page is in the converted format.
    • When a user submits a Spark SQL script to MRS, the script is submitted to the tenant queue bound to the user by default. The bound queue is the queue corresponding to tenant role of the user. If there are multiple queues, the system preferentially selects a queue based on the queue priorities. To set a fixed queue for the user to submit scripts, log in to FusionInsight Manager, choose Tenant Resources > Dynamic Resource Plan, and click the Global User Policy tab. For details, see Managing Global User Policies.
    To facilitate script development, DataArts Factory provides the following capabilities:
    • The script editor supports the following shortcut keys, which improve the script development efficiency:
      • F8: Run a script.
      • F9: Stop running a script.
      • Ctrl + /: Comment out or uncomment the line or code block at the cursor.
      • Ctrl +Z: Cancel
      • Ctrl + F: Search
      • Ctrl + Shift + R: Replace
      • Ctrl + X: Cut
      • Ctrl + S: Save a script.
      • Alt + mouse dragging: Select columns to edit a block.
      • Ctrl + mouse click: Select multiple lines to edit or indent them together.
      • Shift + Ctrl + K: Delete the current line.
      • Ctrl + (or ): Move the cursor rightwards (or leftwards) by word.
      • Ctrl + Home or Ctrl + End: Navigate to the beginning or end of the current file.
      • Home or End: Navigate to the beginning or end of the current line.
      • Ctrl + Shift + L: Double-click all the same character strings and add cursors to them to implement batch modification.
      • Ctrl + D: Delete a line.
      • Shift + Ctrl + U: Unlock a script.
      • Ctrl + Alt + K: Select the word where the cursor resides.
      • Ctrl + B: Format
      • Ctrl + Shift + Z: Redo
      • Ctrl + Enter: Execute the selected line or content.
      • Ctrl + Alt + F: Flag
      • Ctrl + Shift + K: Search for the previous one.
      • Ctrl + K: Search for the next one.
      • Ctrl + Backspace: Delete the word to the left of the cursor.
      • Ctrl + Delete: Delete the word to the right of the cursor.
      • Alt + Backspace: Delete all content from the beginning of the line to the cursor.
      • Alt + Delete: Delete all content from the cursor to the end of the line.
      • Alt + Shift-Left: Select all content from the beginning of the line to the cursor.
      • Alt + Shift-Right: Select all content from the cursor to the end of the line.
    • System functions

      To view the functions supported by this type of data connection, click System Functions on the right of the editor. You can double-click a function to the editor to use it.

    • Script parameters

      Enter script parameters in the SQL statement and click Parameter Setup in the right pane of the editor and then click Update from Script. You can also directly configure parameters and constants for the job script.

      In the following script example, str1 indicates the parameter name. It can contain only letters, digits, hyphens (-), underscores (_), greater-than signs (>), and less-than signs (<), and can contain a maximum of 16 characters. The parameter name must be unique.

      select ${str1} from data;
    • Visualized reading of data tables to generate SQL statements

      Click Data Tables on the right of the editor to display all the tables in the current database or schema. You can select tables and columns and click Generate SQL Statement in the lower right corner to generate an SQL statement, which you need to manually format.

  7. (Optional) In the upper part of the editor, click Format to format SQL statements.
  8. In the upper part of the editor, click Execute. If you need to execute some SQL statements separately, select the SQL statements first. After executing the SQL statements, view the execution history and result of the script in the lower part of the editor.

    You can click View Log to view logs of the job.

  9. Above the editor, click to save the job.

Configuring job parameters

Job parameters can be globally used in any node in jobs. The procedure is as follows:

Click Parameter Setup on the right of the editor and set the parameters described in Table 4.

Table 4 Job parameter setup

Module

Description

Variables

Add

Click Add and enter the variable parameter name and parameter value in the text boxes.

  • Parameter

    Only letters, numbers, hyphens, and underscores (_) are allowed.

  • Parameter Value
    • The string type of parameter value is a character string, for example, str1.
    • The numeric type of parameter value is a number or operation expression.

After the parameter is configured, it is referenced in the format of ${parameter name} in the job.

Edit Parameter Expression

Click next to the parameter value text box. In the displayed dialog box, edit the parameter expression. For more expressions, see Expression Overview.

Modifying a Job

Change the parameter name or value in the corresponding text boxes.

Mask

If the parameter value is a key, click to mask the value for security purposes.

Delete

Click next to the parameter name and value text boxes to delete the job parameter.

Constant Parameter

Add

Click Add and enter the constant parameter name and parameter value in the text boxes.

  • Parameter

    Only letters, numbers, hyphens, and underscores (_) are allowed.

  • Parameter Value
    • The string type of parameter value is a character string, for example, str1.
    • The numeric type of parameter value is a number or operation expression.

After the parameter is configured, it is referenced in the format of ${parameter name} in the job.

Edit Parameter Expression

Click next to the parameter value text box. In the displayed dialog box, edit the parameter expression. For more expressions, see Expression Overview.

Modifying a Job

Modify the parameter name and parameter value in text boxes and save the modifications.

Delete

Click next to the parameter name and value text boxes to delete the job parameter.

Workspace Environment Variables

View the variables and constants that have been configured in the workspace.

Click the Parameter Preview tab and configure the parameters listed in Table 5.

Table 5 Job parameter preview

Module

Description

Current Time

This parameter is displayed only when Scheduling Type is set to Run once. The default value is the current time.

Event Triggering Time

This parameter is displayed only when Scheduling Type is set to Event-based. The default value is the time when an event is triggered.

Scheduling Period

This parameter is displayed only when Scheduling Type is set to Run periodically. The default value is the scheduling period.

Start Time

This parameter is displayed only when Scheduling Type is set to Run periodically. The value is the configured job execution time.

Start Time

This parameter is displayed only when Scheduling Type is set to Run periodically. The value is the time when the periodic job scheduling starts.

Subsequent Instances

Number of job instances scheduled.

  • The default value is 1 when Scheduling Type is set to Run once.
  • The default value is 1 when Scheduling Type is set to Event-based.
  • When Scheduling Type is set to Run periodically:

    If the number of instances exceeds 10, a maximum of 10 instances can be displayed, and the system displays message "A maximum of 10 instances are supported."

In Parameter Preview, if a job parameter has a syntax error, the system displays a message.

If a parameter depends on the data generated during job execution, such data cannot be simulated and displayed in Parameter Preview.

Monitoring Quality

Single-task Data Migration and real-time jobs cannot be associated with quality jobs.

Two execution modes are available: parallel and serial. Click the Quality Monitoring tab on the right of the canvas to expand the slide-out panel and configure the parameters listed in Table 6.

Table 6 Quality monitoring parameters

Parameter

Description

Execution Mode

Execution mode of quality monitoring. The options are as follows:

  • Parallel: All the upstream operators of the quality job operator are set as primary operators.
  • Serial: Quality jobs are connected in series from top to bottom. The quality job on the top depends on the primary operator.

Quality job

Quality jobs to be associated with the single-task job

  1. Click Add. The Data Quality Monitor slide-out panel is displayed.
  2. Set a node name.
  3. Set Job Type to Quality job.
    NOTE:

    Comparison job is not supported.

  4. Select the quality job to be associated and set other parameters based on the site requirements. If no quality job is available, create a quality job by referring to Creating Quality Jobs.
    NOTE:
    • Click Add to add multiple quality jobs.
    • Click to modify an added quality job.
    • Click to delete an added quality job.
  5. Ignore Quality Job Alarm

    Yes: Quality job alarms can be ignored.

    No: Quality job alarms cannot be ignored. When an alarm is generated, it will be reported.

  6. Configure advanced settings.
    1. Max. Node Execution Duration: indicates the execution timeout interval for the node. If retry is configured and the execution is not complete within the timeout interval, the node will be executed again.
    2. Retry upon Failure: specifies whether to re-execute a node if it fails to be executed.

      Yes: The node will be re-executed, and the following parameters must be configured:

      Retry upon Timeout

      Maximum Retries

      Retry Interval (seconds)

      No: The node will not be re-executed. This is the default value.
      NOTE:

      If retry is configured for a job node and the timeout duration is configured, the system allows you to retry a node when the node execution times out.

      If a node is not re-executed when it fails upon timeout, you can go to the Default Configuration page to modify this policy.

      Retry upon Timeout is displayed only when Retry upon Failure is set to Yes.

    3. Policy for Handling Subsequent Nodes If the Current Node Fails

      End the current job execution plan: stops running the current job. The job instance status is Failed.

      Go to the next node: ignores the execution failure of the current node. The job instance status is Failure ignored.

      Suspend current job execution plan: suspends running the current job. The job instance status is Waiting.

      Suspend execution plans of the subsequent nodes: stops running subsequent nodes. The job instance status is Failed.

  7. Click OK to complete the quality monitoring configuration.

Data Table

You can view tables of Hive SQL, DLI SQL, RDS SQL, and DWS SQLsingle-task batch processing jobs. On the Data Tables slide-out panel, you can select a table name to view the column names, field types, and descriptions in the table.

Figure 1 Viewing a data table

Testing and Saving the Job

After configuring the job, perform the following operations:

  1. Click to execute the job.

    You can view the run logs of the job by clicking View Log.

  2. After the job is executed, click to save the job configuration.

    After the job is saved, a version is automatically generated and displayed in Versions. The version can be rolled back. If you save a job multiple times within a minute, only one version is recorded. If the intermediate data is important, you can click Save new version to save and add a version.

Downloading or Dumping Script Execution Results

After a script is executed successfully, you can download or dump the execution result. By default, all users can download and dump the execution results of SQL scripts. If you do not want all users to have this permission, configure the permission by referring to Configuring a Data Export Policy.
  • After executing a script, you can click Download on the Result tab page to download a CSV result file to a local path. You can view the download record on the Download Center page.
  • After executing a script, you can click Dump on the Result tab page to dump a CSV and a JSON result file to OBS. For details, see Table 7.
    • The dump function is supported only if the OBS service is available.
    • Only the execution results of the query statements in SQL scripts can be dumped.
    Table 7 Dump parameters

    Parameter

    Mandatory

    Description

    Data Format

    Yes

    Format of the data to be exported. CSV and JSON formats are supported.

    Resource Queue

    No

    DLI queue where the export operation is to be performed. Set this parameter when a DLI or SQL script is created.

    Compression Format

    No

    Format of compression. Set this parameter when a DLI or SQL script is created.

    • none
    • bzip2
    • deflate
    • gzip

    Storage Path

    Yes

    OBS path where the result file is stored. After selecting an OBS path, customize a folder. Then, the system will create it automatically for storing the result file.

    You can also go to the Download Center page to set the default OBS path, which will be automatically set for Storage Path in the Dump Result dialog box.

    Cover Type

    No

    If a folder that has the same name as your custom folder exists in the storage path, select a cover type. Set this parameter when a DLI or SQL script is created.

    • Overwrite: The existing folder will be overwritten by the customized folder.
    • Report: The system reports an error and suspends the export operation.

    Export Column Name

    No

    Yes: Column names will be exported.

    No: Column names will not be exported.

    Character Set

    No

    • UTF-8: default character set
    • GB2312: recommended when the data to be exported contains Chinese character sets
    • GBK: expanded based on and compatible with GB2312
Download or dump allows you to view more SQL script execution results. Table 8 lists the maximum number of results that you can view, dump, and downloaded for different types of SQL scripts.
Table 8 Maximum number of results that you can view, dump, and download

SQL Type

Maximum Number of Results That You Can View Online

Maximum Number of Results That You Can Download

Maximum Number of Results That You Can Dump

DLI

1,000

1,000

Unlimited

Hive

1,000

1,000

10,000

DWS

1,000

1,000

10,000

Spark

1,000

1,000

10,000

RDS

1,000

1,000

Not supported