Updated on 2024-10-23 GMT+08:00

Setting Up Scheduling for a Job

This section describes how to set up scheduling for an orchestrated job.

Prerequisites

Constraints

  • Set an appropriate value for this parameter. A maximum of five instances can be concurrently executed in a job. If the start time of a job instance is later than the configured job execution time, the job instances in the subsequent batch will be queued. As a result, the job execution costs a longer time than expected. For CDM and ETL jobs, the recurrence must be at least 5 minutes. In addition, the recurrence should be adjusted based on the data volume of the job table and the update frequency of the source table.
  • If you use DataArts Studio DataArts Factory to schedule a CDM migration job and configure a scheduled task for the job in DataArts Migration, both configurations take effect. To ensure unified service logic and avoid scheduling conflicts, enable job scheduling in DataArts Factory and do not configure a scheduled task for the job in DataArts Migration.

Setting Up Scheduling for a Job Using the Batch Processing Mode

Three scheduling types are available: Run once, Run periodically, and Event-based. The procedure is as follows:

Click the Scheduling Setup tab on the right of the canvas to expand the configuration page and configure the scheduling parameters listed in Table 1.

Table 1 Job scheduling parameters

Parameter

Description

Scheduling Type

Scheduling type of the job. Available options include:

  • Run once: You need to manually execute the job.
  • Run periodically: The job is executed periodically. For details about the parameters, see Table 2.
    • Manual confirmation: If this option is selected, the job instance can be executed only after manual confirmation. If manual confirmation is not performed, the job instance cannot be executed.
      NOTE:

      In job instance execution scenarios, job instances are in waiting confirmation state on the Monitor Instance page. When you click Execute, the job instances are in waiting execution state.

      When you rerun instances, they are in waiting confirmation state. When you click Execute, the instances are in waiting execution state.

      In PatchData scenarios, PatchData job instances are in waiting confirmation state on the Monitor PatchData page. When you click Execute on the Monitor Instance page, PatchData job instances are in waiting execution state.

      In batch job monitoring scenarios, job instances are in waiting confirmation state on the Batch Jobs page. When you click Execute, the job instances are in waiting execution state.

  • Event-based: The job will be executed when certain external conditions are met. For details about the parameters, see Table 3. For details, see Scheduling Jobs Across Workspaces.

Enable Dry Run

If you select this option, the job will not be executed, and a success message will be returned.

Task Groups

Select a configured task group. For details, see Configuring Task Groups.

Do not select is selected by default.

If you select a task group, you can control the maximum number of concurrent nodes in the task group in a fine-grained manner in scenarios where a job contains multiple nodes, a data patching task is ongoing, or a job is rerunning.

Example 1: The maximum number of concurrent tasks in the task group is set to 2, and a job has five nodes. When the job runs, only two nodes are running and the other nodes are waiting.

Example 2: The maximum number of concurrent tasks in the task group is set to 2, and the number of concurrent periods for a PatchData job is set to 5. When the PatchData job runs, two PatchData job instances are running, and the other job instances are waiting to run. The waiting instances can be delivered normally after a period of time.

Example 3: If the same task group is configured for multiple jobs, and the maximum number of concurrent tasks in the task group is set to 2, only two jobs are running and the other jobs are waiting. If the same task group is configured for multiple job nodes, the maximum number of concurrent tasks in the task group is set to 2, and there are five job nodes in total, two nodes are running and the other nodes are waiting.

NOTE:

For a pipeline job, you can configure a task group for each node or for the job. A user group configured for the job is prior to one configured for a node.

Table 2 Parameters for jobs that are executed periodically

Parameter

Description

From and to

The period during which a scheduling task takes effect.

You can set it to today or tomorrow by clicking the time box and then Today or Tomorrow.

Recurrence

The frequency at which the scheduling task is executed, which can be:

Set an appropriate value for this parameter. A maximum of five instances can be concurrently executed in a job. If the start time of a job instance is later than the configured job execution time, the job instances in the subsequent batch will be queued. As a result, the job execution costs a longer time than expected. For CDM and ETL jobs, the recurrence must be at least 5 minutes. In addition, the recurrence should be adjusted based on the data volume of the job table and the update frequency of the source table.

You can modify the scheduling period of a running job.

  • Minutes: The job starts at the top of the hour. The interval is accurate to minute. After the scheduling ends at the end time of the current day, the scheduling automatically starts on the next day.
    NOTE:

    If you select Minutes for Scheduling Frequency, the job cannot be scheduled based on the configured interval, that is, the job cannot be executed at a fixed frequency across hours. For example:

    • A scheduling policy is configured at 14:20 on June 19, 2024. According to the policy, the scheduling starts at 00:30 and ends at 23:59, at an interval of 30 minutes. The job is actually scheduled at 14:30:00, 15:30:00, 16:30:00, 17:30:00, 18:30:00, and more on June 19, 2024.
    • A scheduling policy is configured at 14:20 on June 19, 2024. According to the policy, the scheduling starts at 00:00 and ends at 23:59, at an interval of 50 minutes. The job is actually scheduled at 14:50:00, 15:00:00, 15:50:00, 16:00:00, 16:50:00, 17:00:00, 17:50:00, and more on June 19, 2024.
  • Hours: You can select Interval Hour, indicating that the job starts at a specified time point and that the interval is accurate to hour. After the scheduling ends at the end time of the current day, the scheduling automatically starts on the next day. You can also select Discrete Hour and specify any hour in a day to schedule the job.
  • Every day: The job starts at a specified time on a day. The scheduling period is one day.
  • Every week: You can select a specified time point of one or more days in a week.
  • Every month: You can select a specified time point of one or more days in a month. In addition, you can select Last day of each month.
NOTE:

DataArts Studio does not support concurrent running of PatchData instances and periodic job instances of underlying services (such as CDM and DLI). To prevent PatchData instances from affecting periodic job instances and avoid exceptions, ensure that they do not run at the same time.

Scheduling Calendar

Select a scheduling calendar. The default value is Do not use. For details about how to configure a scheduling calendar, see Configuring a Scheduling Calendar.

  • The job is scheduled on the custom working days in the calendar. On non-working days, a dry run occurs. Examples: periodic job scheduling and PatchData tasks.
  • Changes to the working days of the scheduling calendar do not take effect for the job instances that are being executed, but can take effect immediately for those that have not been generated.

OBS Listening

If you enable this function, the system automatically listens to the OBS path for new job files. If you disable this function, the system no longer listens to the OBS path.

Configure the following parameters:

  • OBS File: An EL expression is supported.
  • Listening Interval: Set a value ranging from 1 to 60, in minutes.
  • Timeout: Set a value ranging from 1 to 1440, in minutes.

Dependency job

You can select jobs that are executed periodically in different workspaces as dependency jobs. The current job starts only after the dependency jobs are executed. You can click Parse Dependency to automatically identify job dependencies.

NOTE:

For details about job dependency rules across workspaces, see Job Dependency Rule.

Currently, DataArts Factory supports two types of job dependency policies, that is, dependency between jobs whose scheduling periods are traditional periods and dependency between jobs whose scheduling periods are natural periods. You can select either of them. The scheduling periods for new DataArts Studio instances are natural periods.

Figure 1 Dependency between jobs whose scheduling periods are traditional periods
Figure 2 Dependency between jobs whose scheduling periods are natural periods

For details about the conditions for setting dependency jobs and how jobs run after dependency jobs are set, see Dependency Policies for Periodic Scheduling.

Policy for Current job If Dependency job Fails

Policy for processing the current job when one or more instances of its dependency job fail to be executed in its period.

  • Pending

    Waits to execute the current job, which affects the execution of subsequent jobs. You can force the dependency job to be executed successfully.

  • Continue

    Continues to execute the current job.

  • Cancel

    Cancels the current job. Its status becomes Canceled.

For example, the recurrence of the current job is 1 hour and that of its dependency jobs is 5 minutes.
  • If the value of this parameter is set to Cancel, the current job will be canceled as long as one of the 12 instances of its dependency job fails.
  • If the value of this parameter is set to Continue, the current job will be executed after the 12 instances of its dependency job are executed.
    NOTE:

    You can set this parameter for multiple jobs in a batch. For details, see Configuring a Default Item. This parameter takes effect only for new jobs.

Run After Dependency job Ends

If a job depends on other jobs, the job is executed only after its dependency job instances are executed within a specified time range. If the dependency job instances are not successfully executed, the current job is in waiting state.

If you select this option, the system checks whether all job instances in the previous cycle have been executed before executing the current job.

Dependency Job

When configuring job dependencies, you can filter dependent jobs based on whether they are being scheduled. This prevents downstream job failures caused by upstream dependent jobs not being scheduled.

  • All jobs
  • Running jobs

Dependency Cycle

  • Same Cycle
  • Previous N Cycle. N range is from 1 to 30.

Cross-Cycle Dependency

Dependency between job instances

  • Independent on the previous schedule cycle: You can set Concurrency to set the number of job instances that are concurrently executed. If you set it to 1, a batch is executed only after the previous batch is executed (the execution is successful, cancelled, or failed).
  • Self-dependent: The job can be rescheduled only after it is executed in the current schedule cycle. Before that, the job is in Waiting state.
  • Skip waiting instances and run the latest instance: Skipped job instances will be canceled and not executed. If the execution of a job instance takes a long time, multiple subsequent job instances may be skipped. However, if these job instances need to be executed, skipping them may cause service logic errors. For example, if partitioned tables are required but redundant job instances are skipped, some partitioned tables may go missing. Exercise caution when selecting this option.
    NOTE:
    • Skip waiting instances and run the latest instance is only supported for jobs scheduled by minute or hour.
    • If the number of concurrent jobs is small and no instance has been generated, blocked instances will not be skipped.
    • If a job with a shorter period depends on a job with a longer period, some instances may not be skipped and still be executed.

Clear Waiting Instances

  • No
  • Yes

    If this parameter is not set, expired waiting job instances will be cleared based on the workspace-level configuration by default. You can set whether to clear waiting job instances based on the site requirements.

Enable Dry Run

If you select this option, the job will not be executed, and a success message will be returned.

Task Groups

Select a configured task group. For details, see Configuring Task Groups.

Do not select is selected by default.

If you select a task group, you can control the maximum number of concurrent nodes in the task group in a fine-grained manner in scenarios where a job contains multiple nodes, a data patching task is ongoing, or a job is rerunning.

NOTE:

For a pipeline job, you can configure a task group for each node or for the job. A user group configured for the job is prior to one configured for a node.

Table 3 Parameters for event-based jobs

Parameter

Description

Event Type

Type of the event that triggers job running

  • DIS
  • KAFKA

Parameters for DIS event-triggered jobs

DIS Stream

Name of the DIS stream. When a new message is sent to the specified DIS stream, DataArts Factory transfers the new message to the job to trigger the job running.

Concurrent Events

Number of jobs that can be concurrently processed. The maximum number of concurrent events is 128.

Event Detection Interval

Interval at which the system detects the DIS stream for new messages. The unit of the interval can be Seconds or Minutes.

Access Policy

Select the location where data is to be accessed:

  • Access from the last location: For the first access, data is accessed from the most recently recorded location. For the subsequent access, data is accessed from the previously recoded location.
  • Access from a new location: Data is accessed from the most recently recorded location each time.

Failure Policy

Select a policy to be performed after scheduling fails.

  • Suspend
  • Ignore the failure and proceed with the next event

Enable Dry Run

If you select this option, the job will not be executed, and a success message will be returned.

Task Groups

Select a configured task group. For details, see Configuring Task Groups.

Do not select is selected by default.

If you select a task group, you can control the maximum number of concurrent nodes in the task group in a fine-grained manner in scenarios where a job contains multiple nodes, a data patching task is ongoing, or a job is rerunning.

NOTE:

For a pipeline job, you can configure a task group for each node or for the job. A user group configured for the job is prior to one configured for a node.

Parameters for KAFKA event-triggered jobs

Connection Name

Before selecting a data connection, ensure that a Kafka data connection has been created in the Management Center.

Topic

Topic of the message to be sent to the Kafka.

Concurrent Events

Number of jobs that can be concurrently processed. The maximum number of concurrent events is 128.

Event Detection Interval

Interval at which the system detects the stream for new messages. The unit of the interval can be Seconds or Minutes.

Access Policy

Select the location where data is to be accessed:

  • Access from the last location: For the first access, data is accessed from the most recently recorded location. For the subsequent access, data is accessed from the previously recoded location.
  • Access from a new location: Data is accessed from the most recently recorded location each time.

Failure Policy

Select a policy to be performed after scheduling fails.

  • Suspend
  • Ignore the failure and proceed with the next event

Enable Dry Run

If you select this option, the job will not be executed, and a success message will be returned.

Task Groups

Select a configured task group. For details, see Configuring Task Groups.

Do not select is selected by default.

If you select a task group, you can control the maximum number of concurrent nodes in the task group in a fine-grained manner in scenarios where a job contains multiple nodes, a data patching task is ongoing, or a job is rerunning.

NOTE:

For a pipeline job, you can configure a task group for each node or for the job. A user group configured for the job is prior to one configured for a node.

Enable Dry Run

If you select this option, the job will not be executed, and a success message will be returned.

Task Groups

Select a configured task group. For details, see Configuring Task Groups.

Do not select is selected by default.

If you select a task group, you can control the maximum number of concurrent nodes in the task group in a fine-grained manner in scenarios where a job contains multiple nodes, a data patching task is ongoing, or a job is rerunning.

NOTE:

For a pipeline job, you can configure a task group for each node or for the job. A user group configured for the job is prior to one configured for a node.

Setting Up Scheduling for Nodes of a Job Using the Real-Time Processing Mode

Three scheduling types are available: Run once, Run periodically, and Event-based. The procedure is as follows:

Select a node. On the node development page, click the Scheduling Parameter Setup tab. On the displayed page, configure the parameters listed in Table 4.

Table 4 Parameters for setting up node scheduling

Parameter

Description

Scheduling Type

Scheduling type of the job. Available options include:

  • Run once: You need to manually run the job.
  • Run periodically: The job runs automatically and periodically.
  • Event-based: The job runs when certain external conditions are met.

Parameters displayed when Scheduling Type is Run periodically

From and to

The period during which a scheduling task takes effect.

Recurrence

The frequency at which the scheduling task is executed, which can be:

  • Minutes
  • Hours
  • Every day
  • Every week
  • Every month

For CDM and ETL jobs, the recurrence must be at least 5 minutes. In addition, the recurrence should be adjusted based on the data volume of the job table and the update frequency of the source table.

You can modify the scheduling period of a running job.

Cross-Cycle Dependency

Dependency between job instances

  • Independent on the previous schedule cycle

    Set Concurrency. Number of job instances that are concurrently executed. If you set it to 1, a batch is executed only after the previous batch is executed (the execution is successful, cancelled, or failed).

  • Self-dependent: The job can be rescheduled only after it is executed in the current schedule cycle. Before that, the job is in Waiting state.

Parameters displayed when Scheduling Type is Event-based

Event Type

Type of the event that triggers job running

DIS Stream

Name of the DIS stream. When a new message is sent to the specified DIS stream, DataArts Factory transfers the new message to the job to trigger the job running.

This parameter is mandatory only when Event Type is set to DIS.

Connection Name

Before selecting a data connection, ensure that a Kafka data connection has been created in the Management Center. This parameter is mandatory only when Event Type is set to KAFKA.

Topic

Topic of the message to be sent to the Kafka. This parameter is mandatory only when Event Type is set to KAFKA.

Consumer Group

A scalable and fault-tolerant group of consumers in Kafka.

Consumers in a group share the same ID. They collaborate with each other to consume all partitions of subscribed topics. A partition in a topic can be consumed by only one consumer.

NOTE:
  1. A consumer group can contain multiple consumers.
  2. The group ID is a string that uniquely identifies a consumer group in a Kafka cluster.
  3. Each partition of each topic subscribed to by a consumer group can be consumed by only one consumer. Consumer groups do not affect each other.

If you select DIS or KAFKA for Event Type, the consumer group ID is automatically displayed. You can also manually change the consumer group ID.

Concurrent Events

Number of jobs that can be concurrently processed. The maximum number of concurrent events is 10.

Event Detection Interval

Interval at which the system detects the DIS stream for new messages. The unit of the interval can be Seconds or Minutes.

Access Policy

  • Access from the last location
  • Access from a new location

    This parameter is mandatory only when Event Type is set to KAFKA.

Failure Policy

Select a policy to be performed after scheduling fails.

  • Suspend
  • Ignore failure and proceed