Help Center/ MapReduce Service/ User Guide/ MRS Cluster O&M/ MRS Cluster Alarm Handling Reference/ ALM-45446 Mutation Task of ClickHouse Is Not Complete for a Long Time
Updated on 2024-11-13 GMT+08:00

ALM-45446 Mutation Task of ClickHouse Is Not Complete for a Long Time

This section is available for MRS 3.3.1 or later version only.

Alarm Description

The system checks mutation tasks every 5 minutes. This alarm is generated when the system detects that a mutation task has been running for at least slow_mutation_cost_time minutes. This alarm is automatically cleared when the system does not detect any running mutation task or the running time of a mutation task is less than slow_mutation_cost_time minutes.

Alarm Attributes

Alarm ID

Alarm Severity

Auto Cleared

45446

Minor

Yes

Alarm Parameters

Type

Parameter

Description

Location Information

Source

Specifies the cluster or system for which the alarm was generated.

ServiceName

Specifies the service for which the alarm was generated.

RoleName

Specifies the role for which the alarm was generated.

HostName

Specifies the host for which the alarm was generated.

Impact on the System

  • Server resources are occupied, and the performance of the ClickHouse service deteriorates.
  • Data is inconsistent.

Possible Causes

The data volume is too large. As a result, the mutation task runs slowly or is suspended.

Handling Procedure

  1. Log in to FusionInsight Manager, choose O&M > Alarm > Alarms, and view the role name and the IP address for the hostname in Location.
  2. Log in to the node where the client is installed and run the following commands:

    cd {Client installation path}

    source bigdata_env

    • Security mode (with Kerberos enabled):

      kinit Component service user

      clickhouse client --host IP address of the ClickHouseServer instance for which the alarm is reported --port 21427 --secure

    • Normal mode (with Kerberos disabled):

      clickhouse client --host IP address of the ClickHouseServer instance for which the alarm is reported --user Username --password --port 21423

  3. Log in to FusionInsight Manager, choose Cluster > Services > ClickHouse, click Configurations and then All Configurations. Search for the value of the slow_mutation_cost_time parameter, enter the parameter value in the following SQL statement, and run the following statement to check whether any result is returned:

    SELECT * FROM system.mutations WHERE is_done = 0 AND create_time < now() - INTERVAL The value SECOND

    Add the actual value of slow_mutation_cost_time to the preceding statement.

    • If yes, go to 4.
    • If no, go to 7.

  4. Wait for a while and run the statement in 3 again. Check whether the value of parts_to_do in the returned result decreases.

    • If yes, wait until the mutation task is complete.
    • If no, go to 5.

  5. If the value of parts_to_do remains unchanged, stop the mutation task. Run the following statement and run the statement in 3 again to check whether the current mutation task is in the returned result list:

    KILL MUTATION WHERE database = 'Database name' AND table = 'Table name' AND mutation_id ='mutation ID'

    • If yes, go to 7.
    • If no, go to 6.

  1. Wait for several minutes and check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 7.

Collect fault information.

  1. On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
  2. Expand the Service drop-down list, and select ClickHouse for the target cluster.
  3. Expand the Hosts drop-down list. In the Select Host dialog box that is displayed, select the abnormal host, and click OK.
  4. Click the edit icon in the upper right corner, and set Start Date and End Date for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click Download.
  5. Contact O&M engineers and provide the collected logs.

Alarm Clearance

This alarm is automatically cleared after the fault is rectified.

Related Information

None.