ALM-45446 Mutation Task of ClickHouse Is Not Complete for a Long Time
This section is available for MRS 3.3.1 or later version only.
Alarm Description
The system checks mutation tasks every 5 minutes. This alarm is generated when the system detects that a mutation task has been running for at least slow_mutation_cost_time minutes. This alarm is automatically cleared when the system does not detect any running mutation task or the running time of a mutation task is less than slow_mutation_cost_time minutes.
Alarm Attributes
Alarm ID |
Alarm Severity |
Auto Cleared |
---|---|---|
45446 |
Minor |
Yes |
Alarm Parameters
Type |
Parameter |
Description |
---|---|---|
Location Information |
Source |
Specifies the cluster or system for which the alarm was generated. |
ServiceName |
Specifies the service for which the alarm was generated. |
|
RoleName |
Specifies the role for which the alarm was generated. |
|
HostName |
Specifies the host for which the alarm was generated. |
Impact on the System
- Server resources are occupied, and the performance of the ClickHouse service deteriorates.
- Data is inconsistent.
Possible Causes
The data volume is too large. As a result, the mutation task runs slowly or is suspended.
Handling Procedure
- Log in to FusionInsight Manager, choose O&M > Alarm > Alarms, and view the role name and the IP address for the hostname in Location.
- Log in to the node where the client is installed and run the following commands:
cd {Client installation path}
source bigdata_env
- Security mode (with Kerberos enabled):
clickhouse client --host IP address of the ClickHouseServer instance for which the alarm is reported --port 21427 --secure
- Normal mode (with Kerberos disabled):
clickhouse client --host IP address of the ClickHouseServer instance for which the alarm is reported --user Username --password --port 21423
- Security mode (with Kerberos enabled):
- Log in to FusionInsight Manager, choose Cluster > Services > ClickHouse, click Configurations and then All Configurations. Search for the value of the slow_mutation_cost_time parameter, enter the parameter value in the following SQL statement, and run the following statement to check whether any result is returned:
SELECT * FROM system.mutations WHERE is_done = 0 AND create_time < now() - INTERVAL The value SECOND
Add the actual value of slow_mutation_cost_time to the preceding statement.
- Wait for a while and run the statement in 3 again. Check whether the value of parts_to_do in the returned result decreases.
- If yes, wait until the mutation task is complete.
- If no, go to 5.
- If the value of parts_to_do remains unchanged, stop the mutation task. Run the following statement and run the statement in 3 again to check whether the current mutation task is in the returned result list:
KILL MUTATION WHERE database = 'Database name' AND table = 'Table name' AND mutation_id ='mutation ID'
- Wait for several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 7.
Collect fault information.
- On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
- Expand the Service drop-down list, and select ClickHouse for the target cluster.
- Expand the Hosts drop-down list. In the Select Host dialog box that is displayed, select the abnormal host, and click OK.
- Click the edit icon in the upper right corner, and set Start Date and End Date for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click Download.
- Contact O&M engineers and provide the collected logs.
Alarm Clearance
This alarm is automatically cleared after the fault is rectified.
Related Information
None.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot