Updated on 2024-11-13 GMT+08:00

ALM-50231 Abnormal Tablets Exist in Doris

Alarm Description

The alarm module checks for abnormal tablets in the Doris cluster every 5 minutes. This alarm is generated when an abnormal tablet is detected.

This alarm is cleared when no abnormal tablet exists in the Doris cluster.

This alarm applies only to MRS 3.5.0 or later.

Alarm Attributes

Alarm ID

Alarm Severity

Auto Cleared

50231

Critical

Yes

Alarm Parameters

Type

Parameter

Description

Location Information

Source

Specifies the cluster or system for which the alarm was generated.

ServiceName

Specifies the service for which the alarm was generated.

RoleName

Specifies the role for which the alarm was generated.

HostName

Specifies the host for which the alarm was generated.

Impact on the System

Tablet exceptions may cause data query or write failures.

Possible Causes

The Doris data write frequency is too high, causing abnormal compaction operations or tablet migration failures.

Handling Procedure

  1. Log in to FusionInsight Manager, choose O&M > Alarm > Alarms, wait two minutes, and check whether the alarm is automatically cleared (the alarm logic includes the automatic clearance function).

    • If yes, no further action is required.
    • If no, go to 2.

Check the abnormal tablet and rectify the fault.

  1. Select the alarm and check the value of tabletId in Additional Information. If there are a large number of abnormal tablets and the additional information cannot completely display related information, search for "Abnormal tablets have" in the ${BIGDATA_LOG_HOME}/nodeagent/monitorlog/pluginmonitor.log file on the Master FE node. View the information about all abnormal tablets.
  2. Log in to the node where MySQL is installed and connect to the Doris database.

    If Kerberos authentication (security mode) has been enabled for the cluster, run the following commands to connect to the Doris database:

    export LIBMYSQL_ENABLE_CLEARTEXT_PLUGIN=1

    mysql -uDatabase login username -pDatabase login password -PConnection port for FE queries -hIP address of the Doris FE instance

    • To obtain the query connection port of the Doris FE instance, you can log in to FusionInsight Manager, choose Cluster > Services > Doris > Configurations, and query the value of query_port of the Doris service.
    • You can log in to FusionInsight Manager and choose Cluster > Services > Doris > Instances to view the service IP address of any Doris FE instance.

  3. Run the following command to view details about the abnormal tablet:

    show tablet tabletId;

    Record the DbName and TableName values of the abnormal tablet. Copy and run the command in the DetailCmd column in the command output as follows:

    show proc xxx;

    In the command output, check whether the value of LstFailedTime is NULL and whether the value of VersionCount is greater than the specified threshold (200 by default).

    • If yes, go to 5.
    • If no, go to 8.

  4. Run the following command to view the tablet repair and scheduling tasks that are being executed in the system:

    show proc "/cluster_balance";

    Check whether the values of pending_tablets and running_tablets in the command output decrease significantly based on the actual running environment.

    • If yes, go to 6.
    • If no, go to 8.

  5. Restore the abnormal table first. In the command, replace tableName with the table name recorded in 4.

    admin repair table tableName;

  6. After the abnormal table is restored, wait 2 minutes and check whether the alarm is automatically cleared in the alarm list.

    • If yes, no further action is required.
    • If no, go to 8.

Collect fault information.

  1. On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
  2. Expand the Service drop-down list, and select Doris for the target cluster.
  3. Click the edit icon in the upper right corner, and set Start Date and End Date for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click Download.
  4. Contact O&M engineers and provide the collected logs.

Alarm Clearance

This alarm is automatically cleared after the fault is rectified.

Related Information

None.