Help Center/ MapReduce Service/ User Guide/ MRS Cluster O&M/ MRS Cluster Alarm Handling Reference/ ALM-13009 ZooKeeper Znode Capacity Usage Exceeds the Threshold
Updated on 2024-11-13 GMT+08:00

ALM-13009 ZooKeeper Znode Capacity Usage Exceeds the Threshold

Alarm Description

The system checks the level-2 ZNode status in the ZooKeeper data directory every hour (every 10 minutes in MRS 3.5.0 and later versions). This alarm is generated when the system detects that the capacity usage exceeds the threshold.

Alarm Attributes

Alarm ID

Alarm Severity

Auto Cleared

13009

Major

Yes

Alarm Parameters

Parameter

Description

Source

Specifies the cluster for which the alarm was generated.

ServiceName

Specifies the service for which the alarm was generated.

ServiceDirectory

Specifies the directory for which the alarm was generated.

RoleName

Specifies the role for which the alarm was generated.

Trigger Condition

Specifies the threshold for triggering the alarm.

Impact on the System

ZooKeeper cannot provide services for external systems, and the services of upstream components (such as Yarn, Flink, and Spark) that depend on the alarm directory are abnormal.

Possible Causes

  • A large volume of data has been written to the ZooKeeper data directory.
  • The threshold is improperly defined.

Handling Procedure

Check whether a large volume of data is written to the alarm directory.

  1. On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Alarm > Alarms. Click the drop-down list in the row containing ALM-13009 ZooKeeper ZNode Capacity Usage Exceeds the Threshold, and find the ZNode for which the alarm is generated in the Location area.
  2. Choose Cluster > Services > ZooKeeper. On the page that is displayed, click the Resource tab. In the Used Resources (By Second-Level ZNode) area, click By capacity and check whether a large amount of data is written to the top-level ZNode directory.

    • If yes, record the directory to which a large amount of data is written and go to 3.
    • If no, go to 5.

  3. Check whether data in the directory can be deleted.

    Deleting data from ZooKeeper is a high-risk operation. Exercise caution when performing this operation.

    • If yes, go to 4.
    • If no, go to 5.

  4. Log in to the ZooKeeper client and delete unnecessary data from the directory to which a large amount of data is written.

    1. Log in to the ZooKeeper client installation directory, for example, /opt/client, and configure environment variables.

      cd /opt/client

      source bigdata_env

    2. Run the following command to authenticate the user (skip this step for a cluster in normal mode):

      kinit Component service user

    3. Run the following command to log in to the client tool:

      zkCli.sh -server <Service IP address of the node where any ZooKeeper instance resides>:<Client port>

    4. Run the following command to delete unnecessary data:

      delete Path of the file to be deleted

  5. Log in to FusionInsight Manager, choose Cluster, click the name of the desired cluster, and choose Services > ZooKeeper > Configurations > All Configurations, and search for max.data.size. The value of max.data.size is the maximum capacity quota of the ZooKeeper directory. The unit is byte. Search for the GC_OPTS configuration item and check the value of Xmx.
  6. Compare the values of max.data.size and Xmx*0.65. The threshold is the smaller value multiplied by 80%. You can change the values of max.data.size and Xmx*0.65 to increase the threshold.
  7. Check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 8.

Collect the fault information.

  1. On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
  2. Expand the Service drop-down list, and select ZooKeeper for the target cluster.
  3. Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
  4. Contact O&M personnel and provide the collected logs.

Alarm Clearance

This alarm is automatically cleared after the fault is rectified.

Related Information

None