ALM-13009 ZooKeeper Znode Capacity Usage Exceeds the Threshold

Alarm Description

The system checks the level-2 ZNode status in the ZooKeeper data directory every hour. This alarm is generated when the system detects that the capacity usage exceeds the threshold.

Alarm Attributes

Alarm ID	Alarm Severity	Alarm Type	Service Type	Auto Cleared
13009	Critical (default threshold: 90%) Major (default threshold: 80%)	Quality of service	ZooKeeper	Yes

Alarm ID

Alarm Severity

Alarm Type

Service Type

Auto Cleared

13009

Critical (default threshold: 90%)

Major (default threshold: 80%)

Quality of service

ZooKeeper

Yes

Alarm Parameters

Type	Parameter	Description
Location Information	Source	Specifies the cluster for which the alarm was generated.
	ServiceName	Specifies the service for which the alarm was generated.
	ServiceDirectory	Specifies the directory for which the alarm was generated.
	RoleName	Specifies the role for which the alarm was generated.
Additional Information	Trigger Condition	Specifies the alarm triggering condition.

Impact on the System

ZooKeeper cannot provide services for external systems, and the services of upstream components (such as Yarn, Flink, and Spark) that depend on the alarm directory are abnormal.

Possible Causes

A large volume of data has been written to the ZooKeeper data directory.
The threshold is improperly defined.

Handling Procedure

Check whether a large volume of data is written to the alarm directory.

On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Alarm > Alarms. Click the drop-down list in the row containing ALM-13009 ZooKeeper ZNode Capacity Usage Exceeds the Threshold, and find the ZNode for which the alarm is generated in the Location area.
Choose Cluster > Services > ZooKeeper. On the page that is displayed, click the Resource tab. In the Used Resources (By Second-Level ZNode) area, click By capacity and check whether a large amount of data is written to the top-level ZNode directory.
- If yes, record the directory to which a large amount of data is written and go to 3.
- If no, go to 5.
Check whether data in the directory can be deleted.

Deleting data from ZooKeeper is a high-risk operation. Exercise caution when performing this operation.
- If yes, go to 4.
- If no, go to 5.
Log in to the ZooKeeper client and delete unnecessary data from the directory to which a large amount of data is written.
1. Log in to the ZooKeeper client installation directory, for example, /opt/client, and configure environment variables.
  cd /opt/client
  
  source bigdata_env
2. Run the following command to authenticate the user (skip this step for a cluster in normal mode):
  kinit Component service user
3. Run the following command to log in to the client tool:
  zkCli.sh -server <Service IP address of the node where any ZooKeeper instance resides>:<Client port>
4. Run the following command to delete unnecessary data:
  delete Path of the file to be deleted
Log in to FusionInsight Manager and choose Cluster > Services > ZooKeeper. On the page that is displayed, click the Configuration tab then the All Configurations sub-tab, and search for max.data.size. The value of max.data.size is the maximum capacity quota of the ZooKeeper directory. The unit is byte. Search for the GC_OPTS configuration item and check the value of Xmx.
Compare the values of max.data.size and Xmx*0.65. The threshold is the smaller value multiplied by 80%. You can change the values of max.data.size and Xmx*0.65 to increase the threshold.
Check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 8.

Collect the fault information.

On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
Expand the Service drop-down list, and select ZooKeeper for the target cluster.
Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact O&M engineers and provide the collected logs.