ALM-24001 Flume Agent Exception

Alarm Description

The Flume agent monitoring module monitors the Flume agent status. This alarm is generated when the Flume agent process is faulty (checked every 5 seconds) or the Flume agent fails to start (an alarm is reported immediately).

This alarm is cleared when the Flume agent process recovers, Flume agent starts successfully, and the alarm handling is completed.

Alarm Attributes

Alarm ID	Alarm Severity	Auto Cleared
24001	Major	Yes

Alarm Parameters

Parameter	Description
Source	Specifies the cluster for which the alarm was generated.
ServiceName	Specifies the service for which the alarm was generated.
AgentId	Specifies the ID of the agent for which the alarm was generated.
RoleName	Specifies the role for which the alarm was generated.
HostName	Specifies the host for which the alarm was generated.

Impact on the System

The Flume agent instance for which the alarm is generated cannot provide services properly, and the data transmission tasks of the instance are temporarily interrupted. Real-time data is lost during real-time data transmission.

Possible Causes

The JAVA_HOME directory does not exist, or the Java permission is incorrect.
The Flume agent directory permission is incorrect.
The Flume agent fails to start.

Handling Procedure

Check whether the JAVA_HOME directory exists or whether the JAVA permission is correct.

Log in to the host for which the alarm is generated as user root.
Obtain the installation directory of the Flume client for which the alarm is generated. (The value of AgentId can be obtained from Location of the alarm.)

ps -ef|grep AgentId | grep -v grep | awk -F 'conf-file ' '{print $2}' | awk -F 'fusioninsight' '{print $1}'
Run the su - Flume installation user command to switch to the Flume installation user and run the cd Flume client installation directory/fusioninsight-flume-1.9.0/conf/ command to go to the Flume configuration directory.
Run the cat ENV_VARS | grep JAVA_HOME command.
Check whether the JAVA_HOME directory exists. If both the command output in 4 and ll $JAVA_HOME/ are not empty, the JAVA_HOME directory exists.
- If yes, go to 7.
- If no, go to 6.
Specify a correct JAVA_HOME directory, for example, export JAVA_HOME=${BIGDATA_HOME}/common/runtime0/jdkVersion number.
Run the $JAVA_HOME/bin/java -version command to check whether the Flume agent running user has the Java execution permission. If the Java version is displayed in the command output, the Java permission meets the requirement. Otherwise, the Java permission does not meet the requirement.
- If yes, go to 9.
- If no, go to 8.
  
  JAVA_HOME is the environment variable exported during Flume client installation. You can also go to Flume client installation directory/fusioninsight-flume-1.9.0/conf and run the cat ENV_VARS | grep JAVA_HOME command to view the variable value.
Run the chmod 750 $JAVA_HOME/bin/java command to grant the Java execution permission to the Flume agent running user.

Check the directory permission of the Flume agent.

Log in to the host for which the alarm is generated as user root.
Run the following command to switch to the Flume agent installation directory:

cd Flume client installation directory/fusioninsight-flume-1.9.0/conf/
Run the ls -al * -R command to check whether any file owner is the user running the Flume agent.
- If yes, go to 12.
- If no, run the chown command to change the file owner to the user who runs the Flume agent.

Check the Flume agent configuration.

Run the cat properties.properties | grep spooldir and cat properties.properties | grep TAILDIR commands to check whether the Flume source type is spoolDir or tailDir. If any command output is displayed, the Flume source type is spoolDir or tailDir.
- If yes, go to 13.
- If no, go to 17.
Check whether the data monitoring directory exists.
- If yes, go to 15.
- If no, go to 14.
  
  Run the cat properties.properties | grep spoolDir command to view the spoolDir monitoring directory.
  
  Run the cat properties.properties | grep parentDir command to view the tailDir monitoring directory.
Specify a correct data monitoring directory.
Check whether the Flume agent user has the read, write, and execute permissions on the monitoring directory specified in 13.
- If yes, go to 17.
- If no, go to 16.
  
  Go to the monitoring directory as the Flume running user. If files can be created, the Flume running user has the read, write, and execute permissions on the monitoring directory.
Run the chmod 777 Flume monitoring directory command to grant the Flume agent running user the read, write, and execute permissions on the monitoring directory specified in 13.
Check whether the components connected to the Flume sink are in safe mode.
- If yes, go to 18.
- If no, go to 23.
  
  If the sinks in the properties.properties configuration file are the HDFS sink and HBase sink, and the configuration file contains a keytab file, the components connected to the Flume sink are in safe mode.
  
  If the sink in the properties.properties configuration file is the Kafka sink and *.security.protocol is set to SASL_PLAINTEXT or SASL_SSL, Kafka connected to the Flume sink is in safe mode.
Run the ll ketab path command to check whether the keytab authentication path specified by the *.kerberosKeytab parameter in the configuration file exists.
- If yes, go to 20.
- If no, go to 19.
  
  To view the ketab path, run the cat properties.properties | grep keytab command.
Change the value of kerberosKeytab in 18 to the custom keytab path and go to 21.
Go to 18 and check whether the Flume agent running user has the permission to access the keytab authentication file. If the keytab path is returned, the user has the permission. Otherwise, the user does not have the permission.
- If yes, go to 22.
- If no, go to 21.
Run the chmod 755 ketab file command to grant the read permission on the keytab file specified in 19, and restart the Flume process.
Check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 23.

Collect fault information.

On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
Expand the Service drop-down list, and select Flume for the target cluster.
Click the edit icon in the upper right corner, and set Start Date and End Date for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click Download.
Contact O&M personnel and provide the collected logs.