ALM-24001 Flume Agent Exception

Description

The Flume agent instance for which the alarm is generated cannot be started. This alarm is generated when the Flume agent process is faulty (The system checks in every 5 seconds.) or Flume agent fails to start (The system reporting alarms immediately).

This alarm is cleared when the Flume agent process recovers, Flume agent starts successfully and the alarm handling is completed.

Attribute

Alarm ID	Alarm Severity	Auto Clear
24001	Major	Yes

Parameters

Name	Meaning
Source	Specifies the cluster for which the alarm is generated.
ServiceName	Specifies the service for which the alarm is generated.
AgentId	Specifies the ID of the agent for which the alarm is generated.
RoleName	Specifies the role for which the alarm is generated.
HostName	Specifies the host for which the alarm is generated.

Impact on the System

The Flume agent instance for which the alarm is generated cannot provide services properly, and the data transmission tasks of the instance are temporarily interrupted. Real-time data is lost during real-time data transmission.

Possible Causes

The JAVA_HOME directory does not exist or the Java permission is incorrect.
The Flume agent directory permission is incorrect.
Flume agent fails to start.

Procedure

Check whether the JAVA_HOME directory exists or whether the JAVA permission is correct.

Log in to the host for which the alarm is generated as user root.
Run the following command to obtain the installation directory of the Flume client for which the alarm is generated: (The value of AgentId can be obtained from Location of the alarm.)

ps -ef|grep AgentId | grep -v grep | awk -F 'conf-file ' '{print $2}' | awk -F 'fusioninsight' '{print $1}'
Run the su - Flume installation user command to switch to the Flume installation user and run the cd Flume client installation directory/fusioninsight-flume-1.9.0/conf/ command to go to the Flume configuration directory.
Run the cat ENV_VARS | grep JAVA_HOME command.
Check whether the JAVA_HOME directory exists. If the command output in 4 is not empty and ll $JAVA_HOME/ is not empty, the JAVA_HOME directory exists.
- If yes, go to 7.
- If no, go to 6.
Specify a correct JAVA_HOME directory.
Run the $JAVA_HOME/bin/java -version command to check whether the Flume agent running user has the Java execution permission. If the Java version is displayed in the command output, the Java permission meets the requirement. Otherwise, the Java permission does not meet the requirement.
- If yes, go to 9.
- If no, go to 8.
  
  JAVA_HOME is the environment variable exported during Flume client installation. You can also go to Flume client installation directory/fusioninsight-flume-1.9.0/conf and run the cat ENV_VARS | grep JAVA_HOME command to view the variable value.
Run the chmod 750 $JAVA_HOME/bin/java command to grant the Java execution permission to the Flume agent running user.

Check the directory permission of the Flume agent.

Log in to the host for which the alarm is generated as user root.
Run the following command to switch to the Flume agent installation directory:

cd Flume client installation directory/fusioninsight-flume-1.9.0/conf/
Run the ls -al * -R command to check whether any file owner is the user who running the Flume agent.
- If yes, go to 12.
- If no, run the chown command to change the file owner to the user who runs the Flume agent.

Check the Flume agent configuration.

Run the cat properties.properties | grep spooldir and cat properties.properties | grep TAILDIR commands to check whether the Flume source type is spoolDir or tailDir. If any command output is displayed, the Flume source type is spoolDir or tailDir.
- If yes, go to 13.
- If no, go to 17.
Check whether the data monitoring directory exists.
- If yes, go to 15.
- If no, go to 14.
  
  Run the cat properties.properties | grep spoolDir command to view the spoolDir monitoring directory.
  
  Run the cat properties.properties | grep parentDir command to view the tailDir monitoring directory.
Specify a correct data monitoring directory.
Check whether the Flume agent user has the read, write, and execute permissions on the monitoring directory specified in 13.
- If yes, go to 17.
- If no, go to 16.
  
  Go to the monitoring directory as the Flume running user. If files can be created, the Flume running user has the read, write, and execute permissions on the monitoring directory.
Run the chmod 777 Flume monitoring directory command to grant the Flume agent running user the read, write, and execute permissions on the monitoring directory specified in 13.
Check whether the components connected to the Flume sink are in safe mode.
- If yes, go to 18.
- If no, go to 23.
  
  If the sinks in the properties.properties configuration file are the HDFS sink and HBase sink, and the configuration file contains a keytab file, the components connected to the Flume sink are in safe mode.
  
  If the sink in the properties.properties configuration file is the kafka sink and *.security.protocol is set to SASL_PLAINTEXT or SASL_SSL, Kafka connected to the Flume sink is in safe mode.
Run the ll ketab path command to check whether the keytab authentication path specified by the *.kerberosKeytab parameter in the configuration file exists.
- If yes, go to 20.
- If no, go to 19.
  
  To view the ketab path, run the cat properties.properties | grep keytab command.
Change the value of kerberosKeytab in 18 to the custom keytab path and go to 21.
Perform 18 to check whether the Flume agent running user has the permission to access the keytab authentication file. If the keytab path is returned, the user has the permission. Otherwise, the user does not have the permission.
- If yes, go to 22.
- If no, go to 21.
Run the chmod 755 ketab file command to grant the read permission on the keytab file specified in 19, and restart the Flume process.
Check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 23.

Collect the fault information.

On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
Expand the Service drop-down list, and select Flume for the target cluster.
Click in the upper right corner, and set Start Date and End Date for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click Download.
Contact O&M personnel and provide the collected logs.