Updated on 2024-01-17 GMT+08:00

ALM-24005 Data Transmission by Flume Is Abnormal (For MRS 2.x or Earlier)

Description

The alarm module monitors the capacity of Flume channels. This alarm is generated if the duration that a channel is full or the number of times that a source fails to send data to the channel exceeds the threshold.

Users can set the threshold as required by modifying the channelfullcount parameter.

This alarm is cleared after the Flume channel space is released.

Attribute

Alarm ID

Alarm Severity

Auto Clear

24005

Major

Yes

Parameters

Parameter

Description

ServiceName

Specifies the service for which the alarm is generated.

HostName

Specifies the host for which the alarm is generated.

ComponentType

Specifies the component type for which the alarm is generated.

ComponentName

Specifies the component name for which the alarm is generated.

Impact on the System

If the usage of the Flume channel continues to grow, the data transmission time increases. When the usage reaches 100%, the Flume agent process is suspended.

Possible Causes

  • The Flume sink is faulty.
  • The network is faulty.

Procedure

  1. Check whether the Flume sink is normal.

    1. Check whether the Flume sink is the HDFS type.
      • If yes, go to 1.b.
      • If no, go to 1.c.
    2. On MRS Manager, check whether the ALM-14000 HDFS Service Unavailable alarm is reported and whether the HDFS service is stopped.
      • If the alarm is reported, clear it according to the handling suggestions of ALM-14000 HDFS Service Unavailable; if the HDFS service is stopped, start it. Then go to 1.g.
      • If no, go to 1.g.
    3. Check whether the Flume sink is the HBase type.
      • If yes, go to 1.d.
      • If no, go to 1.g.
    4. On MRS Manager, check whether the ALM-19000 HBase Service Unavailable alarm is reported and whether the HBase service is stopped.
      • If the alarm is reported, clear it according to the handling suggestions of "ALM-19000 HBase Service Unavailable"; if the HBase service is stopped, start it. Then go to 1.g.
      • If no, go to 1.g.
    5. Check whether the Flume sink is the Kafka type.
      • If yes, go to 1.f.
      • If no, go to 1.g.
    6. On MRS Manager, check whether the ALM-38000 Kafka Service Unavailable alarm is reported and whether the Kafka service is stopped.
      • If the alarm is reported, clear it according to the handling suggestions of "ALM-38000 Kafka Service Unavailable"; if the Kafka service is stopped, start it. Then go to 1.g.
      • If no, go to 1.g.
    7. Go to the MRS cluster details page and click Components.
    8. Choose Flume > Instances.
    9. Click the Flume instance of the faulty node and check whether the value of the Sink Speed Metrics is 0.
      • If yes, go to 2.a.
      • If no, no further action is required.

  2. Check the status of the network between the Flume sink and faulty node.

    1. Check whether the Flume sink is the Avro type.
      • If yes, go to 2.c.
      • If no, go to 3.
    2. Log in to the host where the faulty node resides. Run the following command to switch to user root:

      sudo su - root

    3. Run the ping Flume sink IP address command to check whether the Flume sink can be pinged.
      • If yes, go to 3.
      • If no, go to 2.d.
    4. Contact the network administrator to repair the network.
    5. Wait for a while and check whether the alarm is cleared.
      • If yes, no further action is required.
      • If no, go to 3.

  3. Collect fault information.

    1. On MRS Manager, choose System > Export Log.
    2. Contact the O&M engineers and send the collected logs.

Related Information

N/A