Updated on 2024-11-29 GMT+08:00

ALM-43001 Spark Service Unavailable

Alarm Description

The system checks the Spark service status every 300 seconds. This alarm is generated when the Spark service is unavailable.

This alarm is cleared when the Spark service recovers.

Alarm Attributes

Alarm ID

Alarm Severity

Alarm Type

Service Type

Auto Cleared

43001

Critical

Error handling

Spark

Yes

Alarm Parameters

Type

Parameter

Description

Location Information

Source

Specifies the cluster for which the alarm is generated.

ServiceName

Specifies the service for which the alarm is generated.

RoleName

Specifies the role for which the alarm is generated.

HostName

Specifies the host for which the alarm is generated.

Impact on the System

The Spark tasks submitted by users fail to be executed.

Possible Causes

  • The KrbServer service is abnormal.
  • The LdapServer service is abnormal.
  • The ZooKeeper service is abnormal.
  • The HDFS service is abnormal.
  • The Yarn service is abnormal.
  • The corresponding Hive service is abnormal.
  • The Spark assembly package is abnormal.
  • The NameNode memory is insufficient.
  • The memory of the Spark process is insufficient.

Handling Procedure

If the alarm is caused due to the abnormal Spark assembly package, wait about 10 minutes, and the alarm will be automatically cleared.

Check whether any service unavailability alarms have been generated for the services that Spark depends on.

  1. On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Alarm > Alarms.
  2. Check whether the following alarms exist in the alarm list:

    • ALM-25500 KrbServer Service Unavailable
    • ALM-25000 LdapServer Service Unavailable
    • ALM-13000 ZooKeeper Service Unavailable
    • ALM-14000 HDFS Service Unavailable
    • ALM-18000 Yarn Service Unavailable
    • ALM-16004 Hive Service Unavailable
    • If yes, go to 3.
    • If no, go to 4.

  3. Handle the alarms by following the instructions provided in the alarm help.

    After those alarms are cleared, wait a few minutes and check whether this alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 4.

Check whether the NameNode memory is insufficient.

  1. Check whether the NameNode memory is insufficient.

    • If yes, go to 5.
    • If no, go to 6.

  2. Restart the NameNode to release the memory. Then, check whether this alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 6.

Check whether the memory of the Spark process is insufficient.

  1. Check whether the memory of the Spark process is insufficient due to memory-related modifications.

    • If yes, go to 7.
    • If no, go to 8.

  2. Ensure that the memory of the Spark process is sufficient or expand the cluster capacity. Then, check whether this alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 8.

Collect fault information.

  1. On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
  2. Expand the Service drop-down list, and select the following services for the target cluster (Hive is determined based on ServiceName in the alarm location information):

    • KrbServer
    • LdapServer
    • ZooKeeper
    • HDFS
    • Yarn
    • Hive

  3. Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
  4. Contact O&M engineers and provide the collected logs.

Alarm Clearance

This alarm is automatically cleared after the fault is rectified.

Related Information

None.