ALM-43001 Spark Service Unavailable

The system checks the Spark service status every 300 seconds. This alarm is generated when the Spark service is unavailable.

This alarm is cleared when the Spark service recovers.

Alarm ID	Alarm Severity	Alarm Type	Service Type	Auto Cleared
43001	Critical	Error handling	Spark	Yes

Type	Parameter	Description
Location Information	Source	Specifies the cluster for which the alarm is generated.
	ServiceName	Specifies the service for which the alarm is generated.
	RoleName	Specifies the role for which the alarm is generated.
	HostName	Specifies the host for which the alarm is generated.

The Spark tasks submitted by users fail to be executed.

If the alarm is caused due to the abnormal Spark assembly package, wait about 10 minutes, and the alarm will be automatically cleared.

Check whether any service unavailability alarms have been generated for the services that Spark depends on.

On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Alarm > Alarms.
Check whether the following alarms exist in the alarm list:
- ALM-25500 KrbServer Service Unavailable
- ALM-25000 LdapServer Service Unavailable
- ALM-13000 ZooKeeper Service Unavailable
- ALM-14000 HDFS Service Unavailable
- ALM-18000 Yarn Service Unavailable
- ALM-16004 Hive Service Unavailable
- If yes, go to 3.
- If no, go to 4.
Handle the alarms by following the instructions provided in the alarm help.

After those alarms are cleared, wait a few minutes and check whether this alarm is cleared.
- If yes, no further action is required.
- If no, go to 4.

Check whether the NameNode memory is insufficient.

Check whether the NameNode memory is insufficient.
- If yes, go to 5.
- If no, go to 6.
Restart the NameNode to release the memory. Then, check whether this alarm is cleared.
- If yes, no further action is required.
- If no, go to 6.

Check whether the memory of the Spark process is insufficient.

Check whether the memory of the Spark process is insufficient due to memory-related modifications.
- If yes, go to 7.
- If no, go to 8.
Ensure that the memory of the Spark process is sufficient or expand the cluster capacity. Then, check whether this alarm is cleared.
- If yes, no further action is required.
- If no, go to 8.

Collect fault information.

On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
Expand the Service drop-down list, and select the following services for the target cluster (Hive is determined based on ServiceName in the alarm location information):
- KrbServer
- LdapServer
- ZooKeeper
- HDFS
- Yarn
- Hive
Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact O&M engineers and provide the collected logs.