ALM-43001 Spark Service Unavailable
Alarm Description
The system checks the Spark service status every 300 seconds. This alarm is generated when the Spark service is unavailable.
This alarm is cleared when the Spark service recovers.
Alarm Attributes
Alarm ID |
Alarm Severity |
Alarm Type |
Service Type |
Auto Cleared |
---|---|---|---|---|
43001 |
Critical |
Error handling |
Spark |
Yes |
Alarm Parameters
Type |
Parameter |
Description |
---|---|---|
Location Information |
Source |
Specifies the cluster for which the alarm is generated. |
ServiceName |
Specifies the service for which the alarm is generated. |
|
RoleName |
Specifies the role for which the alarm is generated. |
|
HostName |
Specifies the host for which the alarm is generated. |
Impact on the System
The Spark tasks submitted by users fail to be executed.
Possible Causes
- The KrbServer service is abnormal.
- The LdapServer service is abnormal.
- The ZooKeeper service is abnormal.
- The HDFS service is abnormal.
- The Yarn service is abnormal.
- The corresponding Hive service is abnormal.
- The Spark assembly package is abnormal.
- The NameNode memory is insufficient.
- The memory of the Spark process is insufficient.
Handling Procedure
If the alarm is caused due to the abnormal Spark assembly package, wait about 10 minutes, and the alarm will be automatically cleared.
Check whether any service unavailability alarms have been generated for the services that Spark depends on.
- On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Alarm > Alarms.
- Check whether the following alarms exist in the alarm list:
- ALM-25500 KrbServer Service Unavailable
- ALM-25000 LdapServer Service Unavailable
- ALM-13000 ZooKeeper Service Unavailable
- ALM-14000 HDFS Service Unavailable
- ALM-18000 Yarn Service Unavailable
- ALM-16004 Hive Service Unavailable
- Handle the alarms by following the instructions provided in the alarm help.
After those alarms are cleared, wait a few minutes and check whether this alarm is cleared.
- If yes, no further action is required.
- If no, go to 4.
Check whether the NameNode memory is insufficient.
- Check whether the NameNode memory is insufficient.
- Restart the NameNode to release the memory. Then, check whether this alarm is cleared.
- If yes, no further action is required.
- If no, go to 6.
Check whether the memory of the Spark process is insufficient.
- Check whether the memory of the Spark process is insufficient due to memory-related modifications.
- Ensure that the memory of the Spark process is sufficient or expand the cluster capacity. Then, check whether this alarm is cleared.
- If yes, no further action is required.
- If no, go to 8.
Collect fault information.
- On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
- Expand the Service drop-down list, and select the following services for the target cluster (Hive is determined based on ServiceName in the alarm location information):
- KrbServer
- LdapServer
- ZooKeeper
- HDFS
- Yarn
- Hive
- Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
- Contact O&M engineers and provide the collected logs.
Alarm Clearance
This alarm is automatically cleared after the fault is rectified.
Related Information
None.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot