ALM-12012 NTP Service Is Abnormal

Alarm Description

The system checks whether the NTP service on a node synchronizes time with the NTP service on the active OMS node every 60 seconds. This alarm is generated when the NTP service fails to synchronize time for two consecutive times.

This alarm is generated when the time difference between the NTP service on a node and the NTP service on the active OMS node is greater than or equal to 20s for two consecutive times. This alarm is cleared when the time difference is less than 20s.

Alarm Attributes

Alarm ID	Alarm Severity	Auto Cleared
12012	Major	Yes

Alarm Parameters

Parameter	Description
Source	Specifies the cluster or system for which the alarm was generated.
ServiceName	Specifies the service for which the alarm was generated.
RoleName	Specifies the role for which the alarm was generated.
HostName	Specifies the host for which the alarm was generated.

Impact on the System

The time on the node is inconsistent with that on other nodes in the cluster. Therefore, some FusionInsight applications on the node may not run properly. If the time difference between the node and other Kerberos service instances keeps increasing, Kerberos authentication on the node may fail and service exceptions occur.

Possible Causes

The NTP service on the current node cannot start properly.
The current node fails to synchronize time with the NTP service on the active OMS node.
The key authenticated by the NTP service on the current node is inconsistent with that on the active OMS node.
The time offset between the node and the NTP service on the active OMS node is large.

Handling Procedure

Check the NTP service mode of the node.

Log in to the active management node as the root user and check the resource status of the active and standby management nodes.

For details about how to log in to a cluster node, see Logging In to an MRS Cluster Node.

Switch to user omm:
```
su - omm
```
Check the resource status of the active and standby management nodes.
```
sh ${BIGDATA_HOME}/om-server/om/sbin/status-oms.sh
```
- If "chrony" is displayed in the ResName column of the command output, go to Step 2.
- If "ntp" is displayed in the ResName column, go to Step 20.
If both "chrony" and "ntp" are displayed in the ResName column of the command output, the NTP service mode is being switched. Wait for 10 minutes and go to Step 1 again. If both "chrony" and "ntp" persist, contact O&M personnel personnel.

Check whether the chrony service on the node is started properly.

On FusionInsight Manager, choose O&M > Alarm > Alarms. On the page that is displayed, click in the row containing the alarm, and view the name of the host for which the alarm is generated in Location.
Check whether the chronyd process is running on the node where the alarm is generated. Log in to the node where the alarm is generated as user root and run the following command to check whether the chronyd process information is displayed:
```
ps -ef | grep chronyd | grep -v grep
```
- If yes, go to Step 6.
- If no, go to Step 4.
Start the NTP service.
After 10 minutes, check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 6.

Check whether the current node can synchronize time properly with the chrony service on the active OMS node.

Check whether the node can synchronize time with the NTP service on the active OMS node based on additional information of the alarm.
- If yes, go to Step 7.
- If no, go to Step 17.
Check whether the synchronization with the chrony service on the active OMS node is faulty.

Log in to the node for which the alarm is generated as user root and run the chronyc sources command.
```
chronyc sources
```
In the command output, if there is an asterisk (*) before the IP address of the chrony service on the active OMS node, the synchronization is normal. The command output is as follows:
```
MS Name/IP address         Stratum Poll Reach LastRx Last sample               
===============================================================================
^* 10.10.10.162             10  10   377   626    +16us[  +15us] +/-  308us
```
In the command output, if there is no asterisk (*) before the IP address of the NTP service on the active OMS node, and the value of Reach is 0, the synchronization is abnormal.
```
MS Name/IP address         Stratum Poll Reach LastRx Last sample               
===============================================================================
^? 10.1.1.1                      0  10     0     -     +0ns[   +0ns] +/-    0ns
```
- If yes, go to Step 8.
- If no, go to Step 38.
The chrony synchronization failure is typically caused by the system firewall. If the firewall can be disabled, disable it. If the firewall cannot be disabled, check the firewall configuration policy and ensure that UDP ports 123 and 323 are not disabled. (For details, see the firewall configuration policy of each system.)
Check whether the alarm is cleared 10 minutes later.
- If yes, no further action is required.
- If no, go to Step 10.
Log in to the active OMS node as user root and run the following command to view the authentication code whose key index is 1M:
```
cat ${BIGDATA_HOME}/om-server/OMS/workspace/conf/chrony.keys
```

Run the following command to check whether the key is the same as that queried in Step 10:

diff ${BIGDATA_HOME}/om-server/OMS/workspace/conf/chrony.keys /etc/chrony.keys

If the keys are the same, no result is returned after the command is executed. For example:

host01:~ # cat ${BIGDATA_HOME}/om-server/OMS/workspace/conf/chrony.keys       
1 M sdYbq;o^CzEAWo<U=Tw5
host01:~ # diff ${BIGDATA_HOME}/om-server/OMS/workspace/conf/chrony.keys /etc/chrony.keys
host01:~ #

If yes, go to Step 12.
If no, go to Step 38.

Run the following command to check whether the key is the same as that queried in Step 10: (Compare the key with that of the authentication key index 1M queried in Step 10.)
```
cat ${BIGDATA_HOME}/om-server/om/packaged-distributables/ntpKeyFile
```
- If yes, go to Step 13.
- If no, go to Step 15.
Log in to the faulty node as user root and run the cat /etc/chrony.keys command to check whether the key value is the same as that queried in Step 12 (compare it with that of the authentication key index 1M).
```
cat /etc/chrony.keys
```
- If yes, go to Step 38.
- If no, go to Step 14.
Switch to user omm, change the key value of the authentication key index 1M in ${NODE_AGENT_HOME}/chrony.keys to the key value of ntpKeyFile in Step 12, and go to Step 16.
```
su - omm
```
```
vi ${NODE_AGENT_HOME}/chrony.keys
```
Run the following commands as user root or omm to change the NTP key of the active OMS node (change ntp.keys to ntpkeys in Red Hat Enterprise Linux):
```
cd ${BIGDATA_HOME}/om-server/OMS/workspace/conf
```
```
sed -i "`cat chrony.keys | grep -n '1 M'|awk -F ':' '{print $1}'`d" chrony.keys
```
```
echo "1 M `cat ${BIGDATA_HOME}/om-server/om/packaged-distributables/ntpKeyFile`" >> chrony.keys
```
Check whether the key value of the authentication key index 1M in chrony.keys is the same as that of ntpKeyFile.
- If yes, go to Step 16.
- If no, change the key of the authentication key index 1M in chrony.keys to the key of ntpKeyFile and go to Step 16.
After 5 minutes, restart the chrony service on the active OMS node. After 15 minutes, check whether the alarm is cleared.
```
systemctl restart chronyd
```
- If yes, no further action is required.
- If no, go to Step 38.

Check whether the time deviation between the node and the chrony service on the active OMS node is large.

Check whether the time deviation is large in additional information of the alarm.
- If yes, go to Step 18.
- If no, go to Step 38.
On the Hosts tab page, select the host for which the alarm is generated, and choose More > Stop All Instances to stop all the services on the node.

If the time on the alarm node is later than that on the chrony service of the active OMS node, adjust the time of the alarm node. After adjusting the time, choose More > Start All Instances to start the services on the node.

If the time on the alarm node is earlier than that on the chrony service of the active OMS node, wait until the time deviation is due and adjust the time of the alarm node. After adjusting the time, choose More > Start All Instances to start the services on the node.

If you do not wait, data loss may occur.
After 10 minutes, check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 38.

Check whether the NTP service on the node is started properly.

On FusionInsight Manager, choose O&M > Alarm > Alarms. On the page that is displayed, click in the row containing the alarm, and view the name of the host for which the alarm is generated in Location.
Check whether the ntpd process is running on the node using the following method. Log in to the node where the alarm is generated as user root. Run the following command to check whether the ntpd process information is displayed:
```
ps -ef | grep ntpd | grep -v grep
```
- If yes, go to Step 24.
- If no, go to Step 22.
Start the NTP service:
```
service ntp start
```
For Red Hat operating systems, run the service ntpd start command.
After 10 minutes, check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 24.

Check whether the node can synchronize time properly with the NTP service on the active OMS node.

Check whether the node can synchronize time with the NTP service on the active OMS node based on additional information of the alarm.
- If yes, go to Step 25.
- If no, go to Step 35.
Check whether the synchronization with the NTP service on the active OMS node is faulty.

Log in to the alarm node as user root and run the ntpq -np command.
```
ntpq -np
```
If an asterisk (*) exists before the IP address of the NTP service on the active OMS node in the command output, the synchronization is in normal state. The command output is as follows:
```
remote refid st t when poll reach delay offset jitter 
============================================================================== 
*10.10.10.162 .LOCL. 1 u 1 16 377 0.270 -1.562 0.014
```
If there is no asterisk (*) before the IP address of the NTP service on the active OMS node, as shown in the following command output, and the value of refid is .INIT., the synchronization is abnormal.
```
remote refid st t when poll reach delay offset jitter 
============================================================================== 
10.10.10.162 .INIT. 1 u 1 16 377 0.270 -1.562 0.014
```
- If yes, go to Step 26.
- If no, go to Step 38.
The NTP synchronization failure is typically caused by the system firewall. If the firewall can be disabled, run the iptables -F command to disable it. If the firewall cannot be disabled, run the iptables -L command to check the firewall configuration policy and ensure that the UDP port 123 is not disabled. (For details, see the firewall configuration policy of each system.)
```
iptables -F
```
After 10 minutes, check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 28.

Log in to the active OMS node as user root and run the following command to view the authentication key index 1M:
```
cat ${BIGDATA_HOME}/om-server/OMS/workspace/conf/ntpkeys
```

Run the following command to check whether the key is the same as that queried in Step 28:

diff ${BIGDATA_HOME}/om-server/OMS/workspace/conf/ntpkeys /etc/ntp/ntpkeys

If the keys are the same, no result is returned after the command is executed. For example:

host01:~ # cat ${BIGDATA_HOME}/om-server/OMS/workspace/conf/ntp.keys       
1 M sdYbq;o^CzEAWo<U=Tw5
host01:~ # diff ${BIGDATA_HOME}/om-server/OMS/workspace/conf/ntp.keys /etc/ntp.keys
host01:~ #

If yes, go to Step 30.
If no, go to Step 38.

Check whether the key value is the same as that queried in Step 28: (Compare the key with that of the authentication key index 1M queried in Step 28.)
```
cat ${BIGDATA_HOME}/om-server/om/packaged-distributables/ntpKeyFile
```
- If yes, go to Step 31.
- If no, go to Step 33.
Log in to the faulty node as user root. Check whether the key value is the same as that queried in Step 30 (compare it with that of the authentication key index 1M).
```
cat /etc/ntp/ntpkeys
```
- If yes, go to Step 38.
- If no, go to Step 32.
Switch to user omm, change the key value of the authentication key index 1M in ${NODE_AGENT_HOME}/ntp.keys (${NODE_AGENT_HOME}/ntpkeys in Red Hat Enterprise Linux) to the key value of ntpKeyFile in Step 30, and go to Step 34.
```
su - omm
```
Run the following commands as user root or omm to change the NTP key of the active OMS node (change ntp.keys to ntpkeys in Red Hat Enterprise Linux):
```
cd ${BIGDATA_HOME}/om-server/OMS/workspace/conf
```
```
sed -i "`cat ntp.keys | grep -n '1 M'|awk -F ':' '{print $1}'`d" ntp.keys
```
```
echo "1 M `cat ${BIGDATA_HOME}/om-server/om/packaged-distributables/ntpKeyFile`" >>ntp.keys
```
Check whether the key value of the authentication key index 1M in ntp.keys is the same as that of ntpKeyFile.
- If yes, go to Step 34.
- If no, change the key of the authentication key index 1M in ntp.keys to the key of ntpKeyFile and go to Step 34.
After 5 minutes, restart the NTP service on the active OMS node. After 15 minutes, check whether the alarm is cleared.
```
service ntp restart
```
- If yes, no further action is required.
- If no, go to Step 38.

Check whether the time deviation between the node and the NTP service on the active OMS node is large.

Check whether the time deviation is large in additional information of the alarm.
- If yes, go to Step 36.
- If no, go to Step 38.
On the Hosts tab page, select the host for which the alarm is generated, and choose More > Stop All Instances to stop all the services on the node.

If the time on the alarm node is later than that on the NTP service of the active OMS node, adjust the time of the alarm node. After adjusting the time, choose More > Start All Instances to start the services on the node.

If the time on the alarm node is earlier than that on the NTP service of the active OMS node, wait until the time deviation is due and adjust the time of the alarm node. After adjusting the time, choose More > Start All Instances to start the services on the node.

If you do not wait, data loss may occur.
After 10 minutes, check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 38.

Collect fault information.

On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
Expand the Service drop-down list, select NodeAgent and OmmServer for the target cluster, and click OK. Expand the Hosts dialog box and select the alarm node and the active OMS node.
Click in the upper right corner, and set Start Date and End Date for log collection to 30 minutes ahead of and after the alarm generation time respectively. Then, click Download.
Contact O&M personnel and provide the collected logs.