ALM-12045 Network Read Packet Dropped Rate Exceeds the Threshold

Description

The system checks the network read packet dropped rate every 30 seconds and compares the actual packet dropped rate with the threshold (the default threshold is 0.5%). This alarm is generated when the system detects that the network read packet dropped rate exceeds the threshold for several times (5 times by default) consecutively.

To change the threshold, choose O&M > Alarm > Thresholds > Name of the desired cluster > Host > Network Reading > Read Packet Dropped Rate.

When the Trigger Count is 1, this alarm is cleared when the network read packet dropped rate is less than or equal to the threshold. When the Trigger Count is greater than 1, this alarm is cleared when the network read packet dropped rate is less than or equal to 90% of the threshold.

Alarm detection is disabled by default. If you want to enable this function, check whether alarm sending can be enabled based on section "Check the system environment."

Attribute

Alarm ID	Alarm Severity	Auto Clear
12045	Major	Yes

Parameters

Name	Meaning
Source	Specifies the cluster or system for which the alarm is generated.
ServiceName	Specifies the service for which the alarm is generated.
RoleName	Specifies the role for which the alarm is generated.
HostName	Specifies the host for which the alarm is generated.
NetworkCardName	Specifies the network port for which the alarm is generated.
Trigger Condition	Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated.

Impact on the System

The service performance deteriorates or services time out.

Precautions: In SUSE (kernel: 3.0 or later) or Red Hat 7.2, because the system kernel modifies the mechanism for counting read and discarded packets, this alarm may be generated even when the network is normal. Services are not adversely affected. You are advised to check whether the alarm is caused by this problem based on section "Check the system environment."

Possible Causes

An OS exception occurs.
The NIC has configured the active/standby bond mode.
The alarm threshold is set improperly.
The cluster network environment is of poor quality.

Procedure

Check the network packet dropped rate.

On the FusionInsight Manager portal, choose O&M > Alarm > Alarms, click in the row where the alarm is located to view the alarm host name and NIC name in the alarm details.
Log in to the node where the alarm is generated as user omm and run the /sbin/ifconfig NIC name command to check whether packet loss occurs on the network.
- IP address: indicates the value of HostName in the alarm location information. To query the value of OM IP and Business IP, click Host on FusionInsight Manager.
- The formula is as follows: Packet loss rate = (Number of dropped packets/Total number of RX packets) x 100%. If the packet loss rate is greater than the system threshold (0.5% by default), packet loss occurs during packet reading on the network.
- If yes, go to 11.
- If no, go to 3.

Check the system environment.

Run the cat /etc/*-release command to check the OS type.

If Red Hat is used, go to 5.

# cat /etc/*-release
Red Hat Enterprise Linux Server release 7.2 (Santiago)

If SUSE is used, go to 6.

# cat /etc/*-release
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 3

If another OS is used, go to 11.

Run the cat /etc/redhat-release command to check whether the OS version is Red Hat 7.2(x86) or Red Hat 7.4(TaiShan).
```
# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.2 (Santiago)
```
- If yes, the alarm sending function cannot be enabled. Go to 7.
- If no, go to 11.

Run the cat /proc/version command to check whether the SUSE kernel version is 3.0 or later.

# cat /proc/version
Linux version 3.0.101-63-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #1 SMP Tue Jun 23 16:02:31 UTC 2015 (4b89d0c)

If yes, the alarm sending function cannot be enabled. Go to 7.
If no, go to 11.

Log in to FusionInsight Manager and choose O&M > Alarm > Thresholds.
In the navigation tree of the Thresholds page, choose Name of the desired cluster > Host > Network Reading > Read Packet Dropped Rate. In the area on the right, check whether the Switch is on.
- If yes, the alarm sending function has been enabled. Go to 9.
- If no, the alarm sending function has been disabled. Go to 10.
In the area on the right, close Switch to disable the checking of Network Read Packet Dropped Rate Exceeds the Threshold. The following figure shows the operation result.
On the Alarm page of FusionInsight Manager, search for the 12045 alarm. If the alarm is not cleared automatically, clear it manually. No further action is required.

The ID of alarm Network Read Packet Dropped Rate Exceeds the Threshold is 12045.

Check whether the NIC has configured the active/standby bond mode.

Log in to the alarm node as user omm. Run the ls -l /proc/net/bonding command to check whether directory /proc/net/bonding exists on the alarm node.
- If yes, the NIC has configured the active/standby bond mode, as shown in the following. Go to 12.
```
# ls -l /proc/net/bonding/
total 0
-r--r--r-- 1 root root 0 Oct 11 17:35 bond0
```
- If no, the NIC has not configured the active/standby bond mode, as shown in the following. Go to 14.
```
# ls -l /proc/net/bonding/
ls: cannot access /proc/net/bonding/: No such file or directory
```

Run the cat /proc/net/bonding/bond0 command and check whether the value of Bonding Mode is fault-tolerance.

bond0 indicates the name of the bond configuration file. Use the file name queried in 11 in practice.

# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth1 (primary_reselect always)
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Slave queue ID: 0

Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Slave queue ID: 0

If yes, the NIC has configured the active/standby bond mode. Go to 13.
If no, the NIC has not configured the active/standby bond mode. Go to 14.

Check whether the NIC of the NetworkCardName parameter is the standby NIC.
- If yes, manually clear the alarm on the Alarms page because the alarm on the standby cannot be automatically cleared. No further action is required.
- If no, go to 14.
  
  Method of determining whether an NIC is standby: In the /proc/net/bonding/bond0 configuration file, check whether the NIC name of the NetworkCardName parameter is the same as the Slave Interface, but is different from Currently Active Slave (indicating the current active NIC). If the answer is yes, the NIC is a standby one.

Check whether the threshold is set properly.

Log in to FusionInsight Manager, choose O&M > Alarm > Thresholds > Name of the desired cluster > Host > Network Reading > Read Packet Dropped Rate and check whether the alarm threshold is set properly. (By default, 0.5% is a proper value. However, users can configure the value as required.)
- If yes, go to 17.
- If no, go to 15.
Based on actual usage condition, choose O&M > Alarm > Thresholds > Name of the desired cluster > Host > Network Reading > Read Packet Dropped Rate and click Modify in the Operation column to modify the alarm threshold.

For details, see Figure 1.

Figure 1 Setting alarm thresholds
Wait for 5 minutes, and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 17.

Check whether the network is normal.

Contact the system administrator to check whether the network is abnormal.
- If yes, go to 18 to rectify the network fault.
- If no, go to 19.
Wait for 5 minutes, and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 19.

Collect fault information.

On the FusionInsight Manager home page of the active cluster, choose O&M > Log > Download.
Select OMS from the Service and click OK.
Set Host to the node for which the alarm is generated and the active OMS node.
Click in the upper right corner, and set Start Date and End Date for log collection to 30 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact the O&M personnel and send the collected log information.