Help Center > > User Guide> MRS Manager Operation Guide> Health Check Management> Host Health Check

Host Health Check

Updated at: Sep 12, 2019 GMT+08:00

Swap Usage

Indicator name: Swap Usage

Indicator description: This indicator is used to check the system swap usage. Swap usage = Used swap size/Total swap size. If swap usage exceeds the threshold, the indicator is unhealthy.

Recovery guidance:

  1. Check the swap usage on the node.

    Log in to the unhealthy node and run free -m to view the total and used swap size. If the swap usage exceeds the threshold, go to 2.

  2. Expand the system capacity, for example, by adding nodes.

Host File Handle Usage

Indicator name: Host File Handle Usage

Indicator description: This indicator is used to check the usage of file handles in the system. Usage of file handles = Number of used handles/Total number of handles. If host file handle usage exceeds the threshold, the indicator is unhealthy.

Recovery guidance:

  1. Check the host file handle usage.

    Log in to the unhealthy node and run cat /proc/sys/fs/file-nr. Check the first and third columns in the command output, which indicate the number of used handles and the total number of handles respectively. If the usage exceeds the threshold, go to 2.

  2. Check the system and analyze the host file handle usage.

NTP Offset

Indicator name: NTP Offset

Indicator description: This indicator is used to check the NTP time offset. If the NTP time offset exceeds the threshold, the indicator is unhealthy.

Recovery guidance:

  1. Check the NTP time offset.

    Log in to the unhealthy node and run /usr/sbin/ntpq -np to view the information. The offset column indicates the time offset. If the time offset exceeds the threshold, go to 2.

  2. Check whether the clock source configuration is correct. Contact maintenance personnel to handle the problem.

Average Load

Indicator name: Average Load

Indicator description: This indicator is used to check the system average load. The system average load indicates the average number of processes in the running queue within a specified period. The system average load is calculated using the load value obtained by the uptime command. The calculation method is: (Load of 1 minute + Load of 5 minutes + Load of 15 minutes)/(3 x Number of CPUs). If the average load exceeds the threshold, the indicator is unhealthy

Recovery guidance:

  1. Log in to the unhealthy node and run the uptime command. The last three columns in the command output indicate the load of 1 minute, 5 minutes, and 15 minutes, respectively. Calculate the system average load. If the load exceeds the threshold, go to 2.
  2. Expand the system capacity, for example, by adding nodes.

Process in the D Status

Indicator name: Uninterruptible Sleep Process

Indicator description: This indicator is used to check an uninterruptible sleep process, that is, a process in the D state. Generally, a process in the D state is waiting for I/O, such as disk I/O and network I/O, but an I/O exception occurs. If any process in the D state exists in the system, the indicator is unhealthy.

Recovery guidance: If the indicator is abnormal, the system generates an alarm. You are advised to handle the alarm according to ALM-12028.

Hardware Status

Indicator name: Hardware Status

Indicator description: This indicator is used to check the status of system hardware, including CPUs, memory, disks, power supply units (PSUs), and fans. This indicator obtains related hardware information using ipmitool sdr elist. If the hardware status is abnormal, the indicator is unhealthy.

Recovery guidance:

  1. Log in to the unhealthy node. Run ipmitool sdr elist to view the system hardware status. The last column in the command output indicates the hardware status. If the status is included in the following fault description table, the indicator is unhealthy.

    Module

    Fault Description

    Processor

    IERR

    Thermal Trip

    FRB1/BIST failure

    FRB2/Hang in POST failure

    FRB3/Processor startup/init failure

    Configuration Error

    SM BIOS Uncorrectable CPU-complex Error

    Disabled

    Throttled

    Uncorrectable machine check exception

    Power Supply

    Failure detected

    Predictive failure

    Power Supply AC lost

    AC lost or out-of-range

    AC out-of-range, but present

    Config Error: Vendor Mismatch

    Config Error: Revision Mismatch

    Config Error: Processor Missing

    Config Error: Power Supply Rating Mismatch

    Config Error: Voltage Rating Mismatch

    Config Error

    Power Unit

    240VA power down

    Interlock power down

    AC lost

    Soft-power control failure

    Failure detected

    Predictive failure

    Memory

    Uncorrectable ECC

    Parity

    Memory Scrub Failed

    Memory Device Disabled

    Correctable ECC logging limit reached

    Configuration Error

    Throttled

    Critical Overtemperature

    Drive Slot

    Drive Fault

    Predictive Failure

    Parity Check In Progress

    In Critical Array

    In Failed Array

    Rebuild In Progress

    Rebuild Aborted

    Battery

    Low

    Failed

  2. If this indicator is abnormal, contact maintenance personnel to handle the problem.

Host Name

Indicator name: Hostname

Indicator description: This indicator is used to check whether a host name is set. If no host name is set, the indicator is unhealthy. If this indicator is abnormal, you are advised to set a host name properly.

Recovery guidance:

  1. Log in to the unhealthy node.
  2. Run the following command to change the host name to ensure that the node host name is consistent with the planned host name:

    hostname Host name For example, to change the host name to Bigdata-OM-01, run the hostname Bigdata-OM-01 command.

  3. Modify the host name configuration file.

    Run the vi /etc/HOSTNAME command to edit the file, change file content to Bigdata-OM-01, save the modification, and exit.

Umask

Indicator name: Umask

Indicator description: This indicator is used to check whether the umask of user omm is correctly set. If the umask is not set to 0077, the indicator is unhealthy.

Recovery guidance:

  1. If this indicator is abnormal, you are advised to set the umask of user omm to 0077. Log in to the unhealthy node, and run su - omm to switch to user omm.
  2. Run vi ${BIGDATA_HOME}/.om_profile, set umask to 0077, save the modification, and exit.

OMS HA Status

Indicator name: OMS HA Status

Indicator description: This indicator is used to check whether the status of OMS HA resources is normal. For details about the status of OMS HA resources, run ${CONTROLLER_HOME}/sbin/status-oms.sh to view the status. If any module is abnormal, the indicator is unhealthy.

Recovery guidance:

  1. Log in to the active management node, run su - omm to switch to user omm, and run ${CONTROLLER_HOME}/sbin/status-oms.sh to view the OMS status.
  2. If floatip, okerberos, and oldap are abnormal, see ALM-12002, ALM-12004, and ALM-12005 respectively to resolve the problems.
  3. If other resources are abnormal, you are advised to view the logs of the faulty modules.

    If the controller resource is abnormal, view the /var/log/Bigdata/controller/controller.log log file of the faulty node.

    If the cep resource is abnormal, view the /var/log/Bigdata/omm/oms/cep/cep.log log file of the faulty node.

    If the aos resource is abnormal, view the /var/log/Bigdata/controller/aos/aos.log log file of the faulty node.

    If the feed_watchdog resource is abnormal, view the /var/log/Bigdata/watchdog/watchdog.log log file of the faulty node.

    If the httpd resource is abnormal, view the /var/log/Bigdata/httpd/error_log log file of the faulty node.

    If the fms resource is abnormal, view the /var/log/Bigdata/omm/oms/fms/fms.log log file of the faulty node.

    If the pms resource is abnormal, view the /var/log/Bigdata/omm/oms/pms/pms.log log file of the faulty node.

    If the iam resource is abnormal, view the /var/log/Bigdata/omm/oms/iam/iam.log log file of the faulty node.

    If the gaussDB resource is abnormal, view the /var/log/Bigdata/omm/oms/db/omm_gaussdba.log log file of the faulty node.

    If the ntp resource is abnormal, view the /var/log/Bigdata/omm/oms/ha/scriptlog/ha_ntp.log log file of the faulty node.

    If the tomcat resource is abnormal, view the /var/log/Bigdata/tomcat/catalina.log log file of the faulty node.

  4. If the problem cannot be resolved by viewing logs, contact maintenance personnel and send the collected fault logs.

Installation Directory and Data Directory Check

Indicator name: Installation Directory and Data Directory

Indicator description: This indicator checks the lost+found directory in the disk partition root directory where the installation directory (/opt/Bigdata by default) is located first. If files of user omm exist in the directory, an exception occurs. Related files will be stored in the lost+found directory when an exception occurs on a node. This indicator is used to check whether the files are lost in such scenarios. Then this indicator checks the installation directory (such as /opt/Bigdata) and data directory (such as /srv/BigData). If files of non-omm users exist in the two directories, the indicator is unhealthy.

Recovery guidance:

  1. Log in to the unhealthy node, and run su - omm to switch to user omm. Check whether files or folders of user omm exist in the lost+found directory.

    If files of user omm exist, restore the files to a correct directory and perform the check again. If files of user omm do not exist, go to 2.

  2. Check whether files or folders of non-omm users exist in the installation directory and data directory. If files exist and are temporary files generated manually, clear them and perform the check again.

CPU Usage

Indicator name: CPU Usage

Indicator description: This indicator is used to check whether the CPU usage exceeds the threshold. If the CPU usage exceeds the threshold, the indicator is unhealthy.

Recovery guidance: If the indicator is abnormal, the system generates an alarm. You are advised to handle the alarm according to ALM-12016.

Memory Usage

Indicator name: Memory Usage

Indicator description: This indicator is used to check whether the memory usage exceeds the threshold. If the memory usage exceeds the threshold, the indicator is unhealthy.

Recovery guidance: If the indicator is abnormal, the system generates an alarm. You are advised to handle the alarm according to ALM-12018.

Host Disk Usage

Indicator name: Host Disk Usage

Indicator description: This indicator is used to check whether the host disk usage exceeds the threshold. If the host disk usage exceeds the threshold, the indicator is unhealthy.

Recovery guidance: If the indicator is abnormal, the system generates an alarm. You are advised to handle the alarm according to ALM-12017.

Host Disk Write Rate

Indicator name: Host Disk Write Speed

Indicator name: This indicator is used to check the host disk write rate. The host disk write rate may vary according to the service scenario. This indicator only reflects the specific value. You need to determine whether this indicator is normal based on the service scenario.

Recovery guidance: Determine whether the disk write rate is normal based on the service scenario.

Host Disk Read Rate

Indicator name: Host Disk Read Speed

Indicator name: This indicator is used to check the host disk read rate. The host disk read rate may vary according to the service scenario. This indicator only reflects the specific value. You need to determine whether this indicator is normal based on the service scenario.

Recovery guidance: Determine whether the disk read rate is normal based on the service scenario.

Host Service Plane Network Status

Indicator name: Host service plane network status

Indicator description: This indicator is used to check the network connectivity of the cluster host service plane. If the host service plane network is disconnected, the indicator is unhealthy.

Recovery guidance: If the network is a single-plane network, check the IP address of the single plane. If the network is a dual-plane network, the recovery procedures are as follows:

  1. Check the network connectivity between the service plane IP addresses of the active and standby management nodes.

    If the network is abnormal, go to 3.

    If the network is in normal state, go to 2.

  2. Check the network connectivity between the IP addresses of the active management node and the faulty node in the cluster.
  3. If the network is abnormal, contact maintenance personnel to resolve the network problem.

Host Status

Indicator name: Host Status

Indicator description: This indicator is used to check whether the host status is normal. If a node is faulty, the indicator is unhealthy.

Recovery guidance: If this indicator is abnormal, you are advised to rectify the fault according to ALM-12006.

Alarm Check

Indicator name: Alarm information

Indicator description: This indicator is used to check whether an uncleared alarm exists on the host. If an uncleared alarm exists, the host is unhealthy.

Recovery guidance: If this indicator is abnormal, you are advised to rectify the fault according to the alarm help.

Did you find this page helpful?

Submit successfully!

Thank you for your feedback. Your feedback helps make our documentation better.

Failed to submit the feedback. Please try again later.

Which of the following issues have you encountered?







Please complete at least one feedback item.

Content most length 200 character

Content is empty.

OK Cancel