Updated on 2022-08-12 GMT+08:00

Routine Maintenance

To ensure a long-term proper and stable running of the system, system administrators or maintenance engineers need to check the items listed in Table 1 periodically and rectify faults based on the check results. It is recommended that system administrators record the result in each task scenario and sign off based on the enterprise management regulations.

Table 1 Check items

Routine Maintenance Period

Task

Routine Maintenance Content

Every day

Checking the cluster service status

  • Check whether the running status, and configuration status of each service are normal and whether the status icons are in green.
  • Check whether the running status, and configuration status of the role instances of each service are normal and whether the status icons are in green.
  • Check whether the active/standby status of role instances of each service can be properly displayed.
  • Check whether the Dashboard results of services and role instances are normal.

Checking the cluster host status

  • Check whether the running status of each host is normal and whether the status icon is in green.
  • Check the current disk usage, memory usage, and CPU usage of each host. Check whether the current memory usage and CPU usage are ascending.

Checking the cluster alarm information

Check whether there are alarms generated in the previous day and automatically cleared.

Checking the cluster audit information

Check whether there are Critical and Major operations performed in the previous day and whether the operations are valid.

Checking the cluster backup

Check whether the OMS, LDAP, DBService, and NameNodeOMS, LDAP, and DBService were automatically backed up in the previous day.

Checking the health check results

Perform the health check on FusionInsight Manager, and download the health check report to check whether any exception exists in the current cluster. You are advised to enable the automatic health check, export the latest cluster health check result, and repair unhealthy items based on the result.

Checking the network communication

Check the cluster network running status and check whether delay exists in the network communication between nodes.

Checking the storage status

Check whether the total amount of cluster data storage increases suddenly.

  • Check whether the disk usage is reaching the threshold, and find the causes, such as whether there is junk data or cold data needs to be deleted.
  • Check whether the services are increasing and whether the disk partitions need to be expanded.

Checking logs

  • Check whether any failed or suspended MapReduce or Spark job exists, view the /tmp/logs/${username}/logs/${application id} log file in HDFS, and rectify the fault.
  • Check the Yarn job logs, view the logs recording failed or suspended jobs, and delete the duplicate data.
  • Check the worker logs of Storm.
  • Back up logs to the storage server.

Every week

Managing users

Check whether the user passwords are about to expire and notify users to change their passwords. To change the password of a Machine-Machine user, the keytab file needs to be downloaded again.

Analyzing alarms

Export the alarms generated in a specified period and analyze them.

Scanning disks

Check the disk health status. You are advised to use professional disk health check tools to perform the check.

Collecting statistics of storage

Check the cluster node disk data in batches and check whether the data is evenly stored. Select the disks where the data amount is too large or too small and check whether the disks are normal.

Recording changes

Arrange and record the operations on cluster configuration parameters and files to provide references for fault analysis and rectification.

Every month

Analyzing logs

  • Collect and analyze the hardware logs of cluster node servers, such as the BMC system logs.
  • Collect and analyze the OS logs of cluster node servers.
  • Collect and analyze the cluster logs.

Diagnosing the network

Analyze the cluster network health status.

Managing hardware

Check the equipment rooms where the devices are running and clean the devices.