Routine Maintenance

To ensure a long-term proper and stable running of the system, MRS cluster administrators or maintenance engineers need to check the items listed in Table 1 periodically and rectify faults based on the check results. It is recommended that system administrators record the result in each task scenario and sign off based on the enterprise management regulations.

**Table 1** Check items
Routine Maintenance Period	Task	Routine Maintenance Content
Every day	Checking the cluster service status	Check whether the running status, and configuration status of each service are normal and whether the status icons are in green. Check whether the running status, and configuration status of the role instances of each service are normal and whether the status icons are in green. Check whether the active/standby status of role instances of each service can be properly displayed. Check whether the Dashboard results of services and role instances are normal.
	Checking the cluster host status	Check whether the running status of each host is normal and whether the status icon is in green. Check the current disk usage, memory usage, and CPU usage of each host. Check whether the current memory usage and CPU usage are ascending.
	Checking the cluster alarm information	Check whether there are alarms generated in the previous day and automatically cleared.
	Checking the cluster audit information	Check whether there are Critical and Major operations performed in the previous day and whether the operations are valid.
	Checking the cluster backup	Check whether the OMS, LDAP, DBService, and NameNodeOMS, LDAP, and DBService were automatically backed up in the previous day.
	Checking the health check results	Perform the health check on FusionInsight Manager, and download the health check report to check whether any exception exists in the current cluster. You are advised to enable the automatic health check, export the latest cluster health check result, and repair unhealthy items based on the result.
	Checking the network communication	Check the cluster network running status and check whether delay exists in the network communication between nodes.
	Checking the storage status	Check whether the total amount of cluster data storage increases suddenly. Check whether the disk usage is reaching the threshold, and find the causes, such as whether there is junk data or cold data needs to be deleted. Check whether the services are increasing and whether the disk partitions need to be expanded.
	Checking logs	Check whether any failed or suspended MapReduce or Spark job exists, view the /tmp/logs/${username}/logs/${application id} log file in HDFS, and rectify the fault. Check the Yarn job logs, view the logs recording failed or suspended jobs, and delete the duplicate data. Check the worker logs of Storm. Back up logs to the storage server.
Every week	Managing users	Check whether the user passwords are about to expire and notify users to change their passwords. To change the password of a Machine-Machine user, the keytab file needs to be downloaded again.
	Analyzing alarms	Export the alarms generated in a specified period and analyze them.
	Scanning disks	Check the disk health status. You are advised to use professional disk health check tools to perform the check.
	Collecting statistics of storage	Check the cluster node disk data in batches and check whether the data is evenly stored. Select the disks where the data amount is too large or too small and check whether the disks are normal.
	Recording changes	Arrange and record the operations on cluster configuration parameters and files to provide references for fault analysis and rectification.
Every month	Analyzing logs	Collect and analyze the hardware logs of cluster node servers, such as the BMC system logs. Collect and analyze the OS logs of cluster node servers. Collect and analyze the cluster logs.
	Diagnosing the network	Analyze the cluster network health status.
	Managing hardware	Check the equipment rooms where the devices are running and clean the devices.