Updated on 2024-09-23 GMT+08:00

Cluster O&M

Account Maintenance Suggestions

It is recommended that the administrator conduct routine checks on the accounts. The check covers the following items:

  • Check whether the accounts of the OS, Manager, and each component are necessary and whether temporary accounts have been deleted.
  • Check whether the permissions of the accounts are appropriate. Different administrators have different rights.
  • Check and audit the logins and operation records of all types of accounts.

Password Maintenance Suggestions

Accessing portal requires identity authentication. The complexity and validity period of an account password must meet your security requirements.

Refer to the following suggestions to maintain passwords:

  • Assign dedicated personnel to keep OS passwords.
  • Use passwords that meet certain strength requirements, such as minimum password length or mixing of letter cases.
  • Encrypt passwords before transferring them, and do not transfer them via email.
  • Encrypt passwords for storage.
  • Remind enterprise users to change passwords during system handover.
  • Change passwords periodically.

Log Maintenance Suggestions

Operation logs help discover exceptions such as illegal operations and login by unauthorized users. The system records important operations in logs. You can use operation logs to locate problems.

  • Checking Logs Regularly

    Check system logs periodically and handle exceptions such as unauthorized operations or logins in a timely manner.

  • Backing Up Logs Regularly

    Audit logs provided by Manager and clusters record user activity and operation information. You can export audit logs from Manager. If there are too many audit logs in the system, you can configure dump parameters to dump audit logs to a specified server to ensure that the cluster nodes disk space is sufficient.

  • Maintenance Owner

    Network monitoring engineers and system maintenance engineers

Manager Routine Maintenance

To ensure long-term and stable running of the system, administrators or maintenance engineers need to periodically check items listed in the following table and rectify the detected faults based on the check results. It is recommended that administrators or engineers record the result in each task scenario and sign off based on the enterprise management regulations.

Table 1 Routine maintenance check items

Routine Maintenance Frequency

Role

Check Item

Daily

Check the cluster service status.

  • Check whether the running status and configuration status of each service are normal and whether the status icons are green.
  • Check whether the running status and configuration status of the role instances in each service are normal and whether the status icons are green.
  • Check whether the active/standby status of role instances in each service can be properly displayed.
  • Check whether the dashboard of the services and role instances can be displayed properly.

Check the cluster host status.

  • Check whether the running status of each host is normal and whether the status icon is green.
  • Check the current disk usage, memory usage, and CPU usage of each host. Check whether the current memory usage and CPU usage are increasing.

Check the cluster alarm information.

Check whether alarms were generated for unhandled exceptions on the previous day, including alarms that were automatically cleared.

Check the cluster audit information.

Check whether critical and major operations are performed on the previous day and whether the operations are valid.

Check the cluster backup status.

Check whether OMS, LDAP, DBService, and NameNode have been automatically backed up on the previous day.

View the health check result.

Perform a health check on Manager and download the health check report to check whether the current cluster is abnormal. You are advised to enable the automatic health check, export the latest cluster health check result, and repair unhealthy items based on the result.

Check the network communication.

Check the cluster network status and check whether the network communication between nodes is delayed.

Check the storage status.

Check whether the total data storage volume of the cluster increases abruptly.

  • Check whether the disk usage is close to the threshold. If yes, locate the causes. For example, check whether the junk data or cold data left by services needs to be cleared.
  • Check whether disk partitions need to be expanded based on the service growth trend.

Check logs.

  • Check whether there are failed or unresponsive MapReduce and Spark tasks. Check the /tmp/logs/${username}/logs/${application id} log file in HDFS and rectify faults.
  • Check Yarn task logs, view the logs of failed and unresponsive tasks, and delete duplicate data.
  • Check the worker logs of Storm.
  • Back up logs to the storage server.

Weekly

User management

Check whether the user password is about to expire and notify the user of changing the password. To change the password of a machine-machine user, you need to download the keytab file again.

Analyze alarms.

Export and analyze alarms generated in a specified period.

Scan disks.

Check the disk health status. You are advised to use a dedicated disk check tool.

Collect statistics on storage.

Check in batches whether the disk data of cluster nodes is evenly stored, filter out the disks whose data increases significantly or is insufficient, and check whether the disks are normal.

Record changes.

Arrange and record the operations on cluster configuration parameters and files to provide reference for fault analysis and handling.

Monthly

Analyze logs.

  • Collect and analyze hardware logs of cluster node servers, such as BMC system logs.
  • Collect and analyze the OS logs of the cluster node servers.
  • Collect and analyze cluster logs.

Diagnose the network.

Analyze the network health status of the cluster.

Manage hardware.

Check the equipment room environment and clean the devices.