Help Center/ MapReduce Service/ User Guide/ High-Risk Operations
Updated on 2023-05-15 GMT+08:00

High-Risk Operations

Forbidden Operations

Table 1 lists forbidden operations during the routine cluster operation and maintenance process.

Table 1 Forbidden operations

Item

Risk

Delete ZooKeeper data directories.

ClickHouse, HDFS, Yarn, HBase, and Hive depend on ZooKeeper, which stores metadata. This operation has adverse impact on normal operating of related components.

Frequently switch over the active and standby JDBCServer nodes.

This operation may interrupt services.

Delete Phoenix system tables and data (SYSTEM.CATALOG, SYSTEM.STATS, SYSTEM.SEQUENCE, and SYSTEM. FUNCTION).

This operation will cause service operation failures.

Manually modify data in the Hive metabase (hivemeta database).

This operation may cause Hive data parse errors. As a result, Hive cannot provide services.

Manually perform INSERT or UPDATE operations on Hive metadata tables.

This operation may cause Hive data parse errors. As a result, Hive cannot provide services.

Change permission on the Hive private file directory hdfs:///tmp/hive-scratch.

This operation may cause unavailable Hive services.

Modify broker.id in the Kafka configuration file.

This operation may cause invalid node data.

Modify the host names of nodes.

Instances and upper-layer components on the host cannot provide services properly. The fault cannot be rectified.

Reinstall the OS of a node.

This operation will cause MRS cluster exceptions, leaving MRS clusters in abnormal status.

Use private images.

This operation will cause MRS cluster exceptions, leaving MRS clusters in abnormal status.

The following tables list the high-risk operations during the operation and maintenance of each component.

High-Risk Operations on a Cluster

Table 2 High-risk operations on a cluster

Operation

Risk

Severity

Workaround

Check Item

Modify the file directory or file permissions of user omm without permission.

This operation will lead to MRS service unavailability.

▲▲▲▲▲

Do not perform this operation.

Check whether the MRS cluster service is available.

Bind an EIP.

This operation exposes the Master node hosting MRS Manager of the cluster to the public network, increasing the risk of network attacks from the Internet.

▲▲▲▲▲

Ensure that the bound EIP is a trusted public IP address.

None

Enable security group rules for port 22 of a cluster.

This operation increases the risk of exploiting vulnerability of port 22.

▲▲▲▲▲

Configure a security group rule for port 22 to allow only trusted IP addresses to access the port. You are not advised to configure the inbound rule to allow 0.0.0.0 to access the port.

None

Delete a cluster or cluster data.

Data will get lost.

▲▲▲▲▲

Before deleting the data, confirm the necessity of the operation and ensure that the data has been backed up.

None

Scale in a cluster.

Data will get lost.

▲▲▲▲▲

Before scaling in the cluster, confirm the necessity of the operation and ensure that the data has been backed up.

None

Detach or format a data disk.

Data will get lost.

▲▲▲▲▲

Before performing this operation, confirm the necessity of the operation and ensure that the data has been backed up.

None

Manager High-Risk Operations

Table 3 Manager high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Change the OMS password.

This operation will restart all processes of OMSServer, which has adverse impact on cluster maintenance and management.

▲▲▲

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Check whether there are uncleared alarms and whether the cluster management and maintenance are normal.

Import the certificate.

This operation will restart OMS processes and the entire cluster, which has adverse impact on cluster maintenance and management and services.

▲▲▲

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal.

Perform an upgrade.

This operation will restart Manager and the entire cluster, affecting management, maintenance, and services of the cluster.

Strictly manage the user who is eligible to assign the cluster management permission to prevent security risks.

▲▲▲

Ensure that there is no other maintenance and management operations when the operation is performed.

Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal.

Restore the OMS.

This operation will restart Manager and the entire cluster, affecting management, maintenance, and services of the cluster.

▲▲▲

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal.

Change an IP address.

This operation will restart Manager and the entire cluster, affecting management, maintenance, and services of the cluster.

▲▲▲

Ensure that there is no other maintenance and management operations when the operation is performed and that the new IP address is correct.

Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal.

Change log levels.

If the log level is changed to DEBUG, Manager responds slowly.

▲▲

Before the modification, confirm the necessity of the operation and change it back to the default log level in time.

None

Replace a control node.

This operation will interrupt services deployed on the node. If the node is a management node, the operation will restart all OMS processes, affecting the cluster management and maintenance.

▲▲▲

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal.

Replace a management node.

This operation will interrupt services deployed on the node. As a result, OMS processes will be restarted, affecting the cluster management and maintenance.

▲▲▲▲

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal.

Restart the upper-layer service at the same time during the restart of a lower-layer service.

This operation will interrupt the upper-layer service, affecting the management, maintenance, and services of the cluster.

▲▲▲▲

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal.

Modify the OLDAP port.

This operation will restart the LdapServer and Kerberos services and all associated services, affecting service running.

▲▲▲▲▲

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

None

Delete the supergroup group.

Deleting the supergroup group decreases user rights, affecting service access.

▲▲▲▲▲

Before the change, confirm the rights to be added. Ensure that the required rights have been added before deleting the supergroup rights to which the user is bound, ensuring service continuity.

None

Restart a service.

Services will be interrupted during the restart. If you select and restart the upper-layer service, the upper-layer services that depend on the service will be interrupted.

▲▲▲

Confirm the necessity of restarting the system before the operation.

Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal.

Change the default SSH port No.

After the default port (22) is changed, functions such as cluster creation, service/instance adding, host adding, and host reinstallation cannot be used, and results of cluster health check items for node mutual trust, omm/ommdba user password expiration, and others are incorrect.

▲▲▲

Before performing this operation, restore the SSH port to the default value.

None

CDL High-risk Operations

Table 4 CDL high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Start or stop basic components independently.

This operation has adverse impact on the basic functions of some services. As a result, service failures occur.

▲▲▲

Do not start or stop basic components such as Kafka, DBService, ZooKeeper, Kerberos, and LDAP separately. To start or stop basic components, select associated services.

Check whether the service status is normal.

Restart or stop services.

This operation may interrupt services.

▲▲

Restart or stop services when necessary.

Check whether the service is running properly.

ClickHouse High-Risk Operations

Table 5 ClickHouse high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Delete data directories.

This operation may cause service information loss.

▲▲▲

Do not delete data directories manually.

Check whether data directories are normal.

Remove ClickHouseServer instances.

The ClickHouseServer instance nodes in the same shard must be removed in at the same time. Otherwise, the topology information of the logical cluster is incorrect. Before performing this operation, check the database and data table information of each node in the logical cluster and perform scale-in pre-analysis to ensure that data is successfully migrated during the scale-in process to prevent data loss

▲▲▲▲▲

Before scale-in, collect information in advance to learn the status of the ClickHouse logical cluster and instance nodes.

Check the ClickHouse logical cluster topology information, database and data table information in each ClickHouseServer instance, and data volume.

Add ClickHouseServer instances.

When performing this operation, you must check whether a database or data table with the same name as that on the old node needs to be created on the new node. Otherwise, subsequent data migration, data balancing, scale-in, and decommissioning will fail.

▲▲▲▲▲

Before scale-out, confirm the function and purpose of new ClickHouseServer instances and determine whether to create related databases and data tables.

Check the ClickHouse logical cluster topology information, database and data table information in each ClickHouseServer instance, and data volume.

Decommission ClickHouseServer instances.

The ClickHouseServer instance nodes in the same shard must be decommissioned in at the same time. Otherwise, the topology information of the logical cluster is incorrect. Before performing this operation, check the database and data table information of each node in the logical cluster and perform decommissioning pre-analysis to ensure that data is successfully migrated during the decommissioning process to prevent data loss

▲▲▲▲▲

Before decommissioning, collect information in advance to learn the status of the ClickHouse logical cluster and instance nodes.

Check the ClickHouse logical cluster topology information, database and data table information in each ClickHouseServer instance, and data volume.

Recommission ClickHouseServer instances.

When performing this operation, you must select all nodes in the original shard. Otherwise, the topology information of the logical cluster is incorrect.

▲▲▲▲▲

Before recommissioning, you need to confirm the home information about the shards of the node to be recommissioned.

Check the ClickHouse logical cluster topology information.

Modify data directory content (file and folder creation).

This operation may cause the ClickHouse instance of the node faults.

▲▲▲

Do not create or modify files or folders in the data directories manually.

Check whether data directories are normal.

Start or stop basic components independently.

This operation has adverse impact on the basic functions of some services. As a result, service failures occur.

▲▲▲

Do not start or stop ZooKeeper, Kerberos, and LDAP basic components independently. Select related services when performing this operation.

Check whether the service status is normal.

Restart or stop services.

This operation may interrupt services.

▲▲

Restart or stop services when necessary.

Check whether the service is running properly.

DBService High-Risk Operations

Table 6 DBService high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Change the DBService password.

The services need to be restarted for the password change to take effect. The services are unavailable during the restart.

▲▲▲▲

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Check whether there are uncleared alarms and whether the cluster management and maintenance are normal.

Restore DBService data.

After the data is restored, the data generated after the data backup and before the data restoration is lost.

After the data is restored, the configuration of the components that depend on DBService may expire and these components need to be restarted.

▲▲▲▲

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Check whether there are uncleared alarms and whether the cluster management and maintenance are normal.

Perform active/standby DBService switchover.

During the DBServer switchover, DBService is unavailable.

▲▲

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

None

Change the DBService floating IP address.

The DBService needs to be restarted for the change to take effect. The DBService is unavailable during the restart.

If the floating IP address has been used, the configuration will fail, and the DBService will fail to be started.

▲▲▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Check whether services can be started properly.

Flink High-Risk Operations

Table 7 Flink high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Change log levels.

If the log level is modified to DEBUG, the task running performance is affected.

▲▲

Before the modification, confirm the necessity of the operation and change it back to the default log level in time.

None

Modify file permissions.

Tasks may fail.

▲▲▲

Confirm the necessity of the operation before the modification.

Check whether related service operations are normal.

Flume High-Risk Operations

Table 8 Flume high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Modify the Flume instance start parameter GC_OPTS.

Services cannot start properly.

▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Check whether services can be started properly.

Change the default value of dfs.replication from 3 to 1.

This operation will have the following impacts:

  1. The storage reliability deteriorates. If the disk becomes faulty, data will be lost.
  2. NameNode fails to be restarted, and the HDFS service is unavailable.

▲▲▲▲

When modifying related configuration items, check the parameter description carefully. Ensure that there are more than two replicas for data storage.

Check whether the default replica number is not 1 and whether the HDFS service is normal.

HBase High-Risk Operations

Table 9 HBase high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Modify encryption configuration.

  • hbase.regionserver.wal.encryption
  • hbase.crypto.keyprovider.parameters.uri
  • hbase.crypto.keyprovider.parameters.encryptedtext

Services cannot start properly.

▲▲▲▲

Strictly follow the prompt information when modifying related configuration items, which are associated. Ensure that new values are valid.

Check whether services can be started properly.

Change the value of hbase.regionserver.wal.encryption to false or switch encryption algorithm from AES to SMS4.

This operation may cause start failures and data loss.

▲▲▲▲

When HFile and WAL are encrypted using an encryption algorithm and a table is created, do not close or switch the encryption algorithm randomly.

If an encryption table (ENCRYPTION=>AES/SMS4) is not created, you can only switch the encryption algorithm.

None

Modify HBase instance start parameter GC_OPTS and HBASE_HEAPSIZE.

Services cannot start properly.

▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. GC_OPTS does not conflict with HBASE_HEAPSIZE.

Check whether services can be started properly.

Use OfflineMetaRepair tool

Services cannot start properly.

▲▲▲▲

This tool can be used only when HBase is offline and cannot be used in data migration scenarios.

Check whether HBase services can be started properly.

HDFS High-Risk Operations

Table 10 HDFS high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Change HDFS NameNode data storage directory dfs.namenode.name.dir and data configuration directory dfs.datanode.data.dir.

Services cannot start properly.

▲▲▲▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Check whether services can be started properly.

Use the -delete parameter when you run the hadoop distcp command.

During DistCP copying, files that do not exist in the source cluster but exist in the destination cluster are deleted from the destination cluster.

▲▲

When using DistCP, determine whether to retain the redundant files in the destination cluster. Exercise caution when using the -delete parameter.

After DistCP copying is complete, check whether the data in the destination cluster is retained or deleted according to the parameter settings.

Modify the HDFS instance start parameter GC_OPTS, HADOOP_HEAPSIZE, and GC_PROFILE.

Services cannot start properly.

▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. GC_OPTS does not conflict with HADOOP_HEAPSIZE.

Check whether services can be started properly.

Change the default value of dfs.replication from 3 to 1.

This operation will have the following impacts:

  1. The storage reliability deteriorates. If the disk becomes faulty, data will be lost.
  2. NameNode fails to be restarted, and the HDFS service is unavailable.

▲▲▲▲

When modifying related configuration items, check the parameter description carefully. Ensure that there are more than two replicas for data storage.

Check whether the default replica number is not 1 and whether the HDFS service is normal.

Change the remote procedure call (RPC) channel encryption mode (hadoop.rpc.protection) of each module in Hadoop.

This operation causes service faults and service exceptions.

▲▲▲▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Check whether HDFS and other services that depend on HDFS can properly start and provide services.

Hive High-Risk Operations

Table 11 Hive high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Modify the Hive instance start parameter GC_OPTS.

This operation may cause Hive instance start failures.

▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Check whether services can be started properly.

Delete all MetaStore instances.

This operation may cause Hive metadata loss. As a result, Hive cannot provide services.

▲▲▲

Do not perform this operation unless ensure that Hive table information can be discarded.

Check whether services can be started properly.

Delete or modify files corresponding to Hive tables over HDFS interfaces or HBase interfaces.

This operation may cause Hive service data loss or tampering.

▲▲

Do not perform this operation unless ensure that the data can be discarded or that the operation meets service requirements.

Check whether Hive data is complete.

Delete or modify files corresponding to Hive tables or directory access permission over HDFS interfaces or HBase interfaces.

This operation may cause related service scenarios to be unavailable.

▲▲▲

Do not perform this operation.

Check whether related service operations are normal.

Delete or modify hdfs:///apps/templeton/hive-3.1.0.tar.gz over HDFS interfaces.

WebHCat fails to perform services due to this operation.

▲▲

Do not perform this operation.

Check whether related service operations are normal.

Export table data to overwrite the data at the local. For example, export the data of t1 to /opt/dir.

insert overwrite local directory '/opt/dir' select * from t1;

This operation will delete target directories. Incorrect setting may cause software or OS startup failures.

▲▲▲▲▲

Ensure that the path where the data is written does not contain any files or do not use the key word overwrite in the command.

Check whether files in the target path are lost.

Direct different databases, tables, or partition files to the same path, for example, default warehouse path /user/hive/warehouse.

The creation operation may cause disordered data. After a database, table, or partition is deleted, other object data will be lost.

▲▲▲▲▲

Do not perform this operation.

Check whether files in the target path are lost.

IoTDB High-Risk Operations

Table 12 IoTDB high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Delete data directories.

This operation may cause service information loss.

▲▲▲

Do not delete data directories manually.

Check whether data directories are normal.

Modify data directory content (file and folder creation).

This operation may cause the IoTDB instance of the node faults.

▲▲▲

Do not create or modify files or folders in the data directories manually.

Check whether data directories are normal.

Start or stop basic components independently.

This operation has adverse impact on the basic functions of some services. As a result, service failures occur.

▲▲▲

Do not start or stop Kerberos, and LDAP basic components independently. Select related services when performing this operation.

Check whether the service status is normal.

Restart or stop services.

This operation may interrupt services.

▲▲

Restart or stop services when necessary.

Check whether the service is running properly.

Kafka High-Risk Operations

Table 13 Kafka high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Delete Topic

This operation may delete existing topics and data.

▲▲▲

Kerberos authentication is used to ensure that authenticated users have operation permissions. Ensure that topic names are correct.

Check whether topics are processed properly.

Delete data directories.

This operation may cause service information loss.

▲▲▲

Do not delete data directories manually.

Check whether data directories are normal.

Modify data directory content (file and folder creation).

This operation may cause the Broker instance of the node faults.

▲▲▲

Do not create or modify files or folders in the data directories manually.

Check whether data directories are normal.

Modify the disk auto-adaptation function using the disk.adapter.enable parameter.

This operation adjusts the topic data retention period when the disk usage reaches the threshold. Historical data that does not fall within the storage retention may be deleted.

▲▲▲

If the retention period of some topics cannot be adjusted, add this topic to the value of disk.adapter.topic.blacklist.

Observe the data storage period on the Kafka topic monitoring page.

Modify data directory log.dirs configuration.

Incorrect operation may cause process faults.

▲▲▲

Ensure that the added or modified data directories are empty and that the directory permissions are right.

Check whether data directories are normal.

Reduce the capacity of the Kafka cluster.

This operation may cause quantity reduction of backups of some data duplicates of topic. As a result, some topics cannot be accessed.

▲▲

Perform backup operation and then reduce the capacity of the Kafka cluster.

Check whether backup nodes where partitions are located are activated to ensure data security.

Start or stop basic components independently.

This operation has adverse impact on the basic functions of some services. As a result, service failures occur.

▲▲▲

Do not start or stop ZooKeeper, Kerberos, and LDAP basic components independently. Select related services when performing this operation.

Check whether the service status is normal.

Restart or stop services.

This operation may interrupt services.

▲▲

Restart or stop services when necessary.

Check whether the service is running properly.

Modify configuration parameters.

This operation requires service restart for configuration to take effect.

▲▲

Modify configuration when necessary.

Check whether the service is running properly.

Delete or modify metadata.

Modifying or deleting Kafka metadata on ZooKeeper may cause the Kafka topic or service unavailability.

▲▲▲

Do not delete or modify Kafka metadata stored on ZooKeeper.

Check whether the Kafka topics or Kafka service is available.

Delete metadata backup files.

After Kafka metadata backup files are modified and used to restore Kafka metadata, Kafka topics or the Kafka service may be unavailable.

▲▲▲

Do not delete Kafka metadata backup files.

Check whether the Kafka topics or Kafka service is available.

KrbServer High-Risk Operations

Table 14 KrbServer high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Modify the KADMIN_PORT parameter of KrbServer.

After this parameter is modified, if the KrbServer service and its associated services are not restarted in a timely manner, the configuration of KrbClient in the cluster is abnormal and the service running is affected.

▲▲▲▲▲

After this parameter is modified, restart the KrbServer service and all its associated services.

None

Modify the kdc_ports parameter of KrbServer.

After this parameter is modified, if the KrbServer service and its associated services are not restarted in a timely manner, the configuration of KrbClient in the cluster is abnormal and the service running is affected.

▲▲▲▲▲

After this parameter is modified, restart the KrbServer service and all its associated services.

None

Modify the KPASSWD_PORT parameter of KrbServer.

After this parameter is modified, if the KrbServer service and its associated services are not restarted in a timely manner, the configuration of KrbClient in the cluster is abnormal and the service running is affected.

▲▲▲▲▲

After this parameter is modified, restart the KrbServer service and all its associated services.

None

Modify the domain name of Manager system.

After the domain name is modified, if the KrbServer service and its associated services are not restarted in a timely manner, the configuration of KrbClient in the cluster is abnormal and the service running is affected.

▲▲▲▲▲

After this parameter is modified, restart the KrbServer service and all its associated services.

None

Configure cross-cluster mutual trust relationships.

This operation will restart the KrbServer service and all associated services, affecting the management and maintenance and services of the cluster.

▲▲▲▲▲

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal.

LdapServer High-Risk Operations

Table 15 LdapServer high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Modify the LDAP_SERVER_PORT parameter of LdapServer.

After this parameter is modified, if the LdapServer service and its associated services are not restarted in a timely manner, the configuration of LdapClient in the cluster is abnormal and the service running is affected.

▲▲▲▲▲

After this parameter is modified, restart the LdapServer service and all its associated services.

None

Restore LdapServer data.

This operation will restart Manager and the entire cluster, affecting management, maintenance, and services of the cluster.

▲▲▲▲▲

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal.

Replace the Node where LdapServer is located.

This operation will interrupt services deployed on the node. If the node is a management node, the operation will restart all OMS processes, affecting the cluster management and maintenance.

▲▲▲

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal.

Change the password of LdapServer.

The LdapServer and Kerberos services need to be restarted during the password change, affecting the management, maintenance, and services of the cluster.

▲▲▲▲

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

None

Restart the node where LdapServer is located.

Restarting the node without stopping the LdapServer service may cause LdapServer data damage.

▲▲▲▲▲

Restore LdapServer using LdapServer backup data

None

Loader High-Risk Operations

Table 16 Loader high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Change the floating IP address of a Loader instance (loader.float.ip).

Services cannot start properly.

▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Check whether the Loader UI can be connected properly.

Modify the Loader instance start parameter LOADER_GC_OPTS.

Services cannot start properly.

▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Check whether services can be started properly.

Clear table contents when adding data to HBase.

This operation will clear original data in the target table.

▲▲

Ensure that the contents in the target table can be cleared before the operation.

Check whether the contents in the target table can be cleared before the operation.

Spark2x High-risk Operations

Spark high-risk operations apply to MRS 3.x earlier versions.

Table 17 Spark2x high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Modify the configuration item spark.yarn.queue.

Services cannot start properly.

▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Check whether services can be started properly.

Modify the configuration item spark.driver.extraJavaOptions.

Services cannot start properly.

▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Check whether services can be started properly.

Modify the configuration item spark.yarn.cluster.driver.extraJavaOptions.

Services cannot start properly.

▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Check whether services can be started properly.

Modify the configuration item spark.eventLog.dir.

Services cannot start properly.

▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Check whether services can be started properly.

Modify the configuration item SPARK_DAEMON_JAVA_OPTS.

Services cannot start properly.

▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Check whether services can be started properly.

Delete all JobHistory2x instances.

The event logs of historical applications are lost.

▲▲

Reserve at least one JobHistory2x instance.

Check whether historical application information is included in JobHistory2x.

Delete or modify the /user/spark2x/jars/8.1.0.1/spark-archive-2x.zip file in HDFS.

JDBCServer2x fails to be started and service functions are abnormal.

▲▲▲

Delete /user/spark2x/jars/8.1.0.1/spark-archive-2x.zip, and wait for 10-15 minutes until the .zip package is automatically restored.

Check whether services can be started properly.

Storm High-Risk Operations

Table 18 Storm high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Modify the following plug-in related configuration items:

  • storm.scheduler
  • nimbus.authorizer
  • storm.thrift.transport
  • nimbus.blobstore.class
  • nimbus.topology.validator
  • storm.principal.tolocal

Services cannot start properly.

▲▲▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that the class names exist and are valid.

Check whether services can be started properly.

Modify the Storm instance GC_OPTS startup parameters, including:

NIMBUS_GC_OPTS

SUPERVISOR_GC_OPTS

UI_GC_OPTS

LOGVIEWER_GC_OPTS

Services cannot start properly.

▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Check whether services can be started properly.

Modify the user resource pool configuration parameter resource.aware.scheduler.user.pools.

Services cannot run properly.

▲▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that resources allocated to each user are appropriate and valid.

Check whether services can be started and run properly

Change data directories.

If this operation is not properly performed, services may be abnormal and unavailable.

▲▲▲▲

Do not manually change data directories.

Check whether data directories are normal.

Restart services or instances.

The service will be interrupted for a short period of time, and ongoing operations will be interrupted.

▲▲▲

Restart services or instances when necessary.

Check whether the service is running properly and whether interrupted operations are restored.

Synchronize configurations (by restarting the required service).

The service will be restarted, resulting in temporary service interruption. If Supervisor is restarted, ongoing operations will be interrupted for a short period of time.

▲▲▲

Modify configuration when necessary.

Check whether the service is running properly and whether interrupted operations are restored.

Stop services or instances.

The service will be stopped, and related operations will be interrupted.

▲▲▲

Stop services when necessary.

Check whether the services are properly stopped.

Delete or modify metadata.

If Nimbus metadata is deleted, services are abnormal and ongoing operations are lost.

▲▲▲▲▲

Do not manually delete Nimbus metadata files.

Check whether Nimbus metadata files are normal.

Modify file permissions.

If permissions on the metadata and log directories are incorrectly modified, service exceptions may occur.

▲▲▲▲

Do not manually modify file permissions.

Check whether the permissions on the data and log directories are correct.

Delete topologies.

Topologies in use will be deleted.

▲▲▲▲

Delete topologies when necessary.

Check whether the topologies are successfully deleted.

Yarn High-Risk Operations

Table 19 Yarn high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Delete or change data directories

yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs

This operation may cause service information loss.

▲▲▲

Do not delete data directories manually.

Check whether data directories are normal.

ZooKeeper High-Risk Operations

Table 20 ZooKeeper high-risk operations

Operation

Risk

Severity

Workaround

Check Item

Delete or change ZooKeeper data directories.

This operation may cause service information loss.

▲▲▲

Follow the capacity expansion guide to change the ZooKeeper data directories.

Check whether services and associated components are started properly.

Modify the ZooKeeper instance start parameter GC_OPTS.

Services cannot start properly.

▲▲

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Check whether services can be started properly.

Modify the znode ACL information in ZooKeeper.

If znode permission is modified in ZooKeeper, other users may have no permission to access the znode and some system functions are abnormal.

▲▲▲▲

During the modification, strictly follow the ZooKeeper Configuration Guide and ensure that other components can use ZooKeeper properly after ACL information modification.

Check that other components that depend on ZooKeeper can properly start and provide services.