Updated on 2025-01-22 GMT+08:00

Constraints

Before using MRS, ensure that you have read and understood the following restrictions.

Cluster Creation Constraints

Table 1 Constraints on creating an MRS cluster

Constraint

Description

Network requirement

  • MRS clusters must be created in VPC subnets.
  • When you create an MRS cluster, you can select Auto create from Security Group to create a security group or select an existing security group.
  • To prevent illegal access, only assign access permission for security groups used by MRS where necessary.

Browser

You are advised to use a recommended browser to log in to the MRS management page.
  • Google Chrome: 36.0 or later
  • Edge is updated with the Windows operating system.

Data storage

  • The cluster nodes store only users' service data. Non-service data can be stored in the OBS or other ECS nodes.
  • The cluster nodes only run MRS cluster programs. Other client applications or user service programs are deployed on separate ECS nodes.
  • Disk resources for each cluster node should be planned according to service demands. For services that require significant data storage, allocate additional EVS disks or expand storage capacity to avoid storage shortages impacting node performance.
  • The capacity (including storage and computing capabilities) of an MRS cluster can be expanded by adding core or task nodes.

Password requirement

Keep the initial password for logging in to the master node properly because MRS does not save it. Use a complex password to avoid malicious attacks.

Technical support

  • If a cluster exception occurs when no incorrect operations have been performed, contact technical support engineers. They will ask you for your password and then perform troubleshooting.
  • MRS clusters are still charged during exceptions. Contact technical support engineers to handle cluster exceptions.

MRS Cluster Running Constraints

Table 2 MRS cluster running constraints

Item

Description

Node management

  • If a Master node in an MRS cluster is shut down and the cluster is still used to execute jobs or modify component configurations, you must start the stopped Master node before stopping other nodes. Otherwise, data may be lost due to an active/standby switchover.
  • If all nodes in an MRS cluster have been stopped, start them in the reverse order of node shutdown.

Resources scheduling

When the MRS cluster is running, the scheduler can be switched between Capacity and Superior but the configuration synchronization cannot be ensured. Configure synchronization again based on the new scheduler if necessary.

Storage-compute decoupling

To delete a component or cluster connected to OBS (including storage-compute decoupling and cold-hot data separation scenarios), you must also delete the service data on OBS.

Components

  • Do not enable the MOB feature for the HBase service in the MRS cluster. Using this feature may lead to table data read failure and JVM crash.

    For existing HBase tables, run the following command in hbase shell to check whether the table description contains the keyword MOB. If it is contained, contact O&M engineers to set the table non-MOB.

    desc 'Table name'

    For example, if the value of IS_MOB is true in the following command output, the HBase MOB feature is enabled:

    hbase:009:0> desc 't3'
    t3
    COLUMN FAMILIES DESCRIPTION
    {NAME => 'd', MOB_THRESHOLD => '102400', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOO
    MFILTER => 'ROW', IN_MEMORY => 'false', IS_MOB => 'true', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}
  • Spark and other engines in the MRS cluster are forbidden to execute self-read and self-write SQL statements.

    For example, the following SQL statement overwrites the same data with the data it reads:

    insert overwrite table test select * from test;
    insert overwrite table test select a.* from test a join test2 b on a.name=b.name;

Forbidden and High-Risk Operations on Clusters and Components

To avoid service disruptions due to unstable clusters, do not perform the following actions during the use of MRS clusters and components.

Table 3 Forbidden operations on MRS clusters

Operation

Risk

  • Shutting down, restarting, or deleting MRS cluster nodes on the ECS console, changing or reinstalling their OS, or modifying their specifications.
  • Deleting the existing processes, applications, and files from cluster nodes.
  • Deleting MRS cluster nodes. Clusters with deleted nodes will still be charged.

The cluster will function abnormally, affecting its usability.

After the MRS cluster is created, do not delete or modify the used security group.

A cluster exception may occur.

Deleting ZooKeeper data directories.

Component metadata stored in ZooKeeper is lost, including ClickHouse, HDFS, Yarn, HBase, and Hive. Upper-layer components may experience performance issues or errors.

Switching between active and standby JDBCServer nodes frequently.

Services may be interrupted.

Deleting Phoenix system tables or table data (SYSTEM.CATALOG, SYSTEM.STATS, SYSTEM.SEQUENCE, and SYSTEM. FUNCTION).

Service operations will fail.

Modifying data in the Hive metabase (hivemeta database).

Hive data parse errors may occur. As a result, Hive cannot provide services.

Performing INSERT or UPDATE operations on Hive metadata tables.

Modifying Hive metadata may cause Hive data parsing errors and affect Hive services.

Changing permission on the Hive private file directory hdfs:///tmp/hive-scratch.

Hive services may be unavailable.

Modifying broker.id in the Kafka configuration file.

Data on the node may become invalid.

Modifying the host names of nodes.

Instances and upper-layer components on the host cannot provide services properly. The fault cannot be rectified.

Using private images.

The cluster will function abnormally, affecting its usability.

Table 4 lists high-risk operations during maintenance of each component in the MRS cluster.

Table 4 High-risk operations on MRS clusters

Operation Type

Operation

Risk

Preventive Measure

Cluster operations

Modify the file directory or file permissions of user omm without permission.

This operation will lead to MRS service unavailability.

None

Bind an EIP.

This operation exposes the Master node hosting MRS Manager of the cluster to the public network, increasing the risk of network attacks from the Internet.

Ensure that the bound EIP is a trusted public IP address.

Enable security group rules for port 22 of a cluster.

This operation increases the risk of exploiting vulnerability of port 22.

Configure a security group rule for port 22 to allow only trusted IP addresses to access the port. You are not advised to configure the inbound rule to allow 0.0.0.0 to access the port.

Delete a cluster or cluster data.

This operation will cause data loss.

Before deleting the data, confirm the necessity of the operation and ensure that the data has been backed up.

Scale in a cluster.

This operation will cause data loss.

Before scaling in the cluster, confirm the necessity of the operation and ensure that the data has been backed up.

Detach or format a data disk.

This operation will cause data loss.

Before performing this operation, confirm the necessity of the operation and ensure that the data has been backed up.

Manager operations

Change the OMS password.

This operation will restart all processes of OMS, which has adverse impact on cluster maintenance and management.

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Import the certificate.

This operation will restart OMS processes and the entire cluster, which has adverse impact on cluster maintenance and management and services.

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Restore the OMS.

This operation will restart Manager and the entire cluster, affecting management, maintenance, and services of the cluster.

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Change log levels.

If the log level is changed to DEBUG, Manager responds slowly.

Before the modification, confirm the necessity of the operation and change it back to the default log level in time.

Restart both upper-layer and lower-layer services.

This operation will interrupt the upper-layer service, affecting the management, maintenance, and services of the cluster.

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Modify the OLDAP port.

This operation will restart the LdapServer and Kerberos services and all associated services, affecting service running.

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Delete the supergroup user group bound to a user.

This operation decreases user rights, affecting service access.

Before the change, confirm the rights to be added. Ensure that the required rights have been added before deleting the supergroup rights to which the user is bound, ensuring service continuity.

Restart or stop services.

Services will be interrupted during the restart. If you select and restart the upper-layer service, the upper-layer services that depend on the service will be interrupted.

Confirm the necessity of restarting the system before the operation.

Change the default SSH port.

Changing the default port (22) leads to incorrect health check results, including the inspection of node mutual trust and omm/ommdba user password expiration.

Before performing this operation, restore the SSH port to the default value.

ClickHouse

Delete the ClickHouse data directory.

This operation may cause service information loss.

Do not delete data directories.

Remove ClickHouseServer instances.

This operation leads to incorrect topology information for the logical cluster. To avoid this, all ClickHouse Server instance nodes within the same shard must be scaled in and decommissioned at the same time.

Before performing this operation, check the database and data table information of each node in the logical cluster and perform scale-in pre-analysis to ensure that data can be migrated during scale-in and decommissioning, preventing data loss.

Before scale-in, collect information in advance to learn the status of the ClickHouse logical cluster and instance nodes. Check the ClickHouse logical cluster topology information, database and data table information in each ClickHouseServer instance, and data volume.

Add ClickHouseServer instances.

Before performing this operation, verify if a database or table with the matching name exists on the new node; otherwise, subsequent operations such as data migration, balancing, scaling down, and decommissioning will fail.

Before scale-out, confirm the function and purpose of new ClickHouseServer instances and determine whether to create related databases and data tables.

Decommission ClickHouseServer instances.

This operation leads to incorrect topology information for the logical cluster. To avoid this, all ClickHouse Server instance nodes within the same shard must be decommissioneda at the same time.

Before performing this operation, check the database and data table information of each node in the logical cluster and perform pre-analysis to ensure that data can be migrated during decommissioning, preventing data loss.

Before decommissioning, collect information in advance to learn the status of the ClickHouse logical cluster and instance nodes.

Recommission ClickHouseServer instances.

When performing this operation, you must select all nodes in the original shard. Otherwise, the topology information of the logical cluster is incorrect.

Before recommissioning, you need to confirm the home information about the shards of the node to be recommissioned.

Modify data directory content (file and folder creation).

This operation may cause ClickHouse instance faults on the node.

Do not create or modify files or folders in the data directories.

Start or stop basic components independently.

This operation has adverse impact on the basic functions of some services. As a result, service failures occur.

Do not start or stop ZooKeeper, Kerberos, and LDAP basic components independently. Select related services when performing this operation.

DBService

Change the DBService password.

The services need to be restarted for the password change to take effect. The services are unavailable during the restart.

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Restore DBService data.

After the data is restored, the data generated after the data backup and before the data restoration is lost.

After the data is restored, the configurations of the components that depend on DBService may expire and these components need to be restarted.

Perform the operation only when it is necessary, and ensure that there is no other maintenance and management operations when the operation is performed.

Perform active/standby DBService switchover.

During the DBServer switchover, DBService is unavailable.

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Flink

Change the Flink log level.

If the log level is modified to DEBUG, the task running performance is affected.

Before the modification, confirm the necessity of the operation and change it back to the default log level in time.

Modify Flink file permissions.

This operation may cause task execution failures.

Before the modification, confirm that the modification is needed.

Flume

Modify the Flume instance start parameter GC_OPTS.

Flume cannot start.

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

HBase

Modify encryption configuration.

  • hbase.regionserver.wal.encryption
  • hbase.crypto.keyprovider.parameters.uri
  • hbase.crypto.keyprovider.parameters.encryptedtext

The HBase service fails to be started.

Strictly follow the prompt information when modifying related configuration items, which are associated. Ensure that new values are valid.

Disable or switch the encryption algorithm when encryption is enabled.

The encryption function will be disabled by setting hbase.regionserver.wal.encryption to false. The algorithm can be switched to AES or SMS4.

These operations may cause service startup failure and data loss.

When HFile and WAL are encrypted using an encryption algorithm and a table is created, do not disable encryption or switch the encryption algorithm.

If an encryption table (ENCRYPTION=>AES/SMS4) is not created, you can only switch the encryption algorithm.

Modify HBase instance start parameter GC_OPTS and HBASE_HEAPSIZE.

The HBase service fails to be started.

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. GC_OPTS does not conflict with HBASE_HEAPSIZE.

Use OfflineMetaRepair tool.

The HBase service fails to be started.

This tool can be used only when HBase is offline and cannot be used in data migration scenarios.

HDFS

Change HDFS NameNode data storage directory dfs.namenode.name.dir and DataNode data configuration directory dfs.datanode.data.dir.

The HDFS service fails to be started.

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Use the -delete parameter when you run the hadoop distcp command.

During DistCP copying, files that do not exist in the source cluster but exist in the destination cluster are deleted from the destination cluster.

When using DistCP, determine whether to retain the redundant files in the destination cluster. Exercise caution when using the -delete parameter.

After DistCP copying is complete, check whether the data in the destination cluster is retained or deleted according to the parameter settings.

Modify the HDFS instance start parameter GC_OPTS, HADOOP_HEAPSIZE, and GC_PROFILE.

The HDFS service fails to be started.

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. GC_OPTS does not conflict with HADOOP_HEAPSIZE.

Change the value of dfs.replication (number of HDFS replicas) from 3 to 1.

  • The storage reliability deteriorates. If the disk becomes faulty, data will be lost.
  • NameNode fails to be restarted, and the HDFS service is unavailable.

When modifying related configuration items, check the parameter description carefully.

Ensure that there are more than two replicas for data storage.

Change the encryption mode (hadoop.rpc.protection) of the remote procedure call (RPC) channel in each module of Hadoop.

The HDFS service is faulty and abnormal.

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Hive

Modify the Hive instance start parameter GC_OPTS.

This operation may cause Hive instance start failures.

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Delete all MetaStore instances.

Hive metadata is lost, and Hive cannot provide services.

Do not perform this operation unless ensure that Hive table information can be discarded.

Delete or modify files corresponding to Hive tables over HDFS interfaces or HBase interfaces.

This operation may cause Hive service data loss or tampering.

Do not perform this operation unless ensure that the data can be discarded or that the operation meets service requirements.

Delete or modify files corresponding to Hive tables or directory access permission over HDFS interfaces or HBase interfaces.

This operation may cause related services unavailable.

Do not perform this operation.

Delete or modify hdfs:///apps/templeton/hive-3.1.0.tar.gz using HDFS interfaces.

WebHCat fails to perform services due to this operation.

Do not perform this operation.

Export table data to overwrite the data at the local. For example, export the data of t1 to /opt/dir.

insert overwrite local directory '/opt/dir' select * from t1;

This operation will delete target directories. Incorrect setting may cause software or OS startup failures.

Ensure that the path where the data is written does not contain any files or do not use the key word overwrite in the command.

Direct different databases, tables, or partition files to the same path, for example, default warehouse path /user/hive/warehouse.

The creation operation may cause disordered data. After a database, table, or partition is deleted, other object data will be lost.

Do not perform this operation.

IoTDB

Delete data directories.

This operation may cause service information loss.

Do not delete data directories.

Modify data directory content (file and folder creation).

This operation may cause IoTDB instance faults on the node.

Do not create or modify files or folders in the data directories.

Start or stop basic components independently.

This operation has adverse impact on the basic functions of some services. As a result, service failures occur.

Do not start or stop basic components such as Kerberos and LDAP independently. Select related services when performing this operation.

Kafka

Delete topics

This operation will delete the existing schema and data.

Use Kerberos authentication to ensure that authenticated users have operation permissions. Ensure that topic names are correct.

Delete the Kafka data directory.

This operation may cause service information loss.

Do not delete data directories.

Modify data directory content (file and folder creation).

This operation may cause the Broker instance of the node faults.

Do not create or modify files or folders in the data directories.

Modify the disk auto-adaptation function using the disk.adapter.enable parameter.

This operation adjusts the topic data retention period when the disk usage reaches the threshold. Historical data that does not fall within the storage retention may be deleted.

If the storage period of a topic cannot be adjusted, specify the topic in the disk.adapter.topic.blacklist parameter and observe the data storage period on the Kafka topic monitoring page.

Modify data directory log.dirs configuration.

If the configuration is incorrect, the process may be faulty.

Ensure that the data directory is empty and the permission is obtained.

Reduce the capacity of the Kafka cluster.

This operation will reduce the number of data replicas of some topics and may cause topic access failures.

Transfer data replicas before capacity reduction.

Start or stop basic components independently.

This operation has adverse impact on the basic functions of some services and may cause service failure.

Do not start or stop ZooKeeper, Kerberos, and LDAP basic components independently. Select related services when performing this operation.

Delete or modify metadata.

Modifying or deleting Kafka metadata on ZooKeeper may cause topic or Kafka service unavailability.

Do not delete or modify Kafka metadata stored on ZooKeeper.

Delete metadata backup files.

After Kafka metadata backup files are modified and used to restore Kafka metadata, Kafka topics or the Kafka service may be unavailable.

Do not delete Kafka metadata backup files.

KrbServer

Modify the KADMIN_PORT parameter of KrbServer.

After this parameter is modified, if the KrbServer service and its associated services are not restarted in a timely manner, the configuration of KrbClient in the cluster is abnormal and the service running is affected.

After this parameter is modified, restart the KrbServer service and all its associated services.

Modify the kdc_ports parameter of KrbServer.

After this parameter is modified, if the KrbServer service and its associated services are not restarted in a timely manner, the configuration of KrbClient in the cluster is abnormal and the service running is affected.

After this parameter is modified, restart the KrbServer service and all its associated services.

Modify the KPASSWD_PORT parameter of KrbServer.

After this parameter is modified, if the KrbServer service and its associated services are not restarted in a timely manner, the configuration of KrbClient in the cluster is abnormal and the service running is affected.

After this parameter is modified, restart the KrbServer service and all its associated services.

Modify the domain name of Manager.

After the domain name is modified, if the KrbServer service and its associated services are not restarted in a timely manner, the configuration of KrbClient in the cluster is abnormal and the service running is affected.

After this parameter is modified, restart the KrbServer service and all its associated services.

Configure cross-cluster mutual trust relationships.

This operation will restart the KrbServer service and all associated services, affecting the management and maintenance and services of the cluster.

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

LdapServer

Modify the LDAP_SERVER_PORT parameter of LdapServer.

After this parameter is modified, if the LdapServer service and its associated services are not restarted in a timely manner, the configuration of LdapClient in the cluster is abnormal and the service running is affected.

After this parameter is modified, restart the LdapServer service and all its associated services.

Restore LdapServer data.

This operation will restart the Manager and the entire cluster, which has adverse impact on cluster maintenance and management and services.

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Replace the Node where LdapServer is located.

This operation will interrupt services deployed on the node. If the node is a management node, the operation will restart all OMS processes, affecting cluster management and maintenance.

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Change the password of LdapServer.

The LdapServer and Kerberos services need to be restarted during the password change, affecting the management, maintenance, and services of the cluster.

Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time.

Restart the node where LdapServer is located.

Restarting the node without stopping the LdapServer service may cause LdapServer data damage.

Restore the LdapServer node using the backup data.

Loader

Change the floating IP address of a Loader instance.

This operation causes Loader start failure.

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Modify the Loader instance start parameter LOADER_GC_OPTS.

This operation causes Loader start failure.

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Clear a table when adding data to HBase.

This operation will clear the original data in the target table.

Ensure that the target table can be cleared before the operation.

Spark

Modify the spark.yarn.queue configuration item.

This operation causes Spark start failure.

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Modifying the spark.driver.extraJavaOptions configuration item

This operation causes Spark start failure.

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Modify the spark.yarn.cluster.driver.extraJavaOptions configuration item.

This operation causes Spark start failure.

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Modify the spark.eventLog.dir configuration item.

This operation causes Spark start failure.

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Modify the SPARK_DAEMON_JAVA_OPTS configuration item.

This operation causes Spark start failure.

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Delete all JobHistory instances.

The event logs of historical applications are lost.

Retain at least one JobHistory instance and check whether historical application information can be queried in JobHistory.

Delete the spark-archive file from in the HDFS or modify the file.

JDBCServer fails to be started and service functions are abnormal.

After /user/spark2x/jars/XXX/spark-archive-2x.zip or /user/spark/jars/XXX/spark-archive.zip is deleted, wait for 10 to 15 minutes. The .zip package is automatically restored.

Storm

Modify plug-in related configuration items, including the following:

  • storm.scheduler
  • nimbus.authorizer
  • storm.thrift.transport
  • nimbus.blobstore.class
  • nimbus.topology.validator
  • storm.principal.tolocal

This operation causes Storm start failure.

Strictly follow the prompt information when modifying related configuration items. Ensure that the class names exist and are valid.

Modify Storm instance GC_OPTS startup parameters, including:

  • NIMBUS_GC_OPTS
  • SUPERVISOR_GC_OPTS
  • UI_GC_OPTS
  • LOGVIEWER_GC_OPTS

This operation causes Storm start failure.

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Modify the user resource pool configuration parameter resource.aware.scheduler.user.pools.

This operation causes Storm execution failure.

Strictly follow the prompt information when modifying related configuration items. Ensure that resources allocated to each user are appropriate and valid.

Modify the Storm data directory.

This operation causes services abnormal and unavailable.

Do not manually change data directories.

Delete or modify Storm metadata.

Deleting Nimbus metadata will cause service exceptions and loss of running services.

Do not manually delete Nimbus metadata files.

Modify Storm file permission.

If permissions on the metadata and log directories are incorrectly modified, service exceptions may occur.

Do not modify file permissions.

Delete a Storm topology.

Topologies in use will be deleted.

Delete topologies only when necessary.

Yarn

Delete or modify data directories yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs.

This operation causes service information loss.

Do not delete data directories.

ZooKeeper

Delete or change ZooKeeper data directories.

This operation causes service information loss.

Strictly follow the capacity expansion guide to change the ZooKeeper data directories.

Modify the ZooKeeper instance start parameter GC_OPTS.

This operation causes ZooKeeper start failure.

Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid.

Modify the znode ACL information in ZooKeeper.

If znode permission is modified in ZooKeeper, other users may have no permission to access the znode and some system functions are abnormal.

Ensure that the modified ACL information does not affect the normal use of ZooKeeper by other components.