Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

Restarting an MRS Cluster Component

Updated on 2024-09-23 GMT+08:00

To apply configuration changes to a big data component, you must restart it. However, using the common restart mode will restart all services or instances at once, which can cause service interruption.

To ensure that services are not affected during service restart, you can restart services or instances in batches by rolling restart. For instances in active/standby mode, a standby instance is restarted first and then an active instance is restarted.

A rolling restart takes a longer time and may affect service throughput and performance.

For details about whether services and instances in the current MRS cluster support rolling restart and the rolling restart parameters, see Component Restart Reference Information.

Restrictions

  • Perform a rolling restart during off-peak hours.
    • If the service throughput of the Kafka service is high (over 100 MB/s) during a rolling restart, the restart will fail.
    • To avoid RegionServer restart failures caused by heavy loads during an HBase rolling restart, increase the number of handles if the requests per second of each RegionServer on the native interface exceed 10,000.
  • Before restarting, check the current number of requests in HBase. If the number of requests on the native interface for each RegionServer is over 10,000, increase the number of handles to prevent overloading.
  • If the number of Core nodes in a cluster is less than six, services may be affected for a short period of time.
  • Preferentially perform a rolling instance or service restart and select Only restart instances whose configurations have expired.

Prerequisites

  • The IAM users have been synchronized in advance. You can do this by clicking Synchronize next to IAM User Sync on the Dashboard page of the cluster details.
  • You have logged in to MRS Manager. For how to log in, see Accessing MRS Manager.

Restarting Cluster Components

  1. Access the MRS cluster component management page.

    • Log in to the MRS console and click the cluster name to go to the cluster details page. Click Components.
    • If you are using the Manager of MRS 3.x and later versions, log in to Manager and choose Cluster > Services.
    • If you are using the Manager of MRS 2.x and earlier versions, log in to Manager and click Services.

  2. Click the name of the target component to go to the details page.
  3. On the service details page, expand the More drop-down list and select Restart Service or Service Rolling Restart.
  4. Enter the user password (required when you perform operations on Manager), confirm the operation impact, and click OK to restart the system.

    If you select rolling restart, set parameters listed in Table 1. (Required parameters may vary by version, set parameters based on the actual GUI.)

    Figure 1 Performing a rolling restart on Manager
    Table 1 Rolling restart configuration parameters

    Parameter

    Description

    Restart only instances with expired configurations

    Whether to restart only the modified instances in a cluster.

    The name of this parameter may be different in other versions.

    Enable rack strategy

    Whether to enable the concurrent rack rolling restart strategy. This parameter takes effect only for roles that meet the rack rolling restart strategy. (The roles support rack awareness, and instances of the roles belong to two or more racks.)

    This parameter can be set only when a rolling restart is performed on HDFS or YARN.

    Data Nodes to Be Batch Restarted

    Number of instances that are restarted in each batch when the batch rolling restart strategy is used. The default value is 1.

    NOTE:
    • This parameter is valid only when the batch rolling restart strategy is used and the instance type is DataNode.
    • This parameter is invalid when the rack strategy is enabled. In this case, the cluster uses the maximum number of instances (20 by default) configured in the rack strategy as the maximum number of instances that are concurrently restarted in a rack.
    • This parameter can be set only when a rolling restart is performed on some components, such as HDFS, HBase, YARN, Kafka, Storm, and Flume. The actual value displayed on the GUI prevails.
    • The number of concurrent RegionServer rolling restarts of HBase cannot be manually configured. It is automatically adjusted based on the number of RegionServer nodes.

      The adjustment rules are as follows: If the number of nodes is less than 30, one node will be added in each batch. For node counts less than 300, two nodes will be added in each batch. If the node count exceeds 300 (including 300 nodes), each batch will add 1% (rounded down) of the total nodes.

    Batch Interval

    Interval between two batches of instances to be roll-restarted. The default value is 0.

    Setting the batch interval parameter can increase the stability of the big data component process during the rolling restart.

    You are advised to set this parameter to a non-default value, for example, 10.

    Decommissioning Timeout Interval

    Decommissioning waiting time of a role instance during a rolling restart. This parameter can be set only when a rolling restart is performed on Hive or Spark.

    Some roles (such as HiveServer and JDBCServer) stop providing services before the rolling restart. Stopped instances cannot be connected to new clients. Existing connections will be completed after a period of time. An appropriate timeout interval can ensure service continuity.

    Batch Fault Tolerance Threshold

    Tolerance times when the rolling restart of instances fails to be batch executed. The default value is 0, which indicates that the rolling restart task ends after any batch of instances fails to restart.

Component Restart Reference Information

Table 2 provides services and instances that support or do not support rolling restart in the MRS cluster.

Table 2 Services and instances that support or do not support rolling restart

Service

Instance

Rolling Restart

Alluxio

AlluxioJobMaster

Yes

AlluxioMaster

ClickHouse

ClickHouseServer

Yes

ClickHouseBalancer

CDL

CDLConnector

Yes

CDLService

Flink

FlinkResource

No

FlinkServer

Flume

Flume

Yes

MonitorServer

Guardian

TokenServer

Yes

HBase

HMaster

Yes

RegionServer

ThriftServer

RESTServer

HetuEngine

HSBroker

Yes

HSConsole

HSFabric

QAS

HDFS

NameNode

Yes

Zkfc

JournalNode

HttpFS

DataNode

Hive

MetaStore

Yes

WebHCat

HiveServer

Hue

Hue

No

Impala

Impalad

No

StateStore

Catalog

IoTDB

IoTDBServer

Yes

Kafka

Broker

Yes

KafkaUI

No

Kudu

KuduTserver

Yes

KuduMaster

Loader

Sqoop

No

MapReduce

JobHistoryServer

Yes

Oozie

oozie

No

Presto

Coordinator

Yes

Worker

Ranger

RangerAdmin

Yes

UserSync

TagSync

Spark

JobHistory

Yes

JDBCServer

SparkResource

Storm

Nimbus

Yes

UI

Supervisor

Logviewer

Tez

TezUI

No

Yarn

ResourceManager

Yes

NodeManager

ZooKeeper

Quorumpeer

Yes

Table 3 lists the instance startup duration.

Table 3 Restart duration for reference

Service

Restart Duration

Startup Duration

Remarks

IoTDB

3min

IoTDBServer: 3 min

-

CDL

2min

  • CDLConnector: 1 min
  • CDLService: 1 min

-

ClickHouse

4min

  • ClickHouseServer: 2 min
  • ClickHouseBalancer: 2 min

-

HDFS

10min+x

  • NameNode: 4 min + x
  • DataNode: 2 min
  • JournalNode: 2 min
  • Zkfc: 2 min

x indicates the NameNode metadata loading duration. It takes about 2 minutes to load 10,000,000 files. For example, x is 10 minutes for 50 million files.

The startup duration fluctuates with reporting of DataNode data blocks.

Yarn

5min+x

  • ResourceManager: 3 min + x
  • NodeManager: 2 min

x indicates the time required for restoring ResourceManager reserved tasks. It takes about 1 minute to restore 10,000 reserved tasks.

MapReduce

2min+x

JobHistoryServer: 2 min + x

x indicates the scanning duration of historical tasks. It takes about 2.5 minutes to scan 100,000 tasks.

ZooKeeper

2min+x

quorumpeer: 2 min + x

x indicates the duration for loading znodes. It takes about 1 minute to load 1 million znodes.

Hive

3.5min

  • HiveServer: 3 min
  • MetaStore: 1 min 30s
  • WebHcat: 1 min
  • Hive service: 3 min

-

Spark2x

5min

  • JobHistory2x: 5 min
  • SparkResource2x: 5 min
  • JDBCServer2x: 5 min

-

Flink

4min

  • FlinkResource: 1 min
  • FlinkServer: 3 min

-

Kafka

2min+x

  • Broker: 1 min + x
  • Kafka UI: 5 min

x indicates the data restoration duration. It takes about 2 minutes to start 20,000 partitions for a single instance.

Storm

6min

  • Nimbus: 3 min
  • UI: 1 min
  • Supervisor: 1 min
  • Logviewer: 1 min

-

Flume

3min

  • Flume: 2 min
  • MonitorServer: 1 min

-

Doris

2 min

  • FE: 1min
  • BE: 1min
  • DBroker: 1min

-

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback