Automatic Scaling of Task Nodes in an MRS Cluster

In big data application scenarios, especially real-time data analysis and processing, the number of cluster nodes needs to be dynamically adjusted according to data volume changes to provide the required number of resources. The auto scaling function of MRS enables the task nodes of a cluster to be automatically scaled to match cluster loads. If the data volume changes periodically, you can configure an auto scaling rule so that the number of task nodes can be automatically adjusted in a fixed period of time before the data volume changes.

Auto scaling rules: You can increase or decrease task nodes based on real-time cluster loads. Auto scaling will be triggered with a certain delay when the data volume changes.
Resource plans: Set the task node quantity based on the time range. If the data volume changes periodically, you can create resource plans to resize the cluster before the data volume changes, thereby avoiding delays in increasing or decreasing resources.

You can configure either auto scaling rules or resource plans or both to trigger auto scaling. Configuring both resource plans and auto scaling rules improves the cluster node scalability to cope with occasionally unexpected data volume peaks.

In some service scenarios, resources need to be reallocated or service logic needs to be modified after cluster scale-out or scale-in. If you manually scale out or scale in a cluster, you can log in to cluster nodes to reallocate resources or modify service logic. If you use auto scaling, MRS enables you to customize automation scripts for resource reallocation and service logic modification. Automation scripts can be executed before and after auto scaling and automatically adapt to service load changes, all of which eliminates manual operations. In addition, automation scripts can be fully customized and executed at various moments, meeting your personalized requirements and improving auto scaling flexibility.

Auto scaling rules:
- You can set a maximum of five rules for scaling out or in a cluster, respectively.
- The system determines the scale-out and then scale-in based on your configuration sequence. Important policies take precedence over other policies to prevent repeated triggering when the expected effect cannot be achieved after a scale-out or scale-in.
- Comparison factors include greater than, greater than or equal to, less than, and less than or equal to.
- Cluster scale-out or scale-in can be triggered only after the configured metric threshold is reached for consecutive 5n (the default value of n is 1) minutes.
- After each scale-out or scale-in, there is a cooling duration that is greater than 0 and lasts 10 minutes by defaults.
- In each cluster scale-out or scale-in, at least one node and at most 100 nodes can be added or reduced.
- The number of task nodes in a cluster is limited to the default number of nodes configured by users or the node quantity range in the resource plan that takes effect in the current time period. The node quantity range in the resource plan that takes effect in the current time period has a higher priority.
Resource plans (setting the number of Task nodes by time range):
- You can specify a Task node range (minimum number to maximum number) in a time range. If the number of Task nodes is beyond the Task node range in a resource plan, the system triggers cluster scale-out or scale-in.
- You can set a maximum of five resource plans for a cluster.
- A resource plan cycle is by day. The start time and end time can be set to any time point between 00:00 and 23:59. The start time must be at least 30 minutes earlier than the end time. Time ranges configured for different resource plans cannot overlap.
- After a resource plan triggers cluster scale-out or scale-in, there is 10-minute cooling duration. Auto scaling will not be triggered again within the cooling time.
- When a resource plan is enabled, the number of Task nodes in the cluster is limited to the default node range configured by you in other time periods except the time period configured in the resource plan.
Automation scripts:
- You can set an automation script so that it can automatically run on cluster nodes when auto scaling is triggered.
- You can set a maximum number of 10 automation scripts for a cluster.
- You can specify an automation script to be executed on one or more types of nodes.
- Automation scripts can be executed before or after scale-out or scale-in.
- Before using automation scripts, upload them to a cluster VM or OBS file system in the same region as the cluster. The automation scripts uploaded to the cluster VM can be executed only on the existing nodes. If you want to make the automation scripts run on the new nodes, upload them to the OBS file system.

Node Auto Scaling Metrics

Node group dimension policy

When adding a rule, you can refer to Table 1 to configure the corresponding metrics.

**Table 1** Auto scaling metrics
Cluster Type	Metric	Value Type	Description
Streaming cluster	StormSlotAvailable	Integer	Number of available Storm slots Value range: 0 to 2147483646
	StormSlotAvailablePercentage	Percentage	Percentage of available Storm slots, that is, the proportion of the available slots Value range: 0 to 100
	StormSlotUsed	Integer	Number of the used Storm slots Value range: 0 to 2147483646
	StormSlotUsedPercentage	Percentage	Percentage of the used Storm slots, that is, the proportion of the used slots Value range: 0 to 100
	StormSupervisorMemAverageUsage	Integer	Average memory usage of the Supervisor process of Storm Value range: 0 to 2147483646
	StormSupervisorMemAverageUsagePercentage	Percentage	Average percentage of the memory used by the Supervisor process of Storm to the total memory of the system Value range: 0 to 100
	StormSupervisorCPUAverageUsagePercentage	Percentage	Average percentage of the CPUs used by the Supervisor process of Storm to the total CPUs Value range: 0 to 6000
Analysis cluster	YARNAppPending	Integer	Number of pending tasks on YARN Value range: 0 to 2147483646
	YARNAppPendingRatio	Ratio	Ratio of pending tasks on YARN, that is, the ratio of pending tasks to running tasks on YARN Value range: 0 to 2147483646
	YARNAppRunning	Integer	Number of running tasks on YARN Value range: 0 to 2147483646
	YARNContainerAllocated	Integer	Number of containers allocated to YARN Value range: 0 to 2147483646
	YARNContainerPending	Integer	Number of pending containers on YARN Value range: 0 to 2147483646
	YARNContainerPendingRatio	Ratio	Ratio of pending containers on YARN, that is, the ratio of pending containers to running containers on YARN. Value range: 0 to 2147483646
	YARNCPUAllocated	Integer	Number of virtual CPUs (vCPUs) allocated to YARN Value range: 0 to 2147483646
	YARNCPUAvailable	Integer	Number of available vCPUs on YARN Value range: 0 to 2147483646
	YARNCPUAvailablePercentage	Percentage	Percentage of available vCPUs on YARN, that is, the proportion of available vCPUs to total vCPUs Value range: 0 to 100
	YARNCPUPending	Integer	Number of pending vCPUs on YARN Value range: 0 to 2147483646
	YARNMemoryAllocated	Integer	Memory allocated to YARN. The unit is MB. Value range: 0 to 2147483646
	YARNMemoryAvailable	Integer	Available memory on YARN. The unit is MB. Value range: 0 to 2147483646
	YARNMemoryAvailablePercentage	Percentage	Percentage of available memory on YARN, that is, the proportion of available memory to total memory on YARN Value range: 0 to 100
	YARNMemoryPending	Integer	Pending memory on YARN Value range: 0 to 2147483646

When the value type is percentage or ratio in Table 1, the valid value can be accurate to percentile. The percentage metric value is a decimal value with a percent sign (%) removed. For example, 16.80 represents 16.80%.
Hybrid clusters support all metrics of analysis and streaming clusters.

Resource pool policy

When adding a rule, you can refer to Table 2 to configure the corresponding metrics.

Auto scaling policies can be configured for a cluster by resource pool in MRS 3.1.5 or later.

**Table 2** Rule configuration description
Cluster Type	Metric	Value Type	Description
Analysis/Custom cluster	ResourcePoolMemoryAvailable	Integer	Available memory on YARN in the resource pool. The unit is MB. Value range: 0 to 2147483646
	ResourcePoolMemoryAvailablePercentage	Percentage	Percentage of available memory on YARN in the resource pool, that is, the proportion of available memory to total memory on YARN Value range: 0 to 100
	ResourcePoolCPUAvailable	Integer	Number of available vCPUs on YARN in the resource pool Value range: 0 to 2147483646
	ResourcePoolCPUAvailablePercentage	Percentage	Percentage of available vCPUs on YARN in the resource pool. that is, the proportion of available vCPUs to total vCPUs Value range: 0 to 100

When adding a resource plan, you can set parameters by referring to Table 3.

**Table 3** Configuration items of a resource plan
Configuration Item	Description
Effective On	The effective date of a resource plan. Daily is selected by default. You can also select one or multiple days from Monday to Sunday.
Time Range	Start time and End time of a resource plan are accurate to minutes, with the value ranging from 00:00 to 23:59. For example, if a resource plan starts at 8:00 and ends at 10:00, set this parameter to 8:00-10:00. The end time must be at least 30 minutes later than the start time.
Node Range	The number of nodes in a resource plan ranges from 0 to 500. In the time range specified in the resource plan, if the number of Task nodes is less than the specified minimum number of nodes, it will be increased to the specified minimum value of the node range at a time. If the number of Task nodes is greater than the maximum number of nodes specified in the resource plan, the auto scaling function reduces the number of Task nodes to the maximum value of the node range at a time. The minimum number of nodes must be less than or equal to the maximum number of nodes.

When a resource plan is enabled, the Default Range value on the auto scaling page forcibly takes effect beyond the time range specified in the resource plan. For example, if Default Range is set to 1-2, Time Range is between 08:00-10:00, and Node Range is 4-5 in a resource plan, the number of Task nodes in other periods (0:00-8:00 and 10:00-23:59) of a day is forcibly limited to the default node range (1 to 2). If the number of nodes is greater than 2, auto scale-in is triggered; if the number of nodes is less than 1, auto scale-out is triggered.
When a resource plan is not enabled, the Default Range takes effect in all time ranges. If the number of nodes is not within the default node range, the number of Task nodes is automatically increased or decreased to the default node range.
Time ranges of resource plans cannot be overlapped. The overlapped time range indicates that two effective resource plans exist at a time point. For example, if resource plan 1 takes effect from 08:00 to 10:00 and resource plan 2 takes effect from 09:00 to 11:00, the time range between 09:00 to 10:00 is overlapped.
The time range of a resource plan must be on the same day. For example, if you want to configure a resource plan from 23:00 to 01:00 (the next day), configure two resource plans whose time ranges are 23:00-00:00 and 00:00-01:00, respectively.

Automation scripts

When adding an automation script, you can set related parameters by referring to Table 4.

**Table 4** Configuration items of an automation script
Configuration Item	Description
Name	Automation script name. The value can contain only digits, letters, spaces, hyphens (-), and underscores (_) and must not start with a space. The value can contain 1 to 64 characters. NOTE: A name must be unique in the same cluster. You can set the same name for different clusters.
Script Path	Script path. The value can be an OBS file system path or a local VM path. An OBS file system path must start with obs:// and end with .sh, for example, obs://mrs-samples/xxx.sh. A local VM path must start with a slash (/) and end with .sh. For example, the path of the example script for installing the Zepelin is /opt/bootstrap/zepelin/zepelin_install.sh.
Execution Node	Select a type of the node where an automation script is executed. NOTE: If you select Master nodes, you can choose whether to run the script only on the active Master nodes by enabling or disabling the Active Master switch. If you enable it, the script runs only on the active Master nodes. If you disable it, the script runs on all Master nodes. This switch is disabled by default.
Parameter	Automation script parameter. The following predefined variables can be imported to obtain auto scaling information: ${mrs_scale_node_num}: Number of auto scaling nodes. The value is always positive. ${mrs_scale_type}: Scale-out/in type. The value can be scale_out or scale_in. ${mrs_scale_node_hostnames}: Host names of the auto scaling nodes. Use commas (,) to separate multiple host names. ${mrs_scale_node_ips}: IP address of the auto scaling nodes. Use commas (,) to separate multiple IP addresses. ${mrs_scale_rule_name}: Name of the triggered auto scaling rule. For a resource plan, this parameter is set to resource_plan.
Executed	Time for executing an automation script. The following four options are supported: Before scale-out, After scale-out, Before scale-in, and After scale-in. NOTE: Assume that the execution nodes include Task nodes. The automation script executed before scale-out cannot run on the Task nodes to be added. The automation script executed after scale-out can run on the added Task nodes. The automation script executed before scale-in can run on Task nodes to be deleted. The automation script executed after scale-in cannot run on the deleted Task nodes.
Action upon Failure	Whether to continue to execute subsequent scripts and scale-out/in after the script fails to be executed. NOTE: You are advised to set this parameter to Continue in the commissioning phase so that the cluster can continue the scale-out/in operation no matter whether the script is executed successfully. If the script fails to be executed, view the log in /var/log/Bootstrap on the cluster VM. The scale-in operation cannot be rolled back. Therefore, the Action upon Failure can only be set to Continue after scale-in.