Function Overview_MapReduce Service

Help Center/ MapReduce Service/ Function Overview

Function Overview

MapReduce Service
- Big data is a huge challenge for the Internet era as the data that needs to be handled has been growing explosively both in volume and diversity. Conventional data processing technologies, such as single-node storage and relational databases, are unable to handle the massive volumes and varieties of data of the Internet era. To rise up to this challenge, the Apache Software Foundation (ASF) launched Hadoop, an open source distributed computing platform that can take full advantage of the computing and storage capabilities of clusters to process massive amounts of data. However, big data solutions using Hadoop can be expensive, inflexible, take too long to deploy, and maintenance can be both challenging and inefficient.
  
  To solve these issues, HUAWEI CLOUD provides MapReduce Service (MRS). MRS allows you to quickly build and operate full-stack, cloud-native big data platforms on the cloud. With MRS, you can deploy Hadoop clusters with a couple clicks of a mouse. You can control clusters and run big data components, such as Hadoop, Spark, HBase, Kafka, and Storm with ease.
  
  MRS is fully compatible with open source APIs. It combines data lakes, data warehouses, business intelligence (BI), and artificial intelligence (AI) into a full-stack platform that can be further customized based on service requirements. MRS helps you extract new value from massive data and discover opportunities to grow your business.
  
  All regions except AP-Bangkok, AP-Singapore, and LA-Santiago.
  
  MRS Product Architecture
  
  List of MRS Component Versions
Cluster Management
- MRS allows you to efficiently manage Hadoop systems on HUAWEI CLOUD.
  
  You can purchase the MRS clusters best suited to your services. After selecting the cluster type, version, and node specifications, you can easily deploy an enterprise-grade big data platform, tune parameters, and manage clusters by simply clicking your mouse.
- Purchasing a Cluster
  - The MRS console provides Quick Config and Custom Config for you to create your clusters with ease.
    
    Methods of Purchasing an MRS Cluster
    
    Custom Purchase of a Cluster
- Scaling Out a Cluster
  - You can expand MRS storage and compute capabilities by adding Core and Task nodes without modifying the system architecture. Task nodes process data but do not store it. Core nodes process and store data.
    
    You do not need to update the client after scaling out clusters.
    
    Scaling Out a Cluster
- Scaling In a Cluster
  - You can reduce the number of Core or Task nodes to scale in a cluster. The number of Master nodes cannot be changed.
    To scale in a cluster, reduce the number of Core or Task nodes on the MRS console. Then, MRS automatically selects the nodes and completes the scale-in.
    
    Scaling In a Cluster
- Unsubscribing from a Node in a Yearly/Monthly Cluster
  - You can unsubscribe from a node in a yearly/monthly cluster based on service requirements.
    
    Constraints
    
    To ensure data reliability, MRS does not support node unsubscription if the number of analysis Core nodes in the cluster is less than or equal to the number of HDFS copies. To query the number of HDFS copies, choose Components > HDFS > Service Configuration, switch Basic to All, and check the dfs.replication value.
    
    MRS does not support unsubscription from nodes where the ZooKeeper service is deployed.
    Unsubscribing from a Node in a Yearly/Monthly Cluster
- Autoscaling Task Nodes
  - In big data applications, especially real-time data analysis and processing, the number of cluster nodes needs to be dynamically increased or decreased with the data volume. MRS provides auto scaling to enable elastic cluster scale-out or scale-in based on workloads. In addition, if the data volume changes on a predictable schedule from day to day, you can use an MRS resource plan (by setting the Task node quantity based on time) to resize the cluster in advance of expected changes.
    
    Configuring an Auto Scaling Rule
- Upgrading Master Node Specifications
  - If Master nodes cannot meet service requirements, you can upgrade their specifications.
    
    Upgrading Master Node Specifications
- Creating a Custom Cluster
  - MRS provides fixed templates for deploying processes of analysis clusters, streaming clusters, and hybrid clusters. You cannot customize management and control roles. If you want to customize the cluster deployment, set Cluster Type to Custom when creating a cluster.
    
    You can create a custom cluster to implement:
    
    Deployment of the management role and control role on different Master nodes
    
    Co-deployment of the management role and control role on the same Master nodes
    
    Independent deployment of ZooKeeper for reliability purposes
    
    Separate deployment of components to prevent resource contention
    Customizing a Topology Cluster
File Management
- On the Files tab page of the MRS console, you can create and delete folders, and import, export, and delete files for an analysis cluster, but you cannot create files.
  
  Data import: MRS lets you import data from OBS to HDFS only. The file upload is slower for larger files, so import data when the data volume is small.
  
  Data export: After data is processed and analyzed, you can store the data in HDFS or export the data to OBS.
  Managing Data Files
Job Management
- MRS jobs are used to process and analyze user data. On the Jobs tab page of the MRS console, you can create and manage jobs and view information about all jobs created.
  
  Introduction to MRS Jobs
- Submitting a Flink Job
  - Flink is a distributed big data processing engine that performs stateful computations over finite and infinite data streams. You can create and submit Flink jobs to execute programs packaged in JAR format to process streaming data.
    
    Submitting a Flink Job
- Submitting a MapReduce Job
  - MapReduce is a distributed data processing framework that rapidly processes massive amounts of data in parallel. When you have a lot of data to process, you can create and submit MapReduce jobs to execute programs packaged in JAR format and take advantage of parallel processing.
    
    Submitting a MapReduce Job
- Submitting a Hive Job
  - Hive is an open source data warehouse based on Hadoop. You can submit SQL statements or scripts in HiveSql jobs for data query and analysis. If sensitive information is involved, use a script instead of SQL statements.
    
    Submitting a Hive Job
- Submitting a Spark Job
  - Spark is a distributed in-memory computing framework. You can create and submit JAR and Python programs in Spark jobs to compute and process user data.
    
    Submitting a Spark Job
- Submitting a SparkSQL Job
  - Spark is a distributed in-memory computing framework. You can submit SQL statements or scripts in SparkSQL jobs for data query and analysis. Use a script instead of SQL statements if sensitive information is contained.
    
    Submitting a SparkSQL Job
O&M Management
- MRS provides a wide range of methods for you to manage cluster resources.
  
  Cluster O&M
- Logging In to a Cluster
  - When creating a cluster, you can set a private key or password for logging in to the ECSs.
    MRS supports two node access modes:
    
    Remote login (VNC)
    
    Login using a private key or password (SSH)
    
    Remote login is recommended only for emergency O&M. In other scenarios, it is better to log in to an ECS using SSH.
    Cluster Node Overview
- Determining Active and Standby Management Nodes of MRS Manager
  - You can log in to other nodes in the cluster from the Master node.
    
    Switchovers occur automatically between the Master nodes working in active/standby mode. Therefore, Master 1 may not always be the active management node.
    
    Determining Active and Standby Management Nodes of MRS Manager
- Manager
  - Manager provides a unified enterprise-class cluster management platform to help you complete installation, configuration, and management of big data components. Manager provides the following functions:
    
    Cluster monitoring: helps you to monitor the health status of hosts and services.
    
    Intuitive metric monitoring and customization: helps you to conveniently track key system details.
    
    Service configurations: allows you to easily configure service attributes to meet performance requirements.
    
    O&M: allows you to manage services, role instances, and clusters and start or stop services and clusters.
    MRS Manager Introduction
- EIP-based Access to a Cluster
  - You can bind an EIP to a cluster to enable access to the GUIs of the open-source components in the MRS cluster. This method is simple and easy to use.
    
    EIP-based Access
- Configuring Message Notification
  - After configuring Simple Message Notification (SMN), you can receive MRS cluster health status, updates, and component alarms through SMS or emails in real time.
    
    Configuring Message Notification
- Alarm Management
  - MRS provides real-time monitoring of big data clusters and alarms and events of different severity levels. You can also customize monitoring and alarm thresholds for key metrics. When monitoring data reaches the configured threshold, the system triggers an alarm so that you can handle it in a timely manner.
    
    Viewing the Alarm List
    
    Viewing and Manually Clearing an Alarm
- Rolling Service Restart
  - After modifying configurations of a big data component, you need to restart the services to make new configurations take effect. However, restarting multiple services or instances at the same time will interrupt services.
    
    A rolling restart can minimize or prevent service interruptions by sequentially restarting services or instances in batches. For example, for the instances in active/standby mode, the standby instance and active instance will be restarted one after the other. A rolling restart takes longer time than a normal restart but helps ensure service continuity.
    
    Perform Rolling Service Restart
- Rolling Patch Installation
  - After a patch is installed or uninstalled for one or more services in a cluster, perform a rolling restart (sequentially restart services or instances in batches) to minimize or prevent service interruptions.
    
    Rolling Patch Installation
- Bootstrap Actions
  - The bootstrap actions allow you to compile a script and run the script automatically before or after the cluster components are restarted. You can use bootstrap actions to install third-party software or modify the cluster environment on a given node.
    
    If you specify bootstrap actions for cluster scale-out, the script will be automatically executed for the newly added nodes. If auto scaling is enabled in a cluster, you can configure bootstrap actions in addition to a resource plan. Then the script will be automatically executed on the nodes scaled.
    
    Introduction to Bootstrap Actions
- O&M Authorization
  - For the troubleshooting to be performed by HUAWEI CLOUD technical support personnel, you need to assign the O&M permissions to the HUAWEI CLOUD technical support personnel.
    
    O&M Authorization
Operations Management
- The MRS console provides intuitive cluster lifecycle management, including billing management, cluster management, tag management, and audit management.
  
  Cluster Lifecycle Management
- Billing
  - MRS supports two billing modes: pay-per-use and yearly/monthly. A price calculator is provided on the MRS console to help you estimate the price of an MRS cluster.
    
    When you purchase an MRS cluster, the price is calculated and displayed on the MRS console automatically for your reference.
    
    Billing
- Unsubscribing from or Terminating a Cluster
  - The way to handle a cluster that is no longer required varies depending on the billing mode of the cluster:
    
    If the cluster is billed on a yearly/monthly basis, unsubscribe from it. Once a cluster is unsubscribed from, resources and data will be deleted and cannot be restored. Therefore, back up your data before you unsubscribe from a cluster.
    
    If the cluster is billed on a pay-per-use basis, terminate it.
    Unsubscribing from a Cluster
    
    Unsubscription Rules
    
    Terminating a Cluster
- Enterprise Project Management
  - MRS provides enterprise project management to help you manage resources, personnel, permissions, and enterprise finances on the cloud.
    
    Enterprise project management is different from a standard management console. The standard management consoles is focused on control and configuration of individual cloud products. Enterprise project management is focused on control of enterprise resources, such as personnel, permissions, finances, and resources on the cloud.
    
    Enterprise Project Management
- Tag Management
  - A tag identifies a cluster. Adding tags to clusters can help you identify and manage your cluster resources.
    
    By associating Tag Management Service (TMS) with MRS, you can tag cloud resources so they can be located more quickly and managed more easily. You can view, modify, and delete big data clusters and other cloud resources.
    
    You can add tags when creating an MRS cluster or after the cluster is created through the Tags tab page of the cluster on the MRS console.
    
    You can add up to 10 tags to each cluster.
    
    Tag Management
- Audit Management
  - All operations on MRS Manager are logged for tracing and fault locating purposes.
    
    Audit Management
User Permission Management
- MRS interworks with the Identity and Access Management (IAM) service to implement permission-based access control. You can also synchronize an IAM user to MRS Manager and bind the MRS user role to the IAM user so that the IAM user can perform the authorized operations on MRS Manager.
  
  MRS also supports fine-grained access control for Object Storage Service (OBS) buckets by using OBS permission mapping.
  
  MRS Permissions Management
- IAM Permission Management
  - With IAM, you can use your HUAWEI CLOUD account to create IAM users for your employees and assign permissions on an individual basis to control their access to specific resources. For example, if you want to restrict some software developers in your enterprise from performing high-risk operations, such as terminating MRS clusters, you can create IAM users for the software developers and grant them permissions to use MRS cluster resources but not to delete them.
    
    Creating a User and Granting Permissions
- IAM User Synchronization
  - MRS supports IAM user synchronization, which allows IAM users to be synchronized to MRS Manager. The IAM users synchronized to MRS Manager have basic permissions on MRS Manager. You can also bind a role of the MRS Manager user to an IAM user synchronized to MRS Manager. Then, that IAM user can access MRS Manager and manage clusters. Note that the password for the IAM user logging in to MRS Manager must be reset by user admin of MRS Manager.
    
    IAM User Synchronization
- OBS Permission Mapping
  - You can configure OBS access permissions to control access to specific directories in OBS buckets.
    
    For example, to enable a user group to access only the log files in a specified OBS bucket:
    
    1. Configure an agency for the MRS cluster to enable the ECS to access OBS by using an automatically obtained temporary AK/SK.
    
    2. On the IAM console, create a policy to allow access to only the log files in a specified OBS bucket, and create an agency to bind with the policy.
    
    3. Bind the agency created for the cluster with the user group. The user group will then have permissions to access only the logs in the specified OBS bucket.
    
    Configuring Fine-Grained Permissions for MRS Multi-User Access to OBS
Decoupled Storage and Compute
- MRS uses decoupled storage and compute, so you can store data in OBS and use an MRS cluster for data computing only.
- Configuring a Cluster with Decoupled Storage and Compute
  - MRS allows you to store data in OBS and use a dedicated MRS cluster for data processing.
    
    You can use the IAM service to configure an agency for the MRS cluster so that the ECS can access OBS by using an automatically obtained temporary AK/SK. This prevents the AK/SK information from being exposed in the configuration file.
    
    You can also add the AK/SK information to the configuration file and access the OBS file system using obs://. In this way, you do not need to manually add the AK/SK and endpoint each time you run a task.
    
    Configuring a Cluster with Decoupled Storage and Compute (Agency)
    
    Configuring a Cluster with Decoupled Storage and Compute (AK/SK)
- Storing Metadata Externally
  - You can store MRS component metadata externally. For example, you can store Hive metadata in an external relational database (RDS) by creating a data connection between Hive and the RDS.
    
    Managing Data Connections
Flink
- Flink is a unified computing framework that supports both batch processing and stream processing. It provides a stream data processing engine that supports data distribution and parallel computing. Flink is one of the most popular open source stream processing engines in the industry.
  
  Flink processes data pipelined and in parallel, with millisecond latency and high reliability. It is ideal for latency-sensitive data processing.
  
  Flink
  
  Using Flink from Scratch
Flume
- Flume is a distributed, reliable, and HA system for efficiently collecting, aggregating and moving large amounts of log data. Flume supports customization of data senders in the log system for data collection. In addition, Flume can easily process data and write to various data recipients (customizable).
  
  Flume
  
  Installing the Flume Client
HBase
- HBase is an open source, column-oriented, distributed storage system suitable for storing massive amounts of unstructured or semi-structured data. It features high reliability, high performance, and flexible scalability, and supports real-time data read/write.
  
  The tables stored in HBase have the following features:
  
  A table can scale to hundred millions of rows and millions of columns.
  
  Column-oriented storage, retrieval, and permission control are provided.
  
  Null columns in tables do not occupy any storage space.
  
  The HBase component of MRS supports compute and storage decoupling. Data can be stored in low-cost cloud storage services, such as Object Storage Service (OBS), and backed up across AZs. MRS supports secondary indexing on HBase, which allows indexes to be added for column values to filter data by column through native HBase APIs.
  HBase
  
  Using HBase from Scratch
HDFS
- Hadoop Distributed File System (HDFS) enables reliable, distributed read/write of massive amounts of data. HDFS is useful for write-once-read-many applications and sequential writes (appending-writes). The files in HDFS can have only one writer but they can have multiple readers at any time.
  
  HDFS
Hive
- Hive is a data warehouse infrastructure built on Hadoop. Using extract, transform, and load (ETL) tools, it lets you store, query, and analyze Hadoop big data. Hive has defined HiveQL, a simple SQL-like query language, for SQL veterans to query data. Hive data computing uses MapReduce, Spark, and Tez.
  
  Hive
  
  Using Hive from Scratch
Hue
- Hue is a group of web applications that interact with MRS big data components. It helps you browse HDFS, perform Hive queries, and start MapReduce jobs. Hue supports applications that interact with all MRS big data components.
  
  Hue provides a file browser and a query editor:
  
  File browser: allows you to browse and manage different HDFS directories on the GUI.
  
  Query editor: allows you to easily create, manage, and execute SQL statements to query data stored on HDFS, HBase, and Hive, and export the results to an Excel file.
  Hue
  
  Accessing the Hue Web UI
Impala
- Impala provides fast, interactive SQL queries directly on your data stored in HDFS, HBase, or Object Storage Service (OBS). In addition to using the same storage platform, Impala uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Impala UI in Hue) as Apache Hive. It provides a unified query platform for real-time or batch processing.
  
  As a supplement to big data query tools, Impala does not replace the batch processing frameworks built on MapReduce, such as Hive. Hive and other MapReduce-based frameworks are optimal for processing long-running batch jobs.
  
  Impala
  
  Using Impala from Scratch
Kafka
- Kafka is a distributed, partitioned, and replicated publish-subscribe messaging system. It provides features similar to Java Message Service (JMS), but with a unique design. Kafka features high throughput, distributed, real-time processing, and multi-client support, and supports message persistence. It applies to online and offline message consumption, such as regular message collecting, website activeness tracking, aggregation of statistical system operation data (monitoring data), and log collection, as well as mass data collection for Internet services.
  
  Kafka
  
  Managing Kafka Topics
KafkaManager
- KafkaManager is a tool for monitoring Apache Kafka metrics and GUI-based management of Kafka clusters.
  
  KafkaManager
  
  Accessing the KafkaManager Web UI
Kudu
- Kudu is a columnar storage manager developed for the Apache Hadoop platform. Kudu shares the common technical properties of Hadoop ecosystem applications: it runs on commodity hardware, is horizontally scalable, and supports highly available operations.
  
  Kudu
  
  Using Kudu from Scratch
Loader
- Based on the open source Sqoop, Loader is used to exchange data and files between MRS and relational databases or file systems. With Loader, you can import data from relational databases or file servers to HDFS and HBase of MRS, or export data from HDFS and HBase to relational databases or file servers.
  
  Loader
  
  Loader Introduction
MapReduce
- MapReduce is the core of Hadoop. It is a software architecture proposed by Google for parallel computing of datasets larger than 1 TB. The name alludes to the map and reduce functions used in functional programming, although some features are from vector programming.
  
  With MapReduce, a map function is created to map a group of key-value pairs to a new group of key-value pairs, and a reduce function is created to ensure that all values in the mapped key-value pairs share the same key.
  
  MapReduce
  
  Using Hadoop from Scratch
OpenTSDB
- OpenTSDB is a distributed, scalable time series database built on top of HBase. It is designed to collect the monitoring information of a large-scale cluster and enable data queries in seconds.
  
  OpenTSDB consists of a Time Series Daemon (TSD) and a set of command line utilities. Interaction with OpenTSDB is primarily implemented by running one or more TSDs. Each TSD is independent. There is no master server or shared state, so you can run as many TSDs as required to handle any load you throw at it. Each TSD uses HBase in a CloudTable cluster to store and retrieve time series data. The data schema is highly optimized for fast aggregations of similar time series to minimize the storage space required.
  
  TSD users never need to directly access the underlying storage. You can communicate with the TSD through an HTTP API. All communications happen on the same port (the TSD determines the protocol of the client by looking at the first few bytes it receives).
  
  OpenTSDB
  
  Using an MRS Client to Operate OpenTSDB Metric Data
Presto
- Presto is an open source SQL query engine for running interactive analytic queries against data sources of all sizes. It is used to analyze massive volumes of structured and semi-structured data, to aggregate vast amounts of multi-dimensional data, and for ETL and ad-hoc queries.
  
  Presto allows you to query data where it lives, including HDFS, Hive, HBase, Cassandra, relational databases or even proprietary data stores. A Presto query can combine different data sources to perform data analysis across the data sources.
  
  Presto
  
  Accessing the Presto Web UI
Ranger
- Apache Ranger offers a centralized security management framework and supports unified authorization and auditing. It allows fine-grained access control over Hadoop and related components, such as HDFS, Hive, HBase, Kafka, and Storm. You can use the web UI provided by Ranger to configure policies to control user access to these components.
  
  Ranger
  
  Accessing the Ranger Web UI and Synchronizing Unix users to the Ranger Web UI
Spark
- Spark is an open source parallel data processing framework. It helps you to easily develop unified big data applications to perform cooperative processing, stream processing, and interactive data analysis.
  
  The Spark framework features fast computing, write, and interactive query. Spark has obvious performance advantages over Hadoop. Spark uses in-memory computing to prevent I/O bottlenecks caused when multiple tasks in a MapReduce workflow process the same dataset. Spark uses the Scala programming language, which enables distributed datasets to be processed in the same way as local data.
  
  In addition to interactive data analysis, Spark supports interactive data mining. Spark utilizes in-memory computing to facilitate implementation of iterative algorithms, while data mining is implemented by applying iterative computing on the same data. Spark can run in Yarn clusters where Hadoop 2.0 is installed.
  
  Spark provides the Resilient Distributed Dataset (RDD), which ensures high performance and prevents high disk I/O while retaining various features like MapReduce fault tolerance, data locality, and scalability.
  
  Spark
  
  Using Spark from Scratch
Storm
- Apache Storm is a distributed, reliable, and fault-tolerant real-time stream data processing system. In Storm, a directed graph called topology is created to specify real-time application logic. The topology is submitted to a cluster. Then the master node in the cluster distributes codes and assigns tasks to worker nodes.
  
  A topology has two roles: spout and bolt. Spouts send data streams in tuples, which are unchangeable arrays and map to fixed key-value pairs. Bolts convert the tuples. Bolts perform computing and filtering operations and can randomly send data to other bolts.
  
  Storm
  
  Submitting Storm Topologies on the Client
Tez
- Tez is an open source computing framework from Apache that supports Directed Acyclic Graph (DAG) jobs. It can convert multiple dependent jobs into one job to greatly improve the performance. If projects like Hive and Pig use Tez instead of MapReduce as the backbone to process data, response time will be significantly reduced. Tez is built on Yarn and can run MapReduce jobs without any modifications.
  
  MRS uses Tez as the default execution engine of Hive, which provides higher execution efficiency than the MapReduce computing engine.
  
  Tez
YARN
- The Apache open source community introduces the unified resource management framework YARN to implement share of Hadoop clusters, improve scalability and reliability, and eliminate the performance bottleneck of JobTracker in the early MapReduce framework.
  
  The fundamental idea of YARN is to split up the JobTracker's two major functionalities, resource management and job scheduling/monitoring, into separate daemons. It uses a global ResourceManager (RM) and a per-application ApplicationMaster (AM). An application can be a single job or a directed acyclic graph (DAG) of jobs.
  
  YARN

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

Function Overview

MapReduce Service

Cluster Management

Purchasing a Cluster

Scaling Out a Cluster

Scaling In a Cluster

Unsubscribing from a Node in a Yearly/Monthly Cluster

Autoscaling Task Nodes

Upgrading Master Node Specifications

Creating a Custom Cluster

File Management

Job Management

Submitting a Flink Job

Submitting a MapReduce Job

Submitting a Hive Job

Submitting a Spark Job

Submitting a SparkSQL Job

O&M Management

Logging In to a Cluster

Determining Active and Standby Management Nodes of MRS Manager

Manager

EIP-based Access to a Cluster

Configuring Message Notification

Alarm Management

Rolling Service Restart

Rolling Patch Installation

Bootstrap Actions

O&M Authorization

Operations Management

Billing

Unsubscribing from or Terminating a Cluster

Enterprise Project Management

Tag Management

Audit Management

User Permission Management

IAM Permission Management

IAM User Synchronization

OBS Permission Mapping

Decoupled Storage and Compute

Configuring a Cluster with Decoupled Storage and Compute

Storing Metadata Externally

Flink

Flume

HBase

HDFS

Hive

Hue

Impala

Kafka

KafkaManager

Kudu

Loader

MapReduce

OpenTSDB

Presto

Ranger

Spark

Storm

Tez

YARN

Feedback

Was this page helpful?