Function Overview_MapReduce Service-Huawei Cloud

Help Center/ MapReduce Service/ Function Overview

Function Overview

MapReduce Service
- Big data is a huge challenge for the Internet era as the data that needs to be handled has been growing explosively both in volume and diversity. Conventional data processing technologies, such as single-node storage and relational databases, are unable to handle the massive volumes and varieties of data of the Internet era. To rise up to this challenge, the Apache Software Foundation (ASF) launched Hadoop, an open source distributed computing platform that can take full advantage of the computing and storage capabilities of clusters to process massive amounts of data. However, big data solutions using Hadoop can be expensive, inflexible, take too long to deploy, and maintenance can be both challenging and inefficient.
  
  To solve these issues, HUAWEI CLOUD provides MapReduce Service (MRS). MRS allows you to quickly build and operate full-stack, cloud-native big data platforms on the cloud. With MRS, you can deploy Hadoop clusters with a couple clicks of a mouse. You can control clusters and run big data components, such as Hadoop, Spark, HBase, Kafka, and Storm with ease.
  
  MRS is fully compatible with open source APIs. It combines data lakes, data warehouses, business intelligence (BI), and artificial intelligence (AI) into a full-stack platform that can be further customized based on service requirements. MRS helps you extract new value from massive data and discover opportunities to grow your business.
  
  Available in all regions.
  
  MRS Product Architecture
  
  List of MRS Component Versions
Cluster Management
- MRS allows you to efficiently manage Hadoop systems on HUAWEI CLOUD.
  
  You can purchase the MRS clusters best suited to your services. After selecting the cluster type, version, and node specifications, you can easily deploy an enterprise-grade big data platform, tune parameters, and manage clusters by simply clicking your mouse.
- Buying a Cluster
  - The MRS console provides Quick Config and Custom Config for you to create your clusters with ease.
    
    Quickly Buying an MRS Cluster
    
    Manually Buying an MRS Cluster
- Scaling Out a Cluster
  - You can expand MRS storage and compute capabilities by adding Core and Task nodes without modifying the system architecture. Task nodes process data but do not store it. Core nodes process and store data.
    
    You do not need to update the client after scaling out clusters.
    
    Scaling Out a Cluster
- Scaling In a Cluster
  - You can reduce the number of Core or Task nodes to scale in a cluster. The number of Master nodes cannot be changed.
    To scale in a cluster, reduce the number of Core or Task nodes on the MRS console. Then, MRS automatically selects the nodes and completes the scale-in.
    
    Scaling In a Cluster
- Unsubscribing from a Node in a Yearly/Monthly Cluster
  - You can unsubscribe from a node in a yearly/monthly cluster based on service requirements.
    
    Constraints
    
    To ensure data reliability, MRS does not support node unsubscription if the number of analysis Core nodes in the cluster is less than or equal to the number of HDFS copies. To query the number of HDFS copies, choose Components > HDFS > Service Configuration, switch Basic to All, and check the dfs.replication value.
    
    MRS does not support unsubscription from nodes where the ZooKeeper service is deployed.
    Unsubscribing from a Node in a Yearly/Monthly Cluster
- Autoscaling Task Nodes
  - In big data applications, especially real-time data analysis and processing, the number of cluster nodes needs to be dynamically increased or decreased with the data volume. MRS provides auto scaling to enable elastic cluster scale-out or scale-in based on workloads. In addition, if the data volume changes on a predictable schedule from day to day, you can use an MRS resource plan (by setting the Task node quantity based on time) to resize the cluster in advance of expected changes.
    
    Configuring an Auto Scaling Rule
- Upgrading Master Node Specifications
  - If Master nodes cannot meet service requirements, you can upgrade their specifications.
    
    Upgrading Master Node Specifications
File Management
- On the Files tab page of the MRS console, you can create and delete folders, and import, export, and delete files for an analysis cluster, but you cannot create files.
  
  Data import: MRS lets you import data from OBS to HDFS only. The file upload is slower for larger files, so import data when the data volume is small.
  
  Data export: After data is processed and analyzed, you can store the data in HDFS or export the data to OBS.
  Managing Data Files
Job Management
- MRS jobs are used to process and analyze user data. On the Jobs tab page of the MRS console, you can create and manage jobs and view information about all jobs created.
  
  Introduction to MRS Jobs
- Submitting a Flink Job
  - Flink is a distributed big data processing engine that performs stateful computations over finite and infinite data streams. You can create and submit Flink jobs to execute programs packaged in JAR format to process streaming data.
    
    Submitting a Flink Job
- Submitting a MapReduce Job
  - MapReduce is a distributed data processing framework that rapidly processes massive amounts of data in parallel. When you have a lot of data to process, you can create and submit MapReduce jobs to execute programs packaged in JAR format and take advantage of parallel processing.
    
    Submitting a MapReduce Job
- Submitting a Hive Job
  - Hive is an open source data warehouse based on Hadoop. You can submit SQL statements or scripts in HiveSQL jobs for data query and analysis. If sensitive information is involved, use a script instead of SQL statements.
    
    Submitting a Hive Job
- Submitting a Spark Job
  - Spark is a distributed in-memory computing framework. You can create and submit JAR and Python programs in Spark jobs to compute and process user data.
    
    Submitting a Spark Job
- Submitting a SparkSQL Job
  - Spark is a distributed in-memory computing framework. You can submit SQL statements or scripts in SparkSQL jobs for data query and analysis. Use a script instead of SQL statements if sensitive information is contained.
    
    Submitting a SparkSQL Job
O&M Management
- MRS provides a wide range of methods for you to manage cluster resources.
  
  Cluster O&M
- Logging In to a Cluster
  - When creating a cluster, you can set a private key or password for logging in to the ECSs.
    MRS supports two node access modes:
    
    Remote login (VNC)
    
    Login using a private key or password (SSH)
    
    Remote login is recommended only for emergency O&M. In other scenarios, it is better to log in to an ECS using SSH.
    Cluster Node Overview
- Determining Active and Standby Management Nodes of MRS Manager
  - You can log in to other nodes in the cluster from the Master node.
    
    Switchovers occur automatically between the Master nodes working in active/standby mode. Therefore, Master 1 may not always be the active management node.
    
    Determining Active and Standby Management Nodes of MRS Manager
- Manager
  - Manager provides a unified enterprise-class cluster management platform to help you complete installation, configuration, and management of big data components. Manager provides the following functions:
    
    Cluster monitoring: helps you to monitor the health status of hosts and services.
    
    Intuitive metric monitoring and customization: helps you to conveniently track key system details.
    
    Service configurations: allows you to easily configure service attributes to meet performance requirements.
    
    O&M: allows you to manage services, role instances, and clusters and start or stop services and clusters.
    MRS Manager Introduction
- EIP-based Access to a Cluster
  - You can bind an EIP to a cluster to enable access to the GUIs of the open-source components in the MRS cluster. This method is simple and easy to use.
    
    EIP-based Access
- Configuring Message Notification
  - After configuring Simple Message Notification (SMN), you can receive MRS cluster health status, updates, and component alarms through SMS or emails in real time.
    
    Configuring Message Notification
- Alarm Management
  - MRS provides real-time monitoring of big data clusters and alarms and events of different severity levels. You can also customize monitoring and alarm thresholds for key metrics. When monitoring data reaches the configured threshold, the system triggers an alarm so that you can handle it in a timely manner.
    
    Viewing the Alarm List
- Rolling Service Restart
  - After modifying configurations of a big data component, you need to restart the services to make new configurations take effect. However, restarting multiple services or instances at the same time will interrupt services.
    
    A rolling restart can minimize or prevent service interruptions by sequentially restarting services or instances in batches. For example, for the instances in active/standby mode, the standby instance and active instance will be restarted one after the other. A rolling restart takes longer time than a normal restart but helps ensure service continuity.
    
    Perform Rolling Service Restart
- Rolling Patch Installation
  - After a patch is installed or uninstalled for one or more services in a cluster, perform a rolling restart (sequentially restart services or instances in batches) to minimize or prevent service interruptions.
    
    Rolling Patch Installation
- Bootstrap Actions
  - The bootstrap actions allow you to compile a script and run the script automatically before or after the cluster components are restarted. You can use bootstrap actions to install third-party software or modify the cluster environment on a given node.
    
    If you specify bootstrap actions for cluster scale-out, the script will be automatically executed for the newly added nodes. If auto scaling is enabled in a cluster, you can configure bootstrap actions in addition to a resource plan. Then the script will be automatically executed on the nodes scaled.
    
    Introduction to Bootstrap Actions
- O&M Authorization
  - For the troubleshooting to be performed by HUAWEI CLOUD technical support personnel, you need to assign the O&M permissions to the HUAWEI CLOUD technical support personnel.
    
    O&M Authorization
Operations Management
- The MRS console provides intuitive cluster lifecycle management, including billing management, cluster management, tag management, and audit management.
  
  Cluster Lifecycle Management
- Billing
  - MRS supports two billing modes: pay-per-use and yearly/monthly. A price calculator is provided on the MRS console to help you estimate the price of an MRS cluster.
    
    When you purchase an MRS cluster, the price is calculated and displayed on the MRS console automatically for your reference.
    
    Billing
- Unsubscribing from or Deleting a Cluster
  - For a yearly/monthly cluster, if the cluster is not required after the job execution is complete, you can unsubscribe from the MRS cluster. After the cluster is unsubscribed from, resources and data will be deleted and cannot be retrieved. Ensure that data backup is complete before submitting the unsubscription. For a pay-per-use billing mode cluster, if the cluster is not required after the job execution is complete, you can delete the cluster.
    
    Deleting a Cluster
- Enterprise Project Management
  - MRS provides enterprise project management to help you manage resources, personnel, permissions, and enterprise finances on the cloud.
    
    Enterprise project management is different from a standard management console. The standard management consoles are focused on control and configuration of individual cloud products. Enterprise project management is focused on control of enterprise resources, such as personnel, permissions, finances, and resources on the cloud.
    
    Enterprise Project Management
- Tag Management
  - A tag identifies a cluster. Adding tags to clusters can help you identify and manage your cluster resources.
    
    By associating Tag Management Service (TMS) with MRS, you can tag cloud resources so they can be located more quickly and managed more easily. You can view, modify, and delete big data clusters and other cloud resources.
    
    You can add tags when creating an MRS cluster or after the cluster is created through the Tags tab page of the cluster on the MRS console.
    
    You can add up to 10 tags to each cluster.
    
    Tag Management
- Audit Management
  - All operations on MRS Manager are logged for tracing and fault locating purposes.
    
    Audit Management
User Permission Management
- MRS interworks with the Identity and Access Management (IAM) service to implement permission-based access control. You can also synchronize an IAM user to MRS Manager and bind the MRS user role to the IAM user so that the IAM user can perform the authorized operations on MRS Manager.
  
  MRS also supports fine-grained access control for Object Storage Service (OBS) buckets by using OBS permission mapping.
  
  MRS Permissions Management
- IAM Permission Management
  - With IAM, you can use your HUAWEI CLOUD account to create IAM users for your employees and assign permissions on an individual basis to control their access to specific resources. For example, if you want to restrict some software developers in your enterprise from performing high-risk operations, such as deleting MRS clusters, you can create IAM users for the software developers and grant them permissions to use MRS cluster resources but not to delete them.
    
    Creating a User and Granting Permissions
- IAM User Synchronization
  - MRS supports IAM user synchronization, which allows IAM users to be synchronized to MRS Manager. The IAM users synchronized to MRS Manager have basic permissions on MRS Manager. You can also bind a role of the MRS Manager user to an IAM user synchronized to MRS Manager. Then, that IAM user can access MRS Manager and manage clusters. Note that the password for the IAM user logging in to MRS Manager must be reset by user admin of MRS Manager.
    
    IAM User Synchronization
- OBS Permission Mapping
  - You can configure OBS access permissions to control access to specific directories in OBS buckets.
    
    For example, to enable a user group to access only the log files in a specified OBS bucket:
    
    1. Configure an agency for the MRS cluster to enable the ECS to access OBS by using an automatically obtained temporary AK/SK.
    
    2. On the IAM console, create a policy to allow access to only the log files in a specified OBS bucket, and create an agency to bind with the policy.
    
    3. Bind the agency created for the cluster with the user group. The user group will then have permissions to access only the logs in the specified OBS bucket.
    
    Configuring Fine-Grained Permissions for MRS Multi-User Access to OBS
Decoupled Storage and Compute
- MRS uses decoupled storage and compute, so you can store data in OBS and use an MRS cluster for data computing only.
- Configuring a Cluster with Decoupled Storage and Compute
  - MRS allows you to store data in OBS and use a dedicated MRS cluster for data processing.
    
    You can use the IAM service to configure an agency for the MRS cluster so that the ECS can access OBS by using an automatically obtained temporary AK/SK. This prevents the AK/SK information from being exposed in the configuration file.
    
    You can also add the AK/SK information to the configuration file and access the OBS file system using obs://. In this way, you do not need to manually add the AK/SK and endpoint each time you run a task.
    
    Configuring a Cluster with Decoupled Storage and Compute (Agency)
    
    Configuring a Cluster with Decoupled Storage and Compute (AK/SK)
- Storing Metadata Externally
  - You can store MRS component metadata externally. For example, you can store Hive metadata in an external relational database (RDS) by creating a data connection between Hive and the RDS.
    
    Managing Data Connections
- Enabling Access to OBS by Mapping HDFS Addresses to OBS Addresses
  - By mapping HDFS addresses to OBS addresses, you can access data migrated from HDFS to OBS without changing the service logic.
    
    If you migrate metadata from HDFS to OBS, you can also configure the HDFS-OBS address mapping to implement data access to both OBS and HDFS.
    
    The HDFS-OBS mapping function does not support access through the REST API of WebHdfsFileSystem (a FileSystem for HDFS over the web).
    
    Accessing OBS by Mapping HDFS Addresses
CDL
- Change Data Loader (CDL) is a real-time data integration service based on Kafka Connect. The CDL service captures data change events from various OLTP databases and pushes them to Kafka. Then, Sink Connector pushes the events to the big data ecosystem.
  
  CDL
  
  Using CDL from Scratch
ClickHouse
- ClickHouse is an open-source columnar database oriented to online analysis and processing. It is independent of the Hadoop big data system and features ultimate compression rate and fast query performance. In addition, ClickHouse supports SQL query and provides good query performance, especially the aggregation analysis and query performance based on large and wide tables. The query speed is one order of magnitude faster than that of other analytical databases.
  
  ClickHouse
  
  Using ClickHouse from Scratch
Flink
- Flink is a unified computing framework that supports both batch processing and stream processing. It provides a stream data processing engine that supports data distribution and parallel computing. Flink is one of the most popular open source stream processing engines in the industry.
  
  Flink processes data pipelined and in parallel, with millisecond latency and high reliability. It is ideal for latency-sensitive data processing.
  
  Flink
  
  Using Flink from Scratch
Flume
- Flume is a distributed, reliable, and HA system for efficiently collecting, aggregating and moving large amounts of log data. Flume supports customization of data senders in the log system for data collection. In addition, Flume can easily process data and write to various data recipients (customizable).
  
  Flume
  
  Installing the Flume Client
HBase
- HBase is an open source, column-oriented, distributed storage system suitable for storing massive amounts of unstructured or semi-structured data. It features high reliability, high performance, and flexible scalability, and supports real-time data read/write.
  
  The tables stored in HBase have the following features:
  
  A table can scale to hundred millions of rows and millions of columns.
  
  Column-oriented storage, retrieval, and permission control are provided.
  
  Null columns in tables do not occupy any storage space.
  
  The HBase component of MRS supports compute and storage decoupling. Data can be stored in low-cost cloud storage services, such as Object Storage Service (OBS), and backed up across AZs. MRS supports secondary indexing on HBase, which allows indexes to be added for column values to filter data by column through native HBase APIs.
  HBase
  
  Using HBase from Scratch
HetuEngine
- HetuEngine is a high-performance, interactive SQL analysis and data virtualization engine developed by Huawei. It seamlessly integrates with the big data ecosystem to implement interactive query of massive amounts of data within seconds, and supports cross-source and cross-domain unified data access to enable one-stop SQL convergence analysis in the data lake, between lakes, and between lakehouses.
  
  HetuEngine Product Overview
  
  Using HetuEngine from Scratch
HDFS
- Hadoop Distributed File System (HDFS) enables reliable, distributed read/write of massive amounts of data. HDFS is useful for write-once-read-many applications and sequential writes (appending-writes). The files in HDFS can have only one writer but they can have multiple readers at any time.
  
  HDFS
Hive
- Hive is a data warehouse infrastructure built on Hadoop. Using extract, transform, and load (ETL) tools, it lets you store, query, and analyze Hadoop big data. Hive has defined HiveQL, a simple SQL-like query language, for SQL veterans to query data. Hive data computing uses MapReduce, Spark, and Tez.
  
  Hive
  
  Using Hive from Scratch
Hue
- Hue is a group of web applications that interact with MRS big data components. It helps you browse HDFS, perform Hive queries, and start MapReduce jobs. Hue supports applications that interact with all MRS big data components.
  
  Hue provides a file browser and a query editor:
  
  File browser: allows you to browse and manage different HDFS directories on the GUI.
  
  Query editor: allows you to easily create, manage, and execute SQL statements to query data stored on HDFS, HBase, and Hive, and export the results to an Excel file.
  Hue
  
  Accessing the Hue Web UI
IoTDB
- Database for Internet of Things (IoTDB) is a software system that collects, stores, manages, and analyzes IoT time series data. Apache IoTDB uses a lightweight architecture and features high performance and rich functions.
  
  IoTDB sorts time series and stores indexes and chunks, greatly improving the query performance of time series data. IoTDB uses the Raft protocol to ensure data consistency. In time series scenarios, IoTDB pre-computes and stores data to improve analysis performance. Based on the characteristics of time series data, IoTDB provides powerful data encoding and compression capabilities. In addition, its replica mechanism ensures data security. IoTDB is deeply integrated with Apache Hadoop and Flink to meet the requirements of massive data storage, high-speed data reading, and complex data analysis in the industrial IoT field.
  
  IoTDB Basic Principles
  
  Using IoTDB from Scratch
Impala
- Impala provides fast, interactive SQL queries directly on your data stored in HDFS, HBase, or Object Storage Service (OBS). In addition to using the same storage platform, Impala uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Impala UI in Hue) as Apache Hive. It provides a unified query platform for real-time or batch processing.
  
  As a supplement to big data query tools, Impala does not replace the batch processing frameworks built on MapReduce, such as Hive. Hive and other MapReduce-based frameworks are optimal for processing long-running batch jobs.
  
  Impala
  
  Using Impala from Scratch
Kafka
- Kafka is a distributed, partitioned, and replicated publish-subscribe messaging system. It provides features similar to Java Message Service (JMS), but with a unique design. Kafka features high throughput, distributed, real-time processing, and multi-client support, and supports message persistence. It applies to online and offline message consumption, such as regular message collecting, website activeness tracking, aggregation of statistical system operation data (monitoring data), and log collection, as well as mass data collection for Internet services.
  
  Kafka
  
  Managing Kafka Topics
Kudu
- Kudu is a columnar storage manager developed for the Apache Hadoop platform. Kudu shares the common technical properties of Hadoop ecosystem applications: it runs on commodity hardware, is horizontally scalable, and supports highly available operations.
  
  Kudu
  
  Using Kudu from Scratch
Loader
- Based on the open source Sqoop, Loader is used to exchange data and files between MRS and relational databases or file systems. With Loader, you can import data from relational databases or file servers to HDFS and HBase of MRS, or export data from HDFS and HBase to relational databases or file servers.
  
  Loader
  
  Loader Introduction
MapReduce
- MapReduce is the core of Hadoop. It is a software architecture proposed by Google for parallel computing of datasets larger than 1 TB. The name alludes to the map and reduce functions used in functional programming, although some features are from vector programming.
  
  With MapReduce, a map function is created to map a group of key-value pairs to a new group of key-value pairs, and a reduce function is created to ensure that all values in the mapped key-value pairs share the same key.
  
  MapReduce
  
  Using Hadoop from Scratch
Presto
- Presto is an open source SQL query engine for running interactive analytic queries against data sources of all sizes. It is used to analyze massive volumes of structured and semi-structured data, to aggregate vast amounts of multi-dimensional data, and for ETL and ad-hoc queries.
  
  Presto allows you to query data where it lives, including HDFS, Hive, HBase, Cassandra, relational databases or even proprietary data stores. A Presto query can combine different data sources to perform data analysis across the data sources.
  
  Presto
  
  Accessing the Presto Web UI
Ranger
- Apache Ranger offers a centralized security management framework and supports unified authorization and auditing. It allows fine-grained access control over Hadoop and related components, such as HDFS, Hive, HBase, Kafka, and Storm. You can use the web UI provided by Ranger to configure policies to control user access to these components.
  
  Ranger
  
  Accessing the Ranger Web UI and Synchronizing Unix users to the Ranger Web UI
Spark
- Spark is an open source parallel data processing framework. It helps you to easily develop unified big data applications to perform cooperative processing, stream processing, and interactive data analysis.
  
  The Spark framework features fast computing, write, and interactive query. Spark has obvious performance advantages over Hadoop. Spark uses in-memory computing to prevent I/O bottlenecks caused when multiple tasks in a MapReduce workflow process the same dataset. Spark uses the Scala programming language, which enables distributed datasets to be processed in the same way as local data.
  
  In addition to interactive data analysis, Spark supports interactive data mining. Spark utilizes in-memory computing to facilitate implementation of iterative algorithms, while data mining is implemented by applying iterative computing on the same data. Spark can run in Yarn clusters where Hadoop 2.0 is installed.
  
  Spark provides the Resilient Distributed Dataset (RDD), which ensures high performance and prevents high disk I/O while retaining various features like MapReduce fault tolerance, data locality, and scalability.
  
  Spark
  
  Using Spark from Scratch
Tez
- Tez is an open source computing framework from Apache that supports Directed Acyclic Graph (DAG) jobs. It can convert multiple dependent jobs into one job to greatly improve the performance. If projects like Hive and Pig use Tez instead of MapReduce as the backbone to process data, response time will be significantly reduced. Tez is built on Yarn and can run MapReduce jobs without any modifications.
  
  MRS uses Tez as the default execution engine of Hive, which provides higher execution efficiency than the MapReduce computing engine.
  
  Tez
YARN
- The Apache open source community introduces the unified resource management framework YARN to implement share of Hadoop clusters, improve scalability and reliability, and eliminate the performance bottleneck of JobTracker in the early MapReduce framework.
  
  The fundamental idea of YARN is to split up the JobTracker's two major functionalities, resource management and job scheduling/monitoring, into separate daemons. It uses a global ResourceManager (RM) and a per-application ApplicationMaster (AM). An application can be a single job or a directed acyclic graph (DAG) of jobs.
  
  YARN

Feedback

Was this page helpful?

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot

Function Overview

MapReduce Service

Cluster Management

Buying a Cluster

Scaling Out a Cluster

Scaling In a Cluster

Unsubscribing from a Node in a Yearly/Monthly Cluster

Autoscaling Task Nodes

Upgrading Master Node Specifications

File Management

Job Management

Submitting a Flink Job

Submitting a MapReduce Job

Submitting a Hive Job

Submitting a Spark Job

Submitting a SparkSQL Job

O&M Management

Logging In to a Cluster

Determining Active and Standby Management Nodes of MRS Manager

Manager

EIP-based Access to a Cluster

Configuring Message Notification

Alarm Management

Rolling Service Restart

Rolling Patch Installation

Bootstrap Actions

O&M Authorization

Operations Management

Billing

Unsubscribing from or Deleting a Cluster

Enterprise Project Management

Tag Management

Audit Management

User Permission Management

IAM Permission Management

IAM User Synchronization

OBS Permission Mapping

Decoupled Storage and Compute

Configuring a Cluster with Decoupled Storage and Compute

Storing Metadata Externally

Enabling Access to OBS by Mapping HDFS Addresses to OBS Addresses

CDL

ClickHouse

Flink

Flume

HBase

HetuEngine

HDFS

Hive

Hue

IoTDB

Impala

Kafka

Kudu

Loader

MapReduce

Presto

Ranger

Spark

Tez

YARN

Feedback

Was this page helpful?