Help Center/ MapReduce Service/ Getting Started/ Best Practices for Beginners
Updated on 2024-08-28 GMT+08:00

Best Practices for Beginners

After an MRS cluster is deployed, you can try some practices provided by MRS to meet your service requirements.

Table 1 Best practices

Practice

Description

Data analytics

Using Spark2x to Analyze IoV Drivers' Driving Behavior

This practice describes how to use Spark to analyze driving behavior. You can get familiar with basic functions of MRS by using the Spark2x component to analyze and collect statistics on driving behavior, obtain the analysis result, and collect statistics on the number of violations such as sudden acceleration and deceleration, coasting, speeding, and fatigue driving in a specified period.

Using Hive to Load HDFS Data and Analyze Book Scores

This practice describes how to use Hive to import and analyze raw data and how to build elastic and affordable offline big data analytics. In this practice, reading comments from the background of a book website are used as the raw data. After the data is imported to a Hive table, you can run SQL commands to query the most popular best-selling books.

Using Hive to Load OBS Data and Analyze Enterprise Employee Information

This practice describes how to use Hive to import and analyze raw data from OBS and how to build elastic and affordable big data analytics based on decoupled storage and compute resources. This practice describes how to develop a Hive data analysis application and how to run HQL statements to access Hive data stored in OBS after you connect to Hive through the client. For example, manage and query enterprise employee information.

Using Flink Jobs to Process OBS Data

This practice describes how to use the built-in Flink WordCount program of an MRS cluster to analyze the source data stored in the OBS file system and calculate the number of occurrences of specified words in the data source.

MRS supports decoupled storage and compute in scenarios where a large storage capacity is required and compute resources need to be scaled on demand. This allows you to store your data in OBS and use an MRS cluster only for data computing.

Data migration

Data Migration Solution

This practice describes how to migrate HDFS, HBase, and Hive data to an MRS cluster in different scenarios.

You will try to prepare for data migration, export metadata, copy data, and restore data.

Migrating Data from Hadoop to MRS

In this practice, CDM is used to migrate data (dozens of terabytes or less) from Hadoop clusters to MRS.

Migrating Data from HBase to MRS

In this practice, CDM is used to migrate data (dozens of terabytes or less) from HBase clusters to MRS. HBase stores data in HDFS, including HFile and WAL files. The hbase.rootdir configuration item specifies the HDFS path. By default, data is stored in the /hbase folder on MRS.

Some mechanisms and tool commands of HBase can also be used to migrate data. For example, you can migrate data by exporting snapshots, exporting/importing data, and CopyTable.

Migrating Data from Hive to MRS

In this practice, CDM is used to migrate data (dozens of terabytes or less) from Hive clusters to MRS.

Hive data migration consists of two parts:

  • Hive metadata, which is stored in the databases such as MySQL. By default, the metadata of the MRS Hive cluster is stored in MRS DBService (Huawei GaussDB database). You can also use RDS for MySQL as the external metadata database.
  • Hive service data, which is stored in HDFS or OBS

Migrating Data from MySQL to an MRS Hive Partitioned Table

This practice demonstrates how to use CDM to import MySQL data to the Hive partition table in an MRS cluster.

Hive supports SQL to help you perform extraction, transformation, and loading (ETL) operations on large-scale data sets. Queries on large-scale data sets take a long time. In many scenarios, you can create Hive partitions to reduce the total amount of data to be scanned each time. This significantly improves query performance.

Migrating Data from MRS HDFS to OBS

This practice demonstrates how to migrate file data from MRS HDFS to OBS using CDM.

System Interconnection

Using DBeaver to Access Phoenix

This practice describes how to use DBeaver to access Phoenix.

The local DBeaver can connect to the HBase component in the MRS cluster through the Phoenix Jar package. After they are connected, you can create an HBase table and insert data into the table using DBeaver.

Using DBeaver to Access HetuEngine

This practice describes how to use DBeaver to access HetuEngine.

The local DBeaver can connect to the HetuEngine component in the MRS cluster through the JDBC Jar package. After they are connected, you can view information about the data sources connected to HetuEngine with DBeaver.

Interconnecting Hive with External Self-Built Relational Databases

This practice describes how to use Hive to connect to open-source MySQL and Postgres databases.

After an external metadata database is deployed in a cluster that has Hive data, the original metadata tables will not be automatically synchronized. Before installing Hive, determine whether to store metadata in an external database or DBService. For the former, deploy an external database when installing Hive or when there is no Hive data. After Hive installation, the metadata storage location cannot be changed. Otherwise, the original metadata will be lost.

Interconnecting Hive with CSS

This practice describes how to use Hive to interconnect with CSS Elasticsearch.

In this practice, you will use the Elasticsearch-Hadoop plug-in to exchange data between Hive and Elasticsearch of Cloud Search Service (CSS) so that Elasticsearch index data can be mapped to Hive tables.