Running a MapReduce Job

MapReduce is a programming model for large-scale data processing. It divides the complex data processing task into two main phases: "map" and "reduce". The map phase splits a large amount of data into small chunks for parallel processing. Each node independently executes the user-defined map function and produces intermediate key/value pairs. The reduce phase groups all values with the same key, uses the reduce function to aggregates and summarizes the values, and produces a final set of key/value pairs. This distributed computing framework efficiently processes PB-scale data while overcoming the performance limitations of traditional single-node processing through fault tolerance and automatic task scheduling. It is widely used in scenarios such as search engine indexing, log analysis, and data statistics.

MRS allows you to submit and run your own programs, and get the results. This section will show you how to submit a MapReduce job in an MRS cluster.

You can create a job online and submit it for running on the MRS console, or submit a job in CLI mode on the MRS cluster client.

Prerequisites

You have uploaded the program packages and data files required by jobs to OBS or HDFS.
If the job program needs to read and analyze data in the OBS file system, you need to configure storage-compute decoupling for the MRS cluster. For details, see Configuring Storage-Compute Decoupling for an MRS Cluster.

Notes and Constraints

When the policy of the user group to which an IAM user belongs changes from MRS ReadOnlyAccess to MRS CommonOperations, MRS FullAccess, or MRS Administrator, or vice versa, wait for five minutes after user synchronization for the System Security Services Daemon (SSSD) cache of the cluster node to refresh. Submit a job on the MRS console after the new policy takes effect. Otherwise, the job submission may fail.
If the IAM username contains spaces (for example, admin 01), you cannot create jobs on the MRS console.

Video Tutorial

This tutorial demonstrates how to submit and view a MapReduce job on the cluster management page of the MRS console.

The UI may vary depending on the version. This tutorial is for reference only.

Submitting a Job

You can create and run jobs online using the management console or submit jobs by running commands on the cluster client.

Prepare the application and data.

This section uses the Hadoop word count application as an example. You can obtain the sample program from the MRS cluster client (Client installation directory/HDFS/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-XXX.jar) and upload the program to a specified directory in HDFS or OBS. For details, see Uploading Application Data to an MRS Cluster.

To run the application, you need to specify the following parameters:
- Program class name: It is specified by a function in your program. In this application, the class name is wordcount.
- Input file path: Path of the data file to be analyzed. The file must be uploaded to the HDFS or OBS file system in advance.
  For example, upload data file data1.txt. The file content is as follows:
```
This is a test job.
MRS supports multiple job submission modes.
```
- Output file path: Path of the result file after the application counts words. Set this parameter to a directory that does not exist. The directory will be automatically generated after you run the application.
Log in to the MRS console.
On the Active Clusters page, select a running cluster and click its name to switch to the cluster details page.
On the Dashboard page, click Synchronize on the right side of IAM User Sync to synchronize IAM users.

Perform this step only when Kerberos authentication is enabled for the cluster.

After IAM user synchronization, wait for five minutes before submitting a job. For details about IAM user synchronization, see Synchronizing IAM Users to MRS.
Click Job Management. On the displayed job list page, click Create.

In Type, select MapReduce. Configure other job information.

Figure 1 Creating a MapReduce job
Click to enlarge

**Table 1** Job parameters
Parameter	Description	Example
Name	Job name. It can contain 1 to 128 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed.	mapreduce_job
Program Path	Path of the program package to be executed. You can enter the path or click HDFS or OBS to select a file. The path can contain a maximum of 1,023 characters. It cannot contain special characters (;\|&>,<'$) and cannot be left blank or all spaces. The OBS program path starts with obs://. The HDFS program path starts with hdfs://hacluster, for example, *hdfs://hacluster/user/XXX.jar. The MapReduce job execution program must end with .jar*.	obs://mrs-demotest/program/hadoop-mapreduce-examples-XXX.jar
Runtime Parameters	(Optional) Key parameters for program execution. Use spaces to separate multiple parameters. Configuration format for the example program in this section: Program class name Data input path Data output path Program class name: It is specified by a function in your program. MRS is responsible for transferring parameters only. Data input path: Click HDFS or OBS to select a path or manually enter a correct path. Data output path: output path of the data processing result. Enter a directory that does not exist. The parameter can contain a maximum of 150,000 characters. It cannot contain special characters ;\|&><'$, but can be left blank. CAUTION: When entering a parameter containing sensitive information (for example, login password), you can add an at sign (@) before the parameter name to encrypt the parameter value. This prevents the sensitive information from being persisted in plaintext. When you view job information on the MRS console, the sensitive information is displayed as . Example: username=*testuser @password=User password	wordcount obs://mrs-demotest/input/data1.txt obs://mrs-demotest/output/demo1
Service Parameter	(Optional) Service parameters for the job. To modify the current job, change this parameter. For permanent changes to the entire cluster, refer to Modifying the Configuration Parameters of an MRS Cluster Component and modify the cluster component parameters accordingly. For example, if decoupled storage and compute is not configured for the MRS cluster and jobs need to access OBS using AK/SK, you can add the following service parameters: fs.obs.access.key: key ID for accessing OBS. fs.obs.secret.key: key corresponding to the key ID for accessing OBS.	-
Command Reference	Commands submitted to the background for execution when a job is submitted.	N/A

Confirm job configuration information and click OK.
After the job is submitted, you can view the job running status and execution result in the job list. After the job status changes to Completed, you can view the analysis result of related programs.

In this example, you can view the data statistics in the specified OBS output directory.

Figure 2 Viewing the job execution result

During job execution, you can click View Log or choose More > View Details to view program execution details. If the job execution is abnormal or fails, you can locate the fault based on the error information.

A created job cannot be modified. If you need to execute the job again, you can click Clone to quickly copy the created job and adjust required parameters.

Prepare the application and data.

This section uses the Hadoop word count application as an example. You can obtain the sample program from the MRS cluster client (Client installation directory/HDFS/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-XXX.jar) and upload the program to a specified directory in HDFS or OBS. For details, see Uploading Application Data to an MRS Cluster.

To run the application, you need to specify the following parameters:
- Program class name: It is specified by a function in your program. In this application, the class name is wordcount.
- Input file path: Path of the data file to be analyzed. The file must be uploaded to the HDFS or OBS file system in advance.
  For example, upload data file data1.txt. The file content is as follows:
```
This is a test job.
MRS supports multiple job submission modes.
```
- Output file path: Path of the result file after the application counts words. Set this parameter to a directory that does not exist. The directory will be automatically generated after you run the application.
If Kerberos authentication has been enabled for the current cluster, create a service user with job submission permissions on FusionInsight Manager in advance. For details, see Creating an MRS Cluster User.

In this example, create human-machine user testuser, and associate the user with user group supergroup and role System_administrator.
Install an MRS cluster client.

For details, see Installing an MRS Cluster Client.

The MRS cluster comes with a client installed for job submission by default, which can also be used directly. In MRS 3.x or later, the default client installation path is /opt/Bigdata/client on the Master node. In versions earlier than MRS 3.x, the default client installation path is /opt/client on the Master node.
Log in to the node where the client is located as the MRS cluster client installation user.

For details, see Logging In to an MRS Cluster Node.
Run the following command to go to the client installation directory:
```
cd /opt/Bigdata/client
```
Run the following command to load the environment variables:
```
source bigdata_env
```
If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, you do not need to run the kinit command.
```
kinit testuser
```
Run a command to submit the word count job.

You can run the yarn or hadoop jar command to submit a MapReduce job.

yarn commands are primarily used for resource management and job scheduling. These commands allow you to manage resources in a YARN cluster and control the lifecycle of jobs. You can run these commands to submit MapReduce and Spark jobs and query and monitor jobs.

hadoop jar commands are mainly used to submit Java-based MapReduce jobs. These commands enable you to start and run MapReduce jobs, use the main class within the JAR file as the entry point of the jobs, and pass parameters such as the input and output paths to the jobs.

To use the hadoop jar command to submit a sample program job, the command format is as follows:
```
hadoop jar Application wordcount Input file path Output file path
```
For example, run the following command to use the sample program to count the number of words in the /tmp/data/data1.txt file in HDFS and export the result to the /tmp/output/demo directory in the HDFS:
```
hadoop jar HDFS/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /tmp/data/data1.txt /tmp/output/demo
```
If decoupled storage and compute is not configured for the MRS cluster, jobs need to access OBS using AK/SK. Example command is as follows:
```
hadoop jar HDFS/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount -Dfs.obs.access.key=Access key ID for accessing OBS -Dfs.obs.secret.key=Secret access key corresponding to the access key ID for accessing OBS "obs://mrs-demotest/input/data1.txt" "obs://mrs-demotest/output/demo"
```
- Commands carrying authentication passwords pose security risks. Disable historical command recording before running such commands to prevent information leakage.
- To obtain the AK and SK, log in to the OBS console and choose My Credentials > Access Keys from the username drop-down list in the upper right corner of the page.
After the job is successfully submitted and executed, run the following command to view the data statistics result in the specified HDFS output directory:
```
hdfs dfs -ls /tmp/output
```
Figure 3 Viewing the job execution result
Log in to FusionInsight Manager as user testuser, choose Cluster > Services > Yarn, and click the hyperlink on the right of ResourceManager Web UI to access the YARN Web UI. Click the application ID of the job to view the job running information and related logs.

Figure 4 Viewing MapReduce job details

Helpful Links

You can view logs of each job created on the MRS console. For details, see Viewing MRS Job Details and Logs.
Kerberos authentication has been enabled for a cluster and IAM user synchronization has not been performed. When you submit a job, an error is reported. For details about how to handle the error, see What Can I Do If the System Displays a Message Indicating that the Current User Does Not Exist on Manager When I Submit a Job?
After a job is submitted, you can view the logs of a specified YARN task. For details, see How Do I View Logs of a Specified YARN Task?
For more MRS application development sample programs, see MRS Developer Guide.

Parent Topic: Submitting a Job in an MRS Cluster

Previous topic: Submitting a Job in an MRS Cluster

Next topic: Running a SparkSubmit Job