Updated on 2024-08-16 GMT+08:00

MapReduce Java APIs

Common MapReduce APIs

Common classes in MapReduce are as follows:

  • org.apache.hadoop.mapreduce.Job: API for users to submit MapReduce jobs. It is used to set job parameters, submit jobs, control job execution, and query job status.
  • org.apache.hadoop.mapred.JobConf: configuration class of MapReduce jobs and a major configuration API for users to submit jobs to Hadoop.
Table 1 Common APIs of org.apache.hadoop.mapreduce.Job

Function

Description

Job(Configuration conf, String jobName), Job(Configuration conf)

Creates a MapReduce client for configuring job attributes and submitting a job.

setMapperClass(Class<extends Mapper> cls)

A core API used to specify the Mapper class of a MapReduce job. The Mapper class is empty by default. You can also configure mapreduce.job.map.class in mapred-site.xml.

setReducerClass(Class<extends Reducer> cls)

A core API used to specify the Reducer class of a MapReduce job. The Reducer class is empty by default. You can also configure mapreduce.job.reduce.class in mapred-site.xml.

setCombinerClass(Class<extends Reducer> cls)

Specifies the Combiner class of a MapReduce job. The Combiner class is empty by default. You can also configure mapreduce.job.combine.class in mapred-site.xml. The Combiner class can be used only when the input and output key and value types of the reduce task are the same.

setInputFormatClass(Class<extends InputFormat> cls)

A core API used to specify the InputFormat class of a MapReduce job. The default InputFormat class is TextInputFormat. You can also configure mapreduce.job.inputformat.class in mapred-site.xml. This API can be used to specify the InputFormat class for processing data in different formats, reading data, and splitting data into data blocks.

setJarByClass(Class< > cls)

A core API used to specify the local location of the JAR file of a class. Java uses the class file to find the JAR file, which is uploaded to HDFS.

setJar(String jar)

Specifies the local location of the JAR file of a class. You can directly set the location of a JAR file, which is uploaded to HDFS. Use either setJar(String jar) or setJarByClass(Class< > cls). You can also configure mapreduce.job.jar in mapred-site.xml.

setOutputFormatClass(Class<extends OutputFormat> theClass)

A core API used to specify the OutputFormat class of a MapReduce job. The default OutputFormat class is TextOutputFormat. You can also configure mapred.output.format.class in mapred-site.xml, and specify the data format for the output. In the default TextOutputFormat, each key and value are recorded in text. OutputFormat is not specified usually.

setOutputKeyClass(Class< > theClass)

A core API used to specify the output key type of a MapReduce job. You can also configure mapreduce.job.output.key.class in mapred-site.xml.

setOutputValueClass(Class< > theClass)

A core API used to specify the output value type of a MapReduce job. You can also configure mapreduce.job.output.value.class in mapred-site.xml.

setPartitionerClass(Class<extends Partitioner> theClass)

Specifies the Partitioner class of a MapReduce job. You can also configure mapred.partitioner.class in mapred-site.xml. This method is used to allocate Map output results to a Reduce class. HashPartitioner is used by default, and evenly allocates the key-value pairs of a Map task. For example, in HBase applications, different key-value pairs belong to different regions. In this case, you must specify the Partitioner class to allocate Map output results.

setSortComparatorClass(Class<extends RawComparator> cls)

Specifies the compression class for output results of a Map task. Compression is not implemented by default. You can also configure mapreduce.map.output.compress and mapreduce.map.output.compress.codec in mapred-site.xml. You can compress intermediate data for transmission to lighten network pressure when the Map task outputs a large amount of data.

setPriority(JobPriority priority)

Specifies the priority of a MapReduce job. Five priorities can be set: VERY_HIGH, HIGH, NORMAL, LOW, and VERY_LOW. The default priority is NORMAL. You can also configure mapreduce.job.priority in mapred-site.xml.

Table 2 Common APIs of org.apache.hadoop.mapred.JobConf

Method

Description

setNumMapTasks(int n)

A core API used to specify the number of Map tasks in a MapReduce job. You can also configure mapreduce.job.maps in mapred-site.xml.

NOTE:

The InputFormat class controls the number of Map tasks. Ensure that the InputFormat class allows the number of Map tasks to be set on the client.

setNumReduceTasks(int n)

A core API used to specify the number of Reduce tasks in a MapReduce job. Only one Reduce task is started by default. You can also configure mapreduce.job.reduces in mapred-site.xml. The number of Reduce tasks is controlled by users. In most cases, the number of Reduce tasks is one-fourth the number of Map tasks.

setQueueName(String queueName)

Specifies the queue where a MapReduce job is submitted. The default queue is used by default. You can also configure mapreduce.job.queuename in mapred-site.xml.