Access Hive with HCatalog

Function

Use HCatalog to analyze Hive table data with a MapReduce task, read int data in the first column of the input table, perform the count(distinct XX) operation, and write the result to the output table.

Example Code

The sample program is packed in HCatalogExample.java of hive-examples/hcatalog-example. The function modules are as follows:

Implement the Mapper class, use HCatRecord to obtain data of the int type in the first column, count 1, and output the data.

public static class Map extends
        Mapper<LongWritable, HCatRecord, IntWritable, IntWritable> {
    int age;
    @Override
    protected void map(
            LongWritable key,
            HCatRecord value,
            Mapper<LongWritable, HCatRecord,
                    IntWritable, IntWritable>.Context context)
            throws IOException, InterruptedException {
        if ( value.get(0) instanceof Integer ) {
            age = (Integer) value.get(0);
        }
        context.write(new IntWritable(age), new IntWritable(1));
    }
}

Implement the Reducer class, count the map output results, collect statistics on the number of occurrences of non-repeated values, and use HCatRecord to output the results.

public static class Reduce extends Reducer<IntWritable, IntWritable,
        IntWritable, HCatRecord> {
    @Override
    protected void reduce(
            IntWritable key,
            Iterable<IntWritable> values,
            Reducer<IntWritable, IntWritable,
                    IntWritable, HCatRecord>.Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        Iterator<IntWritable> iter = values.iterator();
        while (iter.hasNext()) {
            sum++;
            iter.next();
        }
        HCatRecord record = new DefaultHCatRecord(2);
        record.set(0, key.get());
        record.set(1, sum);
        context.write(null, record);
    }
}

Define the MapReduce task. Specify the input/output class, Mapper/Reducer class, and input/output key-value pair format.

Job job = new Job(conf, "GroupByDemo");
HCatInputFormat.setInput(job, dbName, inputTableName);
job.setInputFormatClass(HCatInputFormat.class);
job.setJarByClass(HCatalogExample.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(WritableComparable.class);
job.setOutputValueClass(DefaultHCatRecord.class);
String outputTableName = otherArgs[1];
OutputJobInfo outputjobInfo = OutputJobInfo.create(dbName, outputTableName, null);
HCatOutputFormat.setOutput(job, outputjobInfo);
HCatSchema schema = outputjobInfo.getOutputSchema();
HCatOutputFormat.setSchema(job, schema);
job.setOutputFormatClass(HCatOutputFormat.class);

Parent topic: Developing an Application

Previous topic: Using the JDBC interface to submit a data analysis task

Next topic: Accessing Hive Using Python