foreachPartition接口使用

场景说明

用户可以在Spark应用程序中使用HBaseContext的方式去操作HBase，将要插入的数据的rowKey构造成rdd，然后通过HBaseContext的mapPartition接口将rdd并发写入HBase表中。

数据规划

在客户端执行：hbase shell命令进入HBase命令行。
使用下面的命令创建HBase表：
create 'table2','cf1'

开发思路

将要导入的数据构造成RDD。
以HBaseContext的方式操作HBase，通过HBaseContext的foreachPatition接口将数据并发写入HBase中。

打包项目

通过IDEA自带的Maven工具，打包项目，生成jar包。具体操作请参考在Linux环境中编包并运行Spark程序。
将打包生成的jar包上传到Spark客户端所在服务器的任意目录（例如“$SPARK_HOME” ）下。

若运行“Spark on HBase”样例程序，需要在Spark客户端的“spark-defaults.conf”配置文件中将配置项“spark.yarn.security.credentials.hbase.enabled”设置为“true”（该参数值默认为“false”，改为“true”后对已有业务没有影响。如果要卸载HBase服务，卸载前请将此参数值改回“false”），将配置项“spark.inputFormat.cache.enabled”设置为“false”。

提交命令

假设用例代码打包后的jar包名为spark-hbaseContext-test-1.0.jar，并将jar包放在客户端“$SPARK_HOME”目录下，以下命令均在“$SPARK_HOME”目录执行，Java接口对应的类名前有Java字样，请参考具体样例代码进行书写。

yarn-client模式：
java/scala版本（类名等请与实际代码保持一致，此处仅为示例）

bin/spark-submit --master yarn --deploy-mode client --class com.huawei.bigdata.spark.examples.hbasecontext.JavaHBaseForEachPartitionExample SparkOnHbaseJavaExample-1.0.jar table2 cf1

python版本（文件名等请与实际保持一致，此处仅为示例）

bin/spark-submit --master yarn --deploy-mode client --jars SparkOnHbaseJavaExample-1.0.jar HBaseForEachPartitionExample.py table2 cf1
yarn-cluster模式：
java/scala版本（类名等请与实际代码保持一致，此处仅为示例）

bin/spark-submit --master yarn --deploy-mode cluster --class com.huawei.bigdata.spark.examples.hbasecontext.JavaHBaseForEachPartitionExample SparkOnHbaseJavaExample-1.0.jar table2 cf1

python版本（文件名等请与实际保持一致，此处仅为示例）

bin/spark-submit --master yarn --deploy-mode cluster --jars SparkOnHbaseJavaExample-1.0.jar HBaseForEachPartitionExample.py table2 cf1

Java样例代码

下面代码片段仅为演示，具体代码参见SparkOnHbaseJavaExample中JavaHBaseForEachPartitionExample文件：

public static void main(String[] args) throws IOException {
    if (args.length < 1) {
      System.out.println("JavaHBaseForEachPartitionExample {tableName} {columnFamily}");
      return;
    }
    final String tableName = args[0];
    final String columnFamily = args[1];
    SparkConf sparkConf = new SparkConf().setAppName("JavaHBaseBulkGetExample " + tableName);
    JavaSparkContext jsc = new JavaSparkContext(sparkConf);
    try {
      List<byte[]> list = new ArrayList<byte[]>(5);
      list.add(Bytes.toBytes("1"));
      list.add(Bytes.toBytes("2"));
      list.add(Bytes.toBytes("3"));
      list.add(Bytes.toBytes("4"));
      list.add(Bytes.toBytes("5"));
      JavaRDD<byte[]> rdd = jsc.parallelize(list);
      Configuration conf = HBaseConfiguration.create();
      JavaHBaseContext hbaseContext = new JavaHBaseContext(jsc, conf);
      hbaseContext.foreachPartition(rdd,
              new VoidFunction<Tuple2<Iterator<byte[]>, Connection>>() {
          public void call(Tuple2<Iterator<byte[]>, Connection> t)
                  throws Exception {
            Connection con = t._2();
            Iterator<byte[]> it = t._1();
            BufferedMutator buf = con.getBufferedMutator(TableName.valueOf(tableName));
            while (it.hasNext()) {
              byte[] b = it.next();
              Put put = new Put(b);
              put.add(Bytes.toBytes(columnFamily), Bytes.toBytes("cid"), b);
              buf.mutate(put);
            }
            mutator.flush();
            mutator.close();

          }
        });
    } finally {
      jsc.stop();
    }
  }

Scala样例代码

下面代码片段仅为演示，具体代码参见SparkOnHbaseScalaExample中HBaseForEachPartitionExample文件：

def main(args: Array[String]) {
    if (args.length < 2) {
      println("HBaseForeachPartitionExample {tableName} {columnFamily} are missing an arguments")
      return
    }
    val tableName = args(0)
    val columnFamily = args(1)
    val sparkConf = new SparkConf().setAppName("HBaseForeachPartitionExample " +
      tableName + " " + columnFamily)
    val sc = new SparkContext(sparkConf)
    try {
      //[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])]
      val rdd = sc.parallelize(Array(
        (Bytes.toBytes("1"),
          Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("1")))),
        (Bytes.toBytes("2"),
          Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("2")))),
        (Bytes.toBytes("3"),
          Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("3")))),
        (Bytes.toBytes("4"),
          Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("4")))),
        (Bytes.toBytes("5"),
          Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("5"))))
      ))
      val conf = HBaseConfiguration.create()
      val hbaseContext = new HBaseContext(sc, conf)
      rdd.hbaseForeachPartition(hbaseContext,
        (it, connection) => {
          val m = connection.getBufferedMutator(TableName.valueOf(tableName))
          it.foreach(r => {
            val put = new Put(r._1)
            r._2.foreach((putValue) =>
              put.addColumn(putValue._1, putValue._2, putValue._3))
            m.mutate(put)
          })
          m.flush()
          m.close()
        })
    } finally {
      sc.stop()
    }
  }

Python样例代码

下面代码片段仅为演示，具体代码参见SparkOnHbasePythonExample中HBaseForEachPartitionExample文件：

# -*- coding:utf-8 -*-
"""
【说明】
由于pyspark不提供Hbase相关api,本样例使用Python调用Java的方式实现
"""
from py4j.java_gateway import java_import
from pyspark.sql import SparkSession
# 创建SparkSession
spark = SparkSession\
        .builder\
        .appName("JavaHBaseForEachPartitionExample")\
        .getOrCreate()
# 向sc._jvm中导入要运行的类
java_import(spark._jvm, 'com.huawei.bigdata.spark.examples.hbasecontext.JavaHBaseForEachPartitionExample')
# 创建类实例并调用方法，传递sc._jsc参数
spark._jvm.JavaHBaseForEachPartitionExample().execute(spark._jsc, sys.argv)
# 停止SparkSession
spark.stop()

父主题： Spark读取HBase表样例程序

上一篇：BulkLoad接口使用

下一篇：分布式Scan HBase表