Updated on 2023-08-31 GMT+08:00

Overview

Scenarios

Data is written to HBase in real time for point query services and is synchronized to CarbonData tables in batches at a specified interval for analytical query services.

Data Preparation

Before running the sample project, set the spark.yarn.security.credentials.hbase.enabled configuration item to true in the spark-defaults.conf configuration file of Spark client. (The default value is false. Changing the value to true does not affect existing services.) If you want to uninstall the HBase service, change the value back to false first.

  1. Create an HBase table and construct data with key, modify_time, and valid columns. key of each data record is unique in the table. modify_time indicates the modification time, and valid indicates whether the data is valid. In this example, 1 indicates that the data is valid, and 0 indicates that the data is invalid.

    For example, go to HBase Shell and run the following commands:

    create 'hbase_table','key','info'

    put 'hbase_table','1','info:modify_time','2019-11-22 23:28:39'

    put 'hbase_table','1','info:valid','1'

    put 'hbase_table','2','info:modify_time','2019-11-22 23:28:39'

    put 'hbase_table','2','info:valid','1'

    put 'hbase_table','3','info:modify_time','2019-11-22 23:28:39'

    put 'hbase_table','3','info:valid','0'

    put 'hbase_table','4','info:modify_time','2019-11-22 23:28:39'

    put 'hbase_table','4','info:valid','1'

    The values of modify_time in the preceding information can be set to the time earlier than the current time.

    put 'hbase_table','5','info:modify_time','2021-03-03 15:20:39'

    put 'hbase_table','5','info:valid','1'

    put 'hbase_table','6','info:modify_time','2021-03-03 15:20:39'

    put 'hbase_table','6','info:valid','1'

    put 'hbase_table','7','info:modify_time','2021-03-03 15:20:39'

    put 'hbase_table','7','info:valid','0'

    put 'hbase_table','8','info:modify_time','2021-03-03 15:20:39'

    put 'hbase_table','8','info:valid','1'

    put 'hbase_table','4','info:valid','0'

    put 'hbase_table','4','info:modify_time','2021-03-03 15:20:39'

    The values of modify_time in the preceding information can be set to the time within 30 minutes after the sample project is started. (30 minutes is the default synchronization interval of the sample project and can be modified.)

    put 'hbase_table','9','info:modify_time','2021-03-03 15:32:39'

    put 'hbase_table','9','info:valid','1'

    put 'hbase_table','10','info:modify_time','2021-03-03 15:32:39'

    put 'hbase_table','10','info:valid','1'

    put 'hbase_table','11','info:modify_time','2021-03-03 15:32:39'

    put 'hbase_table','11','info:valid','0'

    put 'hbase_table','12','info:modify_time','2021-03-03 15:32:39'

    put 'hbase_table','12','info:valid','1'

    The values of modify_time in the preceding information can be set to the time from 30 minutes to 60 minutes after the sample project is started, that is, the second synchronization period.

  2. Run the following commands to create a Hive foreign table for HBase in SparkSQL:

    create table external_hbase_table(key string ,modify_time STRING, valid STRING)

    using org.apache.spark.sql.hbase.HBaseSource

    options(hbaseTableName "hbase_table", keyCols "key", colsMapping "modify_time=info.modify_time,valid=info.valid");

  3. Run the following command to create a CarbonData table in SparkSQL:

    create table carbon01(key string,modify_time STRING, valid STRING) stored as carbondata;

  4. Initialize and load all data in the current HBase table to the CarbonData table.

    insert into table carbon01 select * from external_hbase_table where valid='1';

  5. Run the following spark-submit command:
    spark-submit --master yarn --deploy-mode client --class com.huawei.bigdata.spark.examples.HBaseExternalHivetoCarbon /opt/example/HBaseExternalHivetoCarbon-1.0.jar