Updated on 2022-09-14 GMT+08:00

Overview

Scenarios

Data is written to HBase in real time for point query services and is synchronized to CarbonData tables in batches at a specified interval for analytical query services.

Configuration Operations Before Running

In security mode, the sample code needs to read the user.keytab and krb5.conf files. The user.keytab and krb5.conf files are the authentication files in security mode. You need to download the authentication credential of the principal user on FusionInsight Manager. The user used in the sample code is sparkuser, which needs to be changed to the prepared development user.

Packaging the Project

  1. Upload the user.keytab and krb5.conf files to the server where the client is located.
  1. Use the Maven tool provided by IDEA to package the project and generate the JAR file. For details, see Compiling and Running the Application.
    • Before compilation and packaging, change the paths of the user.keytab and krb5.conf files in the sample code to the actual paths on the client server, for example, /opt/user.keytab and /opt/krb5.conf.
    • Before running the sample program, set the spark.yarn.security.credentials.hbase.enabled configuration item to true in the spark-defaults.conf configuration file of Spark client. (The default value is false. Changing the value to true does not affect existing services.) If you want to uninstall the HBase service, change the value back to false first.
  2. Upload the generated JAR package to any directory (for example, /opt/) on the server where the Spark client is located.

Data Preparation

  1. Create an HBase table and construct data with key, modify_time, and valid columns. key of each data record is unique in the table. modify_time indicates the modification time, and valid indicates whether the data is valid. In this example, 1 indicates that the data is valid, and 0 indicates that the data is invalid.

    For example, go to HBase Shell and run the following commands:

    create 'hbase_table','key','info'

    put 'hbase_table','1','info:modify_time','2019-11-22 23:28:39'

    put 'hbase_table','1','info:valid','1'

    put 'hbase_table','2','info:modify_time','2019-11-22 23:28:39'

    put 'hbase_table','2','info:valid','1'

    put 'hbase_table','3','info:modify_time','2019-11-22 23:28:39'

    put 'hbase_table','3','info:valid','0'

    put 'hbase_table','4','info:modify_time','2019-11-22 23:28:39'

    put 'hbase_table','4','info:valid','1'

    The values of modify_time in the preceding information can be set to the time earlier than the current time.

    put 'hbase_table','5','info:modify_time','2021-03-03 15:20:39'

    put 'hbase_table','5','info:valid','1'

    put 'hbase_table','6','info:modify_time','2021-03-03 15:20:39'

    put 'hbase_table','6','info:valid','1'

    put 'hbase_table','7','info:modify_time','2021-03-03 15:20:39'

    put 'hbase_table','7','info:valid','0'

    put 'hbase_table','8','info:modify_time','2021-03-03 15:20:39'

    put 'hbase_table','8','info:valid','1'

    put 'hbase_table','4','info:valid','0'

    put 'hbase_table','4','info:modify_time','2021-03-03 15:20:39'

    The values of modify_time in the preceding information can be set to the time within 30 minutes after the sample program is started. (30 minutes is the default synchronization interval of the sample program and can be modified.)

    put 'hbase_table','9','info:modify_time','2021-03-03 15:32:39'

    put 'hbase_table','9','info:valid','1'

    put 'hbase_table','10','info:modify_time','2021-03-03 15:32:39'

    put 'hbase_table','10','info:valid','1'

    put 'hbase_table','11','info:modify_time','2021-03-03 15:32:39'

    put 'hbase_table','11','info:valid','0'

    put 'hbase_table','12','info:modify_time','2021-03-03 15:32:39'

    put 'hbase_table','12','info:valid','1'

    The values of modify_time in the preceding information can be set to the time from 30 minutes to 60 minutes after the sample program is started, that is, the second synchronization period.

  2. Run the following commands to create a Hive foreign table for HBase in SparkSQL:

    create table external_hbase_table(key string ,modify_time STRING, valid STRING)

    using org.apache.spark.sql.hbase.HBaseSource

    options(hbaseTableName "hbase_table", keyCols "key", colsMapping "modify_time=info.modify_time,valid=info.valid");

  3. Run the following command to create a CarbonData table in SparkSQL:

    create table carbon01(key string,modify_time STRING, valid STRING) stored as carbondata;

  4. Initialize and load all data in the current HBase table to the CarbonData table.

    insert into table carbon01 select * from external_hbase_table where valid='1';

  5. Run the following spark-submit command:
    spark-submit --master yarn --deploy-mode client --keytab /opt/FIclient/user.keytab --principal sparkuser  --class com.huawei.bigdata.spark.examples.HBaseExternalHivetoCarbon /opt/example/HBaseExternalHivetoCarbon-1.0.jar