Development Plan

Scenarios

Data is written to HBase in real time for point query services and is synchronized to CarbonData tables in batches at a specified interval for analytical query services.

Configuration Operations Before Running

In security mode, the sample code needs to read the user.keytab and krb5.conf files. The user.keytab and krb5.conf files are the authentication files in security mode. You need to download the authentication credential of the principal user on FusionInsight Manager. The user used in the sample code is sparkuser, which needs to be changed to the prepared development user.

Packaging the Project

Upload the user.keytab and krb5.conf files to the server where the client is located.

Use the Maven tool provided by IDEA to package the project and generate the JAR file. For details, see Writing and Running the Spark Program in the Linux Environment.
- Before compilation and packaging, change the paths of the user.keytab and krb5.conf files in the sample code to the actual paths on the client server, for example, /opt/user.keytab and /opt/krb5.conf.
- Before running the sample project, set the spark.yarn.security.credentials.hbase.enabled configuration item to true in the spark-defaults.conf configuration file of Spark client. (The default value is false. Changing the value to true does not affect existing services.) If you want to uninstall the HBase service, change the value back to false first.
Upload the generated JAR package to any directory (for example, /opt/) on the server where the Spark client is located.

Data Preparation

Create an HBase table and construct data with key, modify_time, and valid columns. key of each data record is unique in the table. modify_time indicates the modification time, and valid indicates whether the data is valid. In this example, 1 indicates that the data is valid, and 0 indicates that the data is invalid.
For example, go to HBase Shell and run the following commands:

create 'hbase_table','key','info'

put 'hbase_table','1','info:modify_time','2019-11-22 23:28:39'

put 'hbase_table','1','info:valid','1'

put 'hbase_table','2','info:modify_time','2019-11-22 23:28:39'

put 'hbase_table','2','info:valid','1'

put 'hbase_table','3','info:modify_time','2019-11-22 23:28:39'

put 'hbase_table','3','info:valid','0'

put 'hbase_table','4','info:modify_time','2019-11-22 23:28:39'

put 'hbase_table','4','info:valid','1'

The values of modify_time in the preceding information can be set to the time earlier than the current time.

put 'hbase_table','5','info:modify_time','2021-03-03 15:20:39'

put 'hbase_table','5','info:valid','1'

put 'hbase_table','6','info:modify_time','2021-03-03 15:20:39'

put 'hbase_table','6','info:valid','1'

put 'hbase_table','7','info:modify_time','2021-03-03 15:20:39'

put 'hbase_table','7','info:valid','0'

put 'hbase_table','8','info:modify_time','2021-03-03 15:20:39'

put 'hbase_table','8','info:valid','1'

put 'hbase_table','4','info:valid','0'

put 'hbase_table','4','info:modify_time','2021-03-03 15:20:39'

The values of modify_time in the preceding information can be set to the time within 30 minutes after the sample project is started. (30 minutes is the default synchronization interval of the sample project and can be modified.)

put 'hbase_table','9','info:modify_time','2021-03-03 15:32:39'

put 'hbase_table','9','info:valid','1'

put 'hbase_table','10','info:modify_time','2021-03-03 15:32:39'

put 'hbase_table','10','info:valid','1'

put 'hbase_table','11','info:modify_time','2021-03-03 15:32:39'

put 'hbase_table','11','info:valid','0'

put 'hbase_table','12','info:modify_time','2021-03-03 15:32:39'

put 'hbase_table','12','info:valid','1'

The values of modify_time in the preceding information can be set to the time from 30 minutes to 60 minutes after the sample project is started, that is, the second synchronization period.
Run the following commands to create a Hive foreign table for HBase in SparkSQL:
create table external_hbase_table(key string ,modify_time STRING, valid STRING)

using org.apache.spark.sql.hbase.HBaseSource

options(hbaseTableName "hbase_table", keyCols "key", colsMapping "modify_time=info.modify_time,valid=info.valid");
Run the following command to create a CarbonData table in SparkSQL:
create table carbon01(key string,modify_time STRING, valid STRING) stored as carbondata;
Initialize and load all data in the current HBase table to the CarbonData table.
insert into table carbon01 select * from external_hbase_table where valid='1';

Run the following spark-submit command:

spark-submit --master yarn --deploy-mode client --keytab /opt/FIclient/user.keytab --principal sparkuser  --class com.huawei.bigdata.spark.examples.HBaseExternalHivetoCarbon /opt/example/HBaseExternalHivetoCarbon-1.0.jar

Parent topic: Sample Project for Spark to Synchronize HBase Sata to CarbonData

Previous topic: Sample Project for Spark to Synchronize HBase Sata to CarbonData

Next topic: Synchronizing HBase Data from Spark to CarbonData (Java)