Updated on 2024-11-29 GMT+08:00

Migrating HBase Data Using HBase2ES

Scenario

This section describes how to use TableScanMR concurrency and HBase direct scanning to obtain data from the HBase and import the data to the Elasticsearch cluster when the Elasticsearch cluster is working normally.

Prerequisites

  • The cluster is running properly.
  • HBase data is ready.
  • The HBase client has been installed, and the client node has been connected to the Elasticsearch cluster.
  • The cluster client has been installed in a directory, for example, /opt/client.

Procedure

Modify configuration files

  1. Replace .xml configuration files.

    Switch to the installation directory of the HBase client, and copy the core-site.xml, hbase-site.xml, hdfs-site.xml, mapred-site.xml and yarn-site.xml files in the conf directory of the HBase client to the /opt/client/Elasticsearch/tools/elasticsearch-data2es/hbase2es/conf directory.

    The XML files are the tool configuration files required for accessing HBase/MapReduce.

  2. On Manager, choose Cluster > Name of the desired cluster > Cluster Properties to check whether the authentication mode of the cluster is the security mode.

    • If yes, go to 3.
    • If no, go to 8.

  3. Create a user.

    1. On Manager, choose System > Permission > User > Create.
    2. Enter a username, for example, test. In user- and role-based authentication mode, add user test to the elasticsearch and supergroup user groups, set the primary group to supergroup, and bind the Manager_administrator role to the user to obtain related permissions. In Ranger-based authentication mode, add the Elasticsearch access permission policy for user test to Ranger. For details, see Adding a Ranger Access Permission Policy for Elasticsearch.
    3. Click OK.

  4. Choose System > Permission > User. Locate the newly created user and choose More > Download Authentication Credential. Then select the cluster information, and click OK to download the file.
  5. Upload the user.keytab and krb5.conf files obtained after the decompression to the /opt/client/Elasticsearch/tools/elasticsearch-data2es/hbase2es/conf directory.
  6. Log in to the node where the HBase client is located as user root.
  7. Run the following commands to modify the principal parameter in the jaas.conf file:

    cd /opt/client/Elasticsearch/tools/elasticsearch-data2es/hbase2es/conf

    vi jaas.conf

    In the following example, test is the username, which is the same as the value of principal in the espara.properties file.
    Client {
    com.sun.security.auth.module.Krb5LoginModule required
    useKeyTab=true
    keyTab="/opt/client/Elasticsearch/tools/elasticsearch-data2es/hbase2es/conf/user.keytab"
    principal="test@<System domain name>"
    useTicketCache=false
    storeKey=true
    debug=true;
    };

    You can log in to Manager, choose System > Permission > Domain and Mutual Trust, and view the value of Local Domain, which is the current system domain name.

  8. Run the following command to modify the espara.properties configuration file:

    vi espara.properties

    For details about parameter configuration, see Table 1. The following is a configuration example:

    #para for both index and search
    esServerHost=ip1:port1,ip2:port2,ip3:port3
    index=es_index
    
    isSecurity=true
    
    addindex=0
    numberofshards=6
    numberofreplicas=2
    needsource=false
    
    #para for hbase
    principal=test
    HBase_table=hbase_table
    familyName=info
    
    splitFamily=@
    splitColum=_
    splitField=& 
    
    #1:string 2:long 3:binary 4:byte 5:double 6:float 7:int
    #exmple 20&1_30&1_40&1@name&1_age&7_addr&1&|
    qualifier=name&1_age&7_address&1&|
    
    outputDir=/tmp/out1
    
    batch_size=20000
    hbasescannum=20000
    #hbase rowkey encode type, 0:none 1:Base64
    encodeType=0
    threadNum=1
    
    #Log for Test, 0:on 1:off
    isTestLogOn=1
    #json file max length for MR, unit:M
    max_content_length=1024
    Table 1 espara.properties parameter description

    Parameter

    Default Value

    Description

    esServerHost

    ip1:port1,ip2:port2,ip3:port3

    Specifies the instance configuration for importing data from HBase to Elasticsearch. To facilitate load balancing, multiple groups of IP addresses and port numbers are configured. The parameter value is in the IP address:Port number format. Multiple groups of IP address and port number are separated by comma (,). The IP address indicates the service plane IP address of any node in the Elasticsearch cluster, and the port number indicates the external HTTP/HTTPS port number of the node.

    To view the IP address, log in to Manager and choose Cluster > Name of the desired cluster > Services > Elasticsearch > Instance. The IP address of the instance running in the cluster is displayed.

    To view the port number, log in to Manager and choose Cluster > Name of the desired cluster > Services > Elasticsearch > Configurations > All Configurations. Enter SERVER_PORT in the search box to view the corresponding port number. The port number ranges from 24100 to 24149.

    index

    es_index

    Specifies the Elasticsearch index name. Multiple indexes can be configured simultaneously, separated by comma (,). You can establish an index based on actual requirements, or use the default index. You can log in to any Elasticsearch node in the cluster and run the curl command to view the index created in Elasticsearch. For details about how to use the curl command to view indexes, see Running curl Commands in Linux.

    isSecurity

    true

    Specifies whether the Elasticsearch cluster is in security mode. true indicates that the cluster is in security mode, and false indicates that the cluster is in normal mode.

    addindex

    0

    If the default mapping is used, set the parameter to a value greater than 0. In actual scenarios where users use their own mapping, set this parameter to 0.

    numberofshards

    6

    If addindex is greater than 0, set this parameter to the number of primary shards.

    numberofreplicas

    2

    If addindex is greater than 0, set this parameter to the number of replica shards.

    needsource

    false

    Specifies whether _source is enabled. Set this number when addindex is greater than 0.

    principal

    test

    Specifies the authentication username. When creating the user, add the elasticsearch and supergroup user groups for the user.

    HBase_table

    hbase_table

    Specifies the HBase table name. Multiple tables can be configured simultaneously, separated by comma (,).

    familyName

    info

    Specifies the family name. Multiple family names can be configured simultaneously, separated by comma (,).

    splitFamily

    @

    Specifies the column family separator, which can be customized. The default value is @.

    splitColum

    _

    Specifies the column name separator, which can be customized. The default value is _.

    splitField

    &

    Specifies the field and field type separator, which can be customized. The default value is &.

    qualifier

    name&1_age&7_addr&1&|

    Specifies the column names. You can configure column names for multiple column families at the same time. By default, multiple column families are separated by @, multiple column names are separated by underscore (_), and fields and field types are separated by ampersand (&). Set this parameter based on the actual configuration of the splitFamily, splitColum, and splitField.

    The number of column families must be the same as the number of column families configured in familyName.

    If the data type is string, separators can be configured. Take addr&1&| as an example. HBase's addr column holds a string "city1|city2", the index to Elasticsearch data is the ["city1","city2"] array separated by "|".

    Data of the string, long, binary, byte, double, float, and int types are supported. In this version, 1 indicates string, 2 indicates long, 3 indicates binary, 4 indicates byte, 5 indicates double, 6 indicates float, and 7 indicates the int type data.

    The format is Field name&Field type&Separator. For example: 20&1_30&1_40&1@name&1_age&7_addr&1&| indicates that qualifier contains the column names of the two-column families: 20&1_30&1_40 and name&1 _age&7_addr&1&|, which correspond to the familyname-defined column families info and info1, respectively. The info column family contains columns 20, 30, and 403, and info1 contains columns name, age, and addr. 20, 30, 40, name and addr are of the string type, and age is of the int type.

    NOTE:

    The separator defined by splitFamily, splitColum, and splitField must be unique and cannot be the same as the separator set for the column family, column name, and field in the HBase table.

    outputDir

    /tmp/out1

    Specifies the HDFS path. In TableScanMR mode, if data writing in HTTP mode fails, data is written into HDFS. If there are multiple tables, multiple directories must be configured, separated by comma (,). Otherwise, the data will be overwritten.

    batch_size

    20000

    Specifies the number of Elasticsearch write tasks in batches.

    hbasescannum

    20000

    Specifies the number of cached HBase scan messages. If this parameter is set to 0, the default value 100000 is used. Ensure that the value of this parameter is less than or equal to the value of hbase.client.scanner.caching.max of HBase. The default value is 2147483647.

    encodeType

    0

    Specifies the HBase rowkey encoding type. 0 indicates that no encoding is used and 1 indicates that Base64 encoding is used.

    threadNum

    1

    Specifies the size of the thread pool. If the size of the thread pool and the number of concurrent Elasticsearch writes are too large, HTTP request reading times out. If the TableScanMR mode is used, set this parameter to 1.

    isTestLogOn

    1

    Specifies whether to enable the scan function for log commissioning. You can select to record or not to record error logs during HBase scanning. 0 indicates that logs are recorded, and 1 indicates that logs are not recorded.

    max_content_length

    1024

    Specifies the size of the file generated in TableScanMR mode. Unit: MB

    • When configuring multiple tables, ensure that values of the familyName and qualifier fields are the same. That is, multiple tables use the same familyName and qualifier.
    • When the HBase is used for scanning, the sequence and number of index and HBase_table configuration items must correspond to each other.
    • When the TableScanMR mode is used, the sequence and number of index and HBase_table/outputDir configuration items must correspond to each other.

Scan and import HBase data into Elasticsearch

  1. Scan and import data in the tool package directory.

    • Mode 1: TableScanMR concurrency

      cd /opt/client/Elasticsearch/tools/elasticsearch-data2es/hbase2es

      java -cp ./conf/:./../lib/* com.*.fusioninsight.es.tool.hbase2es.MRTool

    • Mode 2: HBase direct scan

      cd /opt/client/Elasticsearch/tools/elasticsearch-data2es/hbase2es

      java -cp ./conf/:./../lib/* com.*.fusioninsight.es.tool.hbase2es.HBaseTool

  2. Run the curl command to view the index and check whether the data is imported.

    • If the security mode is used, run the following command:

      curl -XGET --tlsv1.2 --negotiate -k -u : "https://IP:port/my_store1/_search"

    • If the normal mode is used, run the following command:

      curl -XGET "http://IP:port/my_store1/_search"

    For details about how to use the curl command, see Running curl Commands in Linux.

  3. If some indexes fail to be imported after data import, a new json directory is generated in the sbin directory of the data import tool package. Run the input.sh script in the sbin directory to import the failed data.

    cd /opt/client/Elasticsearch/tools/elasticsearch-data2es/hbase2es

    ./sbin/input.sh

    You can view the log files in the Elasticsearch/tools/elasticsearch-data2es/hbase2es/logs directory of the cluster client for the import process.