Updated on 2024-05-06 GMT+08:00

Interconnecting Hive with CSS

Scenario

Use the Elasticsearch-Hadoop plug-in to exchange data between Hive and Elasticsearch of Cloud Search Service (CSS) so that Elasticsearch index data can be mapped to Hive tables.

This section applies to MRS 3.x or later.

Prerequisites

The Hive service of MRS and the Elasticsearch service of CSS have been installed, and the two clusters can communicate with each other.

Procedure

  1. On the Clusters page of the CSS console, locate the row containing the target cluster and click Access Kibana in the Operation column. In the navigation pane of Kibana, click Dev Tools. On the console page that is displayed, run the following command to create the index ddj_study_card_ratio_v12:

    PUT ddj_study_card_ratio_v12
    {  
    	"mappings" : {
          "properties" : {
            "uniq_id" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            }
          }
        }
      }

    If the following information is displayed, the index is created:

    {
      "acknowledged" : true,
      "shards_acknowledged" : true,
      "index" : "ddj_study_card_ratio_v12"
    }

  2. Run the following command to insert data into the ddj_study_card_ratio_v12 index:

    POST /ddj_study_card_ratio_v12/_doc/_bulk
    {"index":{}}
    {"id":"1", "uniq_id":"23323"}

    If errors is false in the command output, the data is imported.

  3. Download the corresponding JAR file from Past Releases based on the Elasticsearch version in CSS.

    For example, the JAR file corresponding to Elasticsearch 7.6.2 is elasticsearch-hadoop-7.6.2.jar.

    • The JAR file and Elasticsearch of CSS must have the same version. This section uses an Elasticsearch 7.6.2 cluster with security mode enabled as an example.
    • If there are any additional custom modules, pack them into a separate JAR file.

  4. Upload the JAR file in 3 to the /opt/Bigdata/third_lib/Hive directory on all HiveServer nodes and run the following command to modify the permission:

    chown omm:wheel -R /opt/Bigdata/third_lib/Hive

  5. Log in to FusionInsight Manager and choose Cluster > Services > Hive . On the page that is displayed, click the Instance tab. On this tab page, select all HiveServer instances, and choose More > Restart Instance.
  6. Download commons-httpclient-3.1.jar from Maven central warehouse and upload this JAR file and the JAR file in 3 to any node where the HDFS and Hive clients are installed in the cluster.
  7. Log in to the node to which the JAR files in 6 are uploaded as the client installation user.
  8. Run the following command to authenticate the user:

    cd Client installation directory

    source bigdata_env

    kinit Component service user (Skip this step for clusters with Kerberos authentication disabled.)

  9. Run the following command to create a directory for storing JAR files in HDFS:

    hdfs dfs -mkdir HDFS path for storing JAR files

  10. Run the following command to upload the JAR files in 6 to HDFS:

    hdfs dfs -put JAR file storage path HDFS path for storing JAR files

  11. Run the following command to enable Hive to load a specified JAR file when executing a command line task:

    beeline

    add jar HDFS path for storing JAR files; (Execute this command once for each JAR file.)

  12. Run the following command to create an Elasticsearch external table:

    CREATE EXTERNAL TABLE `ddj_study_card_ratio_v12_test`(                              
       `uniq_id` string)                                                                         
     ROW FORMAT SERDE                                                                       
       'org.elasticsearch.hadoop.hive.EsSerDe'                                              
     STORED BY                                                                              
       'org.elasticsearch.hadoop.hive.EsStorageHandler'                                     
     WITH SERDEPROPERTIES (                                                                 
       'field.delim'='',                                                                   
       'serialization.format'='')                                                               
     TBLPROPERTIES (                                                                        
       'bucketing_version'='2',                                                             
       'es.index.auto.create'='false',                                                      
       'es.mapping.date.rich'='false',                                                      
       'es.net.http.auth.pass'='Pzh6537projectx',                                           
       'es.net.http.auth.user'='elastic',                                                   
       'es.nodes'='vpcep-e0b33065-75b7-4193-8395-dbd00d10bc39.cn-east-3.huaweicloud.com',   
       'es.nodes.wan.only'='true',                                                          
       'es.port'='9200',                                                                                 
       'es.read.metadata'='true',                                                           
       'es.resource'='ddj_study_card_ratio_v12',                                            
       'es.set.netty.runtime.available.processors'='false',                                 
       'es.write.operation'='index',                                                        
       'last_modified_by'='root',                                                           
       'last_modified_time'='1655264909',                                                   
       'transient_lastDdlTime'='1655264909');

    Key parameters are described as follows:

    • es.net.http.auth.pass and es.net.http.auth.user: indicate the password and username of the user created in Kibana who has the permission to perform operations on indexes created in 1.
    • es.nodes: IP address to be connected. You can log in to the CSS management console and view the IP address of the cluster in the Internal Access Addresses column of the cluster list.
    • es.port: port for external access to the Elasticsearch cluster. The default value is 9200.
    • es.resource: name of the index created in 1.

    For details about parameter configurations, visit https://www.elastic.co/guide/en/elasticsearch/hadoop/6.1/hive.html.

  13. Run the following command to view the Elasticsearch external table created in 12:

    select * from ddj_study_card_ratio_v12_test;

    If no error information is displayed and the query is successful, Hive is interconnected with CSS. The command output is as follows: