Help Center/ Cloud Search Service/ Best Practices/ Cluster Access/ Using ES-Hadoop to Read and Write Data in Elasticsearch Through Hive
Updated on 2023-03-29 GMT+08:00

Using ES-Hadoop to Read and Write Data in Elasticsearch Through Hive

The Elasticsearch-Hadoop (ES-Hadoop) connector combines the massive data storage and in-depth processing capabilities of Hadoop with the real-time search and analysis capabilities of Elasticsearch. It allows you to quickly get to know big data and work better in the Hadoop ecosystem.

This section uses the ES-Hadoop of MRS as an example to describe how to connect to a CSS cluster. You can configure any other applications that need to use the Elasticsearch cluster. Ensure the network connection between the client and the Elasticsearch cluster is normal.

Prerequisites

  • The CSS cluster is available.
  • The client can communicate with the CSS cluster.
  • The CSS and MRS clusters are in the same region, AZ, VPC, and subnet.
    Figure 1 CSS cluster information

Procedure

  1. Obtain the private network address of the cluster. It is used to access the cluster.
    1. In the navigation pane on the left, choose Clusters.
    2. In the cluster list, select a cluster, and obtain and record its Private Network Address. Format: <host>:<port> or <host>:<port>,<host>:<port>

      If the cluster has only one node, the IP address and port number of only one node are displayed, for example, 10.62.179.32:9200. If the cluster has multiple nodes, the IP addresses and port numbers of all nodes are displayed, for example, 10.62.179.32:9200,10.62.179.33:9200.

  2. Log in to an MRS cluster node. For details, see .
  3. Run the cURL command on an MRS cluster node to check the network connectivity. Ensure every node in the MRS cluster can connect to the CSS cluster.
    • Cluster in non-security mode
      curl -X GET http://<host>:<port>
    • Cluster in security mode + HTTP
      curl -X GET http://<host>:<port> -u <user>:<password>
    • Cluster in security mode + HTTPS
      curl -X GET https://<host>:<port> -u <user>:<password> -ik
    Table 1 Variables

    Variable

    Description

    <host>

    IP address of each node in the cluster. If the cluster contains multiple nodes, there will be multiple IP addresses. You can use any of them.

    <port>

    Port number for accessing a cluster node. Generally, the port number is 9200.

    <user>

    Username for accessing the cluster.

    <password>

    Password of the user.

  4. Download the ES-Hadoop lib package and decompress it to obtain the elasticsearch-hadoop-x.x.x.jar file. The version must be the same as the CSS cluster version. For example, if the CSS cluster version is 7.6.2, you are advised to download elasticsearch-hadoop-7.6.2.zip.
  5. Download the httpclient dependency package commons-httpclient:commons-httpclient-3.1.jar. In the package name, 3.1 indicates the version number. Select the package of the version you need.
  6. Install the MRS client. If the MRS client has been installed, skip this step. For details, see .
  7. Log in to the MRS client. Upload the JAR dependency packages of ES-Hadoop and httpclient to the MRS client.
  8. Create an HDFS directory on the MRS client. Upload the ES-Hadoop lib package and the httpclient dependency package to the directory.
    hadoop fs -mkdir /tmp/hadoop-es
    hadoop fs -put elasticsearch-hadoop-x.x.x.jar /tmp/hadoop-es
    hadoop fs -put commons-httpclient-3.1.jar /tmp/hadoop-es
  9. Log in to the Hive client from the MRS client. For details, see .
  10. On the Hive client, add the ES-Hadoop lib package and the httpclient dependency package. This command is valid only for the current session.

    Enter beeline or hive to go to the execution page and run the following commands:

    add jar hdfs:///tmp/hadoop-es/commons-httpclient-3.1.jar;
    add jar hdfs:///tmp/hadoop-es/elasticsearch-hadoop-x.x.x.jar;
  11. On the Hive client, create a Hive foreign table.
    • Cluster in non-security mode
      CREATE EXTERNAL table IF NOT EXISTS student( 
         id BIGINT,
         name STRING,
         addr STRING 
      )  
      STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' 
      TBLPROPERTIES(  
          'es.nodes' = 'xxx.xxx.xxx.xxx:9200',
          'es.port' = '9200',
          'es.net.ssl' = 'false', 
          'es.nodes.wan.only' = 'false', 
          'es.nodes.discovery'='false',
          'es.input.use.sliced.partitions'='false',
          'es.resource' = 'student/_doc'
      );
    • Cluster in security mode + HTTP
      CREATE EXTERNAL table IF NOT EXISTS student( 
         id BIGINT,
         name STRING,
         addr STRING 
      )  
      STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' 
      TBLPROPERTIES(  
          'es.nodes' = 'xxx.xxx.xxx.xxx:9200',
          'es.port' = '9200',
          'es.net.ssl' = 'false', 
          'es.nodes.wan.only' = 'false', 
          'es.nodes.discovery'='false',
          'es.input.use.sliced.partitions'='false',
          'es.nodes.client.only'='true',
          'es.resource' = 'student/_doc',
          'es.net.http.auth.user' = 'username', 
          'es.net.http.auth.pass' = 'password'
      );
    • Cluster in security mode + HTTPS
      1. Obtain the security certificate CloudSearchService.cer.
        1. Log in to the CSS management console.
        2. In the navigation pane, choose Clusters. The cluster list is displayed.
        3. Click the name of a cluster to go to the cluster details page.
        4. On the Configuration page, click Download Certificate next to HTTPS Access.
      2. Convert the security certificate CloudSearchService.cer. Upload the downloaded security certificate to the client and use keytool to convert the .cer certificate into a .jks certificate that can be read by Java.
        • In Linux, run the following command to convert the certificate:
          keytool -import -alias newname -keystore ./truststore.jks -file ./CloudSearchService.cer 
        • In Windows, run the following command to convert the certificate:
          keytool -import -alias newname -keystore .\truststore.jks -file .\CloudSearchService.cer

        In the preceding command, newname indicates the user-defined certificate name.

        After this command is executed, you will be prompted to set the certificate password and confirm the password. Securely store the password. It will be used for accessing the cluster.

      3. Put the .jks file to the same path of each node in the MRS cluster, for example, /tmp. You can run the scp command to transfer the file. Ensure user omm has the permission to read the file. You can run the following command to set the permission:
        chown -R omm truststore.jks
      4. Create a Hive foreign table.
        CREATE EXTERNAL table IF NOT EXISTS student( 
           id BIGINT,
           name STRING,
           addr STRING 
        )  
        STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' 
        TBLPROPERTIES(  
            'es.nodes' = 'https://xxx.xxx.xxx.xxx:9200',
            'es.port' = '9200',
            'es.net.ssl' = 'true', 
            'es.net.ssl.truststore.location' = 'cerFilePath',
            'es.net.ssl.truststore.pass' = 'cerPassword',
            'es.nodes.wan.only' = 'false', 
            'es.nodes.discovery'='false',
            'es.nodes.client.only'='true',
            'es.input.use.sliced.partitions'='false',
            'es.resource' = 'student/_doc',
            'es.net.http.auth.user' = 'username', 
            'es.net.http.auth.pass' = 'password'
        );
    Table 2 ES-Hadoop parameters

    Parameter

    Default Value

    Description

    es.nodes

    localhost

    Address for accessing the CSS cluster. You can view private network address in the cluster list.

    es.port

    9200

    Port number for accessing a cluster. Generally, the port number is 9200.

    es.nodes.wan.only

    false

    Whether to perform node sniffing.

    es.nodes.discovery

    true

    Whether to disable node discovery.

    es.input.use.sliced.partitions

    true

    Whether to use slices. Its value can be:

    • true
    • false
    NOTE:

    If this parameter is set to true, the index prefetch time may be significantly prolonged, and may even be much longer than the data query time. You are advised to set this parameter to false to improve query efficiency.

    es.resource

    NA

    Specifies the index and type to be read and written.

    es.net.http.auth.user

    NA

    Username for accessing the cluster. Set this parameter only if the security mode is enabled.

    es.net.http.auth.pass

    NA

    Password of the user. Set this parameter only if the security mode is enabled.

    es.net.ssl

    false

    Whether to enable SSL. If SSL is enabled, you need to configure the security certificate information.

    es.net.ssl.truststore.location

    NA

    Path of the .jks certificate file, for example, file:///tmp/truststore.jks.

    es.nodes.client.only

    false

    Check whether the IP address of an independent Client node is configured for es.nodes (that is, whether the Client node is enabled during Elasticsearch cluster creation). If yes, change the value to true, or an error will be reported, indicating that the data node cannot be found.

    es.net.ssl.truststore.pass

    NA

    Password of the .jks certificate file.

    For details about ES-Hadoop configuration items, see the official configuration description.

  12. On the Hive client, insert data.
    INSERT INTO TABLE student VALUES (1, "Lucy", "address1"), (2, "Lily", "address2");
  13. On the Hive client, run a query.
    select * from student;

    The query result is as follows:

    +-------------+---------------+---------------+
    | student.id  | student.name  | student.addr  |
    +-------------+---------------+---------------+
    | 1           | Lucy          | address1      |
    | 2           | Lily          | address2      |
    +-------------+---------------+---------------+
    2 rows selected (0.116 seconds)
  14. Log in to the CSS console and choose Clusters. Locate the target cluster and click Access Kibana in the Operation column.
  15. On the Dev Tools page of Kibana, run a query and view the result.
    GET /student/_search
    Figure 2 Kibana query result