Help Center/ MapReduce Service/ Component Operation Guide (LTS)/ Using ClickHouse/ ClickHouse Data Import/ Interconnecting ClickHouse with HDFS (MRS 3.3.0-LTS or later)
Updated on 2024-12-13 GMT+08:00

Interconnecting ClickHouse with HDFS (MRS 3.3.0-LTS or later)

This section applies to MRS 3.3.0-LTS or later.

Scenario

This section describes how to read and write files after connecting ClickHouse in security mode to HDFS in security mode. Functions same as those provided in the open source community are available after ClickHouse is connected to HDFS in normal mode. ClickHouse and HDFS deployed in clusters of different modes cannot be connected.

Prerequisites

  • The ClickHouse client has been installed in a directory, for example, /opt/client.
  • A user, for example, clickhouseuser, who has permissions on ClickHouse tables and has the permission to access HDFS has been created on FusionInsight Manager.
  • A corresponding directory exists in HDFS. The HDFS engine of ClickHouse only works with files but does not create or delete directories.
  • When ClickHouse accesses HDFS across clusters, a user, for example, hdfsuser, who has the permission to access HDFS has been created on FusionInsight Manager in the cluster where HDFS is.
  • You have obtained the HDFS cluster domain name by logging in to FusionInsight Manager and choosing System > Permission > Domain and Mutual Trust.
  • ClickHouse cannot connect to encrypted HDFS directories.

Interconnecting ClickHouse with HDFS in a Cluster

  1. Log in to FusionInsight Manager, choose Cluster > Services > HDFS, select Configuration > All Configurations, search for and change the value of hadoop.rpc.protection to Authentication or Integrity, save the settings, and restart the HDFS service.
  2. Choose System > User, select clickhouseuser, and choose More > Download Authentication Credential.

    For the first authentication, change the initial password before downloading the authentication credential file. Otherwise, the security authentication will fail.

  3. Decompress the downloaded authentication credential package and change the name of user.keytab to clickhouse_to_hdfs.keytab.
  4. Log in to FusionInsight Manager, choose Cluster > Services > ClickHouse, and click Configurations then All Configurations. Click ClickHouseServer(Role) and select Engine. Click Upload File next to hdfs.hadoop_kerberos_keytab_file. Then upload the authentication credential file in 3. Set hdfs.hadoop_kerberos_principal to a value in the format of Username@Domain name, for example, clickhouseuser@HADOOP.COM.

    If you are connecting ClickHouse to the HDFS in HA mode, perform the following steps:
    1. Log in to FusionInsight Manager, choose Cluster > Services > HDFS > Instances, click any instance name, and click the hdfs-site.xml in the Configuration File area of the Dashboard tab.
    2. Obtain the values of the following parameters from hdfs-site.xml. For example, if two NameServices are configured for HDFS, one is hacluster and the other is ns1, you need to obtain the values of the following parameters:

      dfs.nameservices, dfs.ha.namenodes.*, dfs.namenode.rpc-address.*.* and dfs.client.failover.proxy.provider.* (change the value to org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider)

      • Value of dfs.namenode.rpc-address.*.* can be obtained from the HDFS. If the value is a host name, change it to the service IP address of the host name.
      • When configuring clickhouse-to-hdfs-customize parameters, use lowercase letters for NameService (regardless of the original letter case of the NameService). For example, if a NameService of the HDFS cluster is NS1, convert it to ns1 in the clickhouse-to-hdfs-customize file.

      The following is an example:

      <property> 
      <name>dfs.nameservices</name> 
      <value>hacluster,ns1</value> 
      </property> 
      <property> 
      <name>dfs.namenode.rpc-address.hacluster.13</name> 
      <value>192.168.0.1:25000</value> 
      </property> 
      <property> 
      <name>dfs.namenode.rpc-address.hacluster.14</name> 
      <value>192.168.0.2:25000</value> 
      </property> 
      <property> 
      <name>dfs.ha.namenodes.hacluster</name> 
      <value>13,14</value> 
      </property>
      <name>dfs.namenode.rpc-address.ns1.16</name> 
      <value>192.168.0.3:25000</value> 
      </property> 
      <property> 
      <name>dfs.namenode.rpc-address.ns1.17</name> 
      <value>192.168.0.4:25000</value> 
      </property> 
      <property> 
      <name>dfs.ha.namenodes.ns1</name> 
      <value>16,17</value> 
      </property> 
      <property> 
      <name>dfs.client.failover.proxy.provider.ns1</name> 
      <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> 
      </property>
      <property> 
      <name>dfs.client.failover.proxy.provider.hacluster</name> 
      <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> 
      </property>
    3. On FusionInsight Manager, choose Cluster > Services > ClickHouse > Configurations > All Configurations > ClickHouseServer(Role) > Engine, and add the parameter values obtained in 4.b to the custom configuration clickhouse-to-hdfs-customize.

  5. Save the configuration and restart ClickHouse.
  6. Log in to the node where the client is installed as the client installation user.
  7. Run the following command to go to the client installation directory:

    cd /opt/client

  8. Run the following command to configure environment variables:

    source bigdata_env

  9. Run the following command to authenticate the current user. (Skip this step for a cluster with Kerberos authentication disabled.)

    kinit clickhouseuser

  10. Run the client command of ClickHouse to log in to the ClickHouse client.

    clickhouse client --host Service IP address of the ClickHouseServer instance --secure --port 9440

    • To obtain the service IP address of the ClickHouseServer instance, perform the following steps:

      Log in to FusionInsight Manager and choose Cluster > Services > ClickHouse. On the page that is displayed, click the Instance tab. On this tab page, obtain the service IP addresses of the ClickHouseServer instance.

  11. Connect ClickHouse to HDFS.

    • Run the following command to connect to the HDFS component in HA mode:

      CREATE TABLE default.hdfs_engine_table (`name` String, `value` UInt32) ENGINE = HDFS('hdfs://{nameservice}/tmp/secure_ck.txt', 'TSV')

    • Run the following command to connect to the HDFS component in non-HA mode:

      CREATE TABLE default.hdfs_engine_table (`name` String, `value` UInt32) ENGINE = HDFS('hdfs://{namenode_ip}:{dfs.namenode.rpc.port}/tmp/secure_ck.txt', 'TSV')

      • To obtain the value of nameservice, perform the following steps:

        On FusionInsight Manager, choose Cluster > Services > HDFS > NameService Management and obtain the value of NameService. The nameservice used in the table creation statement must be in lowercase (regardless of whether the value in the HDFS cluster is in lowercase). For example, if the value in the HDFS cluster is NS1, the value in the table creation statement must be in lowercase ns1.

      • To obtain the value of namenode_ip, perform the following steps:

        Log in to FusionInsight Manager and choose Cluster > Services > HDFS. On the page that is displayed, click the Instance tab. On this tab page, obtain the service IP addresses of the active NameNode.

      • To obtain the value of dfs.namenode.rpc.port, perform the following steps:

        Log in to FusionInsight Manager and choose Cluster > Services > HDFS. On the page that is displayed, click the Configurations tab then the All Configurations sub-tab. On this sub-tab page, search for dfs.namenode.rpc.port to obtain its value.

      • HDFS file path to be accessed:

        If multiple files need to be accessed, add an asterisk (*) to the end of the folder, for example, hdfs://{namenode_ip}:{dfs.namenode.rpc.port}/tmp/*.

        ClickHouse cannot connect to encrypted HDFS directories.

      • Write data. For details, see Process of Writing ClickHouse Data to HDFS.

Interconnecting ClickHouse with HDFS Across Clusters

  1. Log in to the FusionInsight Manager of the HDFS cluster, choose Cluster > Services > HDFS, select Configuration > All Configurations, search for and change the value of hadoop.rpc.protection to Authentication or Integrity, save the settings, and restart the HDFS service.
  2. Log in to FusionInsight Manager of the ClickHouse cluster and choose System > Permission > Domain and Mutual Trust. Configure mutual trust or unilateral mutual trust with the HDFS cluster. To configure unilateral mutual trust, configure mutual trust with the HDFS cluster only on the ClickHouse cluster.
  3. Log in to FusionInsight Manager of the HDFS cluster and choose System > Permission > User. On the page that is displayed, select hdfsuser, click More, and select Download Authentication Credential.

    For the first authentication, change the initial password before downloading the authentication credential file. Otherwise, the security authentication will fail.

  4. Decompress the downloaded authentication credential package and change the name of user.keytab to clickhouse_to_hdfs.keytab.
  5. Log in to FusionInsight Manager of the ClickHouse cluster, choose Cluster > Services > ClickHouse, and click Configurations then All Configurations. Click ClickHouseServer(Role) and select Engine. Click Upload File next to hdfs.hadoop_kerberos_keytab_file to upload the authentication credential file in 3. Set hdfs.hadoop_kerberos_principal to a value in the format of Username@Domain name, for example, hdfsuser@HDFS_HADOOP.COM.

    If you are connecting ClickHouse to the HDFS in HA mode, perform the following steps:
    1. Log in to FusionInsight Manager of the HDFS cluster, choose Cluster > Service > HDFS > Instances, click any instance name, and click the hdfs-site.xml in the Configuration File area in the Dashboard tab.
    2. Obtain the values of the following parameters from hdfs-site.xml. For example, if two NameServices are configured for HDFS, one is hacluster and the other is ns1, you need to obtain the values of the following parameters:

      dfs.nameservices, dfs.ha.namenodes.*, dfs.namenode.rpc-address.*.* and dfs.client.failover.proxy.provider.* (change the value to org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider)

      • Value of dfs.namenode.rpc-address.*.* can be obtained from the HDFS. If the value is a host name, change it to the service IP address of the host name.
      • When configuring clickhouse-to-hdfs-customize parameters, use lowercase letters for NameService (regardless of the original letter case of the NameService). For example, if a NameService of the HDFS cluster is NS1, convert it to ns1 in the clickhouse-to-hdfs-customize file.

      The following is an example:

      <property> 
      <name>dfs.nameservices</name> 
      <value>hacluster,ns1</value> 
      </property> 
      <property> 
      <name>dfs.namenode.rpc-address.hacluster.13</name> 
      <value>192.168.0.1:25000</value> 
      </property> 
      <property> 
      <name>dfs.namenode.rpc-address.hacluster.14</name> 
      <value>192.168.0.2:25000</value> 
      </property> 
      <property> 
      <name>dfs.ha.namenodes.hacluster</name> 
      <value>13,14</value> 
      </property>
      <name>dfs.namenode.rpc-address.ns1.16</name> 
      <value>192.168.0.3:25000</value> 
      </property> 
      <property> 
      <name>dfs.namenode.rpc-address.ns1.17</name> 
      <value>192.168.0.4:25000</value> 
      </property> 
      <property> 
      <name>dfs.ha.namenodes.ns1</name> 
      <value>16,17</value> 
      </property> 
      <property> 
      <name>dfs.client.failover.proxy.provider.ns1</name> 
      <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> 
      </property>
      <property> 
      <name>dfs.client.failover.proxy.provider.hacluster</name> 
      <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> 
      </property>
    3. On FusionInsight Manager of the ClickHouse cluster, choose Cluster > Services > ClickHouse > Configurations > All Configurations > ClickHouseServer(Role) > Engine, and add the parameter values obtained in 5.b to the custom configuration clickhouse-to-hdfs-customize.

  6. Save the configuration and restart ClickHouse.
  1. Log in to the node where the client is installed as the client installation user.
  2. Run the following command to go to the client installation directory:

    cd /opt/client

  3. Run the following command to configure environment variables:

    source bigdata_env

  4. Run the following command to authenticate the current user. (Skip this step for a cluster with Kerberos authentication disabled.)

    kinit clickhouseuser

  5. Run the client command of ClickHouse to log in to the ClickHouse client.

    clickhouse client --host Service IP address of the ClickHouseServer instance --secure --port 9440

    To obtain the service IP address of the ClickHouseServer instance, perform the following steps:

    Log in to FusionInsight Manager and choose Cluster > Services > ClickHouse. On the page that is displayed, click the Instances tab. On this tab page, obtain the service IP addresses of the ClickHouseServer instance.

  6. Connect to the HDFS component.

    • Run the following command to connect to the HDFS component in HA mode:

      CREATE TABLE default.hdfs_engine_table (`name` String, `value` UInt32) ENGINE = HDFS('hdfs://{nameservice}/tmp/secure_ck.txt', 'TSV')

    • Run the following command to connect to the HDFS component in non-HA mode:

      CREATE TABLE default.hdfs_engine_table (`name` String, `value` UInt32) ENGINE = HDFS('hdfs://{namenode_ip}:{dfs.namenode.rpc.port}/tmp/secure_ck.txt', 'TSV')

      • To obtain the value of nameservice, perform the following steps:

        On FusionInsight Manager, choose Cluster > Services > HDFS > NameService Management and obtain the value of NameService. The nameservice used in the table creation statement must be in lowercase (regardless of whether the value in the HDFS cluster is in lowercase). For example, if the value in the HDFS cluster is NS1, the value in the table creation statement must be in lowercase ns1.

      • To obtain the value of namenode_ip, perform the following steps:

        Log in to FusionInsight Manager and choose Cluster > Services > HDFS. On the page that is displayed, click the Instance tab. On this tab page, obtain the service IP addresses of the active NameNode.

      • To obtain the value of dfs.namenode.rpc.port, perform the following steps:

        Log in to FusionInsight Manager and choose Cluster > Services > HDFS. On the page that is displayed, click the Configurations tab then the All Configurations sub-tab. On this sub-tab page, search for dfs.namenode.rpc.port to obtain its value.

      • HDFS file path to be accessed:

        If multiple files need to be accessed, add an asterisk (*) to the end of the folder, for example, hdfs://{namenode_ip}:{dfs.namenode.rpc.port}/tmp/*.

      • Write data. For details, see Process of Writing ClickHouse Data to HDFS.

Process of Writing ClickHouse Data to HDFS

To write ClickHouse data to HDFS, perform the following steps: For example, write the secure_ck.txt file in the /tmp directory of HDFS.

  1. Create an HDFS table.

    CREATE TABLE hdfs_engine_table (name String, value UInt32) ENGINE=HDFS('hdfs://{namenode_ip}:{dfs.namenode.rpc.port}/tmp/secure_ck.txt', 'TSV')

  2. Write an HDFS data file.

    INSERT INTO hdfs_engine_table VALUES ('one', 1), ('two', 2), ('three', 3)

  3. Query the HDFS data file.

    SELECT * FROM hdfs_engine_table LIMIT 2

    ┌─name─┬─value─┐
    │ one    │     1   │
    │ two    │     2   │
    └────┴─── ─┘
  • If you use ClickHouse to write data in an HDFS engine table and the target data file does not exist, it will be generated.
  • ClickHouse does not support deletion, modification, and appending of HDFS engine table data.
  • After the HDFS engine table is deleted from ClickHouse, data files in the HDFS remain unchanged.