Updated on 2024-10-09 GMT+08:00

Interconnecting ClickHouse with HDFS

This topic is available for MRS 3.2.0 or later versions only.

Scenario

Connect ClickHouse to HDFS to read and write files.

Prerequisites

  • The ClickHouse client has been installed in a directory, for example, /opt/client.
  • A user, for example, clickhouseuser, who has permissions on ClickHouse tables and has the permission to access HDFS has been created on FusionInsight Manager.
  • A corresponding directory exists in HDFS. The HDFS engine of ClickHouse only works with files but does not create or delete directories.
  • Only the ClickHouse cluster deployed on x86 nodes can connect to HDFS. The ClickHouse cluster deployed on Arm nodes cannot connect to HDFS.

Procedure

  1. Log in to the node where the client is installed as the client installation user.
  2. Run the following command to go to the client installation directory:

    cd /opt/client

  3. Run the following command to configure environment variables:

    source bigdata_env

  4. Run the following command to authenticate the current user. (Change the password upon the first authentication. Skip this step for a cluster with Kerberos authentication disabled.)

    kinit clickhouseuser

  5. Run the client command of ClickHouse to log in to the ClickHouse client.

    clickhouse client --host Service IP address of the ClickHouseServer instance --secure --port 9440

  6. Run the following command to connect ClickHouse to HDFS:

    CREATE TABLE default.hdfs_engine_table (`name` String, `value` UInt32) ENGINE = HDFS('hdfs://{namenode_ip}:{dfs.namenode.rpc.port}/tmp/secure_ck.txt', 'TSV')

    • To obtain the service IP address of the ClickHouseServer instance, perform the following steps:

      Log in to FusionInsight Manager and choose Cluster > Services > ClickHouse. On the page that is displayed, click the Instances tab. In this tab, obtain the service IP addresses of the ClickHouseServer instance.

    • To obtain the value of namenode_ip, perform the following steps:

      Log in to FusionInsight Manager and choose Cluster > Services > HDFS. On the page that is displayed, click the Instances tab. In this tab, obtain the service IP addresses of the active NameNode.

    • To obtain the value of dfs.namenode.rpc.port, perform the following steps:

      Log in to FusionInsight Manager and choose Cluster > Services > HDFS. On the page that is displayed, click the Configurations tab then the All Configurations sub-tab. In this sub-tab, search for dfs.namenode.rpc.port to obtain its value.

    • HDFS file path to be accessed:

      If multiple files need to be accessed, add an asterisk (*) to the end of the folder, for example, hdfs://{namenode_ip}:{dfs.namenode.rpc.port}/tmp/*.