Updated on 2025-08-22 GMT+08:00

Interconnecting ClickHouse with HDFS (MRS 3.2.0-LTS)

Scenarios

Connect ClickHouse to HDFS to read and write files.

Notes and Constraints

This section applies only to MRS 3.2.0-LTS.

Prerequisites

  • The ClickHouse client has been installed in a directory, for example, /opt/client.
  • A user, for example, clickhouseuser, who has permissions on ClickHouse tables and has the permission to access HDFS has been created on FusionInsight Manager.
  • A corresponding directory exists in HDFS. The HDFS engine of ClickHouse only works with files but does not create or delete directories.
  • Only the ClickHouse cluster deployed on x86 nodes can connect to HDFS. The ClickHouse cluster deployed on Arm nodes cannot connect to HDFS.

Procedure

  1. Log in to the node where the client is installed as the client installation user.
  2. Run the following command to go to the client installation directory:

    cd /opt/client

  3. Run the following command to configure environment variables:

    source bigdata_env

  4. Run the following command to authenticate the current user. (Change the password upon the first authentication. Skip this step for a cluster with Kerberos authentication disabled.)

    kinit clickhouseuser

  5. Run the client command of ClickHouse to log in to the ClickHouse client.

    clickhouse client --host Service IP address of the ClickHouseServer instance --secure --port 9440

    To obtain the service IP address of the ClickHouseServer instance, perform the following operations:

    On FusionInsight Manager, choose Cluster > Services > ClickHouse. On the displayed page, click the Instances tab and obtain the service IP addresses of the ClickHouseServer instances.

  6. Run the following command to connect ClickHouse to HDFS:

    CREATE TABLE default.hdfs_engine_table (`name` String, `value` UInt32) ENGINE = HDFS('hdfs://{namenode_ip}:{dfs.namenode.rpc.port}/tmp/secure_ck.txt', 'TSV')
    Table 1 Parameter description

    Parameter

    Description

    namenode_ip

    IP address of the NameNode instance node

    On FusionInsight Manager, choose Cluster > Services > HDFS. On the displayed page, click the Instances tab, and obtain the service IP address of the active NameNode.

    dfs.namenode.rpc.port

    RPC port used by the NameNode to process all client requests.

    On FusionInsight Manager, choose Cluster > Services > HDFS. On the displayed page, click Configurations and then All Configurations, and search for dfs.namenode.rpc.port to obtain its value.

    hdfs://{namenode_ip}:{dfs.namenode.rpc.port}/tmp/

    HDFS file path to be accessed

    If multiple files need to be accessed, add an asterisk (*) to the end of the folder, for example, hdfs://{namenode_ip}:{dfs.namenode.rpc.port}/tmp/*.