Configuring Interconnection Between ClickHouse and HDFS

Scenario

This section describes how to read and write files after connecting ClickHouse in security mode to HDFS in security mode. Functions same as those provided in the open source community are available after ClickHouse is connected to HDFS in normal mode. ClickHouse and HDFS deployed in clusters of different modes cannot be connected.

Prerequisites

The ClickHouse client has been installed in a directory, for example, /opt/client.
A user, for example, clickhouseuser, who has permissions on ClickHouse tables and has the permission to access HDFS has been created on FusionInsight Manager.
A corresponding directory exists in HDFS. The HDFS engine of ClickHouse only works with files but does not create or delete directories.
When ClickHouse accesses HDFS across clusters, a user, for example, hdfsuser, who has the permission to access HDFS has been created on FusionInsight Manager in the cluster where HDFS is.
You have obtained the HDFS cluster domain name by logging in to FusionInsight Manager and choosing System > Permission > Domain and Mutual Trust.
ClickHouse cannot connect to encrypted HDFS directories.
Only the ClickHouse cluster deployed on x86 nodes can connect to HDFS. The ClickHouse cluster deployed on Arm nodes cannot connect to HDFS.

Interconnecting ClickHouse with HDFS in a Cluster

Log in to FusionInsight Manager, choose Cluster > Services > HDFS, select Configuration > All Configurations, search for and change the value of hadoop.rpc.protection to Authentication or Integrity, save the settings, and restart the HDFS service.
Choose System > User, select clickhouseuser, and choose More > Download Authentication Credential.

For the first authentication, change the initial password before downloading the authentication credential file. Otherwise, the security authentication will fail.
Decompress the downloaded authentication credential package and change the name of user.keytab to clickhouse_to_hdfs.keytab.
Log in to FusionInsight Manager, choose Cluster > Services > ClickHouse, and click Configurations then All Configurations. Click ClickHouseServer(Role) and select Engine. Click Upload File next to hdfs.hadoop_kerberos_keytab_file. Then upload the authentication credential file in 3. Set hdfs.hadoop_kerberos_principal to a value in the format of Username@Domain name, for example, clickhouseuser@HADOOP.COM.
Save the configuration and restart ClickHouse.
Log in to the node where the client is installed as the client installation user.
Run the following command to go to the client installation directory:

cd /opt/client
Run the following command to configure environment variables:

source bigdata_env
Run the following command to authenticate the current user. (Skip this step for a cluster with Kerberos authentication disabled.)

kinit clickhouseuser
Run the client command of ClickHouse to log in to the ClickHouse client.

clickhouse client --host Service IP address of the ClickHouseServer instance --secure --port 9440
Run the following command to connect ClickHouse to HDFS:

CREATE TABLE default.hdfs_engine_table (`name` String, `value` UInt32) ENGINE = HDFS('hdfs://{namenode_ip}:{dfs.namenode.rpc.port}/tmp/secure_ck.txt', 'TSV')
- To obtain the service IP address of the ClickHouseServer instance, perform the following steps:
  Log in to FusionInsight Manager and choose Cluster > Services > ClickHouse. On the page that is displayed, click the Instance tab. On this tab page, obtain the service IP addresses of the ClickHouseServer instance.
- To obtain the value of namenode_ip, perform the following steps:
  Log in to FusionInsight Manager and choose Cluster > Services > HDFS. On the page that is displayed, click the Instance tab. On this tab page, obtain the service IP addresses of the active NameNode.
- To obtain the value of dfs.namenode.rpc.port, perform the following steps:
  Log in to FusionInsight Manager and choose Cluster > Services > HDFS. On the page that is displayed, click the Configurations tab then the All Configurations sub-tab. On this sub-tab page, search for dfs.namenode.rpc.port to obtain its value.
- HDFS file path to be accessed:
  If multiple files need to be accessed, add an asterisk (*) to the end of the folder, for example, hdfs://{namenode_ip}:{dfs.namenode.rpc.port}/tmp/*.
  
  ClickHouse cannot connect to encrypted HDFS directories.
- Write data. For details, see Process of Writing ClickHouse Data to HDFS.

Interconnecting ClickHouse with HDFS Across Clusters

Log in to the FusionInsight Manager of the HDFS cluster, choose Cluster > Services > HDFS, select Configuration > All Configurations, search for and change the value of hadoop.rpc.protection to Authentication or Integrity, save the settings, and restart the HDFS service.
Log in to FusionInsight Manager of the ClickHouse cluster and choose System > Permission > Domain and Mutual Trust. Configure mutual trust or unilateral mutual trust with the HDFS cluster. To configure unilateral mutual trust, configure mutual trust with the HDFS cluster only on the ClickHouse cluster.
Log in to FusionInsight Manager of the HDFS cluster and choose System > Permission > User. On the page that is displayed, select hdfsuser, click More, and select Download Authentication Credential.

For the first authentication, change the initial password before downloading the authentication credential file. Otherwise, the security authentication will fail.
Decompress the downloaded authentication credential package and change the name of user.keytab to clickhouse_to_hdfs.keytab.
Log in to FusionInsight Manager of the ClickHouse cluster, choose Cluster > Services > ClickHouse, and click Configurations then All Configurations. Click ClickHouseServer(Role) and select Engine. Click Upload File next to hdfs.hadoop_kerberos_keytab_file to upload the authentication credential file in 3. Set hdfs.hadoop_kerberos_principal to a value in the format of Username@Domain name, for example, hdfsuser@HDFS_HADOOP.COM.
Save the configuration and restart ClickHouse.

Log in to the node where the client is installed as the client installation user.
Run the following command to go to the client installation directory:

cd /opt/client
Run the following command to configure environment variables:

source bigdata_env
Run the following command to authenticate the current user. (Skip this step for a cluster with Kerberos authentication disabled.)

kinit clickhouseuser
Run the client command of ClickHouse to log in to the ClickHouse client.

clickhouse client --host Service IP address of the ClickHouseServer instance --secure --port 9440
Run the following command to connect ClickHouse to HDFS:

CREATE TABLE default.hdfs_engine_table (`name` String, `value` UInt32) ENGINE = HDFS('hdfs://{namenode_ip}:{dfs.namenode.rpc.port}/tmp/secure_ck.txt', 'TSV')
- To obtain the service IP address of the ClickHouseServer instance, perform the following steps:
  Log in to FusionInsight Manager and choose Cluster > Services > ClickHouse. On the page that is displayed, click the Instance tab. On this tab page, obtain the service IP addresses of the ClickHouseServer instance.
- To obtain the value of namenode_ip, perform the following steps:
  Log in to FusionInsight Manager and choose Cluster > Services > HDFS. On the page that is displayed, click the Instance tab. On this tab page, obtain the service IP addresses of the active NameNode.
- To obtain the value of dfs.namenode.rpc.port, perform the following steps:
  Log in to FusionInsight Manager and choose Cluster > Services > HDFS. On the page that is displayed, click the Configurations tab then the All Configurations sub-tab. On this sub-tab page, search for dfs.namenode.rpc.port to obtain its value.
- HDFS file path to be accessed:
  If multiple files need to be accessed, add an asterisk (*) to the end of the folder, for example, hdfs://{namenode_ip}:{dfs.namenode.rpc.port}/tmp/*.
- Write data. For details, see Process of Writing ClickHouse Data to HDFS.

Process of Writing ClickHouse Data to HDFS

When ClickHouse data is written, for example, to a Hive table in HDFS, data write succeeds if the Hive table is empty or data write fails if the Hive table contains data. If the data fails to write, perform the following steps:

Back up the Hive table mapped to the ClickHouse table. For example, if the ClickHouse table is ck_tab_a and the corresponding Hive table is hive_tab_a, back up hive_tab_a to hive_tab_a_bak.
Delete the Hive table hive_tab_a.
Insert data to the ClickHouse table ck_tab_a on the ClickHouse client.
On the Hive client, insert data in the hive_tab_a_bak table to the Hive table hive_tab_a.
Delete the backup Hive table hive_tab_a_bak.