CarbonData Data Migration

Scenario

If you want to rapidly migrate CarbonData data from a cluster to another one, you can use the CarbonData backup and restoration commands. This method does not require data import in the target cluster, reducing required migration time.

Prerequisites

The Spark2x client has been installed in a directory, for example, /opt/client, in two clusters. The source cluster is cluster A, and the target cluster is cluster B.

Procedure

Log in to the node where the client is installed in cluster A as a client installation user.
Run the following commands to configure environment variables:
source /opt/client/bigdata_env

source /opt/client/Spark2x/component_env
If the cluster is in security mode, run the following command to authenticate the user. In normal mode, skip user authentication.
kinit carbondatauser

carbondatauser indicates the user of the original data. That is, the user has the read and write permissions for the tables.

You must add the user to the hadoop (primary group) and hive groups, and associate it with the System_administrator role.
Run the following command to connect to the database and check the location for storing table data on HDFS:
spark-beeline

desc formatted Name of the table containing the original data;

Location in the displayed information indicates the directory where the data file resides.
Log in to the node where the client is installed in cluster B as a client installation user and configure the environment variables:
source /opt/client/bigdata_env

source /opt/client/Spark2x/component_env
If the cluster is in security mode, run the following command to authenticate the user. In normal mode, skip user authentication.
kinit carbondatauser2

carbondatauser2 indicates the user that uploads data.

You must add the user to the hadoop (primary group) and hive groups, and associate it with the System_administrator role.
Run the spark-beeline command to connect to the database.
Does the database that maps to the original data exist?
- If yes, go to 9.
- If no, run the create database Database name command to create a database with the same name as that maps to the original data and go to 9.
Copy the original data from the HDFS directory in cluster A to that in cluster B.
When uploading data in cluster B, ensure that the upload directory has the directories with the same names as the database and table in the original directory and the upload user has the permission to write data to the upload directory. After the data is uploaded, the user has the permission to read and write the data.

For example, if the original data is stored in /user/carboncadauser/warehouse/db1/tb1, the data can be stored in /user/carbondatauser2/warehouse/db1/tb1 in the new cluster.
1. Run the following command to download the original data to the /opt/backup directory of cluster A:
  hdfs dfs -get /user/carboncadauser/warehouse/db1/tb1 /opt/backup
2. Run the following command to copy the original data of cluster A to the /opt/backup directory on the client node of cluster B.
  scp /opt/backup root@IP address of the client node of cluster B:/opt/backup
3. Run the following command to upload the data copied to cluster B to HDFS:
  hdfs dfs -put /opt/backup /user/carbondatauser2/warehouse/db1/tb1
In the client environment of cluster B, run the following command to generate the metadata associated with the table corresponding to the original data in Hive:
REFRESH TABLE $dbName.$tbName;

$dbName indicates the database name, and $tbName indicates the table name.
If the original table contains an index table, perform 9 and 10 to migrate the index table directory from cluster A to cluster B.
Run the following command to register an index table for the CarbonData table (skip this step if no index table is created for the original table):
REGISTER INDEX TABLE $tableName ON $maintable;

$tableName indicates the index table name, and $maintable indicates the table name.