Restoring Hive Service Data

Scenarios

Hive data restoration is required in the following scenarios: when data is unexpectedly modified or deleted and requires retrieval; when major Hive operations (such as upgrades or significant adjustments) cause exceptions in system data or fail to achieve the expected result; when all modules fail and become unavailable; and when data is migrated to a new cluster.

Hive service data restoration tasks can be created on FusionInsight Manager. The system supports manual data restoration only.

MRS clusters support multiple data path types for restoring Hive service data.

RemoteHDFS: indicates that data is restored from the HDFS directory of the standby cluster.
NFS: indicates that data is restored from the NAS using the NFS protocol.
CIFS: indicates that data is restored from the NAS using the CIFS protocol.
SFTP: indicates that data is restored from the server using the SFTP protocol.
OBS: indicates that data is restored from OBS.

Hive backup and restoration cannot identify the service and structure relationships of objects such as Hive tables, indexes, and views. When executing backup and restoration tasks, you need to manage unified restoration points based on service scenarios to ensure proper service running.

To restore data when the service is running properly, it is recommended that you manually back up the latest management data before performing data restoration. Otherwise, the Hive data that is generated after the data backup and before the data restoration will be lost.

Notes and Constraints

Data restoration can be performed only when the system version is consistent with the version used during data backup.
MRS 3.5.0 and later support data restoration from OBS.

Impact on the System

During data restoration, user authentication stops and users cannot create new connections.
After the data is restored, the data generated after the data backup and before the data restoration is lost.
After the data is restored, you need to start the upper-layer applications of Hive.

Prerequisites

If you need to restore data from a remote HDFS, a standby cluster has been created and the data has been backed up. For details, see Backing Up Hive Service Data. If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see Configuring Mutual Trust Between MRS Clusters. If the active cluster is deployed in normal mode, mutual trust is not required.

Cross-cluster replication has been configured for the active and standby clusters. For details, see Enabling MRS Inter-Cluster Replication.
Time is consistent between the active and standby clusters, with the NTP services on both clusters configured to use the same time source.
The database for storing restored data tables, the HDFS save path of data tables, and the list of users who can access restored data are planned.
The Hive backup file save path is correct.
The Hive upper-layer applications are stopped.

Restoring Hive Service Data

Log in to MRS Manager.

For details about how to log in to MRS Manager, see Accessing MRS Manager.
Choose O&M > Backup and Restoration > Backup Management.
In the row containing the specified backup task, choose More > View History in the Operation column to display the task's historical execution records.

In the displayed window, locate the desired success record and click View in the Backup Path column to display the task's backup path information and obtain the following details:
- Backup Object: indicates the backup data source.
- Backup Path: indicates the full path where the backup files are stored.
  Locate the correct path, and manually copy the full path of the backup files from the Backup Path column.
On FusionInsight Manager, choose O&M > Backup and Restoration > Restoration Management.
Click Create.
Set Task Name to the name of the restoration task.
Select the cluster to be operated from Recovery Object.
In the Restoration Configuration area, select Hive.

Select a backup directory type for Path Type of Hive.

**Table 1** Path for data restoration
Type	Parameter	Description
RemoteHDFS	Source NameService Name	NameService name of the backup data cluster. You can set it to the NameService name (haclusterX, haclusterX1, haclusterX2, haclusterX3, or haclusterX4) of the built-in remote cluster. You can also set it to the NameService name of a configured remote cluster.
	IP Mode	IP version of the target IP address. The system automatically determines the IP version, such as IPv4 or IPv6, based on the cluster network type.
	Source NameNode IP Address	Service plane IP address of the active or standby NameNode in the standby cluster.
	Source Path	Full path of the HDFS directory storing backup data in the standby cluster. Path format: Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time.tar.gz
	Queue Name	Name of the YARN queue used for backup task execution. The name must be identical to that of a queue currently running properly in the cluster.
	Restoration Point List	Click Refresh and select a Hive backup file set that has been backed up in the standby cluster.
	Target NameService Name	Name of the target NameService for the backup directory. The default value is hacluster.
	Maximum Number of Maps	Maximum number of maps in a MapReduce task. The default value is 20.
	Maximum Map Bandwidth (MB/s)	Maximum bandwidth of a map. The default value is 100.
NFS	IP Mode	IP version of the target IP address. The system automatically determines the IP version, such as IPv4 or IPv6, based on the cluster network type.
	Server IP Address	IP address of the NAS server.
	Source Path	Full path of the NAS server directory storing backup files. Path format: Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time.tar.gz
	Queue Name	Name of the YARN queue used for backup task execution. The name must be identical to that of a queue currently running properly in the cluster.
	Restoration Point List	Click Refresh and select a Hive backup file set that has been backed up in the standby cluster.
	Target NameService Name	Name of the target NameService for the backup directory. The default value is hacluster.
	Maximum Number of Maps	Maximum number of maps in a MapReduce task. The default value is 20.
	Maximum Map Bandwidth (MB/s)	Maximum bandwidth of a map. The default value is 100.
CIFS	IP Mode	IP version of the target IP address. The system automatically determines the IP version, such as IPv4 or IPv6, based on the cluster network type.
	Server IP Address	IP address of the NAS server.
	Port	Port number used by the CIFS protocol to connect to the NAS server. The default value is 445.
	Username	Username configured during CIFS protocol setup.
	Password	Password configured during CIFS protocol setup.
	Source Path	Full path of the NAS server directory storing backup files. Path format: Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time.tar.gz
	Queue Name	Name of the YARN queue used for backup task execution. The name must be identical to that of a queue currently running properly in the cluster.
	Restoration Point List	Click Refresh and select a Hive backup file set that has been backed up in the standby cluster.
	Target NameService Name	Name of the target NameService for the backup directory. The default value is hacluster.
	Maximum Number of Maps	Maximum number of maps in a MapReduce task. The default value is 20.
	Maximum Map Bandwidth (MB/s)	Maximum bandwidth of a map. The default value is 100.
SFTP	IP Mode	IP version of the target IP address. The system automatically determines the IP version, such as IPv4 or IPv6, based on the cluster network type.
	Server IP Address	IP address of the server where the backup data is stored.
	Port	Port number used by the SFTP protocol to connect to the backup server. The default value is 22.
	Username	Username used to connect to the server over SFTP.
	Password	Password used to connect to the server over SFTP.
	Source Path	Full path of the backup server directory storing backup files. Path format: Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time.tar.gz
	Queue Name	Name of the YARN queue used for backup task execution. The name must be identical to that of a queue currently running properly in the cluster.
	Restoration Point List	Click Refresh and select a Hive backup file set that has been backed up in the standby cluster.
	Target NameService Name	Name of the target NameService for the backup directory. The default value is hacluster.
	Maximum Number of Maps	Maximum number of maps in a MapReduce task. The default value is 20.
	Maximum Map Bandwidth (MB/s)	Maximum bandwidth of a map. The default value is 100.
OBS (available in MRS 3.5.0 and later)	Source Path	Full path of the OBS directory storing backup files. Path format: Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time.tar.gz
	Queue Name	Name of the YARN queue used for backup task execution. The name must be identical to that of a queue currently running properly in the cluster.
	Restoration Point List	Click Refresh and select an OBS directory that has already been backed up.
	Target NameService Name	Name of the target NameService for the backup directory. The default value is hacluster.
	Maximum Number of Maps	Maximum number of maps in a MapReduce task. The default value is 20.
	Maximum Map Bandwidth (MB/s)	Maximum bandwidth of a map. The default value is 100.

Set Backup Data in the Data Configuration to one or multiple backup data sources to be recovered based on service requirements. In the Target Database and Target Path columns, specify the target database and file save path after backup data recovery.

Configuration restrictions:
- Data can be restored to the original database, but data tables must be stored in a new path that is different from the backup path.
- To restore Hive index tables, select the Hive data tables that correspond to the Hive index tables to be restored.
- If a new restoration directory is selected to avoid affecting the current data, HDFS permission must be manually granted so that users who have permission of backup tables can access this directory.
- Data can be restored to other databases. In this case, HDFS permission must be manually granted so that users who have permission of backup tables can access the HDFS directory that corresponds to the database.
Set Force recovery to true to forcibly restore all backup data when a table with the same name already exists. Any data added to the table after the backup will be lost during restoration. If you set the parameter to false, the restoration task is not executed if a data table with the same name exists.
Click Verify to check whether the restoration task is configured correctly.
- If the queue name is incorrect, the verification fails.
- If the specified directory to be restored does not exist, the verification fails.
- If the forcible overwrite conditions are not met, the verification fails.
Click OK.
In the restoration task list, locate the row containing the created task, and click Start in the Operation column to execute the restoration task.
- After the restoration is successful, the progress bar is in green.
- After the restoration is successful, the restoration task cannot be executed again.
- If the restoration task fails during the first execution, rectify the fault and click Retry to execute the task again.