Introduction

Overview

FusionInsight Manager provides backup and restoration capabilities for user data and system data in a cluster. The backup function is provided by component. The system supports backup of Manager data, component metadata (DBService, HDFS NameNode, HBase, Kafka and Yarn), and service data (HBase, HDFS, Hive).

The backup function supports data backup to the local disk, local HDFS, remote HDFS, NAS (NFS/CIFS), SFTP server and OBS. For details, see section Backing Up Data.

For components supporting multi-service, multiple instances of the same service can be backed up and restored. The backup and restoration operations are the same as those when there is one instance.

MRS 3.1.0 and later versions support backing up data to OBS.

The backup and recovery tasks are performed in the following scenarios:

Routine backup is performed to ensure the data security of the system and components.
When the system is faulty, the data backup can be used to recover the system.
When the active cluster is completely faulty, an image cluster same as the active cluster needs to be created, and backed up data can be used to perform restoration operations.

**Table 1** Backing up Manager configuration data based on service requirements
Backup Type	Backup Content	Backup Directory Type
OMS	Back up database data (excluding alarm data) and configuration data in the cluster management system by default.	LocalDir LocalHDFS RemoteHDFS NFS CIFS SFTP OBS

**Table 2** Backing up component metadata or other data based on service requirements
Backup Type	Backup Content	Backup Directory Type
DBService	Back up metadata of components (including Loader, Hive, Spark, Oozie, Hue) managed by DBService. After the multi-instance function is enabled, the metadata of multiple Hive and Spark service instances is backed up.	LocalDir LocalHDFS RemoteHDFS NFS CIFS SFTP OBS
Kafka	Kafka metadata.	LocalDir LocalHDFS RemoteHDFS NFS CIFS OBS
NameNode	Back up HDFS metadata. For clusters enabled with multi-service, the backup and recovery function is supported for these NameServices and the backup and recovery operations are consistent with those of the default instance hacluster.	LocalDir RemoteHDFS NFS CIFS SFTP OBS
Yarn	Back up information about the Yarn service resource pool.
HBase	tableinfo file and data files of HBase

**Table 3** Backing up service data of specific components based on service requirements
Backup Type	Backup Content	Backup Directory Type
HBase	Back up table-level user data. For clusters enabled with multi-service, the multiple HBase service instances can be backed up and restored. The backup and restoration operations are the same as those for the HBase service instance.	RemoteHDFS NFS CIFS SFTP
HDFS	Back up the directories or files that correspond to user services. NOTE: Encrypted directories cannot be backed up or restored.
Hive	Back up table-level user data. For clusters enabled with multi-service, the multiple Hive service instances can be backed up and restored. The backup and restoration operations are the same as those for the Hive service instance.

Note that some components do not provide the data backup and restoration functions:

Kafka supports copies and allows multiple copies to be specified when a topic is created.
Mapreduce and Yarn data is stored in the HDFS. Therefore, MapReduce and Yarn depend on the HDFS to provide the backup and restoration functions.
Backup and restoration of service data in ZooKeeper are performed by their own upper-layer components.

Principles

Task

Before backup or recovery, you need to create a backup or recovery task and set task parameters, such as the task name, backup data source, and type of backup file save path. Data backup and recovery can be performed by executing backup and recovery tasks. When the Manager is used to recover the data of HDFS, HBase, Hive, and NameNode, the cluster cannot be accessed.

Each backup task can back up data of different data sources and generates an independent backup file for each data source. All the backup files generated in each backup task form a backup file set, which can be used in recovery tasks. Backup data can be stored on Linux local disks, local cluster HDFS, and standby cluster HDFS. The backup task provides the full backup or incremental backup policies. HBase, HDFS, and Hive backup tasks support the incremental backup policy, while OMS, DBService, and NameNode backup task supports only the full backup policy.

Task execution rules:

If a task is being executed, the task cannot be executed repeatedly and other tasks cannot be started too.
The interval at which a periodical task is automatically executed must be greater than 120s; otherwise, the task is postponed and executed in the next period. Manual tasks can be executed at any interval.
When a period task is to be automatically executed, the current time cannot be 120s later than the task start time; otherwise, the task is postponed and executed in the next period.
When a periodical task is locked, it cannot be automatically executed and needs to be manually unlocked.
Before an OMS, LdapServer, Kafka or NameNode backup task starts, ensure that the LocalBackup partition on the active management node has more than 20 GB available space; otherwise, the backup task cannot be started.

When planning backup and recovery tasks, select the data to be backed up or recovered strictly based on the service logic, data store structure, and database or table association. By default, the system creates the periodic backup tasks default-oms and default-cluster ID at an interval of one hour, to fully back up OMS and metadata of DBService and NameNode to the local disk.

Snapshot

The system adopts snapshot technology to quickly back up data. Snapshots include HBase snapshots HDFS snapshots.

HBase snapshot
An HBase snapshot is a backup file of HBase tables at a specified time point. This backup file does not copy service data or affect the RegionServer. The HBase snapshot copies table metadata, including table descriptor, region info, and HFile reference information. The metadata can be used to recover data before the snapshot creation time.
HDFS snapshot
An HDFS snapshot is a read-only backup copy of the HDFS file system at a specified time point. The snapshot is used in data backup, misoperation protection, and disaster recovery scenarios.

The snapshot function can be enabled for any HDFS directory to create the related snapshot file. Before creating a snapshot for a directory, the system automatically enables the snapshot function for the directory. Creating a snapshot does not affect any HDFS operation. A maximum of 65536 snapshots can be created for each HDFS directory.

When a snapshot is being created for an HDFS directory, the directory cannot be deleted or modified before the snapshot is created. Snapshots cannot be created for the upper-layer directories or subdirectories of the directory.

DistCp

Distributed copy (DistCp) is a tool used to perform large-amount data replication in the cluster HDFS or between the HDFSs of different clusters. In an HBase, HDFS or Hive metadata backup or recovery task, if the data is backed up in the HDFS of the standby cluster, the system invokes DistCp to perform the operation. Install the MRS system of the same version on the active and standby clusters.

DistCp uses Mapreduce to implement data distribution, troubleshooting, recovery, and report. DistCp specifies different Map jobs for various source files and directories in the specified list. Each Map job copies the data in the partition that corresponds to the specified file in the list.

To use DistCp to perform data replication between the HDFS of two clusters, configure the trust relationship and cross-cluster replication function for both clusters (The mutual trust relationship does not need to be configured for clusters managed by the same FusionInsight Manager.). When backing up the cluster data to HDFS in another cluster, you need to install the Yarn component. Otherwise, the backup fails.

Local rapid recovery

After using DistCp to back up the HBase, HDFS, and Hive data of the local cluster in the HDFS of the standby cluster, the HDFS of the local cluster retains the backup data snapshots. Users can create local rapid recovery tasks to recovery day by using the snapshot files in the HDFS of the local cluster.

NAS

Network Attached Storage (NAS) is a dedicated data storage server which includes the storage device and embedded system software. It provides the cross-platform file sharing function. By using NFS (supporting NFSv3 and NFSv4) and CIFS (supporting SMBv2 and SMBv3) protocols, users can connect the FusionInsight service plane with the NAS server to back up or restore data to or from the NAS.

Before data is backed up to the NAS, the system automatically mounts the NAS shared address to a local partition. After the backup is complete, the system uninstalls the NAS shared partition.
To prevent backup and restoration failures, do not access the shared address where the NAS server mounts to the local host during data backup and restoration, for example, /srv/BigData/LocalBackup/nas.
When service data is backed up to the NAS, DistCp is used.

Specifications

**Table 4** Backup and recovery feature specifications
Item	Parameter
Maximum number of backup or recovery tasks in a cluster	100
Number of concurrent running tasks	1
Maximum number of waiting tasks	199
Maximum size of backup files on a Linux local disk (GB)	600

If service data is stored in the ZooKeeper upper-layer components, ensure that the number of znodes in a single backup or restoration task is not too large. Otherwise, the task will fail, and the ZooKeeper service performance will be affected. To check the number of znodes in a single backup or restoration task, perform as follows:

Ensure that the number of znodes in a single backup or restoration task is less than the upper limit of OS file handles.
1. To check the upper limit at the system level, run the cat /proc/sys/fs/file-max command.
2. To check the upper limit at the user level, run the ulimit -n command.
If the number of znodes in the parent directory exceeds the upper limit, back up and restore data in its sub-directories in batches. To check the number of znodes using ZooKeeper client scripts, perform as follows:
1. On FusionInsight Manager, choose Cluster > Name of the desired cluster > Services > Zookeeper > Instance and view the management IP address of each ZooKeeper role.
2. Log in to the node where the client resides and run the following command:
  zkCli.sh -server ip:port, where, the IP address can be any management IP address, and the default port number is 2181.
3. If the following information is displayed, login to the ZooKeeper server succeeds:
```
WatchedEvent state:SyncConnected type:None path:null
[zk: ip:port(CONNECIED) 0]
```
4. Run the getusage command to check the number of znodes in the directory to be backed up. For example:
  getusage /hbase/region. In the command output, Node count indicates the number of znodes stored in the region directory.

**Table 5** Specifications of the **default** task
Item	OMS	HBase	Kafka	DBService	NameNode
Backup period	1 hour
Maximum number of copies	168 (Historical records of seven days)				24 (Historical records of one day)
Maximum size of a backup file	10 MB	10 MB	512MB	100 MB	20 GB
Maximum size of disk space used	1.64 GB	1.64 GB	84GB	16.41 GB	480 GB
Save path of backup data	Data path/LocalBackup/ on active and standby management nodes

The administrator must regularly transfer the backup data of the default task to an external cluster based on the enterprise's O&M requirements.
The administrator can create a DistCp backup task to store data of OMS, DBService, and NameNode to an external cluster.
The running duration of a cluster data backup task can be calculated based on the volume of data to be backed up divided by the network bandwidth between the cluster and backup device. In actual scenarios, you are advised to multiply the calculated duration by 1.5 as a reference value.
Performing a data backup task affects the maximum I/O performance of the cluster. Therefore, it is recommended that the backup task run time be staggered from the cluster peak hours.

Parent topic: Backup and Recovery Management

Previous topic: Backup and Recovery Management

Next topic: Enabling Cross-Cluster Replication