What Are the Application Scenarios of the bulkload and put Data-loading Modes?

Question

Both the bulkload and put data-loading modes can be used to load data to HBase. Though the bulkload mode loads data faster than the put mode, the bulkload mode has its own disadvantages. The following describes the application scenarios of these two data-loading modes.

Answer

The bulkload starts MapReduce tasks to generate HFile files, and then registers HFile files with HBase. Incorrect use of the bulkload mode will consume more cluster memory and CPU resources due to started MapReduce tasks. The large number of HFile files may frequently trigger Compaction, decreasing the query speed drastically.

Incorrect use of the put mode may cause a slow data loading rate. If the memory allocated to RegionServer is not sufficient, theprocess may exit.

The application scenarios of the bulkload and put modes are as follows:

bulkload:
- Load a large amount of data to HBase in the one-off manner.
- Load data to HBase with low reliability requirements and without generating WAL files.
- Low loading and query speed if the put mode is used.
- The size of the HFile generated after data loading is similar to the size of HDFS block.
put:
- The size of the data loaded to one Region at a time is smaller than half the size of an HDFS block.
- Load data to HBase in real time.
- The query speed does not decrease wildly during data loading.