Updated on 2024-08-16 GMT+08:00

Application Scenarios of HBase BulkLoad and Put

Both the BulkLoad and Put methods can be used to load data to HBase. Though BulkLoad loads data faster than Put, BulkLoad has disadvantages. The following describes the application scenarios of these two data loading methods.

BulkLoad starts MapReduce tasks to generate HFile files, and then registers HFile files with HBase. Incorrect use of BulkLoad will consume more cluster memory and CPU resources due to started MapReduce tasks. A large number of the generated small HFile files may frequently trigger Compaction, decreasing query speed dramatically.

Incorrect use of the Put method may cause slow data loading. If the memory allocated to RegionServer is insufficient, the process may exit due to the RegionServer memory overflow.

The application scenarios of the BulkLoad and Put methods are as follows:

  • BulkLoad:
    • Large amounts of data needs to be loaded to HBase in the one-off manner.
    • When data is loaded to HBase, requirements on reliability are not high and WAL files do not need to be generated.
    • When the Put method is used to load large amounts of data to HBase, data loading and query will be slow.
    • The size of an HFile generated after data loading is similar to the size of HDFS blocks.
  • Put:
    • The size of the data loaded to one Region at a time is smaller than half the size of HDFS blocks.
    • Data needs to be loaded to HBase in real time.
    • The query speed must not decrease dramatically during data loading.