Updated on 2024-10-09 GMT+08:00

Compacting CarbonData Table Segments

Scenario

Frequent data access results in a large number of fragmented CarbonData files in the storage directory. In each data loading, data is sorted and indexing is performed. This means that an index is generated for each load. With the increase of data loading times, the number of indexes also increases. As each index works only on one loading, the performance of index is reduced. CarbonData provides loading and compression functions. In a compression process, data in each segment is combined and sorted, and multiple segments are combined into one large segment.

Prerequisites

Multiple data loadings have been performed.

Operation Description

There are three types of compaction: Minor, Major, and Custom.

  • Minor compaction:

    In minor compaction, you can specify the number of loads to be merged. If carbon.enable.auto.load.merge is set, minor compaction is triggered for every data load. If any segments are available to be merged, then compaction will run parallel with data load.

    There are two levels in minor compaction:

    • Level 1: Merging of the segments which are not yet compacted
    • Level 2: Merging of the compacted segments again to form a larger segment
  • Major compaction:

    Multiple segments can be merged into one large segment. You can specify the compaction size so that all segments below the size will be merged. Major compaction is usually done during the off-peak time.

  • Custom compaction:

    In Custom compaction, you can specify the IDs of multiple segments to merge them into a large segment. The IDs of all the specified segments must exist and be valid. Otherwise, the compaction fails. Custom compaction is usually done during the off-peak time.

For details, see ALTER TABLE COMPACTION.

Table 1 Compaction parameters

Parameter

Default Value

Application Type

Description

carbon.enable.auto.load.merge

false

Minor

Whether to enable compaction along with data loading.

true: Compaction is automatically triggered when data is loaded.

false: Compaction is not triggered when data is loaded.

carbon.compaction.level.threshold

4,3

Minor

This configuration is for minor compaction which decides how many segments to be merged.

For example, if this parameter is set to 2,3, minor compaction is triggered every two segments and segments form a single level 1 compacted segment. When the number of compacted level 1 segments reach 3, compaction is triggered again to merge them to form a single level 2 segment.

The compaction policy depends on the actual data size and available resources.

The value ranges from 0 to 100.

carbon.major.compaction.size

1024 MB

Major

The major compaction size can be configured using this parameter. Sum of the segments which is below this threshold will be merged.

For example, if this parameter is set to 1024 MB, and there are five segments whose sizes are 300 MB, 400 MB, 500 MB, 200 MB, and 100 MB used for major compaction, only segments whose total size is less than this threshold are compacted. In this example, only the segments whose sizes are 300 MB, 400 MB, 200 MB, and 100 MB are compacted.

carbon.numberof.preserve.segments

0

Minor/Major

If you want to preserve some number of segments from being compacted, then you can set this configuration.

For example, if carbon.numberof.preserve.segments is set to 2, the latest two segments will always be excluded from the compaction.

By default, no segments are reserved.

carbon.allowed.compaction.days

0

Minor/Major

This configuration is used to control on the number of recent segments that needs to be compacted.

For example, if this parameter is set to 2, the segments which are loaded in the time frame of past 2 days only will get merged. Segments which are loaded earlier than 2 days will not be merged.

This configuration is disabled by default.

carbon.number.of.cores.while.compacting

2

Minor/Major

Number of cores to be used while compacting data. The greater the number of cores, the better the compaction performance. If the CPU resources are sufficient, you can increase the value of this parameter.

carbon.merge.index.in.segment

true

SEGMENT_INDEX

If this parameter is set to true, all the Carbon index (.carbonindex) files in a segment will be merged into a single Index (.carbonindexmerge) file. This enhances the first query performance.

Reference

You are advised not to perform minor compaction on historical data. For details, see How to Avoid Minor Compaction for Historical Data?.