Tuning a Bucket Index Table

Common parameters for bucket indexes:

Spark:

hoodie.index.type=BUCKET
hoodie.bucket.index.num.buckets=5

Flink:

index.type=BUCKET
hoodie.bucket.index.num.buckets=5

Determining Whether to Use Partitioned or Non-Partitioned Tables

There are two types of tables based on usage scenarios: fact tables and dimension tables.

Fact tables typically have a larger amount of data, primarily new data, with a small proportion of updated data. Most of the updated data falls within a recent time range (year, month, or day). When a downstream system reads the table for ETL calculations, it typically uses a time range for clipping (e.g., last day, month, year). Such tables can usually be partitioned by the creation time of the data to ensure optimal read/write performance.
Dimension tables generally have a smaller amount of data, primarily updated data, with fewer new entries. The table data size is relatively stable, and full reads are usually required for join-like ETL calculations. Therefore, using non-partitioned tables usually results in better performance.
The partition key for partitioned tables is not allowed to update; otherwise, it will create duplicate data.

Exceptional scenarios: super large dimension tables and very small fact tables

Special cases such as dimension tables with continuous large amounts of new data (table data size over 200 GB or daily growth exceeding 60 MB) or very small fact tables (table data size less than 10 GB and will not exceed 10 GB even after 3 to 5 years of growth) need to be handled on a case-by-case basis:

Dimension tables with continuous large amounts of new data
- Method 1: Reserve bucket numbers. If using a non-partitioned table, pre-increase the bucket numbers by estimating the data increment over a long period. The disadvantage is that files will continue to swell as data grows.
- Method 2: Large-granularity partitions (recommended). If using a partitioned table, calculate the growth of the data to determine the partitioning, for example, using yearly partitions. This method is relatively cumbersome but avoids re-importing the table after many years.
- Method 3: Data aging. Analyze the business logic to see if the large dimension table can reduce data scale by aging out and cleaning up invalid dimension data.

Very small fact tables
Using non-partitioned tables with a slightly more generous number of buckets can improve read/write performance, based on the premise of estimating long-term data growth.

Determining the Number of Buckets in a Table

Setting the number of buckets for Hudi tables is crucial for performance and requires special attention.

The following are key points to confirm before creating a table:

Non-partitioned tables
- Total number of records in a single table = select count(1) from tablename (provided during ingestion).
- Size of a single record = average 1 KB (suggest using select * from tablename limit 100, calculate the size of 100 records, and divide by 100 for the average size).
- Size of single table data (GB) = Total number of records x Size of a single record/1024/1024.
- Number of buckets for non-partitioned tables = MAX(Total size of single table data (GB)/2 GB x 2, round up to nearest even number, 4).
Partitioned tables
- Total number of records in the partition with the largest data volume in the last month = Consult the product line before ingestion.
- Size of a single record = average 1 KB (suggest using select * from tablename limit 100, calculate the size of 100 records, and divide by 100 for the average size).
- Size of single partition data (GB) = Total number of records in the largest data volume partition in the last month x Size of a single record/1024/1024.
- Number of buckets for partitioned tables = MAX(Total size of single partition data (GB)/2 GB, round up to nearest even number, 1).
- Use the total data size of the table, not the compressed file size.
- It is best to set an even number of buckets, with a minimum of 4 for non-partitioned tables and a minimum of 1 for partitioned tables.