Hudi Table Partition Design Specifications

Rules

Partition keys cannot be updated:

Hudi has a primary key uniqueness mechanism, but in the context of partitioned tables, it usually can only guarantee uniqueness within a partition. Therefore, if the partition key value changes, it will result in multiple rows with the same primary key. In scenarios where date partitions are used, the creation time of the data can be used as the partition field. Remember not to use the data update time as the partition.

When the index type of Hudi is specified as global, Hudi supports data updates across partitions, but global index performance is generally poor and is not recommended.

Recommendations

Fact tables should use date partitions, while dimension tables should use non-partitioned or coarser-grained date partitions.
Whether to use partitioned tables depends on the total data volume, increment, and usage. The characteristics of fact tables and dimension tables are:
- Fact tables: large total data volume, large increment, data reads mostly sliced by date, and data is read for a certain time period.
- Dimension tables: relatively small total volume, small increment, mostly update operations, data reads involve the whole table or are filtered by the corresponding service ID.
Based on the above considerations, using daily partitions for dimension tables will result in too many files, and since the whole table is read, it will cause too many file read tasks. Using coarser-grained date partitions, such as yearly partitions, can effectively reduce the number of partitions and files. For dimension tables with small increments, non-partitioned tables can also be used. If the total data volume or increment of the dimension table is large, consider using a certain business ID for partitioning. In most data processing logic, large dimension tables will have certain business conditions for filtering to improve processing performance. This type of table needs to be optimized based on specific service scenarios and cannot be optimized solely by date partitioning. Fact table reads will be sliced by time periods, such as the last year, last month, or last day, so fact tables should prioritize date partitions.
Use date fields for partitions, and the granularity of partitioned tables should be determined based on the data update range, neither too large nor too small.
The granularity of partitions can be yearly, monthly, or daily. The goal of partition granularity is to reduce the number of file buckets written simultaneously, especially when there is a regular pattern of data updates over a certain time range. For example, if the highest proportion of data updates is within the last month, partitions can be created by month; if the highest proportion of data updates is within the last day, partitions can be created by day.

Using bucket index, writes are scattered by hashing the primary key, and data is evenly written to each bucket under the partition. Since the data volume of each partition can fluctuate, the design of the number of buckets under each partition is usually calculated based on the maximum partition data volume. The finer the partition granularity, the more redundant the number of buckets. The following is an example:

Using daily partitions, with an average daily data increment of 3 GB and a maximum daily log of 8 GB, the table is created with Bucket count = 8 GB/2 GB = 4. If the highest proportion of data updates is daily and mainly distributed over the last month, this means that the data will be written into all buckets for the entire month, which is 4 x 30 = 120 buckets. If monthly partitions are used, the number of partition buckets = 3 GB x 30/2 GB = 45 buckets, thus reducing the number of data buckets to 45. With limited compute resources, the fewer buckets written, the higher the performance.