Updated on 2024-08-30 GMT+08:00

Hudi Table Partition Design Specifications

rules

The partition key cannot be updated:

Hudi has the primary key uniqueness mechanism. However, in a partitioned table scenario, only the primary key in the partition can be unique. Therefore, if the value of the partitioned key changes, multiple rows with the same primary key may exist. In the scenario where data is partitioned by date, you can use the data creation time as the partition field. Do not use the data update time as the partition field.

When the Hudi index type is set to Global, Hudi supports cross-partition data update. However, Global index performance is poor and is not recommended.

The suggestion

  • The fact table uses the date partition table, and the dimension table uses the non-partition or large-granularity date partition.

    Whether to use a partitioned table depends on the total data volume, increment, and usage mode of the table. From the table attributes, the fact table and dimension table have the following characteristics:

    • Fact table: The data volume is large and the increment is large. Data is read by date and data in a certain period is read.
    • Dimension table: The total amount is small and the increment is small. Most of the dimension table is updated. Data is read in the entire table or filtered by service ID.

    Based on the preceding considerations, if the dimension table is partitioned by day, the number of files is too large. In addition, the full table is read, which causes a large number of file reading tasks. If the large-granularity date partition, such as year partition, is used, the number of partitions and the number of files can be effectively reduced. For dimension tables with small increments, you can also use non-partitioned tables. If a dimension table contains a large amount of total data or a large amount of incremental data, you can use a service ID for partitioning. In most data processing logic, service conditions are used to filter large dimension tables to improve processing performance. Such tables must be optimized based on service scenarios. You cannot optimize from a date partition alone. The fact table is read by time segment. The number of files read in the last year, month, or day is relatively stable and controllable. Therefore, the date partition table is preferred for fact tables.

  • The partition uses the date field and the granularity of the partition table. The granularity must be determined based on the data update scope. The size must be neither too large nor too small.

    The partitioning granularity can be year, month, or day. The goal of the partitioning granularity is to reduce the number of file buckets that are concurrently written, especially when the data volume is updated and the updated data has a regular time range. For example, if the data update proportion in the last month is the largest, partitions can be created by month. If the data updated in the last day accounts for a large proportion, partition the data by day.

    Bucket index is used. Data is written to each bucket in the partition by using the hash algorithm of the primary key. The data volume in each partition fluctuates. Therefore, the number of buckets in a partition is calculated based on the maximum data volume in the partition. In this case, the more fine-grained partitions are, the more redundant buckets are. For example:

    If day-level partitions are used, the average daily data volume is 3 GB, and the maximum daily logs are 8 GB. In this case, tables are created using the number of buckets = 8 GB/2 GB = 4. Daily updates account for a large number of data, and are mainly scattered in the last month. As a result, data is written to the buckets of the whole month, that is, 4 x 30 = 120 buckets. If monthly partitioning is used, the number of buckets in the partition = 3 GB x 30/2 GB = 45 buckets. In this way, the number of buckets to be written is reduced to 45. With limited computing resources, the fewer buckets are written to, the higher the performance.