Tuning a Bucket Index Table
Common parameters for bucket indexes:
- Spark
hoodie.index.type=BUCKET hoodie.bucket.index.num.buckets=5
- Flink
index.type=BUCKET hoodie.bucket.index.num.buckets=5
Determining Whether to Use Partitioned or Non-Partitioned Tables
There are two types of tables based on usage scenarios: fact tables and dimension tables.
- Fact tables typically have a larger amount of data, primarily new data, with a small proportion of updated data. Most of the updated data falls within a recent time range (year, month, or day). When a downstream system reads the table for ETL calculations, it typically uses a time range for clipping (e.g., last day, month, year). Such tables can usually be partitioned by the creation time of the data to ensure optimal read/write performance.
- Dimension tables generally have a smaller amount of data, primarily updated data, with fewer new entries. The table data size is relatively stable, and full reads are usually required for join-like ETL calculations. Therefore, using non-partitioned tables usually results in better performance.
- The partition key for partitioned tables is not allowed to update; otherwise, it will create duplicate data.
Exceptional scenarios: super large dimension tables and very small fact tables
Special cases such as dimension tables with continuous large amounts of new data (table data size over 200 GB or daily growth exceeding 60 MB) or very small fact tables (table data size less than 10 GB and will not exceed 10 GB even after 3 to 5 years of growth) need to be handled on a case-by-case basis:
- Dimension tables with continuous large amounts of new data
Method 1: Reserve bucket numbers. If using a non-partitioned table, pre-increase the bucket numbers by estimating the data increment over a long period. The disadvantage is that files will continue to swell as data grows.
Method 2: Large-granularity partitions (recommended). If using a partitioned table, calculate the growth of the data to determine the partitioning, for example, using yearly partitions. This method is relatively cumbersome but avoids re-importing the table after many years.
Method 3: Data aging. Analyze the business logic to see if the large dimension table can reduce data scale by aging out and cleaning up invalid dimension data.
Determining the Number of Buckets in a Table
Setting the number of buckets for Hudi tables is crucial for performance and requires special attention.
The following are key points to confirm before creating a table:
- Non-partitioned tables
- Total number of records in a single table = select count(1) from tablename (provided during ingestion).
- Size of a single record = average 1 KB (suggest using select * from tablename limit 100, copy and paste the query results to a notebook, calculate the size of 100 records, and divide by 100 for the average size).
- Size of single table data (GB) = Total number of records x Size of a single table/1024/1024.
- Number of buckets for a non-partitioned table = MAX(Data volume of the table (GB)/2 GB x 2, rounded up, 4)
- Partitioned tables
- Total number of records in the partition with the largest data volume in the last month = Consult the product line before ingestion.
- Size of a single record = average 1 KB (suggest using select * from tablename limit 100, copy and paste the query results to a notebook, calculate the size of 100 records, and divide by 100 for the average size).
- Size of single partition data (GB) = Total number of records in the largest data volume partition in the last month x Size of a single table/1024/1024.
- Number of buckets for a partitioned table = MAX(Data volume of a partition (GB)/2 GB, rounded up, 1)
- Use the total data size of the table, not the compressed file size.
- It is best to set an even number of buckets, with a minimum of 4 for non-partitioned tables and a minimum of 1 for partitioned tables.
Confirming the Table Creation SQL Statement
DataArts supports operating Hudi tables through both Spark JDBC and Spark APIs.
- With Spark JDBC, public resources are used, and Spark jobs do not need to be initiated individually. However, the resources and configuration parameters necessary for executing SQL statements cannot be specified. Consequently, it is recommended to employ Spark JDBC for table creation or querying a limited amount of data.
- SQL statements executed by Spark APIs start Spark jobs independently, which can be time-consuming. However, you can configure program parameters to specify resources required by jobs, such as batch import.
Jobs utilize APIs to specify resources for running, thereby preventing the occupation of JDBC resources for an extended period and blocking other tasks.

If DataArts uses Spark APIs to operate Hudi tables, you must add the --conf spark.support.hudi=true parameter and run jobs through scheduling.
Creating a Hudi Table with DataArts
DataArts supports operating Hudi tables through both Spark JDBC and Spark APIs.
- With Spark JDBC, public resources are used, and Spark jobs do not need to be initiated individually. However, the resources and configuration parameters necessary for executing SQL statements cannot be specified. Consequently, it is recommended to employ Spark JDBC for table creation or querying a limited amount of data.
- SQL statements executed by Spark APIs start Spark jobs independently, which can be time-consuming. However, you can configure program parameters to specify resources required by jobs, such as batch import.
Jobs utilize APIs to specify resources for running, thereby preventing the occupation of JDBC resources for an extended period and blocking other tasks.

If DataArts uses Spark APIs to operate Hudi tables, you must add the --conf spark.support.hudi=true parameter and run jobs through scheduling.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot