Doris Bucketing Rules

Data is divided into different buckets based on the hash values of bucketing columns.

If you use the Partition method, the DISTRIBUTED ... statement will describe how data is divided among partitions. If you do not use the Partition method, that statement will describe how data of the whole table is divided.
You can specify multiple columns as the bucketing columns. In AGGREGATE KEY and UNIQUE KEY models, bucketing columns must be Key columns. In the DUPLICATE model, bucketing columns can be Key columns and Value columns. Bucketing columns can either be partitioning columns or not.
The choice of bucketing columns is a trade-off between query throughput and query concurrency.
- If you choose to specify multiple bucketing columns, the data will be more evenly distributed. However, if the query condition does not include the equivalent conditions for all bucketing columns, the system will scan all buckets, largely increasing the query throughput and decreasing the latency of a single query. This method is suitable for high-throughput and low-concurrency query scenarios.
- If you choose to specify only one or a few bucketing columns, point queries might scan only one bucket. when multiple point queries are performed concurrently, they might scan various buckets, with no interaction between the I/O operations (especially when the buckets are stored on various disks). This approach is suitable for high-concurrency point query scenarios.
AutoBucket: Calculates the number of partition buckets based on the amount of data. For partitioned tables, you can determine a bucket based on the amount of data, the number of machines, and the number of disks in the historical partition.
Theoretically, there is no upper limit on the number of buckets.