Updated on 2025-12-17 GMT+08:00

Doris Bucketing Rules

Data is divided into different buckets based on the hash values of bucketing columns.

  • If you use the Partition method, the DISTRIBUTED ... statement will describe how data is divided among partitions. If you do not use the Partition method, that statement will describe how data of the whole table is divided.
  • You can specify multiple columns as the bucketing columns. In AGGREGATE KEY and UNIQUE KEY models, bucketing columns must be Key columns. In the DUPLICATE model, bucketing columns can be Key columns and Value columns. Bucketing columns can either be partitioning columns or not.
  • The choice of bucketing columns is a trade-off between query throughput and query concurrency.
    • If you choose to specify multiple bucketing columns, the data will be more evenly distributed. However, if the query condition does not include the equivalent conditions for all bucketing columns, the system will scan all buckets, largely increasing the query throughput and decreasing the latency of a single query. This method is suitable for high-throughput and low-concurrency query scenarios.
    • If you choose to specify only one or a few bucketing columns, point queries might scan only one bucket. When multiple point queries are performed concurrently, they might scan various buckets, with no interaction between the I/O operations (especially when the buckets are stored on various disks). This approach is suitable for high-concurrency point query scenarios.
  • Auto Bucket is not recommended. Partitioning and bucketing should be determined by the data volume to enhance the performance of data import and query. Auto Bucket causes superfluous tablets and a large number of small files.
  • Theoretically, there is no upper limit on the number of buckets. However, the bucket size must range from 300 MB to 3 GB to optimize the performance.