Rules for the Number and Data Volume for Partitions and Buckets

Recommendations on the Number and Data Volume for Partitions and Buckets

The total number of tablets in a table is equal to the product of partition number and bucket number.
The recommended number of tablets in a table, regardless of capacity expansion, is slightly more than the number of disks in the entire cluster.
The data volume of a single tablet does not have an upper or lower limit theoretically, but is recommended to be in the range of 1 GB to 10 GB. Overly small data volume of a single tablet can impose a stress on data aggregation and metadata management, while overly large data volume can cause trouble in data migration and completion, and increase the cost of Schema Change or Rollup operation failures (These operations are performed on the Tablet level).
For the tablets, if you cannot have the ideal data volume and the ideal quantity at the same time, it is advised to prioritize the ideal data volume.
Upon table creation, you specify the same number of buckets for each partition. However, when dynamically increasing partitions (ADD PARTITION), you can specify the number of buckets for the new partitions separately. This feature can help you cope with data reduction or expansion.
Once the number of buckets for a partition is specified, it cannot be changed. Therefore, when determining the number of Buckets, you need to consider the need of cluster expansion in advance. For example, if there are only 3 hosts, and each host has only 1 disk, and you have set the number of Buckets is only set to 3 or less, then no amount of newly added machines can increase concurrency.
Assume that there are 10 BEs and each BE has one disk. If the total size of a table is 500 MB, you can consider dividing it into 4 to 8 tablets.

5 GB: 8 to 16 tablets

50 GB: 32 tablets

500 GB: You can consider dividing it into partitions, with each partition about 50 GB in size, and 16 to 32 tablets per partition.

5 TB: You can consider dividing it into partitions, with each partition about 50 GB in size, and 16 to 32 tablets per partition.

Settings and Usage Scenarios of Random Distribution

If the OLAP table does not have columns of REPLACE type, set the data bucketing mode of the table to RANDOM. This can avoid severe data skew. (When loading data into the partition corresponding to the table, each batch of data in a single load task will be written into a randomly selected tablet).
When the bucketing mode of the table is set to RANDOM, since there are no specified bucketing columns, it is impossible to query only a few buckets, so all buckets in the hit partition will be scanned when querying the table. Thus, this setting is only suitable for aggregate query analysis of the table data as a whole, but not for high-concurrency point queries.
If the data distribution of the OLAP table is Random Distribution, you can set load to single tablet to true when importing data. In this way, when importing large amounts of data, in one task, data will be only written in one tablet of the corresponding partition. This can improve both the concurrency and throughput of data import and reduce write amplification caused by data import and compaction, and thus, ensure cluster stability.

Composite Partitioning vs Single Partitioning

Compound partitioning
- The first layer of data partitioning is called Partition. You can specify a dimension column as the partitioning column (currently only supports columns of INT and TIME types), and specify the value range of each partition.
- The second layer is called Distribution, which means bucketing. You can perform HASH distribution on data by specifying the number of buckets and one or more dimension columns as the bucketing columns, or perform random distribution on data by setting the mode to Random Distribution.

Compound partitioning is recommended for the following scenarios:

Scenarios with time dimensions or similar dimensions with ordered values, which can be used as partitioning columns. The partitioning granularity can be evaluated based on data import frequency, data volume, etc.
Scenarios with a need to delete historical data: If, for example, you only need to keep the data of the last N days), you can use compound partitioning so you can delete historical partitions. To remove historical data, you can also send a DELETE statement within the specified partition.
Scenarios with a need to avoid data skew. You can specify the number of buckets individually for each partition. For example, if you choose to partition the data by day, and the data volume per day varies greatly, you can customize the number of buckets for each partition. For the choice of bucketing column, it is advised to select the column(s) with variety in values.