Updated on 2024-05-11 GMT+08:00

Table Creation Rules

This section describes the rules and suggestions for creating a Doris table.

Doris Table Creation Rules

  • When creating a Doris table and specifying bucket buckets, ensure that the data size of each bucket ranges from 100 MB to 3 GB and the maximum number of buckets in a single partition does not exceed 5000.
  • If the number of data records in a table exceeds 500 million, you must set a bucket policy.
  • Do not set too many bucket columns in a table. Generally, you only need to set one or two columns. In addition, you need to ensure even data distribution and balanced query throughput.
    • Data is evenly distributed to prevent data skew in some buckets from affecting data balancing and query efficiency.
    • The query throughput uses the bucket tailoring optimization of query SQL statements to avoid full bucket scanning and improve query performance.
    • Bucket column selection: Columns with even data and commonly used as query conditions are preferentially used as bucket columns.

      You can use the following methods to analyze whether data skew occurs:

      SELECT a, b, COUNT(*) FROM tab GROUP BY a,b;

      After the command is executed, check whether the difference between the number of data records in each group is small. If the difference exceeds 2/3 or 1/2, select another bucket field.

  • Do not use dynamic partitions for less than 20 million data records. Dynamic partitioning automatically creates partitions, but users cannot pay attention to small tables. As a result, a large number of unused partitions are created to distinguish buckets.
  • When creating a table, ensure that there are three to five sort keys. If there are too many sort keys, data writing will be slow and data import performance will be affected.
  • If Auto Bucket is not used, buckets need to be divided based on the existing data volume to improve the import and query performance. Auto Bucket causes a large number of tablets. As a result, a large number of small files exist.
  • The number of copies for creating a table must be at least 2. The default value is 3. Do not use a single backup.

Doris Table Creation Suggestions

  • The number of materialized views in a single table cannot exceed six. It is not recommended that materialized views be nested. You are not advised to use materialized views to perform ETL tasks such as heavy aggregation and join calculation during data writing.
  • If there is a large amount of historical partition data but the historical data is small, the data is unbalanced, or the data query probability is low, you can create historical partitions (such as yearly and monthly partitions) and store all historical data in the corresponding partitions.

    The method of creating a history partition is FROM ("2000-01-01") TO ("2022-01-01") INTERVAL 1 YEAR.

  • If the data volume is less than 10 million to 200 million, you do not need to set partitions (the Doris has a default partition). Instead, you can directly use the bucket policy.
  • If more than 30% data skew occurs in the bucket field, do not use the hash bucketing policy. Instead, use the random bucketing policy. The related commands are as follows:

    Create table ... DISTRIBUTED BY RANDOM BUCKETS 10 ...

  • During table creation, the first field must be the column that is most frequently queried. By default, the prefix index quick query capability is provided. The column that is most frequently queried and has a high cardinality is selected as the prefix index, by default, the first 36 bytes of a row are used as the prefix index of the row. (A column of the varchar type can match only 20 bytes, and the prefix index will be truncated if less than 36 bytes are matched.)
  • For more than 100 million data records, if fuzzy match or equivalent/in conditions are used, you can use inverted indexes (supported since Doris 2.x) or Bloomfilter. For orthogonal queries with low cardinality columns, bitmap indexes are recommended. (The cardinality of bitmap indexes ranges from 10000 to 100000.)
  • When creating a table, you need to plan the number of fields to be used in the future. You can reserve dozens of fields of the integer or character type. If fields are insufficient in the future, you need to add fields temporarily at a high cost.