Doris Table Creation Rules
This topic describes the rules and suggestions for creating Doris tables.
Table Creation Rules
- [Mandatory] Set partition and bucket policies for tables that have over 500 million data records.
- [Mandatory] Limit bucket columns to one or two. Balance data distribution and query throughput to avoid data skew, which can impact overall balance and efficiency. Optimize query performance by tailoring bucket columns to evenly distributed data that is frequently used in query conditions. You can use the following methods to analyze whether data skew occurs:
SELECT a, b, COUNT(*) FROM tab GROUP BY a,b;
Once the command is executed, verify if the variation in the number of data records among the groups is minimal. If the difference surpasses 2/3 or 1/2, select another bucket field.
- [Mandatory] Avoid dynamic partitions for datasets smaller than 20 million records to prevent excessive bucket creation.
- [Mandatory] Set at least 2 replicas when you create a table. The default replication factor is 3. Do not use a single backup.
- [Mandatory] Do not use auto partition when creating dynamic partitions for Doris tables. Dynamic partitions are recommended.
- [Mandatory] Do not create a table that does not have an aggregate function column as an AGGREGATE table.
- [Mandatory] Ensure each bucket's data size is between 100 MB and 3 GB when defining buckets for table creation. A single partition's maximum data size should not exceed 5000 GB. The total number of partitions in a single table should not exceed 10,000, and the number of tables in the entire cluster should not exceed 5,000.
- For substantial historical data with small, unbalanced sizes or low query likelihood, do as follows:
- Create dedicated historical partitions (yearly or monthly) to store all relevant data.
- Historical partition creation example: FROM ("2000-01-01") TO ("2022-01-01") INTERVAL 1 YEAR.
- [Optional] For data records ranging from 10 million to 200 million, bypass partitioning in favor of direct bucket policy application. Doris defaults to a standard partition in the absence of user-defined partitions.
- [Optional] If bucket fields experience over 30% data skew, switch from hash to random bucketing policies. Create table ... DISTRIBUTED BY RANDOM BUCKETS 10 ...
- [Optional] Prioritize the most frequently queried column as the first field when creating a table to leverage the prefix index's quick query feature. For high-cardinality columns outside the bucket, choose those with extensive query lengths. Note that the prefix index is limited to 36 bits; overly long columns may not utilize this feature.
- [Optional] Utilize inverted indexes or Bloom filters for fuzzy matches or equivalent/in conditions on datasets with hundreds of millions of records. Bitmap indexes are best suited for orthogonal queries with low-cardinality columns.
- [Optional] Use no more than six materialized views in a single table. Do not nest materialized views or use them in ETL tasks such as heavy aggregations and joins during data writing.
- [Optional] Manually create buckets to improve data import and query performance. Do not use the Auto Bucket policy because it causes superfluous tablets and a large number of small files, affecting the overall performance.
- [Recommendation] For 2.1 and later versions, if a large amount of data is frequently imported to the database, MOR is recommended. If the query performance is preferentially considered and a long time for importing data to the database is acceptable, MOW is recommended. MOW is not recommended for 2.0 and earlier versions.
- [Optional] When creating Doris tables, create no more than 1,000 tables in a single database. If there are too many tables, create them in multiple databases.
- [Optional] When creating a table, make sure to have three to five sort keys. Having too many sort keys can impede data writing and importing.
- [Optional] When creating a primary key table, ensure that the primary key column is unique. Do not set all columns as primary key columns. Set a value column for the primary key table. Do not use primary key tables in data deduplication scenarios.
- [Optional] Plan the number of fields to be used when creating a table. You can reserve dozens of integer or character fields. If fields are insufficient, you need to add fields temporarily at a high cost.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot