Updated on 2024-08-30 GMT+08:00

Data Import

This topic describes the technical suggestions for importing Doris data.

Doris Data Import Suggestions

  • Do not frequently perform the update, delete, or truncate operation. Perform an operation every several minutes. To use the delete operation, you must set the partitioning condition or primary key column.
  • Avoid using INSERT INTO tbl1 VALUES("1"),("a"); to frequently import small amounts of data. Instead, opt for StreamLoad, BrokerLoad, SparkLoad, or Flink Connector.
  • When Flink writes data to Doris in real time, set the checkpoint based on the data volume of each batch. If the data volume of each batch is too small, a large number of small files will be generated. The recommended value is 60s.
  • Do not use insert values as the main data write mode. StreamLoad, BrokerLoad, or SparkLoad is recommended for batch data import.
  • If there are downstream dependencies or queries when you use INSERT INTO WITH LABEL XXX SELECT to import data, check whether the imported data is visible.

    Run the show load where label='xxx' SQL command to check whether the current INSERT task is VISIBLE. The imported data is visible only when the status is VISIBLE.

  • Streamload is suitable for importing data of less than 10 GB, and Brokerload is suitable for data of less than 100 GB. For large-scale data, use SparkLoad.
  • Do not use Routine Load of Doris to import data. Instead, use Flink to query Kafka data and then write the data to Doris. This limits the amount of data to be imported in a single batch and avoids a large number of small files. If Routine Load has already been used to import data, set max_tolerable_backend_down_num to 1 on the FE before you change the import method to improve reliability.
  • Import data in batches at a low frequency. The average interval for importing a single table must be greater than 30s. Import 1000 to 100000 rows of data each time at a recommended interval of 60s.