Updated on 2024-05-11 GMT+08:00

Data Import

This section describes the specifications for importing Doris data.

Doris Data Import Suggestions

  • Do not frequently perform the update, delete, or truncate operation. You are advised to perform the operation every several minutes. To use the delete operation, you must set the partition or primary key column condition.
  • Do not use INSERT INTO tbl1 VALUES("1"),("a"); to import data. If a small amount of data needs to be written, use StreamLoad, BrokerLoad, SparkLoad, or Flink Connector provided by Doris.
  • When Flink writes data to Doris in real time, the time set for CheckPoint must consider the data volume of each batch. If the data volume of each batch is too small, a large number of small files will be generated. The recommended value is 60s.
  • You are advised not to use insert values as the main data write mode. StreamLoad, BrokerLoad, or SparkLoad is recommended for batch data import.
  • When data is imported in INSERT INTO WITH LABEL XXX SELECT mode, if downstream dependency or query exists, you need to check whether the imported data is visible.

    Run the show load where label='xxx' SQL command to check whether the current INSERT task status is VISIBLE. The imported data is visible only when the status is VISIBLE.

  • Streamload is suitable for importing data of less than 10 GB, and Brokerload is suitable for importing data of less than 100 GB. If the data volume is too large, SparkLoad can be used.
  • Do not use Routine Load of Doris to import data. You are advised to use Flink to query Kafka data and then write the data to Doris. This facilitates the control of the amount of data to be imported in a single batch and prevents a large number of small files from being generated. If Routine Load has been used to import data, set max_tolerable_backend_down_num to 1 on the FE before rectification to improve data import reliability.
  • You are advised to import data in batches at a low frequency. The average interval for importing a single table must be greater than 30s. The recommended interval is 60s. 1000 to 100000 rows of data are imported at a time.