Updated on 2024-10-09 GMT+08:00

Overview of the Doris Data Model

Basic Concepts

In Doris, data is logically described in the form of tables. A table consists of rows and columns. A row is a row of user data, and a column is used to describe different fields in a row of data. Columns are classified into keys and values. From the service perspective, Key and Value can correspond to dimension columns and metric columns, respectively.

Doris data models are classified into the following types:

  • Aggregate
  • Unique
  • Duplicate

AGGREGATE KEY

The statement for creating an Aggregate model table is as follows:

CREATE TABLE IF NOT EXISTS example_db.example_tbl

(

`user_id` LARGEINT NOT NULL COMMENT "User ID",

`date` DATE NOT NULL COMMENT "Data import date and time",

`city` VARCHAR(20) COMMENT "City of the user",

`age` SMALLINT COMMENT "User age",

`sex` TINYINT COMMENT "User Gender",

`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT " Last access time of the user",

`cost` BIGINT SUM DEFAULT "0" COMMENT "Total consumption",

`max_dwell_time` INT MAX DEFAULT "0" COMMENT "Dwell time",

`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum dwell time"

)

AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)

DISTRIBUTED BY HASH(`user_id`) BUCKETS 1

PROPERTIES (

"replication_allocation" = "tag.location.default: 1"

);

When data is imported, rows with the same contents in the Key columns will be aggregated into one row, and their values in the Value columns will be aggregated as their AggregationType specify. Currently, AggregationType has the following four aggregation modes:

  • SUM: Accumulate the values in multiple rows.
  • REPLACE: The newly imported value replaces the previous value.
  • MAX: Keep the maximum value.
  • MIN: Keep the minimum value.

Columns in a table are classified into Key (dimension column) and Value (metric column) based on whether AggregationType is set. For example, if AggregationType is not set, such as user_id, date, and age, is called Key, and if AggregationType is set, is called Value.

UNIQUE KEY model

  • Read-on-Read Combination

    These tables do not need to be aggregated. You only need to ensure that the primary keys (user_id and username) are unique. In addition, the read-time combination implementation of the Unique model can be replaced by the REPLACE mode of the Aggregate model. The following is an example:

    CREATE TABLE IF NOT EXISTS example_db.example_tbl

    (

    `user_id` LARGEINT NOT NULL COMMENT "User ID",

    `username` VARCHAR(50) NOT NULL COMMENT "Username",

    `city` VARCHAR(20) COMMENT "City of the user",

    `age` SMALLINT COMMENT "User age",

    `sex` TINYINT COMMENT "User Gender",

    `phone` LARGEINT COMMENT "User phone number",

    `address` VARCHAR(500) COMMENT "User address",

    `register_time` DATETIME COMMENT "User registration time"

    )

    UNIQUE KEY(`user_id`, `username`)

    DISTRIBUTED BY HASH(`user_id`) BUCKETS 1

    PROPERTIES (

    "replication_allocation" = "tag.location.default: 1"

    );

  • Merge-on-Write

    The Unique model is different from the Aggregate model. The query performance of the Unique model is closer to that of the Duplicate model. Compared with the Aggregate model, the Unique model has great advantages in query performance in scenarios where primary key constraints are required. The Unique model is applicable to aggregation query and query scenarios where a large amount of data needs to be filtered using indexes.

    When creating a table, you can specify the following property to enable the Unique model:

    "enable_unique_key_merge_on_write" = "true"

    For example:

    CREATE TABLE IF NOT EXISTS example_db.example_tbl

    (

    `user_id` LARGEINT,

    `username` VARCHAR(50) NOT NULL,

    `city` VARCHAR(20),

    `age` SMALLINT,

    `sex` TINYINT,

    `phone` LARGEINT,

    `address` VARCHAR(500),

    `register_time` DATETIME

    )

    UNIQUE KEY(`user_id`, `username`)

    DISTRIBUTED BY HASH(`user_id`) BUCKETS 1

    PROPERTIES (

    "replication_allocation" = "tag.location.default: 1",

    "enable_unique_key_merge_on_write" = "true"

    );

    If the merge-on-write option is enabled for a unique table, the overwritten and updated data is marked and deleted during data import, and new data is written to a new file. During query, all data marked for deletion is filtered at the file level. The read data is the latest data, eliminating the data aggregation process in read-time combination. In addition, multiple predicates can be pushed down, therefore, the performance can be greatly improved in the aggregation query scenario.

Duplicate model

If data does not have a primary key or aggregation requirement, you can use the Duplicate data model to create a table. Duplicate model data is stored based on the data in the imported file and is not aggregated. Even if there are two identical rows of data, they will both be retained. The DUPLICATE KEY specified in the table creation statement is only used to specify that the underlying data is sorted by the specified column.

The statement for creating a duplicate model table is as follows:

CREATE TABLE IF NOT EXISTS example_db.example_tbl

(

`timestamp` DATETIME NOT NULL COMMENT " Log time",

`type` INT NOT NULL COMMENT " Log type",

`error_code` INT COMMENT "Error code",

`error_msg` VARCHAR(1024) COMMENT "Error details",

`op_id` BIGINT COMMENT "Owner ID",

`op_time` DATETIME COMMENT "Processing time"

)

DUPLICATE KEY(`timestamp`, `type`, `error_code`)

DISTRIBUTED BY HASH(`type`) BUCKETS 1

PROPERTIES (

"replication_allocation" = "tag.location.default: 1"

);

Key column

For the Duplicate, Aggregate, and Unique models, the Key column is specified during table creation. The differences are as follows:

  • Duplicate model: The Key column of a table is only a sequence and does not uniquely identify the table.
  • Aggregate and Unique models: For tables of the two aggregation types, the Key column is a real Key column that considers both sorting and unique identifier columns.

Suggestions on Data Model Selection

The data model is determined when the table is created and cannot be modified. Therefore, it is important to select a proper data model.

  • The Aggregate model can greatly reduce the amount of data to be scanned and the calculation workload during aggregation query through pre-aggregation. This model is applicable to report query scenarios with fixed modes. However, this model is not applicable to count(*) query. In addition, because the aggregation function in Value columns is fixed, semantic correctness needs to be considered when aggregation queries using other functions are performed.
  • The Unique model ensures that the primary key is unique in scenarios where a unique primary key constraint is required. However, the query advantages brought by pre-aggregation such as ROLLUP cannot be used.
    • For users who have high performance requirements for aggregate query, merge-on-write is recommended.
    • The Unique model supports only the update of an entire row. If you need to update both the unique primary key constraint and some columns (for example, in the scenario where multiple source tables are imported to one Doris table), you can use the Aggregate model and set the aggregation type of non-primary key columns to REPLACE_IF_NOT_NULL.
  • Duplicate is suitable for ad-hoc query in any dimension. Although the pre-aggregation feature cannot be used, it is not restricted by the aggregation model. The advantages of the column-store model can be brought into full play (only related columns are read, and all key columns do not need to be read).