Doris Data Model

Basic Concepts

In Doris, data is logically described in the form of tables. A table consists of rows and columns. A row is a line of user data, and a column describes different fields in a row of data. Columns are classified into two categories: Key and Value. From a business perspective, Key and Value correspond to dimension columns and indicator columns, respectively.

Data models in Doris fall into three types:

Aggregate
Unique
Duplicate

Aggregate Model

The following examples show you what aggregation model is and how to use it correctly:

CREATE TABLE IF NOT EXISTS example_db.example_tbl

(

`user_id` LARGEINT NOT NULL COMMENT "User ID",

`date` DATE NOT NULL COMMENT "Data import date and time",

`city` VARCHAR(20) COMMENT "City",

`age` SMALLINT COMMENT "Age",

`sex` TINYINT COMMENT "Gender",

`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "Last access time",

`cost` BIGINT SUM DEFAULT "0" COMMENT "Total consumption",

`max_dwell_time` INT MAX DEFAULT "0" COMMENT "Dwell time",

`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum dwell time"

)

AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)

DISTRIBUTED BY HASH(`user_id`) BUCKETS 1

PROPERTIES (

"replication_allocation" = "tag.location.default: 1"

);

When data is imported, rows with the same contents in the Key columns will be aggregated into one row, and their values in the Value columns will be aggregated as their AggregationType specify. Currently, there are several aggregate methods:

SUM: Accumulate the values in multiple rows.
REPLACE: Replace the previous value with the newly imported value.
MAX: Keep the maximum value.
MIN: Keep the minimum value.

The columns in the table are divided into Key (dimension) columns and Value (indicator columns) based on whether they are set with an AggregationType. Key columns are not set with an AggregationType, such as user_id, date, and age, while Value columns are.

Unique Model

Merge on Read
If your data does not need to be aggregated, you use this implementation. You only need to ensure the uniqueness of the primary keys (user_id and username). The Merge On Read implementation of the unique model is equivalent to the REPLACE aggregation type in the aggregate model. The following is an example:

CREATE TABLE IF NOT EXISTS example_db.example_tbl

(

`user_id` LARGEINT NOT NULL COMMENT "User ID",

`username` VARCHAR(50) NOT NULL COMMENT "Username",

`city` VARCHAR(20) COMMENT "City",

`age` SMALLINT COMMENT "Age",

`sex` TINYINT COMMENT "Gender",

`phone` LARGEINT COMMENT "User phone number",

`address` VARCHAR(500) COMMENT "User address",

`register_time` DATETIME COMMENT "User registration time"

)

UNIQUE KEY(`user_id`, `username`)

DISTRIBUTED BY HASH(`user_id`) BUCKETS 1

PROPERTIES (

"replication_allocation" = "tag.location.default: 1"

);

Merge on Write
The Merge On Write implementation of the unique model is completely different from that of the aggregate model. It can deliver better performance (almost like that of the duplicate model) in aggregation queries with primary key limitations. This implementation is suitable for aggregation queries and those using indexes to filter out large-scale data.

When creating a table, you can enable the unique model by adding the following property:

"enable_unique_key_merge_on_write" = "true"

For example:

CREATE TABLE IF NOT EXISTS example_db.example_tbl

(

`user_id` LARGEINT,

`username` VARCHAR(50) NOT NULL,

`city` VARCHAR(20),

`age` SMALLINT,

`sex` TINYINT,

`phone` LARGEINT,

`address` VARCHAR(500),

`register_time` DATETIME

)

UNIQUE KEY(`user_id`, `username`)

DISTRIBUTED BY HASH(`user_id`) BUCKETS 1

PROPERTIES (

"replication_allocation" = "tag.location.default: 1",

"enable_unique_key_merge_on_write" = "true"

);

On a unique table with the Merge on Write option enabled, during the import stage, the data that is to be overwritten and updated will be marked for deletion, and new data will be written in. During a query, all data marked for deletion will be filtered out at the file level, and only the latest data would be read. This eliminates the data aggregation cost while reading, and supports many types of predicate pushdown now. Performance is improved in many scenarios, especially in aggregation queries.

Duplicate Model

In some multidimensional analysis scenarios, there is no need for primary keys or data aggregation. This is when we use the Duplicate model. The Duplicate Model stores the data as they are and executes no aggregation. Even if there are two identical rows of data, they will both be retained. The DUPLICATE KEY specified in the table creation statement is only used to specify based on which columns the data are sorted.

The statement for creating a table in duplicate model is as follows:

CREATE TABLE IF NOT EXISTS example_db.example_tbl

(

`timestamp` DATETIME NOT NULL COMMENT " Log time",

`type` INT NOT NULL COMMENT " Log type",

`error_code` INT COMMENT "Error code",

`error_msg` VARCHAR(1024) COMMENT "Error details",

`op_id` BIGINT COMMENT "Owner ID",

`op_time` DATETIME COMMENT "Processing time"

)

DUPLICATE KEY(`timestamp`, `type`, `error_code`)

DISTRIBUTED BY HASH(`type`) BUCKETS 1

PROPERTIES (

"replication_allocation" = "tag.location.default: 1"

);

Key Columns

For the duplicate, aggregate, and unique models, the Key column is specified during table creation. The differences are as follows:

Duplicate model: The Key columns can be regarded as just "sorting columns", but not unique identifiers.
Aggregate and unique models: For tables of the two aggregation types, the Key columns are both "sorting columns" and "unique identifier columns".

Suggestions on Data Model Selection

The data model is established when the table is created and cannot be modified. Therefore, it is important to select a proper data model.

The Aggregate Model can greatly reduce the amount of data scanned and query computation by pre-aggregation. Thus, it is very suitable for report query scenarios with fixed patterns, but is unfriendly to count (*) queries. Since the aggregation method on the Value column is fixed, semantic correctness should be considered in other types of aggregation queries.
The unique model guarantees the uniqueness of primary key for scenarios requiring a unique primary key. The downside is that it cannot exploit the advantage brought by pre-aggregation such as ROLLUP in queries.
- Users who have high-performance requirements for aggregate queries are recommended to use the Merge on Write implementation.
- The unique model only supports entire-row updates. If you require primary key uniqueness as well as partial updates of certain columns (such as loading multiple source tables into one Doris table), you can consider using the aggregate model, while setting the aggregate type of the non-primary key columns to REPLACE_IF_NOT_NULL.
Duplicate is suitable for ad-hoc queries in any dimensions. Although it may not be able to take advantage of the pre-aggregation feature, it is not limited by what constraints the aggregate model and can give full play to the advantage of columnar storage (reading only the relevant columns, but not all Key columns).