Doris Data Model

Basic Concepts

In Doris, data is logically described in the form of tables. A table consists of rows and columns. A row refers to a row of data, and columns describe different fields in a row of data. Columns can be classified into key columns and value columns, that is, dimension columns and indicator columns.

Doris data models are classified into the following types:

Aggregate
Unique
Duplicate

For more information about Doris data models, see Doris Data Model. For details about how to create a table after connecting to Doris on the MySQL client, see Getting Started with Doris.

AGGREGATE KEY

The following is an example of the statement for creating an Aggregate model table:

CREATE TABLE IF NOT EXISTS example_db.example_tbl
(
`user_id` LARGEINT NOT NULL COMMENT "User ID",
`date` DATE NOT NULL COMMENT "Data import date and time",
`city` VARCHAR(20) COMMENT "City where the user locates",
`age` SMALLINT COMMENT "User age",
`gender` TINYINT COMMENT "User gender",
`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "Last visit date of the user",
`cost` BIGINT SUM DEFAULT "0" COMMENT "Total consumption",
`max_dwell_time` INT MAX DEFAULT "0" COMMENT "Maximum dwell time",
`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum dwell time"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `gender`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);

When data is imported, rows with the same contents in the Key columns will be aggregated into one row, and their values in the Value columns will be aggregated as their AggregationType specify. Currently, the following four aggregation functions are supported:
- SUM: Sum up the values in multiple rows.
- REPLACE: Replace the previous value with the newly imported value.
- MAX: Keep the maximum value.
- MIN: Keep the minimum value.
Columns in a table are classified into Key (dimension column) and Value (indicator volume) based on whether they are set with an AggregationType. Keys are those for which AggregationType is not set, such as user_id, date, and age. Values are those for which AggregationType is set.

UNIQUE KEY model

Read-on-Read Combination

These tables do not need to be aggregated. You only need to ensure that the primary keys (user_id and username) are unique. In addition, the read-time combination implementation of the Unique model can be replaced by the REPLACE mode of the Aggregate model. The following is an example:

CREATE TABLE IF NOT EXISTS example_db.example_tbl
(
`user_id` LARGEINT NOT NULL COMMENT "User ID",
`username` VARCHAR(50) NOT NULL COMMENT "Username",
`city` VARCHAR(20) COMMENT "City where the user locates",
`age` SMALLINT COMMENT "User age",
`gender` TINYINT COMMENT "User gender",
`phone` LARGEINT COMMENT "User phone number",
`address` VARCHAR(500) COMMENT "User address",
`register_time` DATETIME COMMENT "User registration time"
)
UNIQUE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);

Merge-on-Write
The Merge On Write implementation of the Unique model is completely different from that of the Aggregate model. It can deliver better performance (almost like that of the Duplicate model) in aggregation queries with primary key limitations. This implementation is suitable for aggregation queries and those using indexes to filter out large-scale data.

When creating a table, you can specify the following property to enable the Unique model:
```
"enable_unique_key_merge_on_write" = "true"
```
For example:
```
CREATE TABLE IF NOT EXISTS example_db.example_tbl
(
`user_id` LARGEINT,
`username` VARCHAR(50) NOT NULL,
`city` VARCHAR(20),
`age` SMALLINT,
`gender` TINYINT,
`phone` LARGEINT,
`address` VARCHAR(500),
`register_time` DATETIME
)
UNIQUE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"enable_unique_key_merge_on_write" = "true"
);
```
If the Merge-on-Write option is enabled for a Unique table, the overwritten and updated data is marked and deleted during data import, and new data is written to a new file. During a query, all data marked for deletion will be filtered out at the file level, and only the latest data would be read. This eliminates the data aggregation cost while reading, and supports many types of predicate pushdown now. Performance is improved in many scenarios, especially in aggregation queries.

Duplicate model

If data does not have a primary key or aggregation requirement, you can use the Duplicate data model to create a table. Duplicate model data is stored based on the data in the imported file and is not aggregated. Even if there are two identical rows of data, they will both be retained. The DUPLICATE KEY specified in the table creation statement is only used to specify that the underlying data is sorted by the specified column.

The statement for creating a duplicate model table is as follows:

CREATE TABLE IF NOT EXISTS example_db.example_tbl
(
timestamp DATETIME NOT NULL COMMENT "Log time",
`type` INT NOT NULL COMMENT "Log type",
`error_code` INT COMMENT "Error code",
`error_msg` VARCHAR(1024) COMMENT "Error message",
`op_id` BIGINT COMMENT "Operator ID",
`op_time` DATETIME COMMENT "Operation time"
)
DUPLICATE KEY(`timestamp`, `type`, `error_code`)
DISTRIBUTED BY HASH(`type`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);

Key column

For the Duplicate, Aggregate, and Unique models, the Key column is specified during table creation. The differences are as follows:

Duplicate model: The Key column of a table is only a sequence and does not uniquely identify the table.
Aggregate and Unique models: For tables of the two aggregation types, the Key column is a real Key column that considers both sorting and unique identifier columns.

Suggestions on Data Model Selection

The data model is determined when the table is created and cannot be modified. Therefore, it is important to select a proper data model.

The Aggregate model can greatly reduce the amount of data to be scanned and the calculation workload during aggregation query through pre-aggregation. This model is applicable to report query scenarios with fixed modes. However, this model is not applicable to count(*) query. Since the aggregation method on the Value column is fixed, semantic correctness should be considered in other types of aggregation queries.
The Unique model ensures that the primary key is unique in scenarios where a unique primary key constraint is required. However, the query advantages brought by pre-aggregation such as ROLLUP cannot be used.
- For users who have high-performance requirements for aggregate query, merge-on-write is recommended.
- The Unique model supports only the update of an entire row. If you need to update both the unique primary key constraint and some columns (for example, in the scenario where multiple source tables are imported to one Doris table), you can use the Aggregate model and set the aggregation type of non-primary key columns to REPLACE_IF_NOT_NULL.
Duplicate is suitable for ad-hoc query in any dimension. Although the pre-aggregation feature cannot be used, it is not restricted by the aggregation model. The advantages of the column-store model can be brought into full play (only related columns are read, and all key columns do not need to be read).

Parent topic: Using Doris

Previous topic: Using Doris

Next topic: Managing Doris User Permissions