Help Center/ CloudTable Service/ User Guide/ Using Doris/ Doris Data Model

Updated on 2025-11-19 GMT+08:00

View PDF

Doris Data Model

In Doris, data is logically described in the form of tables. A table is a collection of homogeneous data with the same schema. A table consists of rows and columns. Row indicates a row of user data. Column describes different fields in a row of data. Different data types (such as integers, strings, and Boolean values) can be used as required.

In OLAP scenarios, columns can be divided into two categories: Key and Value. Key and Value can correspond to the dimension column and indicator column respectively.

Doris data models are classified into the following types:

Aggregate Key Model
Unique Key Model
Duplicate Key Model

Aggregate Key Model

This section illustrates what an Aggregate model is and how to use it correctly with practical examples.

Example 1: Importing data aggregation

Assume that the business has the following data table schema:

**Table 1** Data description
ColumnName	Type	AggregationType	Comment
user_id	LARGEINT	-	User ID
date	DATE	-	Data import date
city	VARCHAR(20)	-	City where a user is located
age	SMALLINT	-	User age
sex	TINYINT	-	User gender
last_visit_date	DATETIME	REPLACE	Last visit date
cost	BIGINT	SUM	Total consumption
max_dwell_time	INT	MAX	Maximum residence time
min_dwell_time	INT	MIN	Minimum residence time

The corresponding CREATE TABLE statement would be as follows:

CREATE TABLE IF NOT EXISTS demo.example_tbl(    `user_id` LARGEINT NOT NULL COMMENT "User ID",    `date` DATE NOT NULL COMMENT "Data import date and time",    `city` VARCHAR(20) COMMENT "City where the user is located",    `age` SMALLINT COMMENT "User age",    `sex` TINYINT COMMENT "User gender",    `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "Last visit date of the user",    `cost` BIGINT SUM DEFAULT "0" COMMENT "Total consumption",    `max_dwell_time` INT MAX DEFAULT "0" COMMENT "Maximum residence time",    `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum residence time",)AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)DISTRIBUTED BY HASH(`user_id`) BUCKETS 1PROPERTIES (    "replication_allocation" = "tag.location.default: 3");

This is a typical fact table of user information and visit behaviors. In star models, user information and visit behaviors are usually stored in dimension tables and fact tables, respectively. Here, for easier explanation of Doris data models, the two types of information are stored in one single table.

The columns in the table are divided into Key (dimension) columns and Value (indicator) columns based on whether they are set with an AggregationType. Key columns are not set with an AggregationType, such as user_id, date, age, and sex, while Value columns are.

When data is imported, rows with the same contents in the Key columns will be aggregated into one row, and their values in the Value columns will be aggregated as their AggregationType specify. AggregationType has the following modes:

SUM: Sum up the values in multiple rows.
REPLACE: Replace the previous value with the newly imported value.
MAX: Keep the maximum value.

MIN: Keep the minimum value.

**Table 2** User information table
user_id	date	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10000	2017-10-01	A	20	0	2017-10-01 06:00:00	20	10	10
10000	2017-10-01	A	20	0	2017-10-01 07:00:00	15	2	2
10001	2017-10-01	A	30	1	2017-10-01 17:05:45	2	22	22
10002	2017-10-02	B	20	1	2017-10-02 12:59:12	200	5	5
10003	2017-10-02	C	32	0	2017-10-02 11:20:00	30	11	11
10004	2017-10-01	D	35	0	2017-10-01 10:00:15	100	3	3
10004	2017-10-03	D	35	0	2017-10-03 10:20:22	11	6	6

Assume that this is a table recording user behaviors when they are visiting a certain commodity page. The first row of data, for example, is explained as follows:

**Table 3** Parameters
Value	Description
10000	User ID, which uniquely identifies a user.
2017-10-01	Time when data is imported to the database. The value is accurate to date.
A	City where a user is located
20	User age
0	Gender: male (1 indicates female)
2017-10-01 06:00:00	Time when a user visits the page. The value is accurate to second.
20	Consumption generated by the current visit
10	Time spent on the page during the current visit
10	Time spent on the page during the current visit (redundancy)

After this batch of data is imported into Doris correctly, it will be stored in Doris as follows:

**Table 4** Inserting data
user_id	date	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10000	2017-10-01	A	20	0	2017-10-01 07:00:00	35	10	2
10001	2017-10-01	A	30	1	2017-10-01 17:05:45	2	22	22
10002	2017-10-02	B	20	1	2017-10-02 12:59:12	200	5	5
10003	2017-10-02	C	32	0	2017-10-02 11:20:00	30	11	11
10004	2017-10-01	D	35	0	2017-10-01 10:00:15	100	3	3
10004	2017-10-03	D	35	0	2017-10-03 10:20:22	11	6	6

As shown in the figure, the data of user 10000 has been aggregated to one row, while those of other users remain the same. The explanation for the aggregated data of User 10000 is as follows.

The first 5 columns remain unchanged, so it starts with Column 6 last_visit_date.

2017-10-01 07:00:00: The last_visit_date column is aggregated by REPLACE, so 2017-10-01 07:00 has replaced 2017-10-01 06:00:00.

When using REPLACE to aggregate data from the same import batch, the order of replacement is uncertain. That means, in this case, the data eventually saved in Doris could be 2017-10-01 06:00:00. However, for different import batches, data from the new batch will replace those from the old batch.
35: The cost column is aggregated by SUM, so the update value 35 is the result of 20 plus 15.
10: The max_dwell_time column is aggregated by MAX, so 10 is saved as it is the maximum between 10 and 2.
2: The min_dwell_time column is aggregated by MIN, so 2 is saved as it is the minimum between 10 and 2.
After aggregation, Doris only stores the aggregated data. The detailed raw data is not retained.

Example 2: Keeping detailed data

Here is a modified version of the table schema in Example 1:

**Table 5** Data description
ColumnName	Type	AggregationType	Comment
user_id	LARGEINT	-	User ID
date	DATE	-	Data import date
timestamp	DATETIME	-	Date and time when the data is imported. The value is accurate to second.
city	VARCHAR(20)	-	City where the user is located
age	SMALLINT	-	User age
sex	TINYINT	-	User gender
last_visit_date	DATETIME	REPLACE	Last visit date
cost	BIGINT	SUM	Total consumption
max_dwell_time	INT	MAX	Maximum residence time
min_dwell_time	INT	MIN	Minimum residence time

A new column timestamp (accurate to second) has been added to record the date and time when the data is imported.

In addition, AGGREGATE KEY is set to AGGREGATE KEY(user_id, date, timestamp, city, age, sex).

Suppose that the imported data is as follows:

**Table 6** User information table
user_id	date	timestamp	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10000	2017-10-01	2017-10-01 08:00:05	A	20	0	2017-10-01 06:00:00	20	10	10
10000	2017-10-01	2017-10-01 09:00:05	A	20	0	2017-10-01 07:00:00	15	2	2
10001	2017-10-01	2017-10-01 18:12:10	A	30	1	2017-10-01 17:05:45	2	22	22
10002	2017-10-02	2017-10-02 13:10:00	B	20	1	2017-10-02 12:59:12	200	5	5
10003	2017-10-02	2017-10-02 13:15:00	C	32	0	2017-10-02 11:20:00	30	11	11
10004	2017-10-01	2017-10-01 12:12:48	D	35	0	2017-10-01 10:00:15	100	3	3
10004	2017-10-03	2017-10-03 12:38:20	D	35	0	2017-10-03 10:20:22	11	6	6

After this batch of data is imported into Doris correctly, it will be stored in Doris as follows:

**Table 7** Stored data
user_id	date	timestamp	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10000	2017-10-01	2017-10-01 08:00:05	A	20	0	2017-10-01 06:00:00	20	10	10
10000	2017-10-01	2017-10-01 09:00:05	A	20	0	2017-10-01 07:00:00	15	2	2
10001	2017-10-01	2017-10-01 18:12:10	A	30	1	2017-10-01 17:05:45	2	22	22
10002	2017-10-02	2017-10-02 13:10:00	B	20	1	2017-10-02 12:59:12	200	5	5
10003	2017-10-02	2017-10-02 13:15:00	C	32	0	2017-10-02 11:20:00	30	11	11
10004	2017-10-01	2017-10-01 12:12:48	D	35	0	2017-10-01 10:00:15	100	3	3
10004	2017-10-03	2017-10-03 12:38:20	D	35	0	2017-10-03 10:20:22	11	6	6

Example 3: Aggregate the imported data with the existing data

Based on the table in Example 1, suppose that you have the following data stored in Doris:

**Table 8** User information table
user_id	date	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10000	2017-10-01	A	20	0	2017-10-01 07:00:00	35	10	2
10001	2017-10-01	A	30	1	2017-10-01 17:05:45	2	22	22
10002	2017-10-02	B	20	1	2017-10-02 12:59:12	200	5	5
10003	2017-10-02	C	32	0	2017-10-02 11:20:00	30	11	11
10004	2017-10-01	D	35	0	2017-10-01 10:00:15	100	3	3
10004	2017-10-03	D	35	0	2017-10-03 10:20:22	11	6	6

Now you need to import a new batch of data:

**Table 9** New data
user_id	date	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10004	2017-10-03	D	35	0	2017-10-03 11:22:00	44	19	19
10005	2017-10-03	E	29	1	2017-10-03 18:11:02	3	1	1

After this batch of data is imported into Doris correctly, the data stored in Doris will be updated as follows:

**Table 10**
user_id	date	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10000	2017-10-01	A	20	0	2017-10-01 07:00:00	35	10	2
10001	2017-10-01	A	30	1	2017-10-01 17:05:45	2	22	22
10002	2017-10-02	B	20	1	2017-10-02 12:59:12	200	5	5
10003	2017-10-02	C	32	0	2017-10-02 11:20:00	30	11	11
10004	2017-10-01	D	35	0	2017-10-01 10:00:15	100	3	3
10004	2017-10-03	D	35	0	2017-10-03 11:22:00	55	19	6
10005	2017-10-03	E	29	1	2017-10-03 18:11:02	3	1	1

In this table, the existing data and the newly imported data of User 10004 have been aggregated. Meanwhile, the new data of User 10005 has been added.

In Doris, data aggregation happens in the following three stages:

The ETL stage of each batch of import data. At this stage, the batch of import data will be aggregated internally.
The data compaction stage of the underlying BE nodes. At this stage, BE nodes will aggregate data from different imported batches.
The data query stage. The data involved in the query will be aggregated accordingly.

At different stages, data will be aggregated to varying degrees. For example, when a batch of data is just imported, it may not be aggregated with the existing data. But users can only query aggregated data and they should not assume that what they have seen are not or partly aggregated.

Unique Key Model

In some multi-dimensional analysis scenarios, users are highly concerned about how to create uniqueness constraints for the Primary Key. The Unique model is introduced to solve this problem. This model is a special case of the Aggregate model and a simplified representation of table schema. Assume that the business has the following data table schema:

Unique model table. You are not advised to enable the merge-on-write attribute. The merge-on-read attribute is used by default.

**Table 11** Data description
ColumnName	Type	IsKey	Comment
user_id	BIGINT	Yes	User ID
username	VARCHAR(50)	Yes	Username
city	VARCHAR(20)	No	City where the user is located
age	SMALLINT	No	User age
sex	TINYINT	No	User gender
phone	LARGEINT	No	Phone number of a user
address	VARCHAR(500)	No	User address
register_time	DATETIME	No	User registration time

This is a table that contains the basic information of a user. There is no aggregation requirement for such data. The only concern is to ensure the uniqueness of the primary key (user_id+username). The CREATE TABLE statement would be as follows:

CREATE TABLE IF NOT EXISTS example_db.expamle_tbl(    `user_id` LARGEINT NOT NULL COMMENT "User ID",    `username` VARCHAR(50) NOT NULL COMMENT "Username",    `city` VARCHAR(20) COMMENT "City where the user is located",    `age` SMALLINT COMMENT "User age",    `sex` TINYINT COMMENT "User gender",    `phone` LARGEINT COMMENT "User phone number",    `address` VARCHAR(500) COMMENT "User address",    `register_time` DATETIME COMMENT "User registration time")UNIQUE KEY(`user_id`, `username`)DISTRIBUTED BY HASH(`user_id`) BUCKETS 1PROPERTIES ("replication_allocation" = "tag.location.default: 3");

This table schema is the same as the following table schema using the Aggregate model:

**Table 12** Data description
ColumnName	Type	AggregationType	Comment
user_id	BIGINT	-	User ID
username	VARCHAR(50)	-	Username
city	VARCHAR(20)	REPLACE	City where the user is located
age	SMALLINT	REPLACE	User age
sex	TINYINT	REPLACE	User gender
phone	LARGEINT	REPLACE	Phone number of a user
address	VARCHAR(500)	REPLACE	User address
register_time	DATETIME	REPLACE	User registration time

The CREATE TABLE statement would be as follows:

CREATE TABLE IF NOT EXISTS example_db.expamle_tbl(    `user_id` LARGEINT NOT NULL COMMENT "User ID",    `username` VARCHAR(50) NOT NULL COMMENT "Username",    `city` VARCHAR(20) REPLACE COMMENT "City where the user is located",    `age` SMALLINT REPLACE COMMENT "User age",    `sex` TINYINT REPLACE COMMENT "User gender",    `phone` LARGEINT REPLACE COMMENT "User phone number",    `address` VARCHAR(500) REPLACE COMMENT "User address",    `register_time` DATETIME REPLACE COMMENT "User registration time")AGGREGATE KEY(`user_id`, `username`)DISTRIBUTED BY HASH(`user_id`) BUCKETS 1PROPERTIES ("replication_allocation" = "tag.location.default: 3");

The Unique Model is equivalent to the REPLACE aggregation function in the Aggregate model. The internal implementation and data storage are exactly the same.

Duplicate Key Model

In some multi-dimensional analysis scenarios, there is no need for primary keys or data aggregation. Duplicate models can be introduced to meet such requirements.

**Table 13** Data
ColumnName	Type	SortKey	Comment
timestamp	DATETIME	Yes	Log time
type	INT	Yes	Log type
error_code	INT	Yes	Error code
error_msg	VARCHAR(1024)	No	Error details
op_id	BIGINT	No	Operator ID
op_time	DATETIME	No	Operation time

The CREATE TABLE statement would be as follows:

CREATE TABLE IF NOT EXISTS example_db.expamle_tbl(    `timestamp` DATETIME NOT NULL COMMENT "Log time",    `type` INT NOT NULL COMMENT "Log type",    `error_code` INT COMMENT "Error code",    `error_msg` VARCHAR(1024) COMMENT "Error details",    `op_id` BIGINT COMMENT "Operator ID",    `op_time` DATETIME COMMENT "Operation time")DUPLICATE KEY(`timestamp`, `type`)DISTRIBUTED BY HASH(`type`) BUCKETS 1PROPERTIES ("replication_allocation" = "tag.location.default: 3");

Different from the Aggregate and Unique models, the Duplicate model stores the data as it is and executes no aggregation. Even if there are two identical rows of data, they will both be retained. The "DUPLICATE KEY" in the CREATE TABLE statement is only used to specify based on which columns the data are sorted. A more appropriate name than "DUPLICATE KEY" would be "Sorted Column", but it is named as such to specify the data model used. The DUPLICATE KEY is suitable for the first 2 to 4 columns.

The Duplicate model is suitable for storing raw data without aggregation requirements or primary key uniqueness constraints. For more usage scenarios, see the part "Limitations of the Aggregate Model".

Limitations of the Aggregate Model

Aggregate model and Unique models

The following describes the limitations of the Aggregate model. (The Unique model will also be discussed.)

The Aggregate model only presents the aggregated data. That is, for any data that has not been aggregated (for example, data in two different import batches), the consistency of the data displayed externally must be ensured.

Suppose that you have the following table schema:

**Table 14** Data
ColumnName	Type	AggregationType	Comment
user_id	LARGEINT	-	User ID
date	DATE	-	Data import date
cost	BIGINT	SUM	Total consumption

Assume that there are two batches of data that have been imported into the storage engine as follows:

batch1

**Table 15** Data
user_id	date	cost
10001	2017-11-20	50
10002	2017-11-21	39

batch2

**Table 16** Data
user_id	date	cost
10001	2017-11-20	1
10001	2017-11-21	5
10003	2017-11-22	22

Data about user 10001 in these two import batches has not yet been aggregated. However, users can only query the aggregated data shown in the following table.

**Table 17** Data
user_id	date	cost
10001	2017-11-20	51
10001	2017-11-21	5
10002	2017-11-21	39
10003	2017-11-22	22

Create a data table.

CREATE TABLE IF NOT EXISTS example_db.expamle_tb2(    `user_id` LARGEINT NOT NULL COMMENT "User ID",    `date` DATE NOT NULL COMMENT "Data import date and time",    `cost` BIGINT SUM DEFAULT "0" COMMENT "Total consumption",)AGGREGATE KEY(`user_id`, `date`)DISTRIBUTED BY HASH(`user_id`) BUCKETS 1PROPERTIES ("replication_allocation" = "tag.location.default: 3");

Insert the data of Table 15 and Table 16.

INSERT INTO example_db.expamle_tb2 (user_id,date,cost) VALUES('10001','2017-11-20','50'),('10002','2017-11-21','39'),('10001','2017-11-20','1'),('10001','2017-11-21','5'),('10003','2017-11-22','22');

The aggregation operator is added to the query engine to ensure data consistency.

In addition, on the aggregated columns (Value columns), when executing aggregation class queries that are inconsistent with the aggregation function, pay attention to the semantics. For example, in the example above, if you execute the following query:

mysql> SELECT MIN(cost) FROM example_db.expamle_tb2;+-------------+| min(`cost`) |+-------------+|           5 |+-------------+1 row in set (0.02 sec)

The result is 5, not 1.

In addition, this consistency guarantee could considerably reduce efficiency in some queries.

Take the basic count (*) query as an example:

SELECT COUNT(*) FROM table;

In other databases, such queries return results quickly. Because in actual implementation, the models can get the query result by counting rows and saving the statistics upon import, or by scanning only one certain column of data to get count value upon query, with very little overhead. But in Doris' Aggregate model, the overhead of such queries is large. Take the following two batches of data as example:

batch1

**Table 18** Data
user_id	date	cost
10001	2017-11-20	50
10002	2017-11-21	39

batch2

**Table 19** Data
user_id	date	cost
10001	2017-11-20	1
10001	2017-11-21	5
10003	2017-11-22	22

The final aggregation result is as follows:

**Table 20** Data
user_id	date	cost
10001	2017-11-20	51
10001	2017-11-21	5
10002	2017-11-21	39
10003	2017-11-22	22

Query the result.

mysql> SELECT COUNT(date) FROM example_db.expamle_tb2;+---------------+| count(`date`) |+---------------+|             4 |+---------------+1 row in set (0.01 sec)

The correct result of select count (*) from table; should be 4. But if the model only scans the user_id column and operates aggregation upon query, the final result will be 3 (10001, 10002, 10003). And if it does not operate aggregation, the final result will be 5 (a total of five rows in two batches). Apparently, both results are wrong.

In order to get the correct result, we must read both the user_id and date columns, and perform aggregation when querying. In the count (*) query, Doris must scan all AGGREGATE KEY columns (the user_id and date columns) and aggregate them to get the semantically correct results. If there are many aggregated columns, count (*) queries could involve scanning large amounts of data.

Therefore, if you need to perform frequent count (*) queries, simulate count (*) by adding a column of value 1 and the SUM aggregation function. The table schema in the previous example will be modified as follows:

**Table 21** Data description
ColumnName	Type	AggregateType	Comment
user_id	BIGINT	-	User ID
date	DATE	-	Data import date
cost	BIGINT	SUM	Total consumption
count	BIGINT	SUM	Used for count queries

Add a count column, the value of which will always be 1. The result of select count(*) from table; is equivalent to that of select sum(count) from table;. The latter is much more efficient than the former. However, this method has its shortcomings. It requires that users will not import rows with the same values in the AGGREGATE KEY columns. Otherwise, select sum (count) from table; can only express the number of rows of the originally imported data, instead of the semantics of select count (*) from table;.

Another method is to add a count column of value 1 but use the REPLACE aggregation function. Then select sum (count) from table; and select count (*) from table; could produce the same results. This method does not have restrictions on duplicate row import.

Duplicate
The Duplicate model does not impose the same limitations as the Aggregate model because the Duplicate model does not involve aggregation semantics. For any columns, the Duplicate model can return the semantically correct results in count (*) queries.

Data Model Selection

The Doris data model is classified into three types: AGGREGATE KEY, UNIQUE KEY, and DUPLICATE KEY. Data in all three models is sorted by KEY.

AGGREGATE KEY models
The AGGREGATE KEY model aggregates data in advance, greatly reducing data scanning and calculation workload. Therefore, it is suitable for reporting query business, which has fixed schema. But it is not suitable for count(*) queries. In addition, because the aggregation function in Value columns is fixed, semantic correctness needs to be considered when aggregation queries using other functions are performed.

When AGGREGATE KEY is the same, old and new records are aggregated. The aggregation functions currently supported are SUM, MIN, MAX, REPLACE.
```
CREATE TABLE site_visit
(
    siteid      INT,
    city        SMALLINT,
    username    VARCHAR(32),
    pv BIGINT   SUM DEFAULT '0'
)
AGGREGATE KEY(siteid, city, username)
DISTRIBUTED BY HASH(siteid) BUCKETS 10;
```
UNIQUE KEY models
The UNIQUE KEY model applies to scenarios where a unique primary key constraint is required. If UNIQUE KEY is the same, the new record overwrites the old record to ensure that the primary key constraint is unique. This model is suitable for analytical business with updated requirements. The UNIQUE KEY model implements the same REPLACE aggregation function as the AGGREGATE KEY model, and they are essentially the same. However, the high-performance query with pre-aggregation using functions such as ROLLUP cannot be used (because the REPLACE function is used instead of SUM).
```
CREATE TABLE sales_order
(
    orderid     BIGINT,
    status      TINYINT,
    username    VARCHAR(32),
    amount      BIGINT DEFAULT '0'
)
UNIQUE KEY(orderid)
DISTRIBUTED BY HASH(orderid) BUCKETS 10;
```

DUPLICATE KEY models

The DUPLICATE KEY model does not merge same rows and is suitable for ad-hoc queries in any dimension. Although the pre-aggregation feature cannot be used, the DUPLICATE KEY models do not impose the same limitations as the AGGREGATE KEY models. It supports column tailoring and vectorized execution.

CREATE TABLE session_data
(
    visitorid   SMALLINT,
    sessionid   BIGINT,
    visittime   DATETIME,
    city        CHAR(20),
    province    CHAR(20),
    ip          varchar(32),
    brower      CHAR(20),
    url         VARCHAR(1024)
)
DUPLICATE KEY(visitorid, sessionid)
DISTRIBUTED BY HASH(sessionid, visitorid) BUCKETS 10;

Parent topic: Using Doris

Previous topic: Using Doris

Next topic: Doris Usage Process