Doris Data Model
In Doris, data is logically described in the form of tables. A table is a collection of homogeneous data with the same schema. A table consists of rows and columns. Row indicates a row of user data. Column describes different fields in a row of data. Different data types (such as integers, strings, and Boolean values) can be used as required.
In OLAP scenarios, columns can be divided into two categories: Key and Value. Key and Value can correspond to the dimension column and indicator column respectively.
Doris data models are classified into the following types:
Aggregate Model
This section illustrates what an Aggregate model is and how to use it correctly with practical examples.
- Example 1: Importing data aggregation
Assume that the business has the following data table schema:
Table 1 Data description ColumnName
Type
AggregationType
Comment
user_id
LARGEINT
-
User ID
date
DATE
-
Data import date
city
VARCHAR(20)
-
City where a user is located
age
SMALLINT
-
User age
sex
TINYINT
-
User gender
last_visit_date
DATETIME
REPLACE
Last visit date
cost
BIGINT
SUM
Total consumption
max_dwell_time
INT
MAX
Maximum residence time
min_dwell_time
INT
MIN
Minimum residence time
The corresponding CREATE TABLE statement would be as follows:
CREATE TABLE IF NOT EXISTS demo.example_tbl ( `user_id` LARGEINT NOT NULL COMMENT "User ID", `date` DATE NOT NULL COMMENT "Data import date and time", `city` VARCHAR(20) COMMENT "City where the user is located", `age` SMALLINT COMMENT "User age", `sex` TINYINT COMMENT "User gender", `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "Last visit date of the user", `cost` BIGINT SUM DEFAULT "0" COMMENT "Total consumption", `max_dwell_time` INT MAX DEFAULT "0" COMMENT "Maximum residence time", `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum residence time", ) AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`) DISTRIBUTED BY HASH(`user_id`) BUCKETS 1 PROPERTIES ( "replication_allocation" = "tag.location.default: 3" );
This is a typical fact table of user information and visit behaviors. In star models, user information and visit behaviors are usually stored in dimension tables and fact tables, respectively. Here, for easier explanation of Doris data models, the two types of information are stored in one single table.
The columns in the table are divided into Key (dimension) columns and Value (indicator) columns based on whether they are set with an AggregationType. Key columns are not set with an AggregationType, such as user_id, date, age, and sex, while Value columns are.
When data is imported, rows with the same contents in the Key columns will be aggregated into one row, and their values in the Value columns will be aggregated as their AggregationType specify. AggregationType has the following modes:- SUM: Sum up the values in multiple rows.
- REPLACE: Replace the previous value with the newly imported value.
- MAX: Keep the maximum value.
- MIN: Keep the minimum value.
Table 2 User information table user_id
date
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
A
20
0
2017-10-01 06:00:00
20
10
10
10000
2017-10-01
A
20
0
2017-10-01 07:00:00
15
2
2
10001
2017-10-01
A
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
B
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
C
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
D
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
D
35
0
2017-10-03 10:20:22
11
6
6
Assume that this is a table recording user behaviors when they are visiting a certain commodity page. The first row of data, for example, is explained as follows:
Table 3 Parameters Value
Description
10000
User ID, which uniquely identifies a user.
2017-10-01
Time when data is imported to the database. The value is accurate to date.
A
City where a user is located
20
User age
0
Gender: male (1 indicates female)
2017-10-01 06:00:00
Time when a user visits the page. The value is accurate to second.
20
Consumption generated by the current visit
10
Time spent on the page during the current visit
10
Time spent on the page during the current visit (redundancy)
After this batch of data is imported into Doris correctly, it will be stored in Doris as follows:
Table 4 Inserting data user_id
date
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
A
20
0
2017-10-01 07:00:00
35
10
2
10001
2017-10-01
A
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
B
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
C
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
D
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
D
35
0
2017-10-03 10:20:22
11
6
6
As shown in the figure, the data of user 10000 has been aggregated to one row, while those of other users remain the same. The explanation for the aggregated data of User 10000 is as follows.
The first 5 columns remain unchanged, so it starts with Column 6 last_visit_date.
- 2017-10-01 07:00:00: The last_visit_date column is aggregated by REPLACE, so 2017-10-01 07:00 has replaced 2017-10-01 06:00.
When using REPLACE to aggregate data from the same import batch, the order of replacement is uncertain. That means, in this case, the data eventually saved in Doris could be 2017-10-01 06:00. However, for different import batches, data from the new batch will replace those from the old batch.
- 35: The cost column is aggregated by SUM, so the update value 35 is the result of 20 plus 15.
- 10: The max_dwell_time column is aggregated by MAX, so 10 is saved as it is the maximum between 10 and 2.
- 2: The min_dwell_time column is aggregated by MIN, so 2 is saved as it is the minimum between 10 and 2.
After aggregation, Doris only stores the aggregated data. The detailed raw data is not retained.
- Example 2: Keeping detailed data
Here is a modified version of the table schema in Example 1:
Table 5 Data description ColumnName
Type
AggregationType
Comment
user_id
LARGEINT
-
User ID
date
DATE
-
Data import date
timestamp
DATETIME
-
Date and time when the data is imported. The value is accurate to seconds.
city
VARCHAR(20)
-
City where the user is located
age
SMALLINT
-
User age
sex
TINYINT
-
User gender
last_visit_date
DATETIME
REPLACE
Last visit date
cost
BIGINT
SUM
Total consumption
max_dwell_time
INT
MAX
Maximum residence time
min_dwell_time
INT
MIN
Minimum residence time
A new column timestamp (accurate to seconds) has been added to record the date and time when the data is imported.
In addition, AGGREGATE KEY is set to AGGREGATE KEY(user_id, date, timestamp, city, age, sex).
Suppose that the imported data is as follows:
Table 6 User information table user_id
date
timestamp
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
2017-10-01 08:00:05
A
20
0
2017-10-01 06:00:00
20
10
10
10000
2017-10-01
2017-10-01 09:00:05
A
20
0
2017-10-01 07:00:00
15
2
2
10001
2017-10-01
2017-10-01 18:12:10
A
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
2017-10-02 13:10:00
B
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
2017-10-02 13:15:00
C
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
2017-10-01 12:12:48
D
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
2017-10-03 12:38:20
D
35
0
2017-10-03 10:20:22
11
6
6
After this batch of data is imported into Doris correctly, it will be stored in Doris as follows:
Table 7 Stored data user_id
date
timestamp
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
2017-10-01 08:00:05
A
20
0
2017-10-01 06:00:00
20
10
10
10000
2017-10-01
2017-10-01 09:00:05
A
20
0
2017-10-01 07:00:00
15
2
2
10001
2017-10-01
2017-10-01 18:12:10
A
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
2017-10-02 13:10:00
B
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
2017-10-02 13:15:00
C
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
2017-10-01 12:12:48
D
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
2017-10-03 12:38:20
D
35
0
2017-10-03 10:20:22
11
6
6
- Example 3: Aggregate the imported data with the existing data
Based on the table in Example 1, suppose that you have the following data stored in Doris:
Table 8 User information table user_id
date
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
A
20
0
2017-10-01 07:00:00
35
10
2
10001
2017-10-01
A
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
B
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
C
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
D
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
D
35
0
2017-10-03 10:20:22
11
6
6
Now you need to import a new batch of data:
Table 9 New data user_id
date
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10004
2017-10-03
D
35
0
2017-10-03 11:22:00
44
19
19
10005
2017-10-03
E
29
1
2017-10-03 18:11:02
3
1
1
After this batch of data is imported into Doris correctly, the data stored in Doris will be updated as follows:
Table 10 user_id
date
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
A
20
0
2017-10-01 07:00:00
35
10
2
10001
2017-10-01
A
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
B
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
C
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
D
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
D
35
0
2017-10-03 11:22:00
55
19
6
10005
2017-10-03
E
29
1
2017-10-03 18:11:02
3
1
1
In this table, the existing data and the newly imported data of User 10004 have been aggregated. Meanwhile, the new data of User 10005 has been added.
In Doris, data aggregation happens in the following three stages:
- The ETL stage of each batch of import data. At this stage, the batch of import data will be aggregated internally.
- The data compaction stage of the underlying BE nodes. At this stage, BE nodes will aggregate data from different imported batches.
- The data query stage. The data involved in the query will be aggregated accordingly.
At different stages, data will be aggregated to varying degrees. For example, when a batch of data is just imported, it may not be aggregated with the existing data. But users can only query aggregated data and they should not assume that what they have seen are not or partly aggregated.
Unique Model
In some multi-dimensional analysis scenarios, users are highly concerned about how to create uniqueness constraints for the Primary Key. The Unique model is introduced to solve this problem. This model is a special case of the Aggregate model and a simplified representation of table schema. Assume that the business has the following data table schema:

Unique model table. You are not advised to enable the merge-on-write attribute. The merge-on-read attribute is used by default.
ColumnName |
Type |
IsKey |
Comment |
---|---|---|---|
user_id |
BIGINT |
Yes |
User ID |
username |
VARCHAR(50) |
Yes |
Username |
city |
VARCHAR(20) |
No |
City where the user is located |
age |
SMALLINT |
No |
User age |
sex |
TINYINT |
No |
User gender |
phone |
LARGEINT |
No |
Phone number of a user |
address |
VARCHAR(500) |
No |
User address |
register_time |
DATETIME |
No |
User registration time |
This is a table that contains the basic information of a user. There is no aggregation requirement for such data. The only concern is to ensure the uniqueness of the primary key (user_id+username). The CREATE TABLE statement would be as follows:
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl ( `user_id` LARGEINT NOT NULL COMMENT "User ID", `username` VARCHAR(50) NOT NULL COMMENT "Username", `city` VARCHAR(20) COMMENT "City where the user is located", `age` SMALLINT COMMENT "User age", `sex` TINYINT COMMENT "User gender", `phone` LARGEINT COMMENT "User phone number", `address` VARCHAR(500) COMMENT "User address", `register_time` DATETIME COMMENT "User registration time" ) UNIQUE KEY(`user_id`, `username`) DISTRIBUTED BY HASH(`user_id`) BUCKETS 1 PROPERTIES ( "replication_allocation" = "tag.location.default: 3" );
This table schema is the same as the following table schema using the Aggregate model:
ColumnName |
Type |
AggregationType |
Comment |
---|---|---|---|
user_id |
BIGINT |
- |
User ID |
username |
VARCHAR(50) |
- |
Username |
city |
VARCHAR(20) |
REPLACE |
City where the user is located |
age |
SMALLINT |
REPLACE |
User age |
sex |
TINYINT |
REPLACE |
User gender |
phone |
LARGEINT |
REPLACE |
Phone number of a user |
address |
VARCHAR(500) |
REPLACE |
User address |
register_time |
DATETIME |
REPLACE |
User registration time |
The CREATE TABLE statement would be as follows:
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl ( `user_id` LARGEINT NOT NULL COMMENT "User ID", `username` VARCHAR(50) NOT NULL COMMENT "Username", `city` VARCHAR(20) REPLACE COMMENT "City where the user is located", `age` SMALLINT REPLACE COMMENT "User age", `sex` TINYINT REPLACE COMMENT "User gender", `phone` LARGEINT REPLACE COMMENT "User phone number", `address` VARCHAR(500) REPLACE COMMENT "User address", `register_time` DATETIME REPLACE COMMENT "User registration time" ) AGGREGATE KEY(`user_id`, `username`) DISTRIBUTED BY HASH(`user_id`) BUCKETS 1 PROPERTIES ( "replication_allocation" = "tag.location.default: 3" );
The Unique Model is equivalent to the REPLACE aggregation function in the Aggregate model. The internal implementation and data storage are exactly the same.
Duplicate Model
ColumnName |
Type |
SortKey |
Comment |
---|---|---|---|
timestamp |
DATETIME |
Yes |
Log time |
type |
INT |
Yes |
Log type |
error_code |
INT |
Yes |
Error code |
error_msg |
VARCHAR(1024) |
No |
Error details |
op_id |
BIGINT |
No |
Operator ID |
op_time |
DATETIME |
No |
Operation time |
The CREATE TABLE statement would be as follows:
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl ( `timestamp` DATETIME NOT NULL COMMENT "Log time", `type` INT NOT NULL COMMENT "Log type", `error_code` INT COMMENT "Error code", `error_msg` VARCHAR(1024) COMMENT "Error details", `op_id` BIGINT COMMENT "Operator ID", `op_time` DATETIME COMMENT "Operation time" ) DUPLICATE KEY(`timestamp`, `type`) DISTRIBUTED BY HASH(`type`) BUCKETS 1 PROPERTIES ( "replication_allocation" = "tag.location.default: 3" );
Different from the Aggregate and Unique models, the Duplicate model stores the data as it is and executes no aggregation. Even if there are two identical rows of data, they will both be retained. The "DUPLICATE KEY" in the CREATE TABLE statement is only used to specify based on which columns the data are sorted. A more appropriate name than "DUPLICATE KEY" would be "Sorted Column", but it is named as such to specify the data model used. The DUPLICATE KEY is suitable for the first 2 to 4 columns.
The Duplicate model is suitable for storing raw data without aggregation requirements or primary key uniqueness constraints. For more usage scenarios, see the part "Limitations of the Aggregate Model".
Limitations of the Aggregate Model
- Aggregate model and Unique models
The following describes the limitations of the Aggregate model. (The Unique model will also be discussed.)
The Aggregate model only presents the aggregated data. That is, for any data that has not been aggregated (for example, data in two different import batches), the consistency of the data displayed externally must be ensured.
Suppose that you have the following table schema:
Table 14 Data ColumnName
Type
AggregationType
Comment
user_id
LARGEINT
-
User ID
date
DATE
-
Data import date
cost
BIGINT
SUM
Total consumption
Assume that there are two batches of data that have been imported into the storage engine as follows:
batch1
batch2
Data about user 10001 in these two import batches has not yet been aggregated. However, users can only query the aggregated data shown in the following table.
Table 17 Data user_id
date
cost
10001
2017-11-20
51
10001
2017-11-21
5
10002
2017-11-21
39
10003
2017-11-22
22
Create a data table.
CREATE TABLE IF NOT EXISTS example_db.expamle_tb2 ( `user_id` LARGEINT NOT NULL COMMENT "User ID", `date` DATE NOT NULL COMMENT "Data import date and time", `cost` BIGINT SUM DEFAULT "0" COMMENT "Total consumption", ) AGGREGATE KEY(`user_id`, `date`) DISTRIBUTED BY HASH(`user_id`) BUCKETS 1 PROPERTIES ( "replication_allocation" = "tag.location.default: 3" );
Insert the data of Table 15 and Table 16.
INSERT INTO example_db.expamle_tb2 (user_id,date,cost) VALUES('10001','2017-11-20','50'),('10002','2017-11-21','39'),('10001','2017-11-20','1'),('10001','2017-11-21','5'),('10003','2017-11-22','22');
The aggregation operator is added to the query engine to ensure data consistency.
In addition, on the aggregated columns (Value columns), when executing aggregation class queries that are inconsistent with the aggregation function, pay attention to the semantics. For example, in the example above, if you execute the following query:
mysql> SELECT MIN(cost) FROM example_db.expamle_tb2; +-------------+ | min(`cost`) | +-------------+ | 5 | +-------------+ 1 row in set (0.02 sec)
The result is 5, not 1.
In addition, this consistency guarantee could considerably reduce efficiency in some queries.
Take the basic count (*) query as an example:
SELECT COUNT(*) FROM table;
In other databases, such queries return results quickly. Because in actual implementation, the models can get the query result by counting rows and saving the statistics upon import, or by scanning only one certain column of data to get count value upon query, with very little overhead. But in Doris's Aggregate model, the overhead of such queries is large. Take the following two batches of data as example:
batch1
Table 18 Data user_id
date
cost
10001
2017-11-20
50
10002
2017-11-21
39
batch2
Table 19 Data user_id
date
cost
10001
2017-11-20
1
10001
2017-11-21
5
10003
2017-11-22
22
The final aggregation result is as follows:
Table 20 Data user_id
date
cost
10001
2017-11-20
51
10001
2017-11-21
5
10002
2017-11-21
39
10003
2017-11-22
22
Query the result.
mysql> SELECT COUNT(date) FROM example_db.expamle_tb2; +---------------+ | count(`date`) | +---------------+ | 4 | +---------------+ 1 row in set (0.01 sec)
The correct result of select count (*) from table; should be 4. But if the model only scans the user_id column and operates aggregation upon query, the final result will be 3 (10001, 10002, 10003). And if it does not operate aggregation, the final result will be 5 (a total of five rows in two batches). Apparently, both results are wrong.
In order to get the correct result, we must read both the user_id and date columns, and perform aggregation when querying. In the count (*) query, Doris must scan all AGGREGATE KEY columns (the user_id and date columns) and aggregate them to get the semantically correct results. If there are many aggregated columns, count (*) queries could involve scanning large amounts of data.
Therefore, if you need to perform frequent count (*) queries, simulate count (*) by adding a column of value 1 and the SUM aggregation function. The table schema in the previous example will be modified as follows:
Table 21 Data description ColumnName
Type
AggregateType
Comment
user_id
BIGINT
-
User ID
date
DATE
-
Data import date
cost
BIGINT
SUM
Total consumption
count
BIGINT
SUM
Used for count queries
Add a count column, the value of which will always be 1. The result of select count(*) from table; is equivalent to that of select sum(count) from table;. The latter is much more efficient than the former. However, this method has its shortcomings. It requires that users will not import rows with the same values in the AGGREGATE KEY columns. Otherwise, select sum (count) from table; can only express the number of rows of the originally imported data, instead of the semantics of select count (*) from table;.
Another method is to add a count column of value 1 but use the REPLACE aggregation function. Then select sum (count) from table; and select count (*) from table; could produce the same results. This method does not have restrictions on duplicate row import.
- Duplicate
The Duplicate model does not impose the same limitations as the Aggregate model because the Duplicate model does not involve aggregation semantics. For any columns, the Duplicate model can return the semantically correct results in count (*) queries.
Data Model Selection
The Doris data model is classified into three types: AGGREGATE KEY, UNIQUE KEY, and DUPLICATE KEY. Data in all three models is sorted by KEY.
- AGGREGATE KEY models
The AGGREGATE KEY model aggregates data in advance, greatly reducing data scanning and calculation workload. Therefore, it is suitable for reporting query business, which has fixed schema. But it is not suitable for count(*) queries. In addition, because the aggregation function in Value columns is fixed, semantic correctness needs to be considered when aggregation queries using other functions are performed.
When AGGREGATE KEY is the same, old and new records are aggregated. The aggregation functions currently supported are SUM, MIN, MAX, REPLACE.
CREATE TABLE site_visit ( siteid INT, city SMALLINT, username VARCHAR(32), pv BIGINT SUM DEFAULT '0' ) AGGREGATE KEY(siteid, city, username) DISTRIBUTED BY HASH(siteid) BUCKETS 10;
- UNIQUE KEY models
The UNIQUE KEY model applies to scenarios where a unique primary key constraint is required. If UNIQUE KEY is the same, the new record overwrites the old record to ensure that the primary key constraint is unique. This model is suitable for analytical business with updated requirements. The UNIQUE KEY model implements the same REPLACE aggregation function as the AGGREGATE KEY model, and they are essentially the same. However, the high-performance query with pre-aggregation using functions such as ROLLUP cannot be used (because the REPLACE function is used instead of SUM).
CREATE TABLE sales_order ( orderid BIGINT, status TINYINT, username VARCHAR(32), amount BIGINT DEFAULT '0' ) UNIQUE KEY(orderid) DISTRIBUTED BY HASH(orderid) BUCKETS 10;
- DUPLICATE KEY models
The DUPLICATE KEY model does not merge same rows and is suitable for ad-hoc queries in any dimension. Although the pre-aggregation feature cannot be used, the DUPLICATE KEY models do not impose the same limitations as the AGGREGATE KEY models. It supports column tailoring and vectorized execution.
CREATE TABLE session_data ( visitorid SMALLINT, sessionid BIGINT, visittime DATETIME, city CHAR(20), province CHAR(20), ip varchar(32), brower CHAR(20), url VARCHAR(1024) ) DUPLICATE KEY(visitorid, sessionid) DISTRIBUTED BY HASH(sessionid, visitorid) BUCKETS 10;
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot