Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

Doris Data Model

Updated on 2024-11-29 GMT+08:00

Basic Concepts

In Doris, data is logically described in the form of tables. A table consists of rows and columns. A row is a line of user data, and a column describes different fields in a row of data. Columns are classified into two categories: Key and Value. From a business perspective, Key and Value correspond to dimension columns and indicator columns, respectively.

Data models in Doris fall into three types:

  • Aggregate
  • Unique
  • Duplicate

Aggregate Model

The following examples show you what aggregation model is and how to use it correctly:

CREATE TABLE IF NOT EXISTS example_db.example_tbl

(

`user_id` LARGEINT NOT NULL COMMENT "User ID",

`date` DATE NOT NULL COMMENT "Data import date and time",

`city` VARCHAR(20) COMMENT "City",

`age` SMALLINT COMMENT "Age",

`sex` TINYINT COMMENT "Gender",

`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "Last access time",

`cost` BIGINT SUM DEFAULT "0" COMMENT "Total consumption",

`max_dwell_time` INT MAX DEFAULT "0" COMMENT "Dwell time",

`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum dwell time"

)

AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)

DISTRIBUTED BY HASH(`user_id`) BUCKETS 1

PROPERTIES (

"replication_allocation" = "tag.location.default: 1"

);

NOTE:

When data is imported, rows with the same contents in the Key columns will be aggregated into one row, and their values in the Value columns will be aggregated as their AggregationType specify. Currently, there are several aggregate methods:

  • SUM: Accumulate the values in multiple rows.
  • REPLACE: Replace the previous value with the newly imported value.
  • MAX: Keep the maximum value.
  • MIN: Keep the minimum value.

The columns in the table are divided into Key (dimension) columns and Value (indicator columns) based on whether they are set with an AggregationType. Key columns are not set with an AggregationType, such as user_id, date, and age, while Value columns are.

Unique Model

  • Merge on Read

    If your data does not need to be aggregated, you use this implementation. You only need to ensure the uniqueness of the primary keys (user_id and username). The Merge On Read implementation of the unique model is equivalent to the REPLACE aggregation type in the aggregate model. The following is an example:

    CREATE TABLE IF NOT EXISTS example_db.example_tbl

    (

    `user_id` LARGEINT NOT NULL COMMENT "User ID",

    `username` VARCHAR(50) NOT NULL COMMENT "Username",

    `city` VARCHAR(20) COMMENT "City",

    `age` SMALLINT COMMENT "Age",

    `sex` TINYINT COMMENT "Gender",

    `phone` LARGEINT COMMENT "User phone number",

    `address` VARCHAR(500) COMMENT "User address",

    `register_time` DATETIME COMMENT "User registration time"

    )

    UNIQUE KEY(`user_id`, `username`)

    DISTRIBUTED BY HASH(`user_id`) BUCKETS 1

    PROPERTIES (

    "replication_allocation" = "tag.location.default: 1"

    );

  • Merge on Write

    The Merge On Write implementation of the unique model is completely different from that of the aggregate model. It can deliver better performance (almost like that of the duplicate model) in aggregation queries with primary key limitations. This implementation is suitable for aggregation queries and those using indexes to filter out large-scale data.

    When creating a table, you can enable the unique model by adding the following property:

    "enable_unique_key_merge_on_write" = "true"

    For example:

    CREATE TABLE IF NOT EXISTS example_db.example_tbl

    (

    `user_id` LARGEINT,

    `username` VARCHAR(50) NOT NULL,

    `city` VARCHAR(20),

    `age` SMALLINT,

    `sex` TINYINT,

    `phone` LARGEINT,

    `address` VARCHAR(500),

    `register_time` DATETIME

    )

    UNIQUE KEY(`user_id`, `username`)

    DISTRIBUTED BY HASH(`user_id`) BUCKETS 1

    PROPERTIES (

    "replication_allocation" = "tag.location.default: 1",

    "enable_unique_key_merge_on_write" = "true"

    );

    NOTE:

    On a unique table with the Merge on Write option enabled, during the import stage, the data that is to be overwritten and updated will be marked for deletion, and new data will be written in. During a query, all data marked for deletion will be filtered out at the file level, and only the latest data would be read. This eliminates the data aggregation cost while reading, and supports many types of predicate pushdown now. Performance is improved in many scenarios, especially in aggregation queries.

Duplicate Model

In some multidimensional analysis scenarios, there is no need for primary keys or data aggregation. This is when we use the Duplicate model. The Duplicate Model stores the data as they are and executes no aggregation. Even if there are two identical rows of data, they will both be retained. The DUPLICATE KEY specified in the table creation statement is only used to specify based on which columns the data are sorted.

The statement for creating a table in duplicate model is as follows:

CREATE TABLE IF NOT EXISTS example_db.example_tbl

(

`timestamp` DATETIME NOT NULL COMMENT " Log time",

`type` INT NOT NULL COMMENT " Log type",

`error_code` INT COMMENT "Error code",

`error_msg` VARCHAR(1024) COMMENT "Error details",

`op_id` BIGINT COMMENT "Owner ID",

`op_time` DATETIME COMMENT "Processing time"

)

DUPLICATE KEY(`timestamp`, `type`, `error_code`)

DISTRIBUTED BY HASH(`type`) BUCKETS 1

PROPERTIES (

"replication_allocation" = "tag.location.default: 1"

);

Key Columns

For the duplicate, aggregate, and unique models, the Key column is specified during table creation. The differences are as follows:

  • Duplicate model: The Key columns can be regarded as just "sorting columns", but not unique identifiers.
  • Aggregate and unique models: For tables of the two aggregation types, the Key columns are both "sorting columns" and "unique identifier columns".

Suggestions on Data Model Selection

The data model is established when the table is created and cannot be modified. Therefore, it is important to select a proper data model.

  • The Aggregate Model can greatly reduce the amount of data scanned and query computation by pre-aggregation. Thus, it is very suitable for report query scenarios with fixed patterns, but is unfriendly to count (*) queries. Since the aggregation method on the Value column is fixed, semantic correctness should be considered in other types of aggregation queries.
  • The unique model guarantees the uniqueness of primary key for scenarios requiring a unique primary key. The downside is that it cannot exploit the advantage brought by pre-aggregation such as ROLLUP in queries.
    • Users who have high-performance requirements for aggregate queries are recommended to use the Merge on Write implementation.
    • The unique model only supports entire-row updates. If you require primary key uniqueness as well as partial updates of certain columns (such as loading multiple source tables into one Doris table), you can consider using the aggregate model, while setting the aggregate type of the non-primary key columns to REPLACE_IF_NOT_NULL.
  • Duplicate is suitable for ad-hoc queries in any dimensions. Although it may not be able to take advantage of the pre-aggregation feature, it is not limited by what constraints the aggregate model and can give full play to the advantage of columnar storage (reading only the relevant columns, but not all Key columns).

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback