Updated on 2024-11-29 GMT+08:00

Basic Principles

Introduction to Doris

Doris is a high-performance, real-time analytical database based on MPP architecture, known for its extreme speed and ease of use. It can return query results of mass data in sub-seconds and can support high-concurrency point queries and high-throughput complex analysis. All this makes Apache Doris an ideal tool for report analysis, ad-hoc query, unified data warehouse, and data lake query acceleration. On Doris, users can build various applications, such as user behavior analysis, AB test platform, log retrieval analysis, user portrait analysis, and order analysis.

Doris Architecture

The following figure shows the overall architecture of Doris. The frontend and backend nodes can be expanded horizontally and infinitely.

Figure 1 Doris architecture
Table 1 Description

Parameter

Description

MySQL Tools

Doris is highly compatible with MySQL syntax and is accessible by various client tools. It also supports standard SQL statements and interconnection with BI tools.

FE

Frontend nodes process user access requests, plan query parsing, and manage metadata and nodes.

BE

Stores data, executes query plans, and balances load among copies.

Leader

Leader is a role elected from the Follower group.

Follower

A metadata log needs to be written successfully in most Follower nodes.

Doris uses the MPP model. Inter-node and intra-node parallel execution is used, which is applicable to distributed join of multiple large tables.

Supports vectorized query engines, AQE ( Adaptive Query Execution ) technology, CBO-RBO optimization policies, and hot data cache query.

Basic Concepts of Doris

In Doris, data is logically described in the form of tables.

  • Row&Column

    A table consists of rows and columns.

    • Row: a row of user data.
    • Column: describes different fields in a row of data.

    Columns can be classified into two types: key and value. From the service perspective, Key and Value can correspond to dimension columns and metric columns, respectively. From the perspective of the aggregation model, rows with the same Key column are aggregated into one row. The aggregation mode of the Value column is specified when the table is created.

  • Tablet&Partition

    In the Doris storage engine, user data is horizontally divided into several data shards (tablets, also called data buckets). Each tablet contains several rows of data. The data between the individual tablets does not intersect and is physically stored independently.

    Multiple tables logically belong to different partitions. A table belongs to only one partition, but a partition contains multiple tables. Since the tablets are physically stored independently, the partitions can be seen as physically independent, too. Tablet is the smallest physical storage unit for data operations such as movement and replication.

    Multiple partitions form a table. A partition can be regarded as the smallest logical management unit. Data can be imported or deleted only for one partition.

  • Data Models

    Doris data models are classified into three types: Aggregate, Unique, and Duplicate.

    • Aggregate models

      When data is imported, rows with the same Key column are aggregated into one row, and the Value column is aggregated based on the configured AggregationType. Currently, AggregationType has the following four aggregation modes:

      • SUM: Accumulate the values in multiple rows.
      • REPLACE: The newly imported value replaces the previous value.
      • MAX: Keep the maximum value.
      • MIN: Keep the minimum value.
    • Unique model

      In some multi-dimensional analysis scenarios, users are highly concerned about how to create uniqueness constraints for the Primary Key. Therefore, the Unique data model is introduced.

      • Read-on-read combination

        The read-on-read combination of the Unique model can be replaced by the Replace mode of the Aggregate model. The internal implementation mode and data storage mode are the same.

      • Merge-on-write

        Different from the Aggregate model, the query performance of the Unique model is closer to that of the Duplicate model. Compared with the Aggregate model, the Unique model has great advantages in query performance in scenarios where primary key constraints are required, especially in aggregation queries and queries that need to filter a large amount of data using indexes.

        In a Unique table where merge-on-write is enabled, overwritten and updated data is marked and deleted during data import, and new data is written to a new file. During query, all data marked for deletion is filtered out at the file level, and the read data is the latest data. This eliminates the data aggregation process in read-time combination and supports pushdown of multiple predicates in many cases. Therefore, the performance can be greatly improved in many scenarios, especially in the case of aggregation query.

    • Duplicate model

      In some multi-dimensional analysis scenarios, primary keys and data aggregation are not required. Duplicate data models can be introduced to meet such requirements.

      Different from the AGGREGATE KEY and UNIQUE KEY models, the DUPLICATE KEY model stores the data as they are and executes no aggregation. Even if there are two identical rows of data, they will both be retained. The DUPLICATE KEY in the CREATE TABLE statement is only used to specify based on which columns the data are sorted.

    • Suggestions on data model selection

      The data model is determined when the table is created and cannot be modified. Therefore, it is important to select a proper data model.

      • The AGGREGATE KEY model aggregates data in advance, greatly reducing data scanning and calculation workload. Therefore, it is suitable for reporting query business, which has fixed schema. However, this model is not user-friendly for count(*) queries. In addition, because the aggregation function in Value columns is fixed, semantic correctness needs to be considered when aggregation queries using other functions are performed.
      • The Unique model ensures that the primary key is unique in scenarios where a unique primary key constraint is required. However, the query advantages brought by pre-aggregation such as ROLLUP cannot be used.
        • For users who have high performance requirements for aggregation query, you are advised to use the write-on-write combination added in version 1.2.
        • The Unique model supports only the update of an entire row. If you need to update both the unique primary key constraint and some columns (for example, importing multiple source tables to one Doris table), you can use the Aggregate model and set the aggregation type of non-primary key columns to REPLACE_IF_NOT_NULL.
        • Duplicate is suitable for ad-hoc query in any dimension. Although the pre-aggregation feature cannot be used, it is not restricted by the aggregation model and can make full use of the advantages of the column-store model (only related columns are read, and all key columns do not need to be read).