Doris Basic Principles

Doris Overview

Doris is a high-performance, real-time analytical database based on MPP architecture. It can return query results of mass data in sub-seconds and can support high-concurrency point queries and high-throughput complex analysis. All this makes Doris an ideal tool for report analysis, ad-hoc query, unified data warehouse, and data lake federated query acceleration. On Doris, users can build various applications, such as user behavior analysis, AB test platform, log retrieval analysis, user portrait analysis, and order analysis.

Doris, formerly known as Palo, was initially created to support ad reporting business. Currently, the Apache Doris community has gathered more than 300 contributors from hundreds companies in different industries, and the number of active contributors is close to 100 per month. In June 2022, Apache Doris graduated from Apache incubator as a Top-Level Project. Doris now has a wide user base in China and around the world. Doris has been used in the production environment of more than 500 enterprises worldwide. Of the top 50 Chinese Internet companies by market capitalization (or valuation), more than 80% are long-term users of Doris. It is also widely used in some traditional industries such as finance, energy, and manufacturing.

Doris Architecture

Doris uses the MPP architecture to query data. Data is concurrently queried between nodes and within a node. Distributed shuffle join of multiple large tables is supported. The performance of multi-table joint query is excellent, which can better cope with service query in various complex scenarios.

Figure 1 Doris architecture
Click to enlarge

The overall architecture of the Doris engine is very simple. There are two types of processes:

Frontend nodes process user access requests, plan query parsing, and manage metadata and nodes.
Backend nodes are responsible for both storing data and executing query plans.

Both types of processes can be scaled out horizontally. Nodes in a single cluster can be flexibly scaled, and the storage capacity can be increased to dozens of petabytes. In addition, the two types of processes use a consistency protocol to ensure high service availability and data reliability. This highly integrated architecture reduces O&M costs.

Advantages

High performance: Doris is equipped with an efficient column storage engine, which not only reduces the amount of data scanning, but also implements an ultra-high data compression ratio. At the same time, Doris also uses various index technologies to speed up data reading and filtering. Using the partition and bucket pruning function, Doris can support ultra-high concurrency of online service business, and a single node can support up to thousands of QPS. Further, Doris combines the vectorized execution engine to give full play to the modern CPU parallel computing power. Doris supports materialized view to accelerate pre-aggregation, and uses the query optimizer to optimize queries based on planning and costs.
Ease of use: CloudTable Doris adheres to standard ANSI SQL syntax, encompassing single-table aggregation, sorting, filtering, multi-table joins, subqueries, and advanced SQL constructs like window functions and GROUPING SETS. In addition, it is also compatible with MySQL protocol, which allows users access Doris through various BI tools.
Simple architecture: Doris has only two types of processes, that is, Frontend (FE) and Backend (BE). The FE node is responsible for user request access, query plan parsing, metadata storage, and cluster management. The BE node is used to store data and execute query plans. Doris can function as a complete distributed database management system and users can run the Doris cluster without installing any third-party management and control components. In addition, both FE and BE nodes support horizontal expansion. A cluster can be expanded to hundreds of nodes and can store more than 10 petabytes of data.
Stability and reliability: Data can be stored in multiple copies and Doris clusters are capable of self-healing. Its distributed management framework can automatically manage the distribution, repair, and balancing of data copies. When a data backup is damaged, the system can automatically detect the damage and repair it.
Rich ecosystem: Doris provides rich data ingest methods, supports fast loading of data from localhost, Hadoop, Flink, Spark, Kafka, SeaTunnel and other systems, and can also directly access data in MySQL, PostgreSQL, Oracle, S3, Hive, Iceberg, Elasticsearch and other systems without data replication. At the same time, the data stored in Doris can also be read by Spark and Flink, and can be output to the upstream data application for display and analysis.
Flexible billing: For long-term stable services, you can purchase compute and cache resources in yearly/monthly mode. For temporary and ever-changing services, you can purchase compute and cache resources in pay-per-use mode. By default, storage resources are billed based on the actual data volume.

Parent topic: Doris

Previous topic: Doris

Next topic: Doris Application Scenarios