Updated on 2023-03-30 GMT+08:00

Distributed Data Storage

GaussDB(DWS) horizontally partitions tables, distributing tuples in a table to multiple nodes. This allows you to filter out unnecessary data when querying to quickly locate data and significantly improve database performance.

Horizontal partitioning distributes data in a table to multiple nodes based on a specific data distribution policy.GaussDB(DWS) supports the data distribution policies described in Table 1. When executing the CREATE TABLE statement, you can configure the DISTRIBUTE BY parameter to enable data distribution on a specific table.

Table 1 Distribution policies

Policy

Description

Scenario

Advantage & Disadvantage

Replication

Full data in a table is stored on each DN in the cluster.

Small tables and dimension tables

  • The advantage of replication is that each DN has full data of the table. During the join operation, data does not need to be redistributed, reducing network overheads and reducing plan segments (each plan segment starts a corresponding thread).
  • The disadvantage of replication is that each DN retains the complete data of the table, resulting in data redundancy. Generally, replication is only used for small dimension tables.

Hash

Table data is distributed on all DNs in the cluster.

Fact tables containing a large amount of data

  • The I/O resources of each node can be used during data read/write, greatly improving the read/write speed of a table.
  • Generally, a large table (containing over 1 million records) is defined as a hash table.

Polling (Round-robin)

Supported by 8.1.2 and later versions

Each row in the table is sent to each DN in turn. Data can be evenly distributed on each DN.

Fact tables that contain a large amount of data and cannot find a proper distribution key in hash mode

  • Round-robin can avoid data skew, improving the space utilization of the cluster.
  • Round-robin does not support local DN optimization like a hash table does, and the query performance of Round-robin is usually lower than that of a hash table.
  • If a proper distribution key can be found for a large table, use the hash distribution mode with better performance. Otherwise, define the table as a round-robin table.