Updated on 2024-08-20 GMT+08:00

Best Practices of Table Design

Using Partitioned Tables

Partitioning refers to splitting what is logically one large table into smaller physical pieces based on specific schemes. The table based on the logic is called a partitioned table, and a physical piece is called a partition. Data is stored on these physical partitions, instead of the logical partitioned table. A partitioned table has the following advantages over an ordinary table:

  1. High query performance: You can specify partitions when querying partitioned tables, improving query efficiency.
  2. High availability: If a certain partition in a partitioned table is faulty, data in the other partitions is still available.
  3. Easy maintenance: To fix a partitioned table having a faulty partition, you only need to fix the partition.
GaussDB supports range partitioned tables, list partitioned tables, and hash partitioned tables.
  • Range partitioned table: Data in different ranges is mapped to different partitions. The range is determined by the partition key specified during the partitioned table creation. The partition key is usually a date. For example, sales data is partitioned by month.
  • List partitioned table: Key values contained in the data are stored in different partitions, and the data is mapped to each partition in sequence. The key values contained in the partitions are specified when the partitioned table is created.
  • Hash partitioned table: Data is mapped to each partition based on the internal hash algorithm. The number of partitions is specified when the partitioned table is created.

Selecting a Distribution Mode

In replication mode, full data in a table is copied to each DN in the cluster. This mode is used for tables containing a small volume of data. Full data in a table stored on each DN avoids data redistribution during the join operation. This reduces network costs and plan segment (each having a thread), but generates much redundant data. Generally, this mode is only used for small dimension tables.

In hash mode, hash values are generated for one or more columns. You can obtain the storage location of a tuple based on the mapping between DNs and the hash values. In a hash table, I/O resources on each node can be used during data read/write, which improves the read/write speed of a table. Generally, a table containing a large amount data is defined as a hash table.

Range distribution and list distribution are user-defined distribution policies. Values in a distribution key are within a certain range or fall into a specific value range of the corresponding target DN. The two distribution modes facilitate flexible data management, which, however, requires users equipped with certain data abstraction capability.

Table 1 Distribution policies and application scenarios

Policy

Description

Application Scenario

Hash

Table data is distributed on all DNs in the cluster.

Fact tables containing a large amount of data.

Replication

Full data in a table is stored on every DN in the cluster.

Small tables and dimension tables.

Range

Table data is mapped to specified columns based on the range and distributed to the corresponding DNs.

Users need to customize distribution rules.

List

Table data is mapped to specified columns based on specific values and distributed to corresponding DNs.

Users need to customize distribution rules.

As shown in Figure 1, T1 is a replication table and T2 is a hash table.

Figure 1 Replication tables and hash tables
  • When you insert, modify, or delete data in a replication table, if you use the shippable or immutable function to encapsulate components that cannot be pushed down, data on different DNs in the replication table may be inconsistent.
  • If statements with unstable results, such as window functions, rownum, and limit clauses and user-defined functions, are used to insert data into or modify data in a replication table, data on different nodes may be different.

Table Compression Level

When creating a table, you can customize the compression level and compression ratio of fields. Compression affects not only data loading but also data query. The COMPRESSION parameter specifies the table compression level.

Parameter description:

COMPRESSION specifies the compression level of table data. It determines the compression ratio and time. Generally, the higher the level of compression, the higher the ratio, the longer the time; and the lower the level of compression, the lower the ratio, the shorter the time. The actual compression ratio depends on the distribution mode of table data loaded.

Value range:

  • Valid values for row-store tables are YES and NO, and the default is NO.

You can select different compression levels based on Table 2 in different scenarios.

Table 2 Application scenarios of compression levels

Compression Level

Application Scenario

Storage Model

YES

Enabling table compression: You are advised not to enable this function because the compression ratio of row-store tables is low.

Row store

NO

Disabling table compression.

Row store

Selecting Distribution Keys

Selecting a distribution key for a hash table is essential. Details are as follows:

  1. Ensure that the column values are discrete so that data can be evenly distributed to each DN. You can select the primary key of the table as the distribution key. For example, for a person information table, choose the ID card number column as the distribution key.
  2. With the above principles met, you can select join conditions as distribution keys so that join tasks can be pushed down to DNs, reducing the amount of data transferred between the DNs.

For a hash table, an improper distribution key may cause data skew or poor I/O performance on certain DNs. Therefore, you need to check the table to ensure that data is evenly distributed on each DN. You can run the following SQL statements to check data skew:

1
2
3
4
5
select 
xc_node_id, count(1) 
from tablename 
group by xc_node_id 
order by xc_node_id desc;

Example:

CREATE TABLE t1(c1 int) distribute by hash(c1);
INSERT INTO t1 values(generate_series(1,100));
select xc_node_id, count(1) from t1 group by xc_node_id order by xc_node_id desc;
DROP TABLE t1;

xc_node_id corresponds to a DN. Generally, over 5% difference between the amount of data on different DNs is regarded as data skew. If the difference is over 10%, choose another distribution key.

Multiple distribution keys can be selected in GaussDB to evenly distribute data.

You can select the distribution key of the range or list distributed table as required. In addition to selecting a proper distribution key, pay attention to the impact of distribution rules on data distribution.

Selecting a Data Type

Use the following principles to select efficient data types:

  1. Select data types that facilitate data calculation.

    Generally, the calculation of integers (including common comparison calculations, for example, =, >, <, >=, <=, and !=, as well as GROUP BY) is more efficient than that of strings and floating point numbers.

  2. Select data types with a short length.

    Data types with short length reduce both the data file size and the memory used for computing, improving the I/O and computing performance. For example, use SMALLINT instead of INT, and INT instead of BIGINT.

  3. Use the same data type for a join.

    You are advised to use the same data type for a join. To join columns with different data types, the database needs to convert them to the same type, which leads to additional performance overheads.

Checking a Node Where a Table Resides

When creating a table, you can specify how the table is distributed or replicated among nodes. For details, see DISTRIBUTEBY. For details about distribution modes, see Selecting a Distribution Mode.

When creating a table, you can also set Node Group to specify a group to which the table belongs. For details, see TO{GROUPgroupname|....

You can also view the instance where the table is located.

  1. Query the schema to which the table belongs.
    select t1.nspname,t2.relname from pg_namespace t1,pg_class t2 where t1.oid = t2.relnamespace and t2.relname = 'table1';

    In the preceding command, nspname indicates the name of a schema, relname indicates the name of a table, an index, or a view, oid indicates the row identifier, relnamespace is the OID of the namespace that contains the relationship, and table1 indicates a table name.

  2. Check relname and nodeoids of the table.
    select t1.relname,t2.nodeoids from pg_class t1, pgxc_class t2, pg_namespace t3  where t1.relfilenode =  t2.pcrelid and t1.relnamespace=t3.oid and t1.relname = 'table1' and t3.nspname ='schema1';

    In the preceding command, nodeoids indicates the OID list of the nodes where the table is distributed, relfilenode indicates the name of the file related to the table on the disk, pcrelid indicates the OID of the table, and schema1 indicates the schema of the table queried in step 1.

  3. Query the instance where the table is located based on the queried node where the table is distributed.
    select * from pgxc_node where oid in (nodeoids1, nodeoids2, nodeoids3);

    In the preceding command, nodeoids1, nodeoids2, nodeoids3 indicates the three nodeoids queried in step 2. Use the actual nodeoids and separate them with commas (,).