Data Skew Causes Slow SQL Statement Execution and Operations Fail on Large Tables

Symptom

SQL statement execution is slow and SQL statements cannot be executed on large tables.

Possible Causes

The distribution modes supported by DWS are hash, replication, and round-robin (supported by 8.1.2 clusters and later versions). If the created table is distributed in Hash mode and the distribution key is not specified, the first column of the table is selected as the distribution key. In this case, skew may occur. Table skew has the following negative impacts.

The SQL execution performance is poor because data is distributed only on some DNs. When the SQL statement is executed, only some DNs are involved in computing, and the advantage of distributed computing is not leveraged.
The usage of resources, especially disks, will be skewed. That is, the usage of some disks may be close to the upper limit, but the usage of other disks is low.
The CPU usage of some nodes may be excessively high.

Cause Analysis

Log in to the DWS console. On the Clusters page, locate the target cluster. In the Operation column of the target cluster, click Monitoring Panel. Choose Monitoring > Node Monitoring. Click the Disks tab to view the disk usage.

Check the usage of each data disk. It is found that the usage is uneven among data disks. Generally, the difference between the highest and the lowest disk usage is small. If the difference exceeds 5%, data skew may occur.

Connect to the database and check the job operating status in the waiting view. It is found that the job waits for being processed by one or some DNs.

     SELECT wait_status, count(*) as cnt FROM pgxc_thread_wait_status WHERE wait_status not like '%cmd%' AND wait_status not like '%none%' and wait_status not like '%quit%' group by 1 order by 2 desc;

The explain performance of the slow statement shows that the scan time and number of scan rows in the base table of each DN are unbalanced.
1

explain performance select avg(ss_wholesale_cost) from store_sales;
- Time of scanning a base table: The fastest DN takes 5 ms, and the slowest DN takes 1173 ms.
- Data distribution: Some DNs have 22,831,616 rows and other DNs have no row, resulting in data skew.

You can detect data skew by using the skew check interface.

     SELECT table_skewness('store_sales');

Click to enlarge

     SELECT table_distribution('public','store_sales');

Click to enlarge

The resource monitoring result shows that the CPU usage and I/O of some nodes are significantly higher than those of other nodes.

Handling Procedure

How to find the skewed table

If the number of tables in the database is less than 10,000, use the PGXC_GET_TABLE_SKEWNESS view to query data skew of all tables in the database.
1

SELECT * FROM pgxc_get_table_skewness ORDER BY totalsize DESC;

If the number of tables in the database is greater than 10,000, it may take a long time (hours) to query the entire database and calculate skew columns in the view. You are advised to perform the following operations by referring to the definition of the PGXC_GET_TABLE_SKEWNESS view:

In 8.1.2 and earlier cluster versions, the function is used to optimize calculation by customizing output and reducing output columns. For example:

     SELECT schemaname,tablename,max(dnsize) AS maxsize, min(dnsize) AS minsize 
FROM pg_catalog.pg_class c 
INNER JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace 
INNER JOIN pg_catalog.table_distribution() s ON s.schemaname = n.nspname AND s.tablename = c.relname 
INNER JOIN pg_catalog.pgxc_class x ON c.oid = x.pcrelid AND x.pclocatortype = 'H' 
GROUP BY schemaname,tablename;
 
 
  

For clusters of 8.1.3 and later versions, the function can be used to check data skew of all tables in the database. The gs_table_distribution() function is better than the table_distribution() function when you query all tables in the database. In a large cluster with a large amount of data, use the gs_table_distribution() function.

     SELECT schemaname,tablename,max(dnsize) AS maxsize, min(dnsize) AS minsize 
FROM pg_catalog.pg_class c 
INNER JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace 
INNER JOIN pg_catalog.gs_table_distribution() s ON s.schemaname = n.nspname AND s.tablename = c.relname 
INNER JOIN pg_catalog.pgxc_class x ON c.oid = x.pcrelid AND x.pclocatortype = 'H' 
GROUP BY schemaname,tablename;
 
 
  

Run the following statement to query large tables:

     SELECT schemaname||'.'||tablename as table, sum(dnsize) as size FROM gs_table_distribution() group by 1 order by 2 desc limit 10;

Run the following statement to query the table skew rate:

     WITH skew AS
(
        SELECT
                schemaname,
                tablename,
                pg_catalog.sum(dnsize) AS totalsize,
                pg_catalog.avg(dnsize) AS avgsize,
                pg_catalog.max(dnsize) AS maxsize,
                pg_catalog.min(dnsize) AS minsize,
                (pg_catalog.max(dnsize) - pg_catalog.min(dnsize)) AS skewsize,
                pg_catalog.stddev(dnsize) AS skewstddev
        FROM pg_catalog.pg_class c
        INNER JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
        INNER JOIN pg_catalog.gs_table_distribution() s ON s.schemaname = n.nspname AND s.tablename = c.relname
        INNER JOIN pg_catalog.pgxc_class x ON c.oid = x.pcrelid AND x.pclocatortype IN('H', 'N')
        GROUP BY schemaname,tablename
)
SELECT
        schemaname,
        tablename,
        totalsize,
        avgsize::numeric(1000),
        (maxsize/totalsize)::numeric(4,3)  AS maxratio,
        (minsize/totalsize)::numeric(4,3)  AS minratio,
        skewsize,
        (skewsize/avgsize)::numeric(4,3)  AS skewratio,
        skewstddev::numeric(1000)
FROM skew
WHERE totalsize > 0;
 
 
  

Methods of selecting a distribution key for a table

If the distinct value of the column is large and no obvious data skew occurs, you can define multiple columns as a distribution key.
View the size of distinct.
1

SELECT count(distinct column1) FROM table;
Check whether there are data skews:
1

SELECT count(*) cnt, column1 FROM table group by column1 order by cnt limit 100;
Select the columns where JOIN or GROUP BY statement is frequently used to reduce the use of STREAM.
It is advised against using these distribution key selection methods:
1. The default value of the distribution key (the first column) is used.
2. The distribution key is generated through the auto-increment of sequences.
3. The distribution key is generated using a random number. This method is recommended only when any column or any combination of two columns is skewed.

Parent topic: Cluster Performance

Previous topic: SQL Execution Is Slow with Low Performance and Sometimes Does Not End After a Long Period of Time

Next topic: Table Size Does not Change After VACUUM FULL Is Executed on the Table

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.

The system is busy. Please try again later.