Updated on 2024-12-18 GMT+08:00

Updating Statistics

In a database, statistics indicate the source data of a plan generated by a planner. If statistics are unavailable or out of date, the execution plan may seriously deteriorate, leading to low performance.

Scenario

The ANALYZE statement collects statistics on database table contents. These statistics will be stored in the PG_STATISTIC system catalog. Then, the query optimizer uses the statistics to work out the most efficient execution plan.

After executing batch INSERT and DELETE operations, you are advised to run the ANALYZE statement on the table or the entire database to update statistics. By default, 30,000 rows of statistics are sampled. That is, the default value of the GUC parameter default_statistics_target is 100. If the total number of rows in the table exceeds 1,600,000, you are advised to set default_statistics_target to -2, indicating that 2% of the statistics are collected.

For an intermediate table generated during the execution of scripts or stored procedures in batch, you also need to run the ANALYZE statement.

If there are multiple inter-related columns in a table and the conditions or grouping operations based on these columns are involved in the query, collect statistics about these columns so that the query optimizer can accurately estimate the number of rows and generate an effective execution plan.

Generating Statistics

  • Update statistics on a single table.
    1
    ANALYZE tablename; 
    
  • Update the statistics of the entire database.
    1
    ANALYZE; 
    
  • Collect statistics from multiple columns.
    • Collect statistics on the column_1 and column_2 columns of the tablename table.
      1
      ANALYZE tablename ((column_1, column_2));  
      
    • --Add declarations for the column_1 and column_2 columns of the tablename table.
      1
      ALTER TABLE tablename ADD STATISTICS ((column_1, column_2));
      
    • Collect the statistics of a single column and statistics of multiple declared columns.
      1
      ANALYZE tablename;  
      
    • Delete the statistics of column_1 and column_2 in the tablename table or their declarations.
      1
      ALTER TABLE tablename DELETE STATISTICS ((column_1, column_2));
      
  • After the statistics are declared for multiple columns by running the ALTER TABLE Tablename ADD STATISTICS statement, the system collects the statistics about these columns next time ANALYZE is performed on the table or the entire database. To collect the statistics, run the ANALYZE statement.
  • Use EXPLAIN to show the execution plan of each SQL statement. If rows=10 (the default value, probably indicating the table has not been analyzed) is displayed in the SEQ SCAN output of a table, run the ANALYZE statement for this table.

Improving the Quality of Statistics

ANALYZE samples data from a table based on the random sampling algorithm and calculates table data features based on the samples. The number of samples can be specified by the default_statistics_target parameter. The value of default_statistics_target ranges from -100 to 10000 and the default value is 100.

  • If the value of default_statistics_target is greater than 0, the number of samples is 300 x default_statistics_target. This means a larger value of default_statistics_target indicates a larger number of samples, larger memory space occupied by samples, and longer time required for calculating statistics.
  • If the value of default_statistics_target is smaller than 0, the number of samples is default_statistics_target/100 x Total number of rows in the table. A smaller value of default_statistics_target indicates a larger number of samples. If the value of default_statistics_target is smaller than 0, the sampled data is written to the disk. In this case, the samples do not occupy memory. However, the calculation still takes a long time because the sample size is too large.

    When default_statistics_target is negative, the number of samples is calculated as default_statistics_target divided by 100, multiplied by the total number of rows in the table. This sampling mode is also known as percentage sampling.

Automatic Statistics Collection

When the autoanalyze parameter is turned on, the optimizer will automatically collect statistics if it finds that there are no statistics in the table or if the data changes exceed a certain threshold. This ensures that the optimizer has the information it needs to make precise decisions.

In a cost-based optimizer (CBO) model, statistics play a crucial role in determining whether a query plan is generated. Therefore, it is crucial to have timely and effective statistics.

  • Table-level statistics are stored in relpages and reltuples of pg_class.
  • Column-level statistics, stored in pg_statistics and accessible through the pg_statistics view, provide information on the percentage of NULL values, percentage of distinct values, high-frequency MCV values, and histograms.

Collection condition: If there is a substantial change in data volume (default threshold is 10%), indicating a shift in data characteristics, the system will initiate the collection of statistics again.

Overall policy: The system enables dynamic sampling to collect statistics promptly and polling sampling to ensure persistent statistics. To ensure fast query performance with response times in seconds, it is recommended to use manual sampling.

Basic Rules

Table 1 Typical sampling methods

Function

Description

Feature

Constraint

Auto sampling

After making significant changes to the data in a job, you need to manually run the ANALYZE command.

  • In normal mode, statistics are stored in system catalogs and shared globally. A level-4 lock is applied, preventing concurrent operations on a table.
  • In light mode, statistics are stored in memory and shared globally. A level-1 lock is applied, allowing concurrent operations on a table.
  • In force mode, you can perform forcible sampling even when statistics are locked, in addition to the normal mode functionalities.

Syntax: ANALYZE tablename; ANALYZE (light|force) tablename;

N/A

Polling sampling

Background thread operates according to a threshold.

Polling maintenance statistics

Only the normal mode is supported. Statistics are stored in system catalogs and shared. A level-4 lock is applied, preventing concurrent operations on a table.

Related GUC parameters:

  • autovacuum
  • autovacuum_mode
  • autovacuum_analyze_threshold
  • autovacuum_analyze_scale_factor

Asynchronous polling triggering

Dynamic sampling

Depending on the threshold, the query parsing process can take several dozen seconds.

Real-time maintenance statistics

  • In normal mode, statistics are stored in system catalogs and shared globally. A level-4 lock is applied, preventing concurrent operations on a table.
  • In light mode, statistics are stored in memory and shared globally. A level-1 lock is applied, allowing concurrent operations on a table.

Related GUC parameters:

  • autoanalyze
  • autoanalyze_mode

Real-time triggering upon query

In lightweight scenarios, persistence relies on polling sampling.

Forcible sampling

Uses SQL hints to forcefully gather statistics for each query.

Used in data feature-sensitive scenarios to ensure real-time and up-to-date query statistics.

Usage: select /*+ lightanalyze (t1 1) */ from t1; (1: forcible sampling; 0: sampling disabled)

The SQL statement needs to be modified.

Collecting partition statistics

Collects incremental information by partition and combines it globally.

Used in ultra-large partitioned tables to ensure accurate query cost estimation after partition pruning.

This method takes up more storage space but provides greater accuracy.

Collecting statistics from multiple columns

Gather statistics from multiple columns.

Used to filter multiple columns simultaneously to ensure accurate query cost estimation.

You need to select target columns manually and use temporary tables.

Collecting expression statistics

Collects statistics on a column based on expression functions.

Used in batch expression filtering scenarios to ensure accurate query cost estimation.

Manual identification is required.

Collecting expression index statistics

Automatically collects statistics for created expression indexes.

Used in the point query expression filtering scenario to ensure accurate query cost estimation.

Manual identification is required.

Freezing statistics

Freezes table-level statistics to prevent changes.

Used in scenarios where data features are extremely stable to prevent sampling and query plan changes.

Used in scenarios where data features are highly variable to ensure sampling for each query.

Parameter: table-level attribute analyze_mode

N/A

Modifying statistics

Directly modifies statistics after manual calculation.

Used to maintain a low sampling ratio with manual calibration. Usage:

select approx_count_distinct(col_name) from table_name;

alter table set (n_distinct=xxx)

N/A

Copying partition information

Copies statistics from old partitions to new ones.

Used for partitioned tables with minimal data feature changes to reduce statistics collection overhead.

N/A

Statistical information inference

Automatically calculates more accurate statistics based on existing data.

Controlled by the GUC parameter enable_extrapolation_stats.

N/A

Backing up and restoring statistics

Backs up statistics to an SQL statement using the EXPLAIN (STAT ON) command.

Used for scenario reproduction or statistics restoration.

Statistics are exported as SQL statements.

Scenarios and Strategies

The table below outlines typical data processing scenarios and the corresponding strategies for collecting statistics.

Table 2 Statistics collection strategies

Scenario

Description

Strategy

Incremental stream processing

Incremental data flow changes with no reasonable time for ANALYZE.

Enable dynamic sampling to automatically collect and share statistics globally.

Online batch processing

(Data lake)

Data processing and querying occur concurrently, requiring stable queries.

Enable dynamic sampling or complete data processing and ANALYZE within a transaction.

begin;
truncate table or partition;
copy/merge/insert overwrite
ANALYZE (light) tablename;
end;

Partition parallel processing

Concurrent data processing in different partitions

Enable dynamic or manual light sampling and collect statistics concurrently for the same table.

Flat-wide table scenario

Wide table with over 100 columns

1. Enable automatic predicate management for dynamic sampling.

2. Collect statistics only on the first N columns.

3. Set column-level participation in sampling based on common query predicates.

Large table scenario

Large data volume with changes not reaching the threshold

Variable statistics

Lower the threshold for triggering dynamic sampling.

Feature-sensitive scenario

Changeable data features causing unstable query plans, requiring forcible collection.

1. Lower the threshold for triggering dynamic sampling.

2. Use the HINT mode in SQL statements for light dynamic sampling.

3. Clear and freeze statistics, re-collecting them for each query without sharing.

High-concurrency scenario

Concurrent queries (over 10) are performed on the same table, triggering dynamic sampling and resource usage.

1. Disable concurrency, and other queries use outdated statistics.

2. Generate the latest statistics before querying (under development).

Streaming performance sensitivity

Stream processing with queries responded in seconds or high resource usage

Disable dynamic sampling at the table or SQL level and use background polling sampling.

Batch performance sensitivity

Batch processing with queries responded in seconds or high resource usage

Manually collect statistics during processing.

Resource Consumption

Table 3 Resource consumption

Category

Sub-Category

Description

CPU

Predicate column management

Automatically manage predicates and collect statistics only on queried columns.

Manually mask non-predicate columns.

Ultra-long column statistics

Data type that can be truncated, counting only the first 1,024 characters.

I/O

30,000 samples are collected by default.

Related to the number of columns, partitions, and small CUs, not table size.

Memory

Buffer usage

At most one slot in the cstore buffer is occupied.

Memory zero copy

Directly calculate statistics from buffer samples without organizing into tuples.

Memory adaptation

Configure the system to use temporary tables for sampling when memory is insufficient. Prevent temporary table creation triggered by queries using the analyze_stats_mode parameter.

Memory size

Control maximum memory usage during ANALYZE with the maintenance_work_mem parameter. Exceeding memory limits results in data being written to disks or reduced samples.

Lock

Level-4 lock

(Normal mode) Applied in distributed mode, conflicting with DDL, VACUUM, ANALYZE, and REINDEX but not with addition, deletion, or modification.

Level-1 lock

(Light mode) Only local level-1 lock is supported, conflicting only with DDL statements.

Accuracy and Reliability

Table 4 Accuracy/Reliability

Accuracy/Reliability

Item

Description

Accuracy

Sampling size

Configurable to adapt to table size with the default_statistics_target parameter.

Sampling randomness

  • Optimize reservoir and range sampling with the analyze_sample_mode parameter.
  • Enhance randomness of random number calculation with the random_function_version parameter.

Global sharing

Statistics can be shared across sessions and nodes.

Modifying count broadcast

Background thread checks and broadcasts the global modification count in polling mode.

The job thread can also directly broadcast the modification count by specifying the tuple_change_sync_threshold parameter.

Cross-CN modification and query have minimal impact. The modification count is broadcast and synchronized in asynchronous mode.

Adjusting the CU sampling ratio

Increase CU sampling ratio if the CU filling rate is low, using the cstore_cu_sample_ratio parameter.

Stabilizing distinct values

Use the n_distinct parameter to stabilize distinct values after random sampling without increasing the sampling ratio.

Statistical information calculation

Use the enable_extrapolation_stats parameter to calculate more accurate statistics based on old statistics during distortion estimation.

Reliability

CN fault

Dynamic sampling is unaffected by other CN faults, and statistics are not synchronized. Query quality on the current CN remains unaffected.

CN restoration

Forcibly perform dynamic sampling and global synchronization during queries after CN recovery.

DN fault

Dynamic sampling of the logical cluster is unaffected by faults in other logical clusters.

O&M Monitoring

GaussDB(DWS) offers a comprehensive view of the ANALYZE running mode and different execution stages by adding comments after the ANALYZE command. This information is primarily presented through the following views:

  • query column in the pgxc_stat_activity view
  • wait_status column in the pgxc_thread_wait_status view

The format of the ANALYZE command is --Action-RunMode-StatsMode-SyncMode.

  • Values and meanings of Action:
     {"begin", "finished", "lock FirstCN", "estimate rows", "statistics", "sample rows", "calc stats"};

    begin: indicates the start of the process; finished: indicates the end of the process; lock FirstCN: applies a lock from the FirstCN; estimate rows: estimates the number of rows in the first phase; statistics: executes ANALYZE in the second phase; sample rows: collects samples in the second phase; calc stats: calculates statistics in the second phase.

  • Values and meanings of RunMode:
     {"manual", "backend", "normal runtime", "light runtime", "light runtime inxact", "light estimate rows", "light manual"};

    manual: indicates the manual mode; backend: indicates the background polling mode; normal runtime: indicates the normal dynamic sampling; light runtime: indicates the light dynamic samplin; light runtime inxact: indicates the light dynamic sampling in a transaction; light estimate rows indicates the light estimation function only; light manual: indicates the manual light mode.

  • Values and meanings of StatsMode:
     {"dynamic", "memory", "smptbl"};

    dynamic: indicates adaptive selection of memory or temporary table placement samples; memory: uses only internal storage samples; smptbl: uses only temporary table placement samples.

  • Values and meanings of SyncMode:
     {"sync", "nosync"};

    sync: Statistics are synchronized to all CNs; nosync: Statistics are not synchronized.

Example:

SELECT coorname,datid,datname,pid,usename,application_name,query_id,query 
FROM pgxc_stat_activity WHERE query like '%analyze%' and query not like '%application_name%';
     coorname | datid | datname | pid | usename | application_name | query_id | query
--------------+-------+----------+-----------------+-----------+------------------+-------------------+-----------------------------------------
coordinator1 | 15676 | postgres | 139919333779200 | test | gsql | 73183493944770822 | analyze t_1;
coordinator2 | 15676 | postgres | 140217336461056 | test | coordinator1 | 73183493944770822 | analyze public.t_1;--push stats-manual-memory-sync
coordinator3 | 15676 | postgres | 139944245847808 | test | coordinator1 | 73183493944770822 | analyze public.t_1;--push stats-manual-memory-sync
(3 rows)

Viewing Statistics

  • Check the dynamically sampled memory statistics.
    • Retrieve table-level memory statistics.
      SELECT * FROM pv_runtime_relstats; 
    • Retrieve column-level memory statistics.
      SELECT * FROM pv_runtime_attstats; 
  • Check the system catalog statistics.
    • Check the table-level system catalog statistics.
      select relname, relpages, reltuples from pg_class; 
    • Check the column-level system catalog statistics.
      SELECT * FROM pg_stats; 
  • Check the latest time when statistics are collected.

    Dynamic sampling stores statistics in memory without modifying the timestamp of the system catalog.

    SELECT * FROM pg_object;