Updating Statistics

In a database, statistics indicate the source data of a plan generated by a planner. If statistics are unavailable or out of date, the execution plan may seriously deteriorate, leading to low performance.

Scenario

The ANALYZE statement collects statistics on database table contents. These statistics will be stored in the PG_STATISTIC system catalog. Then, the query optimizer uses the statistics to work out the most efficient execution plan.

After executing batch INSERT and DELETE operations, you are advised to run the ANALYZE statement on the table or the entire database to update statistics. By default, 30,000 rows of statistics are sampled. That is, the default value of the GUC parameter default_statistics_target is 100. If the total number of rows in the table exceeds 1,600,000, you are advised to set default_statistics_target to -2, indicating that 2% of the statistics are collected.

For an intermediate table generated during the execution of scripts or stored procedures in batch, you also need to run the ANALYZE statement.

If there are multiple inter-related columns in a table and the conditions or grouping operations based on these columns are involved in the query, collect statistics about these columns so that the query optimizer can accurately estimate the number of rows and generate an effective execution plan.

Generating Statistics

Update statistics on a single table.
1

ANALYZE tablename;
Update the statistics of the entire database.
1

ANALYZE;

Collect statistics from multiple columns.

Collect statistics on the column_1 and column_2 columns of the tablename table.
1

ANALYZE tablename ((column_1, column_2));

--Add declarations for the column_1 and column_2 columns of the tablename table.

       
            ALTER TABLE tablename ADD STATISTICS ((column_1, column_2));

Collect the statistics of a single column and statistics of multiple declared columns.
1

ANALYZE tablename;

Delete the statistics of column_1 and column_2 in the tablename table or their declarations.

       
            ALTER TABLE tablename DELETE STATISTICS ((column_1, column_2));

After the statistics are declared for multiple columns by running the ALTER TABLE Tablename ADD STATISTICS statement, the system collects the statistics about these columns next time ANALYZE is performed on the table or the entire database. To collect the statistics, run the ANALYZE statement.
Use EXPLAIN to show the execution plan of each SQL statement. If rows=10 (the default value, probably indicating the table has not been analyzed) is displayed in the SEQ SCAN output of a table, run the ANALYZE statement for this table.

Improving the Quality of Statistics

ANALYZE samples data from a table based on the random sampling algorithm and calculates table data features based on the samples. The number of samples can be specified by the default_statistics_target parameter. The value of default_statistics_target ranges from -100 to 10000 and the default value is 100.

If the value of default_statistics_target is greater than 0, the number of samples is 300 x default_statistics_target. This means a larger value of default_statistics_target indicates a larger number of samples, larger memory space occupied by samples, and longer time required for calculating statistics.
If the value of default_statistics_target is smaller than 0, the number of samples is default_statistics_target/100 x Total number of rows in the table. A smaller value of default_statistics_target indicates a larger number of samples. If the value of default_statistics_target is smaller than 0, the sampled data is written to the disk. In this case, the samples do not occupy memory. However, the calculation still takes a long time because the sample size is too large.
When default_statistics_target is negative, the number of samples is calculated as default_statistics_target divided by 100, multiplied by the total number of rows in the table. This sampling mode is also known as percentage sampling.

Automatic Statistics Collection

When the autoanalyze parameter is turned on, the optimizer will automatically collect statistics if it finds that there are no statistics in the table or if the data changes exceed a certain threshold. This ensures that the optimizer has the information it needs to make precise decisions.

In a cost-based optimizer (CBO) model, statistics play a crucial role in determining whether a query plan is generated. Therefore, it is crucial to have timely and effective statistics.

Table-level statistics are stored in relpages and reltuples of pg_class.
Column-level statistics, stored in pg_statistics and accessible through the pg_statistics view, provide information on the percentage of NULL values, percentage of distinct values, high-frequency MCV values, and histograms.

Collection condition: If there is a substantial change in data volume (default threshold is 10%), indicating a shift in data characteristics, the system will initiate the collection of statistics again.

Overall policy: The system enables dynamic sampling to collect statistics promptly and polling sampling to ensure persistent statistics. To ensure fast query performance with response times in seconds, it is recommended to use manual sampling.

Basic Rules

**Table 1** Typical sampling methods
Function	Description	Feature	Constraint
Auto sampling	After making significant changes to the data in a job, you need to manually run the ANALYZE command.	In normal mode, statistics are stored in system catalogs and shared globally. A level-4 lock is applied, preventing concurrent operations on a table. In light mode, statistics are stored in memory and shared globally. A level-1 lock is applied, allowing concurrent operations on a table. In force mode, you can perform forcible sampling even when statistics are locked, in addition to the normal mode functionalities. Syntax: ANALYZE tablename; ANALYZE (light\|force) tablename;	N/A
Polling sampling	Background thread operates according to a threshold. Polling maintenance statistics	Only the normal mode is supported. Statistics are stored in system catalogs and shared. A level-4 lock is applied, preventing concurrent operations on a table. Related GUC parameters: autovacuum autovacuum_mode autovacuum_analyze_threshold autovacuum_analyze_scale_factor	Asynchronous polling triggering
Dynamic sampling	Depending on the threshold, the query parsing process can take several dozen seconds. Real-time maintenance statistics	In normal mode, statistics are stored in system catalogs and shared globally. A level-4 lock is applied, preventing concurrent operations on a table. In light mode, statistics are stored in memory and shared globally. A level-1 lock is applied, allowing concurrent operations on a table. Related GUC parameters: autoanalyze autoanalyze_mode	Real-time triggering upon query In lightweight scenarios, persistence relies on polling sampling.
Forcible sampling	Uses SQL hints to forcefully gather statistics for each query.	Used in data feature-sensitive scenarios to ensure real-time and up-to-date query statistics. Usage: *select /+ lightanalyze (t1 1) / from t1;* (1: forcible sampling; 0: sampling disabled)	The SQL statement needs to be modified.
Collecting partition statistics	Collects incremental information by partition and combines it globally.	Used in ultra-large partitioned tables to ensure accurate query cost estimation after partition pruning.	This method takes up more storage space but provides greater accuracy.
Collecting statistics from multiple columns	Gather statistics from multiple columns.	Used to filter multiple columns simultaneously to ensure accurate query cost estimation.	You need to select target columns manually and use temporary tables.
Collecting expression statistics	Collects statistics on a column based on expression functions.	Used in batch expression filtering scenarios to ensure accurate query cost estimation.	Manual identification is required.
Collecting expression index statistics	Automatically collects statistics for created expression indexes.	Used in the point query expression filtering scenario to ensure accurate query cost estimation.	Manual identification is required.
Freezing statistics	Freezes table-level statistics to prevent changes.	Used in scenarios where data features are extremely stable to prevent sampling and query plan changes. Used in scenarios where data features are highly variable to ensure sampling for each query. Parameter: table-level attribute analyze_mode	N/A
Modifying statistics	Directly modifies statistics after manual calculation.	Used to maintain a low sampling ratio with manual calibration. Usage: select approx_count_distinct(col_name) from table_name; alter table set (n_distinct=xxx)	N/A
Copying partition information	Copies statistics from old partitions to new ones.	Used for partitioned tables with minimal data feature changes to reduce statistics collection overhead.	N/A
Statistical information inference	Automatically calculates more accurate statistics based on existing data.	Controlled by the GUC parameter enable_extrapolation_stats.	N/A
Backing up and restoring statistics	Backs up statistics to an SQL statement using the EXPLAIN (STAT ON) command.	Used for scenario reproduction or statistics restoration.	Statistics are exported as SQL statements.

Scenarios and Strategies

The table below outlines typical data processing scenarios and the corresponding strategies for collecting statistics.

**Table 2** Statistics collection strategies
Scenario	Description	Strategy
Incremental stream processing	Incremental data flow changes with no reasonable time for ANALYZE.	Enable dynamic sampling to automatically collect and share statistics globally.
Online batch processing (Data lake)	Data processing and querying occur concurrently, requiring stable queries.	Enable dynamic sampling or complete data processing and ANALYZE within a transaction. begin; truncate table or partition; copy/merge/insert overwrite ANALYZE (light) tablename; end;
Partition parallel processing	Concurrent data processing in different partitions	Enable dynamic or manual light sampling and collect statistics concurrently for the same table.
Flat-wide table scenario	Wide table with over 100 columns	1. Enable automatic predicate management for dynamic sampling. 2. Collect statistics only on the first N columns. 3. Set column-level participation in sampling based on common query predicates.
Large table scenario	Large data volume with changes not reaching the threshold Variable statistics	Lower the threshold for triggering dynamic sampling.
Feature-sensitive scenario	Changeable data features causing unstable query plans, requiring forcible collection.	1. Lower the threshold for triggering dynamic sampling. 2. Use the HINT mode in SQL statements for light dynamic sampling. 3. Clear and freeze statistics, re-collecting them for each query without sharing.
High-concurrency scenario	Concurrent queries (over 10) are performed on the same table, triggering dynamic sampling and resource usage.	1. Disable concurrency, and other queries use outdated statistics. 2. Generate the latest statistics before querying (under development).
Streaming performance sensitivity	Stream processing with queries responded in seconds or high resource usage	Disable dynamic sampling at the table or SQL level and use background polling sampling.
Batch performance sensitivity	Batch processing with queries responded in seconds or high resource usage	Manually collect statistics during processing.

Resource Consumption

**Table 3** Resource consumption
Category	Sub-Category	Description
CPU	Predicate column management	Automatically manage predicates and collect statistics only on queried columns. Manually mask non-predicate columns.
CPU	Ultra-long column statistics	Data type that can be truncated, counting only the first 1,024 characters.
I/O	30,000 samples are collected by default.	Related to the number of columns, partitions, and small CUs, not table size.
Memory	Buffer usage	At most one slot in the cstore buffer is occupied.
	Memory zero copy	Directly calculate statistics from buffer samples without organizing into tuples.
	Memory adaptation	Configure the system to use temporary tables for sampling when memory is insufficient. Prevent temporary table creation triggered by queries using the analyze_stats_mode parameter.
	Memory size	Control maximum memory usage during ANALYZE with the maintenance_work_mem parameter. Exceeding memory limits results in data being written to disks or reduced samples.
Lock	Level-4 lock	(Normal mode) Applied in distributed mode, conflicting with DDL, VACUUM, ANALYZE, and REINDEX but not with addition, deletion, or modification.
Lock	Level-1 lock	(Light mode) Only local level-1 lock is supported, conflicting only with DDL statements.

Accuracy and Reliability

**Table 4** Accuracy/Reliability
Accuracy/Reliability	Item	Description
Accuracy	Sampling size	Configurable to adapt to table size with the default_statistics_target parameter.
	Sampling randomness	Optimize reservoir and range sampling with the analyze_sample_mode parameter. Enhance randomness of random number calculation with the random_function_version parameter.
	Global sharing	Statistics can be shared across sessions and nodes.
	Modifying count broadcast	Background thread checks and broadcasts the global modification count in polling mode. The job thread can also directly broadcast the modification count by specifying the tuple_change_sync_threshold parameter. Cross-CN modification and query have minimal impact. The modification count is broadcast and synchronized in asynchronous mode.
	Adjusting the CU sampling ratio	Increase CU sampling ratio if the CU filling rate is low, using the cstore_cu_sample_ratio parameter.
	Stabilizing distinct values	Use the n_distinct parameter to stabilize distinct values after random sampling without increasing the sampling ratio.
	Statistical information calculation	Use the enable_extrapolation_stats parameter to calculate more accurate statistics based on old statistics during distortion estimation.
Reliability	CN fault	Dynamic sampling is unaffected by other CN faults, and statistics are not synchronized. Query quality on the current CN remains unaffected.
	CN restoration	Forcibly perform dynamic sampling and global synchronization during queries after CN recovery.
	DN fault	Dynamic sampling of the logical cluster is unaffected by faults in other logical clusters.

O&M Monitoring

GaussDB(DWS) offers a comprehensive view of the ANALYZE running mode and different execution stages by adding comments after the ANALYZE command. This information is primarily presented through the following views:

query column in the pgxc_stat_activity view
wait_status column in the pgxc_thread_wait_status view

The format of the ANALYZE command is --Action-RunMode-StatsMode-SyncMode.

Values and meanings of Action:
```
 {"begin", "finished", "lock FirstCN", "estimate rows", "statistics", "sample rows", "calc stats"};
```
begin: indicates the start of the process; finished: indicates the end of the process; lock FirstCN: applies a lock from the FirstCN; estimate rows: estimates the number of rows in the first phase; statistics: executes ANALYZE in the second phase; sample rows: collects samples in the second phase; calc stats: calculates statistics in the second phase.

Values and meanings of RunMode:
```
 {"manual", "backend", "normal runtime", "light runtime", "light runtime inxact", "light estimate rows", "light manual"};
```
manual: indicates the manual mode; backend: indicates the background polling mode; normal runtime: indicates the normal dynamic sampling; light runtime: indicates the light dynamic samplin; light runtime inxact: indicates the light dynamic sampling in a transaction; light estimate rows indicates the light estimation function only; light manual: indicates the manual light mode.
Values and meanings of StatsMode:
```
 {"dynamic", "memory", "smptbl"};
```
dynamic: indicates adaptive selection of memory or temporary table placement samples; memory: uses only internal storage samples; smptbl: uses only temporary table placement samples.
Values and meanings of SyncMode:
```
 {"sync", "nosync"};
```
sync: Statistics are synchronized to all CNs; nosync: Statistics are not synchronized.

Example:

SELECT coorname,datid,datname,pid,usename,application_name,query_id,query 
FROM pgxc_stat_activity WHERE query like '%analyze%' and query not like '%application_name%';
     coorname | datid | datname | pid | usename | application_name | query_id | query
--------------+-------+----------+-----------------+-----------+------------------+-------------------+-----------------------------------------
coordinator1 | 15676 | postgres | 139919333779200 | test | gsql | 73183493944770822 | analyze t_1;
coordinator2 | 15676 | postgres | 140217336461056 | test | coordinator1 | 73183493944770822 | analyze public.t_1;--push stats-manual-memory-sync
coordinator3 | 15676 | postgres | 139944245847808 | test | coordinator1 | 73183493944770822 | analyze public.t_1;--push stats-manual-memory-sync
(3 rows)

Viewing Statistics

Check the dynamically sampled memory statistics.
- Retrieve table-level memory statistics.
```
SELECT * FROM pv_runtime_relstats; 
```
- Retrieve column-level memory statistics.
```
SELECT * FROM pv_runtime_attstats; 
```
Check the system catalog statistics.
- Check the table-level system catalog statistics.
```
select relname, relpages, reltuples from pg_class; 
```
- Check the column-level system catalog statistics.
```
SELECT * FROM pg_stats; 
```
Check the latest time when statistics are collected.
Dynamic sampling stores statistics in memory without modifying the timestamp of the system catalog.
```
SELECT * FROM pg_object; 
```