Updating Statistics
In a database, statistics indicate the source data of a plan generated by a planner. If statistics are unavailable or out of date, the execution plan may seriously deteriorate, leading to low performance.
Scenario
The ANALYZE statement collects statistics on database table contents. These statistics will be stored in the PG_STATISTIC system catalog. Then, the query optimizer uses the statistics to work out the most efficient execution plan.
After executing batch INSERT and DELETE operations, you are advised to run the ANALYZE statement on the table or the entire database to update statistics. By default, 30,000 rows of statistics are sampled. That is, the default value of the GUC parameter default_statistics_target is 100. If the total number of rows in the table exceeds 1,600,000, you are advised to set default_statistics_target to -2, indicating that 2% of the statistics are collected.
For an intermediate table generated during the execution of scripts or stored procedures in batch, you also need to run the ANALYZE statement.
If there are multiple inter-related columns in a table and the conditions or grouping operations based on these columns are involved in the query, collect statistics about these columns so that the query optimizer can accurately estimate the number of rows and generate an effective execution plan.
Generating Statistics
- Update statistics on a single table.
1
ANALYZE tablename;
- Update the statistics of the entire database.
1
ANALYZE;
- Collect statistics from multiple columns.
- Collect statistics on the column_1 and column_2 columns of the tablename table.
1
ANALYZE tablename ((column_1, column_2));
- --Add declarations for the column_1 and column_2 columns of the tablename table.
1
ALTER TABLE tablename ADD STATISTICS ((column_1, column_2));
- Collect the statistics of a single column and statistics of multiple declared columns.
1
ANALYZE tablename;
- Delete the statistics of column_1 and column_2 in the tablename table or their declarations.
1
ALTER TABLE tablename DELETE STATISTICS ((column_1, column_2));
- Collect statistics on the column_1 and column_2 columns of the tablename table.
- After the statistics are declared for multiple columns by running the ALTER TABLE Tablename ADD STATISTICS statement, the system collects the statistics about these columns next time ANALYZE is performed on the table or the entire database. To collect the statistics, run the ANALYZE statement.
- Use EXPLAIN to show the execution plan of each SQL statement. If rows=10 (the default value, probably indicating the table has not been analyzed) is displayed in the SEQ SCAN output of a table, run the ANALYZE statement for this table.
Improving the Quality of Statistics
ANALYZE samples data from a table based on the random sampling algorithm and calculates table data features based on the samples. The number of samples can be specified by the default_statistics_target parameter. The value of default_statistics_target ranges from -100 to 10000 and the default value is 100.
- If the value of default_statistics_target is greater than 0, the number of samples is 300 x default_statistics_target. This means a larger value of default_statistics_target indicates a larger number of samples, larger memory space occupied by samples, and longer time required for calculating statistics.
- If the value of default_statistics_target is smaller than 0, the number of samples is default_statistics_target/100 x Total number of rows in the table. A smaller value of default_statistics_target indicates a larger number of samples. If the value of default_statistics_target is smaller than 0, the sampled data is written to the disk. In this case, the samples do not occupy memory. However, the calculation still takes a long time because the sample size is too large.
When default_statistics_target is negative, the number of samples is calculated as default_statistics_target divided by 100, multiplied by the total number of rows in the table. This sampling mode is also known as percentage sampling.
Automatic Statistics Collection
When the autoanalyze parameter is turned on, the optimizer will automatically collect statistics if it finds that there are no statistics in the table or if the data changes exceed a certain threshold. This ensures that the optimizer has the information it needs to make precise decisions.
In a cost-based optimizer (CBO) model, statistics play a crucial role in determining whether a query plan is generated. Therefore, it is crucial to have timely and effective statistics.
- Table-level statistics are stored in relpages and reltuples of pg_class.
- Column-level statistics, stored in pg_statistics and accessible through the pg_statistics view, provide information on the percentage of NULL values, percentage of distinct values, high-frequency MCV values, and histograms.
Collection condition: If there is a substantial change in data volume (default threshold is 10%), indicating a shift in data characteristics, the system will initiate the collection of statistics again.
Overall policy: The system enables dynamic sampling to collect statistics promptly and polling sampling to ensure persistent statistics. To ensure fast query performance with response times in seconds, it is recommended to use manual sampling.
Basic Rules
Function |
Description |
Feature |
Constraint |
---|---|---|---|
Auto sampling |
After making significant changes to the data in a job, you need to manually run the ANALYZE command. |
Syntax: ANALYZE tablename; ANALYZE (light|force) tablename; |
N/A |
Polling sampling |
Background thread operates according to a threshold. Polling maintenance statistics |
Only the normal mode is supported. Statistics are stored in system catalogs and shared. A level-4 lock is applied, preventing concurrent operations on a table. Related GUC parameters:
|
Asynchronous polling triggering |
Dynamic sampling |
Depending on the threshold, the query parsing process can take several dozen seconds. Real-time maintenance statistics |
Related GUC parameters:
|
Real-time triggering upon query In lightweight scenarios, persistence relies on polling sampling. |
Forcible sampling |
Uses SQL hints to forcefully gather statistics for each query. |
Used in data feature-sensitive scenarios to ensure real-time and up-to-date query statistics. Usage: select /*+ lightanalyze (t1 1) */ from t1; (1: forcible sampling; 0: sampling disabled) |
The SQL statement needs to be modified. |
Collecting partition statistics |
Collects incremental information by partition and combines it globally. |
Used in ultra-large partitioned tables to ensure accurate query cost estimation after partition pruning. |
This method takes up more storage space but provides greater accuracy. |
Collecting statistics from multiple columns |
Gather statistics from multiple columns. |
Used to filter multiple columns simultaneously to ensure accurate query cost estimation. |
You need to select target columns manually and use temporary tables. |
Collecting expression statistics |
Collects statistics on a column based on expression functions. |
Used in batch expression filtering scenarios to ensure accurate query cost estimation. |
Manual identification is required. |
Collecting expression index statistics |
Automatically collects statistics for created expression indexes. |
Used in the point query expression filtering scenario to ensure accurate query cost estimation. |
Manual identification is required. |
Freezing statistics |
Freezes table-level statistics to prevent changes. |
Used in scenarios where data features are extremely stable to prevent sampling and query plan changes. Used in scenarios where data features are highly variable to ensure sampling for each query. Parameter: table-level attribute analyze_mode |
N/A |
Modifying statistics |
Directly modifies statistics after manual calculation. |
Used to maintain a low sampling ratio with manual calibration. Usage: select approx_count_distinct(col_name) from table_name; alter table set (n_distinct=xxx) |
N/A |
Copying partition information |
Copies statistics from old partitions to new ones. |
Used for partitioned tables with minimal data feature changes to reduce statistics collection overhead. |
N/A |
Statistical information inference |
Automatically calculates more accurate statistics based on existing data. |
Controlled by the GUC parameter enable_extrapolation_stats. |
N/A |
Backing up and restoring statistics |
Backs up statistics to an SQL statement using the EXPLAIN (STAT ON) command. |
Used for scenario reproduction or statistics restoration. |
Statistics are exported as SQL statements. |
Scenarios and Strategies
The table below outlines typical data processing scenarios and the corresponding strategies for collecting statistics.
Scenario |
Description |
Strategy |
---|---|---|
Incremental stream processing |
Incremental data flow changes with no reasonable time for ANALYZE. |
Enable dynamic sampling to automatically collect and share statistics globally. |
Online batch processing (Data lake) |
Data processing and querying occur concurrently, requiring stable queries. |
Enable dynamic sampling or complete data processing and ANALYZE within a transaction. begin; truncate table or partition; copy/merge/insert overwrite ANALYZE (light) tablename; end; |
Partition parallel processing |
Concurrent data processing in different partitions |
Enable dynamic or manual light sampling and collect statistics concurrently for the same table. |
Flat-wide table scenario |
Wide table with over 100 columns |
1. Enable automatic predicate management for dynamic sampling. 2. Collect statistics only on the first N columns. 3. Set column-level participation in sampling based on common query predicates. |
Large table scenario |
Large data volume with changes not reaching the threshold Variable statistics |
Lower the threshold for triggering dynamic sampling. |
Feature-sensitive scenario |
Changeable data features causing unstable query plans, requiring forcible collection. |
1. Lower the threshold for triggering dynamic sampling. 2. Use the HINT mode in SQL statements for light dynamic sampling. 3. Clear and freeze statistics, re-collecting them for each query without sharing. |
High-concurrency scenario |
Concurrent queries (over 10) are performed on the same table, triggering dynamic sampling and resource usage. |
1. Disable concurrency, and other queries use outdated statistics. 2. Generate the latest statistics before querying (under development). |
Streaming performance sensitivity |
Stream processing with queries responded in seconds or high resource usage |
Disable dynamic sampling at the table or SQL level and use background polling sampling. |
Batch performance sensitivity |
Batch processing with queries responded in seconds or high resource usage |
Manually collect statistics during processing. |
Resource Consumption
Category |
Sub-Category |
Description |
---|---|---|
CPU |
Predicate column management |
Automatically manage predicates and collect statistics only on queried columns. Manually mask non-predicate columns. |
Ultra-long column statistics |
Data type that can be truncated, counting only the first 1,024 characters. |
|
I/O |
30,000 samples are collected by default. |
Related to the number of columns, partitions, and small CUs, not table size. |
Memory |
Buffer usage |
At most one slot in the cstore buffer is occupied. |
Memory zero copy |
Directly calculate statistics from buffer samples without organizing into tuples. |
|
Memory adaptation |
Configure the system to use temporary tables for sampling when memory is insufficient. Prevent temporary table creation triggered by queries using the analyze_stats_mode parameter. |
|
Memory size |
Control maximum memory usage during ANALYZE with the maintenance_work_mem parameter. Exceeding memory limits results in data being written to disks or reduced samples. |
|
Lock |
Level-4 lock |
(Normal mode) Applied in distributed mode, conflicting with DDL, VACUUM, ANALYZE, and REINDEX but not with addition, deletion, or modification. |
Level-1 lock |
(Light mode) Only local level-1 lock is supported, conflicting only with DDL statements. |
Accuracy and Reliability
Accuracy/Reliability |
Item |
Description |
---|---|---|
Accuracy |
Sampling size |
Configurable to adapt to table size with the default_statistics_target parameter. |
Sampling randomness |
|
|
Global sharing |
Statistics can be shared across sessions and nodes. |
|
Modifying count broadcast |
Background thread checks and broadcasts the global modification count in polling mode. The job thread can also directly broadcast the modification count by specifying the tuple_change_sync_threshold parameter. Cross-CN modification and query have minimal impact. The modification count is broadcast and synchronized in asynchronous mode. |
|
Adjusting the CU sampling ratio |
Increase CU sampling ratio if the CU filling rate is low, using the cstore_cu_sample_ratio parameter. |
|
Stabilizing distinct values |
Use the n_distinct parameter to stabilize distinct values after random sampling without increasing the sampling ratio. |
|
Statistical information calculation |
Use the enable_extrapolation_stats parameter to calculate more accurate statistics based on old statistics during distortion estimation. |
|
Reliability |
CN fault |
Dynamic sampling is unaffected by other CN faults, and statistics are not synchronized. Query quality on the current CN remains unaffected. |
CN restoration |
Forcibly perform dynamic sampling and global synchronization during queries after CN recovery. |
|
DN fault |
Dynamic sampling of the logical cluster is unaffected by faults in other logical clusters. |
O&M Monitoring
GaussDB(DWS) offers a comprehensive view of the ANALYZE running mode and different execution stages by adding comments after the ANALYZE command. This information is primarily presented through the following views:
- query column in the pgxc_stat_activity view
- wait_status column in the pgxc_thread_wait_status view
The format of the ANALYZE command is --Action-RunMode-StatsMode-SyncMode.
- Values and meanings of Action:
{"begin", "finished", "lock FirstCN", "estimate rows", "statistics", "sample rows", "calc stats"};
begin: indicates the start of the process; finished: indicates the end of the process; lock FirstCN: applies a lock from the FirstCN; estimate rows: estimates the number of rows in the first phase; statistics: executes ANALYZE in the second phase; sample rows: collects samples in the second phase; calc stats: calculates statistics in the second phase.
- Values and meanings of RunMode:
{"manual", "backend", "normal runtime", "light runtime", "light runtime inxact", "light estimate rows", "light manual"};
manual: indicates the manual mode; backend: indicates the background polling mode; normal runtime: indicates the normal dynamic sampling; light runtime: indicates the light dynamic samplin; light runtime inxact: indicates the light dynamic sampling in a transaction; light estimate rows indicates the light estimation function only; light manual: indicates the manual light mode.
- Values and meanings of StatsMode:
{"dynamic", "memory", "smptbl"};
dynamic: indicates adaptive selection of memory or temporary table placement samples; memory: uses only internal storage samples; smptbl: uses only temporary table placement samples.
- Values and meanings of SyncMode:
{"sync", "nosync"};
sync: Statistics are synchronized to all CNs; nosync: Statistics are not synchronized.
Example:
SELECT coorname,datid,datname,pid,usename,application_name,query_id,query FROM pgxc_stat_activity WHERE query like '%analyze%' and query not like '%application_name%'; coorname | datid | datname | pid | usename | application_name | query_id | query --------------+-------+----------+-----------------+-----------+------------------+-------------------+----------------------------------------- coordinator1 | 15676 | postgres | 139919333779200 | test | gsql | 73183493944770822 | analyze t_1; coordinator2 | 15676 | postgres | 140217336461056 | test | coordinator1 | 73183493944770822 | analyze public.t_1;--push stats-manual-memory-sync coordinator3 | 15676 | postgres | 139944245847808 | test | coordinator1 | 73183493944770822 | analyze public.t_1;--push stats-manual-memory-sync (3 rows)
Viewing Statistics
- Check the dynamically sampled memory statistics.
- Retrieve table-level memory statistics.
SELECT * FROM pv_runtime_relstats;
- Retrieve column-level memory statistics.
SELECT * FROM pv_runtime_attstats;
- Retrieve table-level memory statistics.
- Check the system catalog statistics.
- Check the table-level system catalog statistics.
select relname, relpages, reltuples from pg_class;
- Check the column-level system catalog statistics.
SELECT * FROM pg_stats;
- Check the table-level system catalog statistics.
- Check the latest time when statistics are collected.
Dynamic sampling stores statistics in memory without modifying the timestamp of the system catalog.
SELECT * FROM pg_object;
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.