更新时间:2025-05-29 GMT+08:00
分享

长事务导致oldestxmin卡住

问题描述

由于Astore的VACUUM只能清理已提交且事务号小于oldestxmin的死亡元组,如果存在一个长事务迟迟未提交,导致oldestxmin无法推进,AUTOVACUUM将无法回收死亡元组,从而引起Astore空间膨胀。

问题定位

针对该问题,可以使用gs_txid_oldestxmin()视图查询oldestxmin,使用txid_current()视图查询当前最新的xid,通过两者的差值判断oldestxmin是否卡住。

select txid_current() as xid, gs_txid_oldestxmin() as oldestxmin, xid - oldestxmin as diff;

示例:

gaussdb=# select txid_current() as xid, gs_txid_oldestxmin() as oldestxmin, xid - oldestxmin as diff;
  xid   | oldestxmin | diff
--------+------------+------
 220790 |     220789 |    1
(1 row)

一般情况下,二者的差值在10000以内是正常的。如果差值过大,达到几十甚至几百万,则系统存在长事务。可以通过pg_stat_activity获取卡住的事务信息:

select 'elp:'|| now() - xact_start diff, query, datname, pid, usename, xact_start, enqueue, state, query_id, query from pg_stat_activity where query not like 'WLM%' order by diff desc;
select * from pg_running_xacts;

示例:

gaussdb=# select 'elp:'|| now() - xact_start diff, query, datname, pid, usename, xact_start, enqueue, state, query_id, query from pg_stat_activity where query not like 'WLM%' and state='active' order by diff desc;
     diff     |                                                                                                    query
                                                          | datname  |       pid       |  usename  |          xact_start           | enqueue | state  |     query_
id     |                                                                                                    query

--------------+---------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------+----------+-----------------+-----------+-------------------------------+---------+--------+-----------
-------+----------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------
 elp:00:00:00 | select 'elp:'|| now() - xact_start diff, query, datname, pid, usename, xact_start, enqueue, state, query_id, query from pg_stat_activity where que
ry not like 'WLM%' and state='active' order by diff desc; | postgres | 140123736897280 | postgres | 2025-02-09 13:50:21.628494-05 |         | active | 3096224744
018863 | select 'elp:'|| now() - xact_start diff, query, datname, pid, usename, xact_start, enqueue, state, query_id, query from pg_stat_activity where query not
like 'WLM%' and state='active' order by diff desc;
     0 |
 elp:         |
                                                          | postgres | 140124852582144 | postgres |                               |         | active |
     0 |
 elp:         |
                                                          | postgres | 140124747724544 | postgres |                               |         | active |
     0 |
 elp:         |
                                                          | postgres | 140124232349440 | postgres |                               |         | active |
     0 |
(6 rows)

gaussdb=# select * from pg_running_xacts where pid=140125682005760;
 handle |  gxid  | state |  node  | xmin | vacuum | timeline | prepare_xid |       pid       | next_xid | dbid
--------+--------+-------+--------+------+--------+----------+-------------+-----------------+----------+-------
     -1 | 220796 | 0     | sgnode |    0 | f      |       11 |           0 | 140125682005760 |        0 | 12912
(1 row)

除此之外,逻辑复制槽也可能阻塞oldestxmin的推进,可以通过以下查询查看复制槽信息:

select * from pg_get_replication_slots();

示例:

gaussdb=# select * from pg_get_replication_slots();
 slot_name | plugin | slot_type | datoid | active | xmin | catalog_xmin | restart_lsn | dummy_standby | confirmed_flush | confirmed_csn | dictionary_csn_min | slo
t_dictionary_type
-----------+--------+-----------+--------+--------+------+--------------+-------------+---------------+-----------------+---------------+--------------------+----
------------------
 dn_6003   |        | physical  |      0 | t      |      |              | 0/15E52890  | f             |                 |               |                    |
 dn_6002   |        | physical  |      0 | t      |      |              | 0/15E52890  | f             |                 |               |                    |
(2 rows)

问题规避

在获取到阻塞oldestxmin推进的活跃线程后,将对应的线程终止即可。

相关文档