Troubleshooting High Memory Usage

Description

The memory usage of an RDS for PostgreSQL instance includes the usage of both shared memory and local memory.

Shared memory: It is mainly used for the data buffer and WAL buffer to improve the read and write performance. It also stores some global information, such as process and lock information.
The value of shared_buffers determines the size of the initial shared memory you can request. The initial value for this parameter is set to 25% of the physical memory for an RDS for PostgreSQL instance. The value ranges from 25% to 40%. If the value exceeds 40% of the physical memory, the buffer effect is not obvious. This is because RDS for PostgreSQL runs on the file system and if the file system also has a buffer, there will be two buffers, causing negative impacts.

Local memory: Backend services need local memory to temporarily store data that does not need to be stored globally. Local memory is specified by the following parameters:
- temp_buffers specifies the maximum amount of memory used for temporary buffers within each database session.
- work_mem specifies the base maximum amount of memory to be used by a query operation (such as a sort or hash table) before writing to temporary disk files. Note that each sort operation instead of each SQL statement will use as much memory as the value of work_mem.
- maintenance_work_mem specifies the maximum amount of memory to be used by maintenance operations.

Impact

Redundancy is required for the memory of a production DB instance. In normal cases, the memory usage must be less than 70%. If the memory usage continues to be higher than this limit, you are advised to upgrade the memory specifications. High memory usage may trigger an alarm and cause the following problems:

Data is frequently swapped between memory and disks, which consumes a large number of CPU resources. As a result, the database performance deteriorates and data reads and writes are affected.
In severe cases, an out of memory (OOM) problem may occur. If an OOM problem occurs, the database service process restarts, existing database connections are interrupted, and new connections cannot be established. Then the HA process restarts the DB instance. During this period, the instance replays the WAL logs generated from the last checkpoint to the time when the OOM problem occurred to ensure transaction consistency.

Cause Analysis

High memory usage is generally caused by an increase in connections, active connections, slow SQL queries, TPS, or persistent connections. If the memory usage increases sharply or does not meet your expectation, analyze the causes as follows:

Figure 1 Cause analysis

Troubleshooting

Connections/Active Connections
- On the Cloud Eye console, check whether the memory usage, connection usage, number of database connections, and number of active connections increase or decrease simultaneously in the target period.
- Run the following SQL statement to check the maximum number of connections allowed for the instance:
```
show max_connections;
```
- Run the following SQL statement to check the number of active connections to the instance:
```
select count(1) from pg_stat_activity where state <> 'idle';
```
- Run the following SQL statement to check the number of idle connections:
```
select count(1) from pg_stat_activity where state = 'idle';
```

Slow SQL Statements
- On the Cloud Eye console, check whether the memory usage, number of SQL statements executed for more than 1s, number of SQL statements executed for more than 3s, and number of SQL statements executed for more than 5s increase or decrease simultaneously in the target period.
- Run the following SQL statement to view the top three slow SQL statements (for RDS for PostgreSQL 10 and later versions) and check whether the SQL statements in the returned query field use the JOIN or ORDER syntax:
  select (extract(epoch from now() - query_start)) query_time, datname, usename, client_addr, wait_event, state, query from pg_stat_activity where state not like 'idle%' and query_start is not null and backend_type = 'client backend' and pid <> pg_backend_pid() order by 1 desc limit 3;
- Query the pg_stat_statements view to obtain statistics and query the SQL statement that consumes the most shared memory. For details, see 4.

TPS
On the Cloud Eye console, check whether the memory usage and TPS increase or decrease simultaneously in the target period.

Persistent Connections
- Run the SQL statement shown below to view the top three persistent connections (for RDS for PostgreSQL 10 and later versions). In the command output, the conn_time field indicates the connection lifetime, and the query field indicates the SQL statement executed by the process.
```
select (extract(epoch from now()-backend_start)) conn_time, datname, pid, usename, client_addr, wait_event_type, wait_event, state, query from pg_stat_activity where backend_type = 'client backend' order by conn_time desc nulls last limit 3;
```
- Persistent connections cache certain information, such as query results, transaction information, and lock information, in the database. If many persistent connections are maintained for a long period of time, the cached information increases accordingly, occupying more memory. To further locate the fault, query the pg_stat_statements view based on the value of the query field obtained in the last step and check how much shared memory the SQL statement has used.
```
select userid::regrole, dbid, shared_blks_hit, shared_blks_dirtied from pg_stat_statements where query =  'query';
```

Solution

Too Many Connections or Active Connections
If there are too many connections or idle connections, run the SQL statement shown below or configure connection timeout for clients to release idle connections, or use a connection pool to reduce the overhead of establishing new database connections. If there are too many active connections, reduce the number of concurrent requests or upgrade the memory specifications.
```
select pg_terminate_backend(pid) from pg_stat_activity where state = 'idle';
```
Too Many Slow SQL Statements
Locate the SQL statements that consume much memory, optimize the SQL statements or upgrade the memory specifications.
High TPS
Reduce the number of transactions or upgrade the memory specifications.
Too Many Persistent Connections/Long Connection Lifetime
Periodically release persistent connections because maintaining them may generate a large cache and use up memory.

FAQ

Q: Why does the memory usage increase when pg_dump is used to export data? How do I avoid this problem?

A: When pg_dump is used to export data, a process accesses all objects such as tables and indexes in the target database to obtain structure data. If the accessed tables or indexes are too large, there may be large RelCache (relational table caches) or CatCache (system catalog table caches) that cannot be released. As a result, the memory usage increases and even an OOM problem occurs.

Suggestions for executing a pg_dump task:

Do not perform DDL operations.
Monitor the metric of slow SQL statements. If there is a lock conflict, kill the conflicting process.
Execute the pg_dump task during off-peak hours.
Decrease the values of shared_buffers and work_mem to 1/2 or 1/4 of the current values or less. After the task is complete, roll back the parameters.
Upgrade the memory specifications.

Parent Topic: Performance Tuning

Previous topic: Troubleshooting High CPU Usage

Next topic: Troubleshooting Database Age Increase Problem