Updated on 2024-04-19 GMT+08:00

Troubleshooting High Storage Space Usage

Description

The storage space of a production DB instance must be redundant. If the storage space usage is too high, handle the problem in a timely manner to prevent the instance from being damaged due to full storage.

You need to pay attention to the following key metrics:

  • Storage space usage: rds039_disk_util
  • Total storage space: rds047_disk_total_size
  • Used storage space: rds048_disk_used_size
  • Transaction logs (WAL logs) usage: rds040_transaction_logs_usage
  • Oldest replication slot lag (WAL logs accumulated due to replication slot problems): rds045_oldest_replication_slot_lag

Cause Analysis

In an RDS for PostgreSQL instance, data files (such as tables and indexes), WAL logs, and temporary files may occupy the most storage space. If the storage space usage increases unexpectedly, analyze the causes as follows:

Figure 1 Cause analysis

Troubleshooting and Solution

SQL statements for querying the storage usage of databases, tables, or WAL logs occupy a large amount of disk I/O. Therefore, run such SQL statements during off-peak hours.

  • Check whether the WAL log size is normal. If no, rectify the fault.
    • Check the WAL log size.

      View the rds040_transaction_logs_usage metric or run the following SQL statement to check the WAL log size. If there are many WAL logs, perform the following steps to locate the fault.

      select round(sum(size)/1024/1024/1024,2) "GB" from pg_ls_waldir();

      The pg_ls_waldir() function is available only in RDS for PostgreSQL 12 and later versions.

      User root is required to execute the pg_ls_waldir function.

    • Check the parameter for WAL log retention.
      • For RDS for PostgreSQL 12 or earlier versions, check the value of wal_keep_segments (unit: MB). For later versions, check the value of wal_keep_size (unit: MB).
      • The value of the parameter for WAL log retention should be greater than 4 GB but less than 10% of the total storage space. Otherwise, the primary instance may clear the WAL logs required by the standby instance, causing exceptions on the standby instance.
    • Check the replication slot status and the size of logs that are not cleared.

      Replication slots block WAL reclamation. If inactive or unnecessary replication slots are found, delete them as required.

      Run the following SQL statement to query the status of a replication slot and uncleared WAL logs:

      select slot_name, active,
      pg_size_pretty(pg_wal_lsn_diff(b, a.restart_lsn)) as slot_latency
      from pg_replication_slots as a, pg_current_wal_lsn() as b;

      Run the following SQL statement to delete a slot:

      select pg_drop_replication_slot('slot_name');
    • Check how busy write services are.

      View the rds044_transaction_logs_generations metric to determine how busy write services are. This metric indicates the average size of transaction logs (WAL logs) generated per second.

      If the value of this metric is large, there are a large number of write services. In this case, the database kernel reserves more WAL logs for reclamation, and the storage usage of WAL logs increases. You are advised to scale up storage space to ensure storage redundancy.

  • Check whether the size of data files is normal. If no, rectify the fault.
    • Query the top 10 databases with the highest storage usage.
      select datname, pg_database_size(oid)/1024/1024 as dbsize_mb from pg_database order by dbsize_mb desc limit 10;
    • View the top 10 objects (tables/indexes) with the highest storage usage.

      You can use the relpages field of pg_class to estimate the size of a table or index. The SQL statement is as follows:

      select relname, relpages*8/1024 as tablesize_mb from pg_class order by tablesize_mb desc limit 10;

      To obtain the exact size of a table or index, use any of the following functions:

      Table 1 Function description

      Name

      Return Type

      Description

      pg_relation_size(relation regclass, fork text)

      bigint

      Storage space used by the specified fork ('main', 'fsm', 'vm', or 'init') of the specified table or index

      pg_relation_size(relation regclass)

      bigint

      Shorthand for pg_relation_size(..., 'main')

      pg_table_size(regclass)

      bigint

      Storage space used by the specified table, excluding indexes (but including TOAST, free space map, and visibility map)

      pg_total_relation_size(regclass)

      bigint

      Total storage space used by the specified table, including all indexes and TOAST data

    • Check whether there is any table bloat.

      Once the table that occupies a large amount of storage space is determined, you can use the pgstattuple plugin to analyze whether the table is bloated. The plugin can be installed by running the following statements:

      create control_extension('create', 'pgstattuple');
      select * from pgstattuple('table_name');

      Some kernel versions do not support the pgstattuple plugin. For details, see Supported Plugins.

      For details about how to use this plugin, see https://www.postgresql.org/docs/15/pgstattuple.html.

    • Clear table data.
      • If any table bloat is found, you can vacuum the table in the maintenance time window.

        VACUUM FULL locks the table. Ensure that no DML operation is being performed during the operation.

        vacuum full table_name;
      • If any unnecessary table or data is found, you can use the truncate table or drop table statement to delete unnecessary data.
        truncate table table_name;
      • If you cannot delete all data in a table by running DROP or TRUNCATE but want to release storage space by deleting some of table data, note that the delete operation does not release storage space (deleted data is marked as invisible) due to the Multi-Version Concurrency Control (MVCC) mechanism of RDS for PostgreSQL. In this case, use VACUUM FULL to release the storage space.
      • If only a small amount of data needs to be retained, you can create a new table to transfer the data to the table. The procedure is as follows:
        1. Store information such as the structure and indexes of the original table.
        2. Create a new table.
        3. Insert data into the new table.
        4. Check whether the data in the new table meets the expectation. If yes, go to the next step. If no, check whether the previous operations are successful.
        5. Delete the original table.
        6. Rename the new table and create indexes.

      VACUUM FULL rebuilds the table and its indexes. During this period, WAL logs are generated. Sufficient storage space needs to be reserved. (Assume that the size of the rebuilt table is 1 GB and the size of indexes is 0.5 GB. You are advised to reserve at least 2.5 GB of storage space.)

      For details about vacuum, see https://www.postgresql.org/docs/current/routine-vacuuming.html.

  • Check whether the size of temporary files is normal. If no, rectify the fault.

    If high storage usage is not caused by data files or WAL logs, temporary files may occupy a large amount of storage space. Run the following SQL statement to check the size of temporary files:

    select round(sum(size)/1024/1024/1024,2) "GB" from pg_ls_tmpdir();
    • The pg_ls_waldir() function is available only in RDS for PostgreSQL 12 and later versions.
    • User root is required to execute the pg_ls_waldir function.
    • When there are a large number of temporary files, the SQL statement execution is slow.

    Generally, temporary files are released after complex SQL statements are executed. However, if an OOM exception occurs, temporary files may fail to be released. To reduce the generated temporary files, analyze and optimize slow SQL statements. Or you can reboot the instance in the maintenance time window to delete all temporary files.