Updated on 2024-12-13 GMT+08:00

Historical Hudi Data Deletion

This topic is available for MRS 3.3.0-LTS and later versions only.

Scenario

Delete old data from Hudi tables to reduce space occupation and save storage costs.

Running delete/drop partition

The delete/drop partition command can be used to delete historical data. For details, see Hudi SQL Syntax Reference.

Advantages: The operation is simple and COW and MOR tables are supported.

Disadvantages: The concurrency is low. When Hudi tables are in the real-time write state, concurrent execution of the delete/drop partition command may cause the real-time data import job to fail.

Running call clean_data

  • Function

    The call clean_data is used to delete historical data from MOR tables.

    Advantages: The deletion operation can be executed concurrently with the data import task, which does not affect the real-time import of data.

    Disadvantages: Only MOR tables are supported, and lazy deletion depends on compaction.

  • Syntax

    call clean_data(table => 'table_name', sql => 'delete statement')

  • Parameter description
    Table 1 Parameter description

    Parameter

    Description

    table_name

    Name of the table whose data is to be deleted. The value can be in the database.tablename format.

    delete statement

    SQL statement of the select type, which is used to find the data to be deleted.

  • Example

    Delete all data whose primaryKey is smaller than 100 from the mytable table:

    call clean_data(table => 'mytable', sql=>'select *  from mytable where primaryKey < 100') 

    Clear the residual files of the clean_data command. If cleanData fails, temporary files are generated. The following command can be used to clear these temporary files:

    call clean_data(table => 'mytable', sql=>'delete cleanData') 
  • Response

    You can view query results on the client.