Help Center/ MapReduce Service/ Component Operation Guide (LTS)/ Using Hive/ Enterprise-Class Enhancements of Hive/ Detecting Statements That Overwrite a Table with Its Own Data
Updated on 2024-12-13 GMT+08:00

Detecting Statements That Overwrite a Table with Its Own Data

Scenario

You can intercept SQL statements that read the data written by themselves in Hive. You can add the rule dynamic_0004 on the SQL Inspector page to intercept SQL statements that overwrite a table, partition, or directory with data from the same table, partition, or directory.

This topic is available for MRS 3.5.0 or later versions.

The following are some typical large Hive SQL statements you need to detect:

  • Non-partitioned table

    INSERT OVERWRITE TABLE tbl_a SELECT * FROM tbl_a;

  • Static partition table

    INSERT OVERWRITE TABLE tbl_b PARTITION (ds='20240101') SELECT * FROM tbl_b WHERE ds='20240101';

  • Dynamic partitioning

    INSERT OVERWRITE TABLE tbl_c PARTITION(pday) SELECT id,name,pday FROM tbl_c WHERE id > 100;

  • Writing data to an HDFS table

    INSERT OVERWRITE DIRECTORY 'hdfs://hacluster/user/hive/warehouse/tbl_d' SELECT * FROM tbl_d;

Procedure

  1. Log in to FusionInsight Manager as a user with management right.
  2. Choose Cluster > SQL Inspector, click Add Rule, enter the password of the user, and click OK. The Add Rule page is displayed.
  3. On the displayed page, set the following parameters and click OK:

    • Name: Enter a rule name, for example, test.
    • ID: Select dynamic_0004.
    • Tenant: Click Add and select the tenant name for which the inspection rule will be used, for example, default.
    • Services and Actions: Click Add. In the Add Service and Action dialog box, set the following parameters and click OK:
      • Service: Select Hive.
      • Enable the Hint or Intercept action based on service requirements.
    Figure 1 Creating a Hive SQL inspection rule

  4. Log in to the node where the Hive client is installed and switch to the client installation directory.

    cd /opt/client

    Configure environment variables.

    source bigdata_env

    Authenticate the user (skip this step if Kerberos authentication is disabled for the cluster).

    kinit Component service user who has the Hive operation permission

  5. Log in to the Hive client.

    beeline

  6. Create a table and insert data into the table.

    drop table tbl_a;

    create table tbl_a(id int, name string);

    insert into table tbl_a values(123,'sjk'),(234,'shen'),(111,'aaa');

  7. (Optional) Enable strict interception. For details, see Setting Strict Interception for Hive Dynamic Partitioned Tables. The nonstrict mode is used by default for dynamic partitioning.

    set hive-ext.dynamic.partition.intercept.mode=strict;

  8. Check whether the static_0001 rule has been applied.

    insert overwrite table tbl_a select * from tbl_a;

    If the configured action is Hint, the system reports an alarm when detecting a SQL statement that triggers the inspection rule. The SQL statement will continue to run, and the following information will be displayed:

    WARN  : DYNAMIC_0004 Self-read and self-overwrite operations are not allowed.

    If the configured action is Intercept, the SQL statement will be intercepted when the system detects a SQL statement that triggers the inspection rule. The following information is displayed:

    Error: Error while compiling statement: FAILED: RuleException DYNAMIC_0004 Self-read and self-overwrite operations are not allowed. (state=42000,code=40000)

Setting Strict Interception for Hive Dynamic Partitioned Tables

The system cannot obtain the partition information in the compilation phase of dynamic partitioning, and therefore cannot detect SQL statements that overwrite data with the same data. You can choose a non-strict or strict interception policy when SQL statements involve dynamic partitioning.

  • Non-strict interception allows the system to check whether the queried table is the one where the submitted SQL statement writes data to. The detection is completed during SQL compilation. So, non-strict interception is fast, but it also works for quires of data in different partitions.
  • Strict interception allows the system to check whether the queried partition is the one where the submitted SQL statement writes data to. The detection is completed during the MoveTask phase of a SQL task. Strict interception is more accurate, but it works only after most logics of the SQL task are executed, causing resource waste.

Non-strict interception is used by default for dynamic partitioning. To enable strict interception, set hive-ext.lakeformation.transform.role (in Table 1) to strict. You are advised to set this parameter for a single task, for example, for a SQL statement like set hive-ext.dynamic.partition.intercept.mode=strict;.

Table 1 Parameters

Parameter

Default Value

Description

hive-ext.dynamic.partition.intercept.mode

nonstrict

Interception mode for SQL statements that involve Hive dynamic partitioning. The options are as follows:

  • nonstrict: If the queried table is the one where a SQL statement inserts data to, the SQL statement is intercepted.
  • strict: If the queried partition is the one where a SQL statement inserts data to, the SQL statement is intercepted.