Updated on 2024-12-13 GMT+08:00

Configuring Spark Dynamic Masking

  • This section is available for MRS 3.3.1-LTS or later version only.
  • The dynamic data masking feature cannot be enabled if jobs are submitted on the console.

Scenario

Enabling Spark dynamic masking allows for the utilization of data within the masked column for computations, while keeping it concealed during the output of calculation results. The cluster's masking policy is dynamically transferred in accordance with lineage relationships, optimizing data utility while safeguarding privacy.

Constraints

  • Data masking is not applicable to Hudi tables.
  • Masking for non-SQL methods is not supported.
  • Masking for direct HDFS read/write operations is not supported.
  • Masking for complex data types like arrays, maps, and structs is not supported.
  • Spark jobs are restricted to submission via spark-beeline (JDBC connection) mode.
  • In instances where the masking policy transfer results in a conflict with an existing policy on the target table, the latter's policy will be overridden as Custom: ***.
  • Presently, data types such as int, char, varchar, date, decimal, float, bigint, timestamp, tinyint, smallint, double, string, and binary are amenable to data masking. Post-policy configuration for data types like int, date, decimal, float, bigint, timestamp, tinyint, smallint, and double, discrepancies may arise between the spark-beeline query outcome and anticipated results; the output will not reflect original values. To align query results with policy expectations, employing the Nullify data masking policy is advised.
  • For data types not supported by the data masking policy, or when data masking transfer is implicated in the output column, the Nullify policy is the default recourse.

Procedure

  1. Modify the JDBCServer instance configuration. Log in to FusionInsight Manager, choose Cluster > Services > Spark, click Configurations, click All Configurations, and choose JDBCServer(Role).
    • If you plan to use Ranger authentication, add the following custom parameters in the custom area:

      Parameter

      Value

      spark.dynamic.masked.enabled

      true

      spark.ranger.plugin.authorization.enable

      true

      Modify the following parameter:

      Parameter

      Value

      spark.ranger.plugin.masking.enable

      true

      spark.sql.authorization.enabled

      true

    • If you plan to use Hive metadata authentication instead of Ranger authentication, add the following custom parameters in the custom area:

      Parameter

      Value

      spark.ranger.plugin.use.hive.acl.enable

      true

      spark.dynamic.masked.enabled

      true

      spark.ranger.plugin.authorization.enable

      false

      Modify the following parameter:

      Parameter

      Value

      spark.ranger.plugin.masking.enable

      true

    1. If you plan to use Hive metadata authentication instead of Ranger authentication and Hive policy initialization is not complete in Ranger, perform the following operations:
      • Enable the Ranger authentication function of Hive and restart Hive and Spark.
      • Enable the Ranger authentication function of Spark and restart Spark.
      • Disable the Ranger authentication function of Hive and restart Hive.
      • Disable the Ranger authentication function of Spark and restart Spark.
    2. Log in to the Ranger web UI. If the Hive component exists under Hadoop SQL, the Hive policy has been initialized. Otherwise, the Hive policy has not been initialized.
    3. If the HetuEngine component is installed in the cluster and the masking policies of the Ranger and HetuEngine spaces need to be automatically updated when the Spark dynamic masking policy is transferred, set spark.dynamic.masked.hetu.policy.sync.update.enable to true. You also need to change the Ranger user type of the built-in user Spark2x to admin.
  2. Save the configuration and restart the Spark service.
  3. Log in to the Spark client node and run the following commands:

    cd Client installation directory

    source bigdata_env

    source Spark/component_env

    For clusters with Kerberos authentication enabled, additionally run the following command:

    kinit test (Enter the password for authentication and change the password upon your first login.)

  4. Run the beeline commands of Spark to submit a task and create a Spark table.

    spark-beeline

    create table sparktest(a int, b string);

    insert into sparktest values (1,"test01"), (2,"test02");

  5. Configure a masking policy for the sparktest table and check whether the masking takes effect. For details, see Adding a Ranger Access Permission Policy for Spark2x.

    select * from sparktest;

  6. Verify the transfer of the data masking policy.

    create table sparktest02 as select * from sparktest;

    select * from sparktest02;

    Should the information above be displayed, it indicates the dynamic masking configuration is operational. Access the Ranger masking policy management page to view the automatically generated masking policy for the sparktest02 table.