Discovering Sensitive Data

After creating a sensitive data identification rule group, you can create a sensitive data discovery task to discover sensitive data and synchronize it to Data Map.

After running a sensitive data discovery task, you must choose Sensitive Data Distribution in the left navigation pane, click the Manual Recovery tab, and ensure that the identification rule of the task is valid, so that the rule can take effect for dynamic masking tasks.

Prerequisites

Sensitive data identification rule groups have been created. For details, see Creating Identification Rule Groups.
A DWS connection, a DLI connection, and an MRS Hive connection have been created in Management Center based on Creating a DataArts Studio Data Connection.
Before discovering DLI sensitive data, you must prepare a general-purpose DLI queue.
To enable automatic synchronization of identified sensitive data to the Data Map component, the sensitive data discovery task must be created, run, or scheduled by DAYU Administrator, Tenant Administrator, or data security administrator.
To enable the synchronization of sensitive data classifications to the Data Map component, ensure that the following prerequisites are met:
- You have collected the metadata of the data table in DataArts Catalog. For details, see Metadata Collection Task.
- Real-time metadata synchronization has been enabled for the data connections in Management Center. For details, see Creating a DataArts Studio Data Connection.

Constraints

Sensitive data discovery is only available for standard warehouses of GaussDB(DWS), Data Lake Insight (DLI), and MRS Hive.
Only sensitive DLI and GaussDB(DWS) data discovery tasks can discover sensitive data in data tables matching specified wildcard characters or in all data tables. Resource specifications can be configured only for sensitive DLI data discovery tasks. (If more resources are configured than available ones, the tasks may fail.)
Only sensitive GaussDB(DWS) data discovery tasks support resumable scans and task progress display in logs.
If the sensitive data identification rule is of the content identification type (that is, a built-in rule or a custom rule of the content identification type), a field is considered as a sensitive field and matched with a security level and classification only when the proportion of the number of records that match the identification rule of a field to the total number of records in the data table exceeds a specified threshold (80% by default).
During sensitive data identification, if a field matches multiple identification rules in an identification rule group, the highest security level of the identification rules is used as the security level of the field, and multiple field classifications are allowed.
After a sensitive data discovery task is executed, the security levels and classifications are generated for the discovered sensitive fields. By default, security levels of data tables are not generated. Security levels of data tables are generated only if you select Update the security level. The security level of a data table is the highest security level of sensitive fields.
Currently, sensitive data can be synchronized only to Data Map. Sensitive data cannot be synchronized to DataArts Catalog, and sensitive data security levels and classifications cannot be added or edited in DataArts Catalog.
Only the DAYU Administrator, Tenant Administrator, or data security administrator has the permission to enable automatic synchronization of sensitive data to Data Map or manually synchronize sensitive data to Data Map.
- Automatic synchronization: If Manually synchronize the recognition result is not selected during the creation of a sensitive data discovery task, sensitive data is automatically synchronized to Data Map.
- Manual synchronization: If you select Manually synchronize the recognition result when creating a sensitive data discovery task, you need to choose Sensitive Data Distribution and click the Manual Recovery tab to synchronize sensitive data to Data Map.
When creating a sensitive data discovery task as a common user other than the DAYU Administrator, Tenant Administrator, or data security administrator, you must select Manually synchronize the recognition result so that the task can be successfully created. In addition, if you run or schedule a task for which Manually synchronize the recognition result is not selected as a common user, the task cannot be executed.

Creating a Sensitive Data Discovery Task

On the DataArts Studio console, locate a workspace and click DataArts Security.
Choose Sensitive Data Discovery from the left navigation bar.

Figure 1 Sensitive Data Discovery page

Click Create. In the Create Sensitive Task slide-out panel, set parameters based on Table 1.

Figure 2 Setting parameters for the sensitive data discovery task

The following table lists the parameters for a sensitive data discovery task.

**Table 1** Parameters
Parameter	Description
Basic Settings
*Task	Name of the task. To facilitate task management, you are advised to include the data table to be identified and the rule group to be used in the task name.
Task Description	A description of the task to be created.
*Data Source	Select a created data source from the drop-down list.
*Data Connection	Select a data connection from the drop-down list. If no data connection is available, create one by referring to Creating a DataArts Studio Data Connection.
*Database	Databases to be scanned. Click Configure following the Database box to select databases. Click Clear to delete the selected databases.
*Data Table	For sensitive DLI andGaussDB(DWS) data discovery tasks, you need to select one of the following table selection modes: Manual: Select the tables in which you want to discover sensitive data. You can perform fuzzy match in the search box in the table filtering window. If you want to select all tables, you need to select them page by page. This mode is recommended if you want to discover sensitive data in a small number of tables. Wildcard: Enter matching rules to match target tables based on wildcards. You can enter a maximum of 100 matching rules for a task and separate them by line breaks. Each line is regarded as a rule. A rule can contain only letters, digits, underscores (_), and wildcards (). For example, the test_ rule means to match tables whose names start with test_. You can also check whether the matching rules meet expectations in the test window. This mode is recommended if there are a large number of rules and tables. All: You do not need to enter rules or filter tables. All tables will be scanned. Select this mode if you want to scan all tables in the selected databases. For sensitive MRS Hive data discovery tasks, only the Manual mode is available. You can perform fuzzy match in the search box in the table filtering window. If you want to select all tables, you need to select them page by page.
Sampling	This parameter is available when Data Source is DWS. The maximum value allowed is 10,000.
*Computing Queue	This parameter is mandatory if Data Source is set to DLI. Select a general-purpose DLI queue for executing DLI jobs.
Rule Settings
*Recognize Rule Group	Select a rule group from the drop-down list. If no rule groups are created, create one by referring to Creating Identification Rule Groups. When you select a group, details about the identification rules in the group are displayed. You can configure thresholds for preset rules and custom rules that contain content matching. When the proportion of the number of records that match the identification rule of a field to the total number of records in the data table exceeds the threshold (80% by default), the field is considered sensitive. If different rule groups contain the same rule, the threshold for the rule must be the same.
Manually synchronize the recognition result	Only the DAYU Administrator, Tenant Administrator, or data security administrator has the permission to enable automatic synchronization of sensitive data to Data Map or manually synchronize sensitive data to Data Map. Automatic synchronization: If Manually synchronize the recognition result is not selected during the creation of a sensitive data discovery task, sensitive data is automatically synchronized to Data Map. Manual synchronization: If you select Manually synchronize the recognition result when creating a sensitive data discovery task, you need to choose Sensitive Data Distribution and click the Manual Recovery tab to synchronize sensitive data to Data Map. When creating a sensitive data discovery task as a common user other than the DAYU Administrator, Tenant Administrator, or data security administrator, you must select Manually synchronize the recognition result so that the task can be successfully created. In addition, if you run or schedule a task for which Manually synchronize the recognition result is not selected as a common user, the task cannot be executed.
Schedule Properties
Once	The sensitive data discovery task runs only once.
On Schedule	The sensitive data discovery task runs based on the configured scheduling period. Date Period during which the task takes effect Cycle The frequency at which a task is executed. The options are: minutes: Select the scheduling start time and end time, and set the interval in minutes. hours: Select the scheduling start time and end time, and set the interval in hours. Day: Set the scheduling time everyday. Week: Select a day in a week and set the specific time to start scheduling. Month: Select a day in a month and set the specific time to start scheduling. For example, you can set Cycle to Week, Time to 15:52, and Time Range to Tuesday. In this case, the task is executed at 15:52 every Tuesday within the configured date range. Start now: If you select this option, the task is scheduled immediately.
Configure Resources
Specifications	If DLI Spark resources are sufficient, you can configure Spark task resources to accelerate the execution of the sensitive data discovery task. The system provides three types of resource flavors. The default flavor is A. You can choose a flavor that meets your requirements. NOTE: If more resources than available ones are requested, the task may fail. A (8 vCPUs, 32 GB memory; executor memory: 4 GB; number of executors: 6; number of executor CPUs: 1; number of driver CPUs: 2; driver memory: 7 GB) B (16 vCPUs, 64 GB memory; executor memory: 8 GB; number of executors: 7; number of executor CPUs: 2; number of driver CPUs: 2; driver memory: 7 GB) C (32 vCPUs, 128 GB memory; executor memory: 8 GB; number of executors: 14; number of executor CPUs: 2; number of driver CPUs: 4; driver memory: 15 GB) NOTE: The parallelism degree of Spark resources is jointly determined by the number of Executors and the number of Executor CPU cores. The maximum number of tasks that can be concurrently executed is equal to the number of executors multiplied by the number of executor CPUs. You can properly plan compute resource specifications based on the DLI queue resources. Note that Spark tasks need to be jointly executed by multiple roles, such as driver and executor. So, the number of executors multiplied by the number of executor CPU cores must be less than the number of compute CUs of the queue to prevent other roles from failing to start Spark tasks. Calculation formula for Spark job parameters: CUs = Driver Cores + Executors x Executor Cores Memory = Driver Memory + (Executors x Executor Memory)
Executor Memory	Memory of each Executor. It is recommended that the ratio of Executor CPU cores to Executor memory be 1:4. The value ranges from 0 to 16 GB or from 0 to 16,384 MB. If more resources than available ones are requested, the task may fail.
Executor Cores	Number of CPU cores of each Executor applied for by jobs, which determines the capability of each Executor to execute tasks concurrently. Enter a value from 0 to 4. If more resources than available ones are requested, the task may fail.
Executors	Number of Executors applied for by a job Enter a value from 0 to 100. If more resources than available ones are requested, the task may fail.
Driver Cores	Number of CPU cores of the driver Enter a value from 0 to 4. If more resources than available ones are requested, the task may fail.
Driver Memory	Driver memory size. It is recommended that the ratio of the number of driver CPU cores to the driver memory be 1:4. The value ranges from 0 to 16 GB or from 0 to 16,384 MB. If more resources than available ones are requested, the task may fail.

Click OK. The sensitive data discovery task is created.

If no execution result is displayed after the sensitive data discovery task is successfully executed, and no matched information is found in the run log, it means no sensitive data is discovered.

Related Operations

Running or scheduling a task: On the Sensitive Data Discovery page, locate a task and click Run in the Operation column or click More in the Operation column and select Start.
You can determine whether a task is scheduled once or repeatedly based on the scheduling period.

If you run or schedule a task for which Manually synchronize the recognition result is not selected as a common user other than the DAYU Administrator, Tenant Administrator, or data security administrator, the task fails to be executed. Only the DAYU Administrator, Tenant Administrator, or data security administrator can run or schedule tasks for which Manually synchronize the recognition result is not selected.
Editing a task: On the Sensitive Data Discovery page, locate a task and click Edit in the Operation column.
A task in the Running state cannot be edited.
Deleting tasks: On the Sensitive Data Discovery page, locate a task, click More in the Operation column, and select Delete. To delete multiple tasks at a time, select the tasks and click Delete above the task list.
A task in the Running state cannot be deleted.
- Deleting a sensitive data discovery task will delete the discovery result. Exercise caution when performing this operation.
- The deletion operation cannot be undone. Exercise caution when performing this operation.
Viewing running instance logs: On the Sensitive Data Discovery page, locate a task and click to expand instances. Click Operation and select View Log.
If a task fails to be executed, you can locate the failure cause based on logs, rectify the fault, and try the task again. If the fault persists, contact technical support.