Creating a Connection Between DataArts Studio and an MRS Hive Data Lake
This section describes how to create an MRS Hive connection between DataArts Studio and the data lake base.
Prerequisites
- You have created a data lake to connect, for example, a database or cloud service supported by DataArts Studio.
- Before creating a DWS data connection, ensure that you have created a cluster in DWS and have the permissions required to view Key Management Service (KMS) keys.
- Before creating an MRS connection such as an MRS HBase or MRS Hive connection, ensure that you have purchased an MRS cluster whose Kerberos encryption type is aes256-sha1,aes128-sha1, and that the cluster contains required components.
- The data lake to connect communicates with the DataArts Studio instance properly.
- If the data lake is an on-premises database, a public network or a dedicated connection is required. Ensure that the host where the data source is located can access the public network and the port has been enabled in the firewall rule.
- If the data lake is a cloud service (such as DWS and MRS), the following requirements must be met for network interconnection:
- If the CDM cluster in the DataArts Studio instance and the cloud service are in different regions, a public network or a dedicated connection is required.
- If the CDM cluster in the DataArts Studio instance and the cloud service are in the same region, VPC, subnet, and security group, they can communicate with each other by default. If they are in the same VPC but in different subnets or security groups, you must configure routing rules and security group rules. For details about how to configure routing rules, see Configuring Routing Rules. For details about how to configure security group rules, see Configuring Security Group Rules.
- The cloud service instance and the DataArts Studio workspace belong to the same enterprise project. If they do not, you can modify the enterprise project of the workspace.
- If the enterprise mode is used, pay attention to the following points:
In enterprise mode, the development environment and production environment need to be distinguished. Therefore, you need to prepare two sets of data lake services for the production environment and development environment to isolate the development environment from the production environment.
- If two clusters are used for clustered data sources, such as MRS, GaussDB(DWS), RDS, MySQL, Oracle, DIS, and ECS, you can create data connections in Management Center to distinguish data lake services in the development environment from those in the production environment. The data lake is automatically switched during development and production. Therefore, you need to prepare two sets of data lake services. The versions, specifications, components, regions, VPCs, subnets, and related configurations of the two sets of data lake services must be the same. For details on how to create data connections, see Creating a DataArts Studio Data Connection.
- For serverless services (such as DLI), DataArts Studio configures the mapping between data lake services in the production environment and development environment through environment isolation in the management center. The corresponding data lake is automatically switched during the development and production processes. Therefore, you need to prepare two sets of queues and database resources in the serverless data lake service and distinguish them by name suffix. For details, see Configuring Environment Isolation for a DataArts Studio Workspace in Enterprise Mode.
- For GaussDB(DWS), MRS Hive, and MRS Spark, if you select the same cluster when creating a data connection, you must configure database mapping on the Configure Data Source Resource Mapping page to isolate the development and production environments. For details, see DB configuration.
- Offline processing migration jobs are not supported in enterprise mode.
For example, if your data lake service is an MRS cluster, you need to prepare two MRS clusters with the same version, specifications, components, region, VPC, and subnet. If some configurations of an MRS cluster are modified, you also need to synchronize the modifications to the other MRS cluster.
Creating a Data Connection
- Log in to the DataArts Studio console by following the instructions in Accessing the DataArts Studio Instance Console.
- On the DataArts Studio console, locate a workspace and click Management Center.
- On the displayed Manage Data Connections page, click Create Data Connection.
Figure 1 Creating a data connection
- On the Manage Data Connections page, click Create Data Connection. On the displayed page, select MRS Hive for Data Connection Type and set other parameters based on the descriptions in Table 1.
Figure 2 MRS Hive connection parameters
Table 1 MRS Hive connection Parameter
Mandatory
Description
Data Connection Type
Yes
MRS Hive is selected by default and cannot be changed.
Name
Yes
Name of the data connection to create. Data connection names can contain a maximum of 100 characters. They can contain only letters, digits, underscores (_), and hyphens (-).
Tag
No
Attribute of the data connection to create. Tags make management easier.NOTE:The tag name can contain only letters, digits, and underscores (_) and cannot start with an underscore (_) or contain more than 100 characters.
Applicable Modules
Yes
Select the modules for which this connection is available.
NOTE:- When the data migration job feature is enabled, you can select the DataArts Migration module. Then you can select this data connection when creating a data migration job in DataArts Factory.
- You can use offline processing migration jobs only after apply for the trustlist membership. To use this feature, contact customer service or technical support.
Basic and Network Connectivity Configuration
Connection Type
Yes
Connection type. Proxy connection is recommended.- Proxy connection: An agent (CDM cluster) is used to access MRS clusters. This method supports all versions of MRS clusters.
- MRS API connection: MRS APIs are used to access MRS clusters. This method supports only MRS clusters of the 2.X or a later version.
When you select MRS API connection, pay attention to the following restrictions:
- The MRS API connection is available only for DataArts Factory.
- In DataArts Factory, you cannot view or manage the databases, data tables, and fields of the connection in a visualized manner. If an MRS cluster of version 3.2.1 or later is connected, you can view rather than manage the databases, data tables, and fields of the connection in a visualized manner.
- When the SQL editor of DataArts Factory is used to run SQL statements, the execution results can be displayed only in logs.
NOTE:Select Proxy connection for Connection Type so that the DataArts Architecture, DataArts Quality, DataArts Catalog, and DataArts DataService components can use the MRS connection.
Manual
Yes
This parameter is mandatory when Connection Type is set to Proxy connection.
Select the connection mode. If you do not need to access MRS clusters in other projects or enterprise projects, select Cluster Name Mode.- Cluster Name Mode: Select an existing cluster. You can only connect to an MRS cluster in the same project and enterprise project.
- If you select Connection String Mode, you can set Manager IP and enable communication between this connection's agent (CDM cluster) and an MRS cluster in another project or enterprise project so that you can access the MRS cluster.
Manager IP
Yes
This parameter is mandatory when Connection String Mode is selected for Manual.
Set this parameter to the floating IP address of MRS Manager. Only MRS clusters are supported. A Hadoop cluster can be connected only after it is managed by MRS.NOTE:DataArts Studio does not support MRS clusters whose Kerberos encryption type is aes256-sha2,aes128-sha2, and only supports MRS clusters whose Kerberos encryption type is aes256-sha1,aes128-sha1.
You can click Select next to the text box and select an MRS cluster in the same project and enterprise project. If you want to access an MRS cluster in another project or enterprise project, obtain and enter the floating IP address of MRS Manager and ensure that the connection's agent (CDM cluster) can communicate with the tenant-plane MRS cluster. To obtain the floating IP address of MRS Manager, log in to the active master node of the MRS cluster and run the ifconfig command. In the command output, the IP address of eth0:wsom is the floating IP address of MRS Manager. For details about how to log in to the master node of the MRS cluster, see Logging In to an ECS.
Enter multiple IP addresses based on the scenario in sequence and separate them with commas (,), for example, 127.0.0.1 or 127.0.0.1,127.0.0.2,127.0.0.3.- If you enter one IP address, enter the management-plane floating IP address of the MRS cluster.
- If you enter three IP addresses, enter the IP address of the active node on the MRS cluster service plane, IP address of the standby node on the MRS cluster service plane, and the floating IP address of the MRS cluster management plane.
MRS Cluster Name
Yes
This parameter is mandatory when MRS API connection is selected for Connection Type or Cluster Name Mode is selected for Manual.
The name of the MRS cluster. Select an MRS cluster that Hive belongs to. Only MRS clusters are supported. A Hadoop cluster can be selected only after it is managed by MRS. All the MRS clusters with the same project ID and enterprise project are displayed.NOTE:DataArts Studio does not support MRS clusters whose Kerberos encryption type is aes256-sha2,aes128-sha2, and only supports MRS clusters whose Kerberos encryption type is aes256-sha1,aes128-sha1.
If the connection fails after you select a cluster, check whether the MRS cluster can communicate with the CDM instance which functions as the agent. They can communicate with each other in the following scenarios:- If the CDM cluster in the DataArts Studio instance and the MRS cluster are in different regions, a public network or a dedicated connection is required. If the Internet is used for communication, ensure that an EIP has been bound to the CDM cluster, and the MRS cluster can access the Internet and the port has been enabled in the firewall rule.
- If the CDM cluster in the DataArts Studio instance and the MRS cluster are in the same region, VPC, subnet, and security group, they can communicate with each other by default. If they are in the same VPC but in different subnets or security groups, you must configure routing rules and security group rules. For details about how to configure routing rules, see Configuring Routing Rules. For details about how to configure security group rules, see Configuring Security Group Rules.
- The MRS cluster and the DataArts Studio workspace belong to the same enterprise project. If they do not, you can modify the enterprise project of the workspace.
NOTE:If an agent is connected to multiple MRS clusters and one of the MRS clusters is deleted or abnormal, connections to the other MRS clusters will be affected. Therefore, you are advised to connect an agent to only one MRS cluster.
KMS Key
No
This parameter is mandatory when Connection Type is set to Proxy connection.
KMS key used to encrypt and decrypt data source authentication information. Select a default or custom key.NOTE:When you use KMS for encryption through DataArts Studio or KPS for the first time, the default key dlf/default or kps/default is automatically generated. For more information about default keys, see What Is a Default Master Key?.
Agent
Yes
This parameter is mandatory when Connection Type is set to Proxy connection.
MRS is not a fully managed service and cannot be directly connected to DataArts Studio. A CDM cluster can provide an agent for DataArts Studio to communicate with non-fully-managed services. Therefore, you need to select a CDM cluster when creating an MRS data connection. If no CDM cluster is available, create one first.
As a network proxy, the CDM cluster must be able to communicate with the MRS cluster. To ensure network connectivity, the CDM cluster must be in the same region and AZ and use the same VPC and subnet as the MRS cluster. The security group rule must also allow the CDM cluster to communicate with the MRS cluster.
NOTE:- If a CDM cluster functions as the agent for a data connection in Management Center, the cluster cannot connect to multiple MRS security clusters. You are advised to plan multiple agents which are mapped to MRS security clusters one by one.
-
If a CDM cluster functions as the agent for a data connection in Management Center, the cluster supports a maximum of 200 concurrent active threads. If multiple data connections share an agent, a maximum of 200 SQL, Shell, and Python scripts submitted through the connections can run concurrently. Excess tasks will be queued. You are advised to plan multiple agents based on the workload.
Data Source Authentication and Other Function Configuration
Authentication Method
Yes
This parameter is mandatory when Connection String Mode is selected for Manual.
It specifies the authentication method used for accessing the MRS cluster. The following options are available:- SIMPLE: for non-security mode
- KERBEROS: for security mode
Username
Yes
Human-machine user of the MRS cluster. This parameter is mandatory when Connection Type is set to Proxy connection. If a new MRS user is used for connection, you need to log in to Manager and change the initial password.
To create a data connection for an MRS security cluster, do not use user admin. The admin user is the default management page user and cannot be used as the authentication user of the security cluster. You can create an MRS user whose password never expires by referring to Creating a Kerberos Authentication User for an MRS Security Cluster. When creating an MRS data connection, set Username and Password to the new MRS username and password.NOTE:- For clusters of MRS 3.1.0 or later, the user must at least have permissions of the Manager_viewer role to create data connections in Management Center. To perform database, table, and data operations on components, the user must also have user group permissions of the components.
- For clusters earlier than MRS 3.1.0, the user must have permissions of the Manager_administrator or System_administrator role to create data connections in Management Center.
- A user with only the Manager_tenant or Manager_auditor permission cannot create connections.
- You are advised to set a user password that never expires to prevent connection failures and service loss caused by password expiration.
Password
Yes
The password for accessing the MRS cluster. This parameter is mandatory when Connection Type is set to Proxy connection.
Enable ldap
No
This parameter is available when Connection Type is set to Proxy connection.
If LDAP authentication is enabled for an external LDAP server connected to MRS Hive, the LDAP username and password are required for authenticating the connection to MRS Hive. In this case, this option must be enabled. Otherwise, the connection will fail.
ldapUsername
Yes
This parameter is mandatory when Enable ldap is enabled.
Enter the username configured when LDAP authentication was enabled for MRS Hive.
ldapPassword
Yes
This parameter is mandatory when Enable ldap is enabled.
Enter the password configured when LDAP authentication was enabled for MRS Hive.
OBS storage support
No
This parameter is displayed when DataArts Migration is selected for Applicable Modules.
The server must support OBS storage. When creating a Hive table, you can store the table in OBS.
Use Agency
No
This parameter is displayed when DataArts Migration is selected for Applicable Modules.
If you enable the agency function, you can create a data connection without having a permanent AK/SK and execute CDM jobs using the scheduling identity configured in DataArts Factory.
Public agency
No
This parameter is displayed when DataArts Migration is selected for Applicable Modules and Use Agency is enabled.
The agency is only used to check whether the connection agency function is normal. CDM jobs will be executed using the scheduling identity configured in DataArts Factory.
AK
N/A
This parameter is displayed when DataArts Migration is selected for Applicable Modules and OBS storage support is enabled.
AK and SK are used to log in to the OBS server.
You need to create an access key for the current account and obtain an AK/SK pair.
To obtain an access key, perform the following steps:- Log in to the management console, move the cursor to the username in the upper right corner, and select My Credentials from the drop-down list.
- On the My Credentials page, choose Access Keys, and click Create Access Key. See Figure 3.
- Click OK and save the access key file as prompted. The access key file will be saved to your browser's configured download location. Open the credentials.csv file to view Access Key Id and Secret Access Key.
NOTE:
- Only two access keys can be added for each user.
- To ensure access key security, the access key is automatically downloaded only when it is generated for the first time and cannot be obtained from the management console later. Keep them properly.
SK
N/A
- Click Test to test connectivity of the data connection. If the test fails, the data connection fails to be created.
- After the test is successful, click OK to create the data connection.
Reference
- Why is no MRS Hive cluster displayed on the Create Data Connection page?
Possible causes are as follows:
- Hive/HBase components were not selected during MRS cluster creation.
- The enterprise project selected during MRS cluster creation is different from that in the workspace.
- The network between the CDM cluster and MRS cluster was disconnected when an MRS data connection is created.
The CDM cluster functions as a network agent. MRS data connections that you are going to create need to communicate with CDM.
- Why does a Hive data connection fail to obtain information about databases or tables?
The possible cause is that the CDM cluster is stopped or a concurrency conflict occurs. You can switch to another agent to temporarily avoid this issue.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot