Data Sources
Before using DataArts Studio, you need to select cloud services or databases as the data lake foundation, which provides storage and compute capabilities. DataArts Studio provides one-stop data development, governance, and services based on the data lake foundation.
Data Sources Supported By DataArts Studio
DataArts Studio can interconnect with cloud services such as DWS, DLI, and MRS Hive as well as traditional databases such as MySQL and Oracle. For details, see Table 1.
To connect to these data sources, go to the DataArts Studio console and choose Management Center to create a data connection.
Data connections in Management Center are independent of the data links in DataArts Migration. To use the data connections in DataArts Migration, create corresponding data links in DataArts Migration first.
- The data connections in Management Center are used to connect to the data lake foundation. DataArts Studio provides one-stop data development, governance, and services based on the data lake foundation.
- Data links in DataArts Migration can be used only in DataArts Migration to integrate source datasets into the destination data lake foundation. For details about the data sources supported by DataArts Migration, see Data Sources Supported by DataArts Migration.
Data Source Type |
Management Center |
DataArts Architecture |
DataArts Factory |
DataArts Catalog[1] |
DataArts Quality[2] |
DataArts DataService |
---|---|---|---|---|---|---|
DWS |
Supported |
Supported |
Supported |
Supported |
Supported |
Supported |
DLI |
Supported |
Supported |
Supported |
Supported |
Supported |
Supported |
MRS HBase |
Supported |
Not supported |
Not supported |
Supported |
Not supported |
Not supported |
MapReduce (MRS) Hive |
Supported |
Supported |
Supported |
Supported |
Supported |
Not supported |
MRS Kafka |
Supported |
Not supported |
Supported |
Not supported |
Not supported |
Not supported |
MapReduce (MRS) Ranger |
Supported |
Not supported |
Not supported |
Not supported |
Not supported |
Not supported |
MySQL |
Supported |
Not supported |
Not supported |
Not supported |
Supported |
Supported |
MapReduce (MRS) Spark[4] |
Supported |
Supported |
Supported |
Not supported |
Supported |
Not supported |
RDS for MySQL |
Supported |
Not supported |
Supported |
Supported |
Supported |
Supported |
RDS for PostgreSQL |
Supported |
Supported |
Supported |
Supported |
Supported |
Not supported |
Host Connection |
Supported |
Not supported |
Supported |
Not supported |
Not supported |
Not supported |
MapReduce (MRS) Presto |
Supported |
Not supported |
Supported |
Not supported |
Not supported |
Not supported |
Annotation
- Relational databases, such as MySQL and PostgreSQL databases (You can use RDS connections to collect the metadata of these databases.)
- Cloud Search Service (CSS)
- Graph Engine Service (GES)
- Object Storage Service (OBS)
- MRS Hudi (MRS Hudi is a data format. The metadata is stored in Hive, and operations are performed using Spark.) You can enable synchronization of the Hive table configuration for Hudi tables, and then you can collect the metadata of Hudi tables by collecting the MRS Hive metadata.
[2] The quality jobs and comparison jobs of DataArts Quality are not supported by MRS clusters with decoupled storage and compute.
[3] MRS Spark: MRS Spark connections can be used to integrate data into the DataArts Architecture and DataArts Quality modules. MRS Hudi is a data format. The metadata is stored in Hive, and operations are performed using Spark. DataArts Catalog uses MRS Hive to collect Hudi metadata, and DataArts Architecture and DataArts Quality use MRS Spark to govern Hudi data sources. (Business metric monitoring of DataArts Quality does not support Hudi data sources.)
Overview
Data Source Type |
Description |
---|---|
DWS |
DWS employs the shared-nothing architecture and massively parallel processing (MPP) engine. It is compatible with ANSI SQL 99, SQL 2003, and the PostgreSQL or Oracle database ecosystem, providing competitive solutions for analyzing petabytes of data in various industries. |
DLI |
DLI is a serverless big data compute and analysis service that is fully compatible with Apache Spark and Apache Flink ecosystems. With multi-model engines supported by DLI, enterprises can use SQL statements or programs to easily complete batch processing, stream processing, in-memory computing, and machine learning of heterogeneous data sources. |
MRS HBase |
HBase undertakes data storage. It is an open-source, column-oriented, distributed storage system that is suitable for storing massive amounts of unstructured or semi-structured data. It features high reliability, high performance, and flexible scalability, and supports real-time data read/write. MRS HBase stores massive amount of data and supports data queries in milliseconds. MRS HBase can load and update logistics data in milliseconds, and query and analyze petabytes of time series data in seconds. |
MRS Hive |
Hive is a mechanism that can store, query, and analyze large-scale data stored in Hadoop. Hive defines simple SQL-like query language, which is known as HiveQL. It allows users familiar with SQL to query data. MRS Hive can be used to analyze terabytes or petabytes of data and quickly migrate on-premises Hadoop big data platforms (such as CDH and HDP) to the cloud without service interruption and service code modification. |
MRS Kafka |
MRS provides dedicated MRS Kafka clusters. Kafka is an open-source, distributed, partitioned, and replicated commit log service. Kafka is publish-subscribe messaging, rethought as a distributed commit log. It provides features similar to Java Message Service (JMS) but another design. It features message endurance, high throughput, distributed methods, multi-client support, and real time. It applies to both online and offline message consumption, such as regular message collection, website activeness tracking, aggregation of statistical system operation data (monitoring data), and log collection. These scenarios engage large amounts of data collection for Internet services. |
MRS Ranger |
Ranger offers a centralized security management framework and supports unified authorization and auditing. It manages fine-grained access control over Hadoop and related components, such as HDFS, Hive, HBase, Kafka, and Storm. You can use the frontend web UI console provided by Ranger to configure policies to control users' access to these components. |
MRS Hudi |
Hudi is a data lake table format that provides the ability to update and delete data as well as consume new data on HDFS. It supports multiple compute engines and provides insert, update, and delete (IUD) interfaces and streaming primitives, including upsert and incremental pull, over datasets on HDFS. Hudi metadata is stored in Hive, and operations are performed using Spark. |
MySQL |
MySQL is one of the most popular open-source databases. It features excellent performance, uses mature and stable architecture, supports popular applications, adapts to multiple fields and industries, and supports various web applications. It is cost-effective and preferred by small- and medium-sized enterprises. |
MRS Spark |
Spark is an open-source parallel data processing framework. It helps users easily develop unified big data applications and perform cooperative processing, stream processing, and interactive analysis on data. Spark provides a framework featuring fast calculation, write, and interactive query. Spark has obvious advantages over Hadoop in terms of performance. Spark provides the Spark SQL language similar to SQL statements to process structured data. |
RDS |
RDS is an online, out-of-the-box relational database service that is based on the cloud computing platform. It is stable, reliable, scalable, and easy to manage. Currently, DataArts Studio supports only MySQL and PostgreSQL databases in RDS. |
Host Connection |
You can connect to a specified host during data development and execute shell or Python scripts on the host through script development and job development. If the host connection information changes, you only need to edit it on the Host Connections page, but do not need to edit it in scripts or jobs one by one. |
MRS Presto |
Presto is an open-source SQL query engine for running interactive analytic queries against data sources of all sizes. It applies to massive structured/semi-structured data analysis, massive multi-dimensional data aggregation/report, ETL, ad-hoc queries, and more scenarios. Presto allows querying data where it lives, including HDFS, Hive, HBase, Cassandra, relational databases, or even proprietary data stores. A Presto query can combine different data sources to perform data analysis across the data sources. |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.