What Is Data Lake Insight

DLI Introduction

Data Lake Insight (DLI) is a serverless data processing and analysis service fully compatible with Apache Spark, HetuEngine, and Apache Flink ecosystems. It frees you from managing any servers.

DLI supports multiple querying methods including standard SQL, Spark SQL, and Flink SQL, with compatibility with mainstream data formats. You can use standard SQL or Spark and Flink applications to query mainstream data formats without data ETL. DLI supports SQL statements and Spark applications for heterogeneous data sources, including RDS, GaussDB(DWS), CSS, OBS, custom databases on ECSs, and offline databases.

DLI Compute Engines

DLI offers multiple compute engines, including Spark, HetuEngine, and Flink, each suited for different data processing scenarios.

Spark is better suited for batch processing and complex analysis of large-scale data, while Flink excels in real-time stream processing. HetuEngine is a high-performance interactive SQL analytics and data virtualization engine.

Features:
Spark is a unified analytics engine designed for large-scale data processing, focusing on query, compute, and analysis.

DLI has undergone extensive performance optimization and service-oriented enhancements over the open-source Spark, maintaining compatibility with the Apache Spark ecosystem and APIs while boosting performance by 2.5 times, enabling exabyte-scale data queries and analyses within hours.

DLI's Spark engine supports batch processing and interactive analysis of large-scale data and provides high-performance distributed computing capabilities.
Use cases
- Suitable for scenarios requiring large-scale batch processing and complex data analysis.
- Ideal for deep mining and analysis of historical data, such as querying data in a data warehouse and generating reports.

Features
- Flink is a distributed computing engine that can be used for both batch processing and stream processing.
- DLI has enhanced features and security based on the open-source Flink and offers the Stream SQL feature needed for data processing.
- It supports real-time stream processing, capable of handling large-scale real-time data streams, with support for event time processing and state management.
Use cases
- Suitable for scenarios requiring real-time data stream processing, such as real-time monitoring systems and real-time recommender systems.
- Ideal for fast analysis and response to real-time data, such as financial transaction monitoring and IoT device data processing.

Features
HetuEngine is a high-performance interactive SQL analysis and data virtualization engine, seamlessly integrating with the big data ecosystem to achieve second-level interactive queries on massive datasets.

HetuEngine + LakeFormation can quickly handle query requests on large-scale datasets, swiftly and efficiently extracting information from big data, significantly simplifying data management and analysis processes, and enhancing indexing and query performance in a big data environment.

Figure 1 DLI supporting HetuEngine + LakeFormation
- Responses to terabytes of data within mere seconds:
  Through automatic optimization of resource and load balancing, HetuEngine can respond to TB-level data within seconds, significantly improving the efficiency of data queries.
- Serverless resources ready to use:
  The serverless service model requires no attention to underlying configurations, software updates, or failure issues, making resources easy to maintain and expand.
- Multiple resource types to meet various service requirements:
  Shared resource pool: pay-per-use billing, providing more cost-effective compute resources.
  
  Dedicated resource pool: offers a dedicated resource pool to meet high-performance resource requirements.
- Data ecosystem enhancement:
  HetuEngine and LakeFormation can connect to mainstream BI tools, such as Yonghong BI, FineBI, and DBeaver, to enhance the application capabilities in the data analysis field.
- 5x real-time data processing performance
  HetuEngine + LakeFormation supports Apache Hudi's copy-on-write (COW) and merge-on-read (MOR) tables. The vertex query performance is 5 times better compared to open-source Trino, enabling faster response to query requests and providing real-time data access.
Use cases
Suitable for data query and analysis in large-scale data storage.

Core Functions

For details about DLI functions, see DLI Functions.

**Table 1** DLI core functions
Function	Description
DLI is a data processing and analytics service built on the serverless architecture.	DLI is a serverless big data query and analytics service. With DLI, you only pay for the actual compute resources used, with no need to maintain or manage cloud servers. Pay-per-use compute resources: You pay only for what you use (scanned data volume/CUH packages). When no jobs are running, you will not be billed. Auto scaling: DLI ensures you always have enough capacity on hand to deal with any traffic spikes.
DLI supports multiple compute engines.	DLI is fully compatible with ecosystems like Apache Spark, HetuEngine, and Apache Flink, and supports standard SQL, Spark SQL, and Flink SQL. It is compatible with mainstream data formats such as CSV, JSON, Parquet, and ORC. Spark is a unified analytics engine designed for large-scale data processing, focusing on query, compute, and analysis. DLI has undergone extensive performance optimization and service-oriented enhancements over the open-source Spark, maintaining compatibility with the Apache Spark ecosystem and APIs while boosting performance by 2.5 times, enabling exabyte-scale data queries and analyses within hours. Flink is a distributed compute engine that can be used for batch processing, which involves handling static datasets and historical datasets. It can also be used for stream processing, enabling the real-time processing of live data streams and the immediate generation of data results. DLI has enhanced features and security based on the open-source Flink and offers the Stream SQL feature needed for data processing. HetuEngine is an open source SQL query engine that allows for interactive query and analysis. It excels in quickly and efficiently processing large-scale data queries and analyses with low latency. For more information about HetuEngine, see Advantages.
DLI supports multiple connection methods.	DLI provides multiple connection methods to meet diverse user needs and scenarios. Connection methods: Web-based console APIs SDKs Client tools Submitting DLI jobs using DataArts Studio Connecting to BI tools for visual analysis For more DLI connection methods, see Development Tools Supported by DLI.
DLI can connect to multiple data sources for cross-source data analysis.	Spark datasource connection: Data sources such as CloudTable, GaussDB(DWS), RDS, and CSS can be accessed through DLI. For details, see Enhanced Datasource Connections. Flink supports cross-source connectivity with various cloud services, forming a rich streaming ecosystem. DLI's streaming ecosystem is divided into cloud service ecosystems and open-source ecosystems: Cloud service ecosystem: DLI supports connectivity with other services in Flink SQL. You can directly use SQL to read and write data from these cloud services, such as DIS, OBS, CloudTable, MRS, RDS, SMN, and DCS. Open-source ecosystem: By establishing network connections with other VPCs through enhanced datasource connections, you can access all Flink and Spark-supported data sources and output sources, such as Kafka, Hbase, Elasticsearch, in the tenant-authorized DLI queues. For details, see Flink Jobs.
Three basic job types supported by DLI	SQL jobs allow you to query data using standard SQL statements. For details, see Spark SQL Syntax Reference. Flink jobs support Flink SQL online analysis capabilities: supporting aggregation functions such as Window and Join, using SQL to express service logic, and achieving service implementation conveniently and quickly. For details, see Flink OpenSource SQL Syntax Reference. Spark jobs provide fully managed Spark computing features: You can submit computing tasks through interactive sessions and batch processing, and perform data analysis on fully managed Spark queues. For details, see Creating a Queue.
DLI supports decoupled storage and compute.	After storing data in OBS, you can connect DLI to OBS for data analysis. Under the decoupled storage and compute architecture, storage resources and compute resources can be requested and billed separately, reducing costs and improving resource utilization. You can choose single-AZ or multi-AZ storage when creating an OBS bucket for storing redundant data on the DLI console. The differences between the two storage policies are as follows: Multi-AZ storage means data will be redundantly stored across multiple AZs, offering higher reliability. Buckets with multi-AZ storage will store data in multiple different AZs within the same region. If one AZ becomes unavailable, data can still be accessed normally from other AZs, making it suitable for data storage scenarios requiring high reliability. You are advised to use this policy. Single-AZ storage means data is only stored in a single AZ, but it is more cost-effective compared to multi-AZ storage.
DLI manages and schedules resources in a unified manner using elastic resource pools.	The backend of elastic resource pools adopts a CCE cluster architecture, supporting heterogeneous resources, so you can manage and schedule resources in a unified manner.

DLI Product Architecture

The DLI product architecture is as follows:

Figure 2 DLI serverless architecture
Click to enlarge

The core modules are described as follows:

**Table 2** DLI core modules
Module	Description
Ecosystem tools	DLI leverages its robust serverless architecture and multimodal engine support to fulfill the diverse needs of various industries, driving their digital transformation and fostering innovation.
Compute engine	Spark: supports batch processing and interactive analysis of large-scale data and provides high-performance distributed computing capabilities. Flink: supports real-time stream processing, capable of handling large-scale real-time data streams, with support for event time processing and state management. HetuEngine: supports interactive data analysis, swiftly handles complex SQL queries, and facilitates connections and queries across various data sources. For more information about HetuEngine, see Advantages.
Unified resource management	Resource decoupling: DLI adopts a decoupled compute and storage architecture, decoupling compute resources from storage resources. This allows for flexible adjustment of the ratio between compute and storage resources based on actual needs, enhancing resource utilization and reducing costs. Elastic scaling: DLI compute resources are based on containerized Kubernetes and possess ultimate elastic scaling capabilities. Resources can be automatically adjusted based on job demands. Multi-tenant support: Compute resources can be isolated by tenant to ensure independence among different tenants. Each tenant can independently manage their own compute resources, enabling fine-grained resource management and facilitating inter-departmental data sharing and permissions management within enterprises. Pay-per-use compute resources: You only pay for the compute resources you actually use, with no need to pre-purchase or manage servers, enhancing usage efficiency.
Unified metadata management	Multi-source metadata integration: DLI supports centralized management of metadata from various data sources, including cloud-based data sources (such as OBS, RDS, GaussDB(DWS), and CSS) and on-premises data sources (such as self-built databases and Redis). You can manage and analyze metadata across different data sources without the need to migrate data to a unified data lake. Metadata synchronization: DLI provides metadata management to ensure the timeliness and consistency of metadata. Metadata query and management: DLI offers standard SQL APIs, enabling you to query and manage metadata using SQL statements. You can add, delete, modify, and query metadata to facilitate data governance and analysis. Data security and permission management: Permissions on data catalogs, databases, and tables can be managed. You can assign different permissions to various tenants and user groups to ensure data security and compliance.
Storage service	OBS and databases are used to store structured or unstructured data for data analysis, providing persistent data storage services.
Data source connection	Cloud data sources can be connected. For example, OBS can be used to store and manage unstructured data. Relational database service (RDS) can be used to store and manage structured data. GaussDB(DWS) can be used to efficiently query and analyze data. On-premises data sources, such as self-built databases (MySQL, PostgreSQL, and HDFS), can be connected.
Data applications	DLI can connect to mainstream BI tools in the industry to flexibly meet data presentation needs.

Accessing DLI

A web-based service management platform is provided. You can access DLI using the management console or HTTPS-based APIs, or connect to the DLI server through the JDBC client.

For more DLI connection methods, see Development Tools Supported by DLI.

Using the management console
You can submit SQL, Spark, or Flink jobs on the DLI management console.

Log in to the management console. Choose EI Enterprise Intelligence > Data Lake Insight from the service list.

Using APIs
If you need to integrate DLI into a third-party system for secondary development, you can call DLI APIs to use the service.

For details, see Data Lake Insight API Reference.
JDBC
You can use JDBC to connect to the server for data query. For details, see Obtaining the Server Connection Address.
DataArts Studio
DataArts Studio is a one-stop data operations platform that provides intelligent data lifecycle management. It supports intelligent construction of industrial knowledge libraries and incorporates data foundations such as big data storage, computing, and analysis engines. With DataArts Studio, your company can easily construct end-to-end intelligent data systems. These systems can help eliminate data silos, unify data standards, accelerate data monetization, and promote digital transformation.

Create a data connection on the DataArts Studio management console to access DLI for data analysis.