Updated on 2023-12-20 GMT+08:00

Overview

Application Scenario

As big data technologies burgeon, people are deepening their understanding of data values. Big data is everywhere in a variety of industries. According to a report, of all enterprises around the world, over 39.6% have applied big data to their businesses and earned benefits, more than 89.6% already have or plan to set up departments for big data analysis, and over 60% are investing more in big data. The capability of leveraging big data is crucial to each industry's success in the future.

In big data scenarios, data is a new asset, and intelligence has become a new productivity. Enterprises are in urgent need of digital transformation to improve productivity and to maximize the data value. Before services are migrated to the cloud, traditional enterprises deploy their services and store data in multiple clusters in the on-premises IDC, and one server provides both compute and storage capabilities. This causes key problems shown in Table 1, and these problems have hindered the enterprise's digital transformation.

Table 1 Key concerns faced by traditional enterprises in big data scenarios

No.

Key Concern

Description

1

Hard to share data among multiple clusters

Enterprise's data is stored in multiple clusters, resulting in the following problems:

  • There is no global view. Data in one cluster cannot be used in another, unless data is copied.
  • Copying data is the only way to share data across clusters, which takes a long time.
  • Public data set copies are stored in multiple clusters, leaving data redundant.

2

Resource waste due to coupled compute and storage resources

Compute and storage resources must be expanded proportionally even if their demands are inconsistent, which causes a waste of resources.

3

Low utilization and high cost due to three copies of data

The Hadoop Distributed File System (HDFS) stores data in three copies. The disk space utilization is only 33%, and the utilization of a single disk is lower than 70%.

Solution Architecture

To address the problems in the table above, Huawei Cloud provides a solution with decoupled storage and compute, where OBS is used as the unified data lake storage.

Figure 1 OBS-based big data solution with decoupled storage and compute

Relying on the large capacity and high bandwidth of OBS and shared access based on multiple protocols (HDFS, POSIX, and OBS API), this solution enables Hadoop compute engines (such as Hive and Spark) compatible with each other.

Solution Advantages

Compared with traditional solutions, this solution has the advantages described in Table 2.

Table 2 Advantages

No.

Advantage

Description

1

Converged, efficient, and collaborative analysis

  • Data can be shared among multiple clusters through unified permission control.
  • No data copy is required.
  • Integration of big data and AI reduces the operation time.

2

High resource utilization thanks to decoupled storage and compute

Compute and storage resources can be separately scaled. This improves the resource utilization.

3

High utilization and low cost with EC storage

OBS supports Erasure Code (EC), the most utilized distributed fault tolerance technology. EC greatly increases the disk space utilization and requires much less storage space than the three copies of data mechanism.

In addition, OBS provides the OBSFileSystem plug-in (OBSA-HDFS) to seamlessly connect to the upper-layer big data platform, requiring no modifications.

OBSFileSystem provides HDFS-related APIs so that big data compute engines (such as Hive and Spark) can use OBS as the underlying storage.

Figure 2 OBSFileSystem in the solution with decoupled storage and compute

OBS offers object storage buckets (object semantics) and parallel file systems (POSIX). In big data scenarios, parallel file systems are recommended. Parallel file systems support POSIX and are encapsulated through OBSFileSystem. Compared with object semantics, parallel file systems have additional APIs (including Rename, Append, hflush, and hsync). These APIs supplement HDFS semantics and provide better performance for big data computing.

Based on the preceding advantages, compared with traditional big data solutions, the Huawei Cloud big data solution with decoupled storage and compute requires significantly fewer compute resources, storage resources, and servers for the same service scale. This greatly increases resource utilization and reduces the total cost of ownership (TCO).

Application Scope

This practice explains how to connect different big data platforms and components to OBS in the big data solutions with decoupled storage and compute, and how to migrate data from HDFS to OBS.