Catalog Introduction

Iceberg Catalog serves as the top-tier component within Iceberg, tasked with overseeing the metadata and its associated operations across all Iceberg tables. This Catalog governs both the structure and metadata of these tables, providing interfaces essential for their creation, query, and modification—acting as the gateway through which users engage with Iceberg tables. Through it, users can pinpoint the exact location of the current metadata file for each table, making it an indispensable element for both reading from and writing to Iceberg tables.

The current DataArts Fabric SQL version supports using Hadoop Catalog as the Catalog component for Iceberg tables.

Hadoop Catalog

Hadoop Catalog operates independently of external systems and can utilize any file system, recording the metadata file paths of tables within a specific directory.

As Hadoop supports decoupled storage and compute, the underlying data files may reside on HDFS or an object storage system like OBS.

To locate a table in Hadoop Catalog, simply specify its path since all metadata for the table is embedded within these files.

LakeFormation Catalog

LakeFormation Catalog relies on the LakeFormation metadata service to oversee the most recent snapshots.

It uses an optimistic concurrency control (OCC) mechanism to maintain data integrity during concurrent writes across multiple tenants.

All Iceberg tables currently created on DataArts Fabric SQL are LakeFormation Catalog tables.

The following figure outlines the concurrency handling process for LakeFormation Catalog commits.

Click to enlarge

Read the current table snapshot information from LakeFormation, including the latest metadata file path and snapshot ID.
Write new data based on the current snapshot.
Load the latest snapshot.
Detect data conflicts. If a conflict occurs, the statement execution fails. Otherwise, attempt to commit the transaction.
Write metadata, including manifest files, manifest list, and metadata file.
Submit the latest metadata file path and snapshot ID to LakeFormation. If the submission fails, retry from step 3.

Parent topic: SQL on Iceberg

Previous topic: Introduction to Iceberg

Next topic: Preparations Before Using Iceberg