Introduction to Iceberg

Iceberg is an open table format aimed at massive data analytics, serving as an organizational method for metadata and data files. Positioned between computational engines and storage systems, it seeks to offer a scalable and dependable approach to managing vast amounts of tabular data. Designed with the goal of delivering a scalable, high-performance, and user-friendly table management solution, Iceberg meets the demands of contemporary distributed data processing.

This feature is supported only in version 25.3.0 or later.

Highlights

Scalability: Easily scales to accommodate extensive data tables and large datasets.
Query performance: Enhances data retrieval efficiency through its metadata management and query optimization capabilities.
Data version management: Offers robust data version control, helping users track and revert data states.
Ease of use: Equipped with straightforward APIs and CLI tools, facilitating effortless creation, management, and query of data tables.
Flexible partitioning strategies: Supports adaptable partitioning schemes tailored to diverse dataset characteristics.
Multi-version data support: Enables effective version handling and historical tracing of data entries.
Diverse data format compatibility: Accommodates various data formats like Parquet, ORC, and Avro.

Basic Concepts

Table: Fundamental entity representing a structured collection of data, including metadata details, storage locations, and partitioning strategies.
Partition: Divides table content into subsets based on predefined criteria such as temporal intervals, geographic regions, or product categories.
Metadata: Descriptive elements outlining table structures, partitioning strategies, and data evolution histories, persisted across durable mediums like HDFS or S3.
Snapshot: Captures a static view of table contents at specific moments, encapsulating both data and associated metadata.
Manifest: An inventory detailing constituent data files within a table, recording attributes like paths, sizes, and partition specifics.

File Organizational Structure

Illustrated below, Iceberg segregates data into two primary layers: metadata management layer and data storage layer.

Click to enlarge

Metadata layer:
- Metadata files are in JSON format. They store the metadata information of the current version and all snapshot information.
- Manifest list files, also known as the snapshot files or manifest list files, are in Avro format. A snapshot file is generated for each commit, with each line storing the path of a manifest file, the partition range of its stored data files, and information such as the number of data files added or deleted. This provides filtering information during queries to enhance speed.
- Manifest files are in Avro format. They maintain a list of information about multiple data files, where each row offers a detailed description of a data file, including its status, path, partition information, column-level statistics (like max/min values, count of nulls), file sizes, and the number of data rows within the files. Column-level statistics help in filtering out unnecessary files while scanning table data.