Back to Elasticsearch

ESQL Iceberg Data Source Plugin

x-pack/plugin/esql-datasource-iceberg/README.md

9.4.07.6 KB
Original Source

ESQL Iceberg Data Source Plugin

This plugin provides Apache Iceberg table catalog support for ESQL external data sources.

Overview

The Iceberg plugin enables ESQL to query Apache Iceberg tables stored in S3. Iceberg is an open table format for large analytic datasets that provides ACID transactions, schema evolution, and efficient metadata management.

Features

  • Iceberg Table Catalog - Read Iceberg table metadata and schema
  • Schema Discovery - Automatically resolve schema from Iceberg metadata
  • Partition Pruning - Skip data files based on partition predicates
  • Predicate Pushdown - Push filter expressions to Iceberg for efficient scanning
  • Arrow Vectorized Reading - High-performance columnar data reading via Apache Arrow
  • S3 Integration - Native S3 file I/O for cloud-native deployments

Usage

Once installed, the plugin enables querying Iceberg tables via their metadata location:

sql
FROM "s3://my-bucket/warehouse/db/sales_table"
| WHERE sale_date >= "2024-01-01" AND region = "EMEA"
| STATS total = SUM(amount) BY product

The plugin automatically detects Iceberg tables by looking for the metadata/ directory structure.

Iceberg Table Structure

s3://bucket/warehouse/db/table/
├── data/
│   ├── part-00000.parquet
│   ├── part-00001.parquet
│   └── ...
└── metadata/
    ├── v1.metadata.json
    ├── v2.metadata.json
    ├── snap-*.avro
    └── version-hint.text

Dependencies

This plugin bundles significant dependencies for Iceberg, Arrow, and AWS support:

Iceberg Core

DependencyVersionPurpose
iceberg-core1.xIceberg table operations
iceberg-aws1.xS3FileIO implementation
iceberg-parquet1.xParquet file support
iceberg-arrow1.xArrow vectorized reading

Apache Arrow

DependencyVersionPurpose
arrow-vector18.xArrow vector types
arrow-memory-core18.xArrow memory management
arrow-memory-unsafe18.xOff-heap memory allocation

Apache Parquet & Hadoop

DependencyVersionPurpose
parquet-hadoop-bundle1.16.0Parquet file reading
hadoop-client-api3.4.1Hadoop Configuration
hadoop-client-runtime3.4.1Hadoop runtime

AWS SDK

DependencyVersionPurpose
software.amazon.awssdk:s32.xS3 client
software.amazon.awssdk:sts2.xSTS for role assumption
software.amazon.awssdk:kms2.xKMS for encryption

Architecture

┌─────────────────────────────────────────┐
│        IcebergDataSourcePlugin           │
│  implements DataSourcePlugin             │
└─────────────────┬───────────────────────┘
                  │
                  │ provides
                  ▼
┌─────────────────────────────────────────┐
│         IcebergTableCatalog              │
│  implements TableCatalog                 │
│                                          │
│  - metadata(tablePath, config)           │
│  - planScan(tablePath, config, preds)    │
│  - catalogType() → "iceberg"             │
│  - canHandle(path)                       │
└─────────────────┬───────────────────────┘
                  │
                  │ uses
                  ▼
┌─────────────────────────────────────────┐
│        IcebergCatalogAdapter             │
│                                          │
│  Adapts Iceberg's StaticTableOperations  │
│  to work with S3 metadata locations      │
└─────────────────┬───────────────────────┘
                  │
                  │ uses
                  ▼
┌─────────────────────────────────────────┐
│          S3FileIOFactory                 │
│                                          │
│  Creates S3FileIO instances for          │
│  Iceberg table operations                │
└─────────────────────────────────────────┘

Supported Iceberg Features

FeatureStatus
Schema discoverySupported
Column projectionSupported
Partition pruningSupported
Predicate pushdownSupported
Time travelNot yet supported
Schema evolutionRead-only
Hidden partitioningSupported
Row-level deletesNot yet supported

Supported Data Types

Iceberg TypeESQL Type
booleanBOOLEAN
intINTEGER
longLONG
floatDOUBLE
doubleDOUBLE
decimalDOUBLE
dateDATE
timeTIME
timestampDATETIME
timestamptzDATETIME
stringKEYWORD
uuidKEYWORD
fixedKEYWORD
binaryKEYWORD (base64)
listNot yet supported
mapNot yet supported
structNot yet supported

Predicate Pushdown

The plugin supports pushing filter predicates to Iceberg for partition pruning and data skipping:

sql
-- Partition pruning: only scans partitions matching the predicate
FROM "s3://bucket/table"
| WHERE sale_date >= "2024-01-01"

-- Data skipping: uses column statistics to skip row groups
FROM "s3://bucket/table"
| WHERE amount > 1000

Supported predicates:

  • Equality: =, !=
  • Comparison: <, <=, >, >=
  • NULL checks: IS NULL, IS NOT NULL
  • IN lists: field IN (value1, value2, ...)
  • Boolean AND/OR combinations

Configuration

S3 Configuration

S3 access is configured via environment variables or Elasticsearch settings:

bash
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
AWS_REGION=us-east-1

Iceberg-specific Settings

SettingDefaultDescription
esql.iceberg.s3.endpoint(AWS default)Custom S3 endpoint (for MinIO, etc.)
esql.iceberg.s3.path_style_accessfalseUse path-style S3 access

Building

bash
./gradlew :x-pack:plugin:esql-datasource-iceberg:build

Testing

bash
# Unit tests
./gradlew :x-pack:plugin:esql-datasource-iceberg:test

# Integration tests (requires S3 fixture)
./gradlew :x-pack:plugin:esql-datasource-iceberg:qa:javaRestTest

Test Fixtures

The qa/ directory contains test fixtures for integration testing:

qa/src/javaRestTest/resources/iceberg-fixtures/
├── employees/           # Sample Iceberg table
│   ├── data/
│   │   └── data.parquet
│   └── metadata/
│       ├── v1.metadata.json
│       └── ...
└── standalone/
    └── employees.parquet  # Standalone Parquet file

Security Considerations

  • Use IAM roles for S3 access when running on AWS
  • Enable S3 bucket encryption for data at rest
  • Use VPC endpoints for private S3 access
  • Consider using AWS Lake Formation for fine-grained access control

Installation

The plugin is bundled with Elasticsearch and enabled by default when the ESQL feature is available.

License

Elastic License 2.0