docs/integrations/destinations/gcs-data-lake.md
This page guides you through setting up the GCS Data Lake destination connector.
This connector is Airbyte's official support for the Iceberg protocol on Google Cloud Storage. It writes the Iceberg table format to GCS using a supported Iceberg catalog.
The GCS Data Lake connector requires two things:
Follow these steps to set up your GCS storage and Iceberg catalog permissions.
In the Google Cloud Console, navigate to IAM & Admin > Service Accounts
Click CREATE SERVICE ACCOUNT
Give it a name (for example: airbyte-gcs-data-lake)
Grant the following roles:
Click CREATE KEY and choose the JSON format
Download the JSON key file
In Airbyte, paste the entire contents of this JSON file into the Service Account JSON field
The rest of the setup process differs depending on the catalog you're using.
The BigLake catalog is Google Cloud's managed Iceberg catalog service. To use BigLake, you need to have created a BigLake catalog in your GCP project. The service account you created earlier should have the necessary permissions to access this catalog.
To authenticate with Apache Polaris, follow these steps:
Set up your Polaris catalog and create a principal with the necessary permissions. Refer to the Apache Polaris documentation for detailed setup instructions.
When creating a principal in Polaris, you'll receive OAuth credentials (Client ID and Client Secret). Keep these credentials secure.
Grant the required privileges to your principal's catalog role. You can either:
Option A: grant the broad CATALOG_MANAGE_CONTENT privilege (recommended for simplicity):
Option B: grant specific granular privileges:
TABLE_LIST - List tables in a namespaceTABLE_CREATE - Create new tablesTABLE_DROP - Delete tablesTABLE_READ_PROPERTIES - Read table metadataTABLE_WRITE_PROPERTIES - Update table metadataTABLE_WRITE_DATA - Write data to tablesNAMESPACE_LIST - List namespacesNAMESPACE_CREATE - Create new namespacesNAMESPACE_READ_PROPERTIES - Read namespace metadataEnsure that your Polaris catalog has been configured with the appropriate storage credentials to access your GCS bucket.
In Airbyte, configure the following fields:
| Field | Required | Description |
|---|---|---|
| GCS Bucket Name | Yes | The name of your GCS bucket (for example: my-data-lake) |
| Service Account JSON | Yes | The complete JSON content from your service account key file |
| GCP Project ID | No | The GCP project ID. If not specified, extracted from service account |
| GCP Location | Yes | The GCP location/region (for example: us, us-central1, eu) |
| Warehouse Location | Yes | Root path for Iceberg data in GCS (for example: gs://my-bucket/warehouse) |
| Catalog Type | Yes | Select the type of Iceberg catalog to use: BigLake or Polaris |
| Main Branch Name | No | Iceberg branch name (default: main) |
| Default Namespace | No | Default namespace for tables (for example: default, airbyte_data) |
When Catalog Type is set to BigLake, configure these additional fields:
| Field | Required | Description |
|---|---|---|
| BigLake Catalog Name | Yes | Name of your BigLake catalog (from the setup step) |
When Catalog Type is set to Polaris, configure these additional fields:
| Field | Required | Description |
|---|---|---|
| Polaris Server URI | Yes | The base URL of your Polaris server (for example: http://localhost:8181/api/catalog) |
| Catalog Name | Yes | The name of the catalog in Polaris (for example: quickstart_catalog) |
| Client ID | Yes | The OAuth Client ID for authenticating with the Polaris server |
| Client Secret | Yes | The OAuth Client Secret for authenticating with the Polaris server |
| Sync mode | Supported? |
|---|---|
| Full Refresh - Overwrite | Yes |
| Full Refresh - Append | Yes |
| Full Refresh - Overwrite + Deduped | Yes |
| Incremental Sync - Append | Yes |
| Incremental Sync - Append + Deduped | Yes |
In each stream, Airbyte maps top-level fields to Iceberg fields. Airbyte maps nested fields (objects, arrays, and unions) to string columns and writes them as serialized JSON.
This is the full mapping between Airbyte types and Iceberg types.
| Airbyte type | Iceberg type |
|---|---|
| Boolean | Boolean |
| Date | Date |
| Integer | Long |
| Number | Double |
| String | String |
| Time with timezone* | Time |
| Time without timezone | Time |
| Timestamp with timezone* | Timestamp with timezone |
| Timestamp without timezone | Timestamp without timezone |
| Object | String (JSON-serialized value) |
| Array | String (JSON-serialized value) |
| Union | String (JSON-serialized value) |
*Airbyte converts the time with timezone and timestamp with timezone types to Coordinated Universal Time (UTC) before writing to the Iceberg file.
This connector never rewrites existing Iceberg data files. This means Airbyte can only handle specific source schema changes:
You have the following options to manage schema evolution:
This connector uses a merge-on-read strategy to support deduplication.
The GCS Data Lake connector assumes that one of two things is true:
If these conditions aren't met, you may see inaccurate data in Iceberg in the form of older records taking precedence over newer records. If this happens, use append or overwrite as your sync modes.
An unknown number of API sources have streams that don't meet these conditions. Airbyte knows Stripe and Monday don't, but there are probably others.
Iceberg supports Git-like semantics over your data. This connector leverages those semantics to provide resilient syncs.
airbyte_staging branch and replaces the main branch with the airbyte_staging at the end of the sync. Since most query engines target the main branch, people can query your data until the end of a truncate sync, at which point it's atomically swapped to the new version.At the end of stream sync, the current main branch is replaced with the airbyte_staging branch. Fast-forwarding is intentionally avoided to better handle potential compaction issues.
Important Warning: any changes made to the main branch outside of Airbyte's operations after a sync begins is going to be lost during this process.
:::caution Do not run compaction during a truncate refresh sync to prevent data loss. During a truncate refresh sync, the system deletes all files that don't belong to the latest generation. This includes:
If compaction runs simultaneously with the sync, it would delete files from the current generation, causing data loss. :::
This destination supports namespaces.
| Version | Date | Pull Request | Subject |
|---|---|---|---|
| 1.0.7 | 2026-02-04 | 72855 | Upgrade CDK to 0.2.8 |
| 1.0.6 | 2026-01-23 | 72300 | Upgrade CDK to 0.2.0 |
| 1.0.5 | 2026-01-14 | 71760 | Restore integration tests in CI. Workaround DI error. |
| 1.0.4 | 2026-01-12 | 71227 | Add speed mode support with PROTOBUF serialization |
| 1.0.3 | 2026-01-12 | 71258 | Migrate to TableSchemaMapper from deprecated ColumnNameMapper pattern |
| 1.0.2 | 2025-11-13 | 69317 | Connector generally available |
| 1.0.1 | 2025-11-13 | 69212 | Initial release of GCS Data Lake destination with BigLake and Polaris catalog support |