doc/user/content/serve-results/sink/iceberg.md
{{< public-preview />}}
Iceberg sinks provide exactly once delivery of updates from Materialize into Apache Iceberg1 tables hosted on Amazon S3 Tables2. As data changes in Materialize, the corresponding Iceberg tables are automatically kept up to date. You can sink data from a materialized view, a source, or a table.
This guide walks you through the steps required to set up Iceberg sinks in Materialize Cloud.
In AWS, set up permissions to allow Materialize to write data files to the object storage backing your Iceberg catalog. This tutorial uses an IAM policy and IAM role to grant the required permissions. We strongly recommend using role assumption-based authentication to manage access.
Create an IAM
policy
that allows full access to your S3 Tables API.Replace <S3 table bucket ARN>
with the ARN of your S3 table bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3tables:*",
"Resource": [
"<S3 table bucket ARN>"
"<S3 table bucket ARN>/table/*"
]
}
]
}
Create an IAM role that Materialize can assume.
For the Trusted entity type, specify Custom trust policy with the following:
Principal: The example uses the Materialize Cloud IAM
principal. For self-managed
deployments and the Emulator, the principal will differ.ExternalId: "PENDING" is a placeholder and will be updated after
creating the AWS connection in Materialize.{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::664411391173:role/MaterializeConnection"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "PENDING"
}
}
}
]
}
For permissions, add the IAM policy created earlier to grant access to the S3 Tables.
Once you have created the IAM role, copy the role ARN from the AWS console. You'll use the ARN in the next step.
In Materialize, create an AWS connection to authenticate with the object storage:
Use CREATE CONNECTION ... TO AWS to create
an AWS connection, replacing:
<IAM role ARN> with your IAM role ARN from step
1<region> with your AWS region (e.g., us-east-1):CREATE CONNECTION aws_connection TO AWS (
ASSUME ROLE ARN = '<IAM role ARN>',
REGION = '<region>'
);
For more details on AWS connection options, see CREATE CONNECTION.
Fetch the external_id for your connection, replacing <IAM role ARN> with
your IAM role ARN:
SELECT external_id
FROM mz_internal.mz_aws_connections awsc
JOIN mz_connections c ON awsc.id = c.id
WHERE c.name = 'aws_connection'
AND awsc.assume_role_arn = '<iam-role-arn>';
You will use the external_id to update the IAM role in the next step.
Once you have the external_id, update the trust policy for the IAM role
created in step 1. Replace "PENDING" with your
external ID value. Your IAM trust policy should look like the following (but
with your external ID value):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::664411391173:role/MaterializeConnection"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "mz_1234abcd-5678-efgh-9012-ijklmnopqrst_u123"
}
}
}
]
}
In Materialize, create an Iceberg catalog connection for the Iceberg sink to
use. To create, use CREATE CONNECTION ... TO ICEBERG CATALOG, replacing:
<region> with your AWS region (e.g., us-east-1) and<S3 table bucket ARN> with your AWS S3 Table bucket ARN.The command uses the AWS connection you created earlier.
CREATE CONNECTION iceberg_catalog_connection TO ICEBERG CATALOG (
CATALOG TYPE = 's3tablesrest',
URL = 'https://s3tables.<region>.amazonaws.com/iceberg',
WAREHOUSE = '<S3 table bucket ARN>',
AWS CONNECTION = aws_connection
);
In Materialize, you can sink from a materialized view, table, or source. Use
CREATE SINK to create an Iceberg sink, replacing:
<sink_name> with a name for your sink.<sink_cluster> with the name of your sink cluster.<my_materialize_object> with the name of your materialized view, table, or source.<my_s3_table_bucket_namespace> with your S3 Table bucket namespace.<my_iceberg_table> with the name of your Iceberg table. If the Iceberg table
does not exist, Materialize creates the table. For details, see CREATE SINK
reference page.<key> with the column(s) that uniquely identify rows.<commit_interval> with your commit interval (e.g., 60s). The commit
interval specifies how frequently Materialize commits snapshots to Iceberg.
The minimum commit interval is 1s. See Commit interval
tradeoffs below.CREATE SINK <sink_name>
IN CLUSTER <sink_cluster>
FROM <my_materialize_object>
INTO ICEBERG CATALOG CONNECTION iceberg_catalog_connection (
NAMESPACE = '<my_s3_table_bucket_namespace>',
TABLE = '<my_iceberg_table>'
)
USING AWS CONNECTION aws_connection
KEY (<key>)
MODE UPSERT
WITH (COMMIT INTERVAL = '<commit_interval>');
For the full list of syntax options, see the CREATE SINK reference.
The COMMIT INTERVAL setting controls how frequently Materialize commits
snapshots to your Iceberg table, making the data available to downstream query
engines. This setting involves tradeoffs:
Shorter intervals (e.g., < 60s) | Longer intervals (e.g., 5m) |
|---|---|
| Lower latency - data visible sooner in downstream systems | Higher latency - data takes longer to appear |
| More small files - can degrade query performance over time | Fewer, larger files - better query performance |
| More frequent snapshot commits - higher catalog overhead | Less catalog overhead |
| Lower throughput efficiency | Higher throughput efficiency |
Recommendations:
60s or longer5m to 15m){{< note >}}
Outside of development environments, commit intervals should be at least 60s.
Short commit intervals increase catalog overhead and produce many small files.
Small files will result in degraded query performance. It also increases load on
the Iceberg metadata, which can result in a degraded catalog, and non-responsive
queries.
{{< /note >}}
{{< include-from-yaml data="examples/create_sink_iceberg" name="exactly-once-delivery" >}}
{{% include-headless "/headless/iceberg-sinks/type-mapping" %}}
{{% include-headless "/headless/iceberg-sinks/limitations-list" %}}
{{% include-headless "/headless/iceberg-sinks/troubleshooting" %}}
Apache Iceberg is an open table format for large-scale analytics datasets. ↩
Amazon S3 Tables is an AWS feature that provides fully managed Apache Iceberg tables as a native S3 storage type. ↩