docs/integrations/sources/s3.md
This page contains the setup guide and reference information for the S3 source connector.
</HideInUI>:::warning Using cloud storage may incur egress costs. Egress refers to data that is transferred out of the cloud storage system, such as when you download files or access them from a different location. For detailed information on egress costs, please consult the AWS S3 pricing guide. :::
If you are syncing from a private bucket, you need to authenticate the connection. This can be done either by using an IAM User (with AWS Access Key ID and Secret Access Key) or an IAM Role (with Role ARN). Begin by creating a policy with the necessary permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::{your-bucket-name}/*",
"arn:aws:s3:::{your-bucket-name}"
]
}
]
}
:::note At this time, object-level permissions alone are not sufficient to successfully authenticate the connection. Please ensure you include the bucket-level permissions as provided in the example above. :::
:::caution
Your Secret Access Key will only be visible once upon creation. Be sure to copy and store it securely for future use.
:::
For more information on managing your access keys, please refer to the official AWS documentation.
:::note S3 authentication using an IAM role member is not supported using the OSS platform. :::
<!-- /env:oss --> <!-- env:cloud -->:::note S3 authentication using an IAM role member must be enabled by a member of the Airbyte team. If you'd like to use this feature, please contact the Sales team for more information. :::
In the IAM dashboard, click Roles, then Create role.
Choose the AWS account trusted entity type.
Set up a trust relationship for the role. This allows the Airbyte instance's AWS account to assume this role. You will also need to specify an external ID, which is a secret key that the trusting service (Airbyte) and the trusted role (the role you're creating) both know. This ID is used to prevent the "confused deputy" problem. The External ID should be your Airbyte workspace ID, which can be found in the URL of your workspace page. Edit the trust relationship policy to include the external ID:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::094410056844:user/delegated_access_user"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "{your-airbyte-workspace-id}"
}
}
}
]
}
** as the pattern. For more precise pattern matching options, refer to the Globs section below.{} and will automatically infer the schema from the file(s) you are replicating. For details on providing a custom schema, refer to the User Schema section.{"data": "object"} and all downstream data will be nested in a "data" field. This is a good option if the schema of your records changes frequently.All other fields are optional and can be left empty. Refer to the S3 Provider Settings section below for more information on each field.
Choose a delivery method for your data.
</FieldAnchor>If enabled, sends subdirectory folder structure along with source file names to the destination. Otherwise, files will be synced by their names only. This option is ignored when file-based replication is not enabled.
The S3 source connector supports the following sync modes:
| Feature | Supported? |
|---|---|
| Full Refresh Sync | Yes |
| Incremental Sync | Yes |
| Replicate Incremental Deletes | No |
| Replicate Multiple Files (pattern matching) | Yes |
| Replicate Multiple Streams (distinct tables) | Yes |
| Namespaces | No |
There is no predefined streams. The streams are based on content of your bucket.
| Compression | Supported? |
|---|---|
| Gzip | Yes |
| Zip | Yes |
| Bzip2 | Yes |
| Lzma | No |
| Xz | No |
| Snappy | No |
Please let us know any specific compressions you'd like to see support for next!
(tl;dr -> path pattern syntax using wcmatch.glob. GLOBSTAR and SPLIT flags are enabled.)
This connector can sync multiple files by using glob-style patterns, rather than requiring a specific path for every file. This enables:
** would indicate every file in the bucket.You must provide a path pattern. You can also provide many patterns split with | for more complex directory layouts.
Each path pattern is a reference from the root of the bucket, so don't include the bucket name in the pattern(s).
Some example patterns:
** : match everything.**/*.csv : match all files with specific extension.myFolder/**/*.csv : match all csv files anywhere under myFolder.*/** : match everything at least one folder deep.*/*/*/** : match everything at least three folders deep.**/file.*|**/file : match every file called "file" with any extension (or no extension).x/*/y/* : match all files that sit in folder x -> any folder -> folder y.**/prefix*.csv : match all csv files with specific prefix.**/prefix*.parquet : match all parquet files with specific prefix.Let's look at a specific example, matching the following bucket layout:
myBucket
-> log_files
-> some_table_files
-> part1.csv
-> part2.csv
-> images
-> more_table_files
-> part3.csv
-> extras
-> misc
-> another_part1.csv
We want to pick up part1.csv, part2.csv and part3.csv (excluding another_part1.csv for now). We could do this a few different ways:
**/part*.csv.some_table_files/*.csv|more_table_files/*.csv to pick up relevant files only from those exact folders.*table_files/*.csv. This could however cause problems in the future if new unexpected folders started being created.extras/**/*.csv would pick up any csv files nested in folders below "extras", such as "extras/misc/another_part1.csv".As you can probably tell, there are many ways to achieve the same goal with path patterns. We recommend using a pattern that ensures clarity and is robust against future additions to the directory structure.
To perform incremental syncs, Airbyte syncs files from oldest to newest. Each file that's synced (up to 10,000 files) will be added as an entry in a "history" section of the connection's state message.
Once history is full, we drop the older messages out of the file, and only read files that were last modified between the date of the newest file in history and Days to Sync if History is Full days prior.
Providing a schema allows for more control over the output of this stream. Without a provided schema, columns and datatypes will be inferred from the first created file in the bucket matching your path pattern and suffix. This will probably be fine in most cases but there may be situations you want to enforce a schema instead, e.g.:
:::note
Without providing a schema for a CSV file all columns will be inferred as a string.
:::
_ab_additional_properties map._ab_additional_properties map.Or any other reason! The schema must be provided as valid JSON as a map of {"column": "datatype"} where each datatype is one of:
For example:
{"id": "integer", "location": "string", "longitude": "number", "latitude": "number"}{"username": "string", "friends": "array", "information": "object"}:::note
Please note, the S3 Source connector used to infer schemas from all the available files and then merge them to create a superset schema. Starting from version 2.0.0 the schema inference works based on the first file found only. The first file we consider is the oldest one written to the prefix.
:::
YYYY-MM-DDTHH:mm:ssZ. Leaving this field blank will replicate data from all files that have not been excluded by the Path Pattern and Path Prefix.Since CSV files are effectively plain text, providing specific reader options is often required for correct parsing of the files. These settings are applied when a CSV is created or exported so please ensure that this process happens consistently over time.
User Provided assumes the CSV does not have a header row and uses the headers provided and Autogenerated assumes the CSV does not have a header row and the CDK will generate headers using for f{i} where i is the index starting from 0. Else, the default behavior is to use the header from the CSV file. If a user wants to autogenerate or provide column names for a CSV having headers, they can set a value for the "Skip rows before header" option to ignore the header row.\t. By default, this value is set to ,.utf8.\). For example, given the following data:Product,Description,Price
Jeans,"Navy Blue, Bootcut, 34\"",49.99
The backslash (\) is used directly before the second double quote (") to indicate that it is not the closing quote for the field, but rather a literal double quote character that should be included in the value (in this example, denoting the size of the jeans in inches: 34" ).
Leaving this field blank (default option) will disallow escaping.
".Apache Parquet is a column-oriented data storage format of the Apache Hadoop ecosystem. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. At the moment, partitioned parquet datasets are unsupported. The following settings are available:
The Avro parser uses the Fastavro library. The following settings are available:
There are currently no options for JSONL parsing.
<FieldAnchor field="streams.0.format[unstructured],streams.1.format[unstructured],streams.2.format[unstructured]">:::warning The Document File Type Format is currently an experimental feature and not subject to SLAs. Use at your own risk. :::
The Document File Type Format is a special format that allows you to extract text from Markdown, TXT, PDF, Word and Powerpoint documents. If selected, the connector will extract text from the documents and output it as a single field named content. The document_key field will hold a unique identifier for the processed file which can be used as a primary key. The content of the document will contain markdown formatting converted from the original file format. Each file matching the defined glob pattern needs to either be a markdown (md), PDF (pdf), Word (docx) or Powerpoint (.pptx) file.
One record will be emitted for each document. Keep in mind that large files can emit large records that might not fit into every destination as each destination has different limitations for string fields.
This connector utilizes the open source Unstructured library to perform OCR and text extraction from PDFs and MS Word files, as well as from embedded tables and images. You can read more about the parsing logic in the Unstructured docs and you can learn about other Unstructured tools and services at www.unstructured.io.
</FieldAnchor>| Version | Date | Pull Request | Subject |
|---|---|---|---|
| 4.15.2 | 2025-11-11 | 69268 | Update dependencies |
| 4.15.1 | 2025-11-04 | 68844 | Update dependencies |
| 4.15.0 | 2025-10-29 | 68640 | Update dependencies |
| 4.14.6 | 2025-10-21 | 67222 | Update dependencies |
| 4.14.5 | 2025-10-08 | 67494 | Fix utf_8_sig encoding for zip files |
| 4.14.4 | 2025-09-30 | 60547 | Update dependencies |
| 4.14.3 | 2025-09-15 | 66023 | Update to CDK v7 |
| 4.14.2 | 2025-05-22 | 60863 | chore(source-s3): bump base image to 4.0.1 |
| 4.14.1 | 2025-05-10 | 58988 | Update dependencies |
| 4.14.0 | 2025-05-06 | 59685 | Promoting release candidate 4.14.0-rc.1 to a main version. |
| 4.14.0-rc.1 | 2025-05-05 | 57498 | Adapt file-transfer records to latest protocol, requires platform >= 1.7.0, destination-s3 >= 1.8.0 |
| 4.13.5 | 2025-04-19 | 57994 | Update dependencies |
| 4.13.4 | 2025-04-05 | 57485 | Update dependencies |
| 4.13.3 | 2025-03-29 | 56791 | Update dependencies |
| 4.13.2 | 2025-03-22 | 52953 | Update dependencies |
| 4.13.1 | 2025-03-13 | 55694 | Fix bug where csv column name is 'type' |
| 4.13.0 | 2025-03-12 | 55202 | Bump base image to 4.0.0 and also CDK |
| 4.12.2 | 2025-02-14 | 53684 | Added pendulum to the dependencies |
| 4.12.1 | 2025-01-25 | 52509 | Update dependencies |
| 4.12.0 | 2025-01-20 | 52030 | Promoting release candidate 4.12.0-rc.1 to a main version. |
| 4.12.0-rc.1 | 2025-01-15 | 51474 | Bump cdk to have preserve subdirectories (default) in copy raw files functionality |
| 4.11.4 | 2025-01-11 | 51370 | Update dependencies |
| 4.11.3 | 2025-01-04 | 50932 | Update dependencies |
| 4.11.2 | 2024-12-28 | 50739 | Update dependencies |
| 4.11.1 | 2024-12-21 | 49042 | Update dependencies |
| 4.11.0 | 2024-12-17 | 49824 | Increase file size limit to 1.5GB |
| 4.10.2 | 2024-11-25 | 48613 | Starting with this version, the Docker image is now rootless. Please note that this and future versions will not be compatible with Airbyte versions earlier than 0.64 |
| 4.10.1 | 2024-11-12 | 48346 | Implement file-transfer capabilities |
| 4.9.2 | 2024-11-04 | 48259 | Update dependencies |
| 4.9.1 | 2024-10-29 | 47038 | Update dependencies |
| 4.9.0 | 2024-10-17 | 46973 | Promote releae candidate. |
| 4.9.0-rc.1 | 2024-10-14 | 46298 | Migrate to CDK v5 |
| 4.8.5 | 2024-10-12 | 46511 | Update dependencies |
| 4.8.4 | 2024-09-28 | 46131 | Update dependencies |
| 4.8.3 | 2024-09-21 | 45757 | Update dependencies |
| 4.8.2 | 2024-09-14 | 45504 | Update dependencies |
| 4.8.1 | 2024-09-07 | 45257 | Update dependencies |
| 4.8.0 | 2024-09-03 | 44908 | Migrate to CDK v3 |
| 4.7.8 | 2024-08-31 | 45009 | Update dependencies |
| 4.7.7 | 2024-08-24 | 44732 | Update dependencies |
| 4.7.6 | 2024-08-19 | 44380 | Update dependencies |
| 4.7.5 | 2024-08-12 | 43868 | Update dependencies |
| 4.7.4 | 2024-08-10 | 43667 | Update dependencies |
| 4.7.3 | 2024-08-03 | 43083 | Update dependencies |
| 4.7.2 | 2024-07-27 | 42814 | Update dependencies |
| 4.7.1 | 2024-07-20 | 42205 | Update dependencies |
| 4.7.0 | 2024-07-16 | 41934 | Update to 3.5.1 CDK |
| 4.6.3 | 2024-07-13 | 41934 | Update dependencies |
| 4.6.2 | 2024-07-10 | 41503 | Update dependencies |
| 4.6.1 | 2024-07-09 | 40067 | Update dependencies |
| 4.6.0 | 2024-06-26 | 39573 | Improve performance: update to Airbyte CDK 2.0.0 |
| 4.5.17 | 2024-06-06 | 39214 | [autopull] Upgrade base image to v1.2.2 |
| 4.5.16 | 2024-05-29 | 38674 | Avoid error on empty stream when running discover |
| 4.5.15 | 2024-05-20 | 38252 | Replace AirbyteLogger with logging.Logger |
| 4.5.14 | 2024-05-09 | 38090 | Bump python-cdk version to include CSV field length fix |
| 4.5.13 | 2024-05-03 | 37776 | Update airbyte-cdk to fix the discovery command issue |
| 4.5.12 | 2024-04-11 | 37001 | Update airbyte-cdk to flush print buffer for every message |
| 4.5.11 | 2024-03-14 | 36160 | Bump python-cdk version to include CSV tab delimiter fix |
| 4.5.10 | 2024-03-11 | 35955 | Pin transformers transitive dependency |
| 4.5.9 | 2024-03-06 | 35857 | Bump poetry.lock to upgrade transitive dependency |
| 4.5.8 | 2024-03-04 | 35808 | Use cached AWS client |
| 4.5.7 | 2024-02-23 | 34895 | Run incremental syncs with concurrency |
| 4.5.6 | 2024-02-21 | 35246 | Fixes bug that occurred when creating CSV streams with tab delimiter. |
| 4.5.5 | 2024-02-18 | 35392 | Add support filtering by start date |
| 4.5.4 | 2024-02-15 | 35055 | Temporarily revert concurrency |
| 4.5.3 | 2024-02-12 | 35164 | Manage dependencies with Poetry. |
| 4.5.2 | 2024-02-06 | 34930 | Bump CDK version to fix issue when SyncMode is missing from catalog |
| 4.5.1 | 2024-02-02 | 31701 | Add region support |
| 4.5.0 | 2024-02-01 | 34591 | Run full refresh syncs concurrently |
| 4.4.1 | 2024-01-30 | 34665 | Pin moto & CDK version |
| 4.4.0 | 2024-01-12 | 33818 | Add IAM Role Authentication |
| 4.3.1 | 2024-01-04 | 33937 | Prepare for airbyte-lib |
| 4.3.0 | 2023-12-14 | 33411 | Bump CDK version to auto-set primary key for document file streams and support raw txt files |
| 4.2.4 | 2023-12-06 | 33187 | Bump CDK version to hide source-defined primary key |
| 4.2.3 | 2023-11-16 | 32608 | Improve document file type parser |
| 4.2.2 | 2023-11-20 | 32677 | Only read files with ".zip" extension as zipped files |
| 4.2.1 | 2023-11-13 | 32357 | Improve spec schema |
| 4.2.0 | 2023-11-02 | 32109 | Fix docs; add HTTPS validation for S3 endpoint; fix coverage |
| 4.1.4 | 2023-10-30 | 31904 | Update CDK |
| 4.1.3 | 2023-10-25 | 31654 | Reduce image size |
| 4.1.2 | 2023-10-23 | 31383 | Add handling NoSuchBucket error |
| 4.1.1 | 2023-10-19 | 31601 | Base image migration: remove Dockerfile and use the python-connector-base image |
| 4.1.0 | 2023-10-17 | 31340 | Add reading files inside zip archive |
| 4.0.5 | 2023-10-16 | 31209 | Add experimental Markdown/PDF/Docx file format |
| 4.0.4 | 2023-09-18 | 30476 | Remove streams.*.file_type from source-s3 configuration |
| 4.0.3 | 2023-09-13 | 30387 | Bump Airbyte-CDK version to improve messages for record parse errors |
| 4.0.2 | 2023-09-07 | 28639 | Always show S3 Key fields |
| 4.0.1 | 2023-09-06 | 30217 | Migrate inference error to config errors and avoir sentry alerts |
| 4.0.0 | 2023-09-05 | 29757 | New version using file-based CDK |
| 3.1.11 | 2023-08-30 | 29986 | Add config error for conversion error |
| 3.1.10 | 2023-08-29 | 29943 | Add config error for arrow invalid error |
| 3.1.9 | 2023-08-23 | 29753 | Feature parity update for V4 release |
| 3.1.8 | 2023-08-17 | 29520 | Update legacy state and error handling |
| 3.1.7 | 2023-08-17 | 29505 | v4 StreamReader and Cursor fixes |
| 3.1.6 | 2023-08-16 | 29480 | update Pyarrow to version 12.0.1 |
| 3.1.5 | 2023-08-15 | 29418 | Avoid duplicate syncs when migrating from v3 to v4 |
| 3.1.4 | 2023-08-15 | 29382 | Handle legacy path prefix & path pattern |
| 3.1.3 | 2023-08-05 | 29028 | Update v3 & v4 connector to handle either state message |
| 3.1.2 | 2023-07-29 | 28786 | Add a codepath for using the file-based CDK |
| 3.1.1 | 2023-07-26 | 28730 | Add human readable error message and improve validation for encoding field when it empty |
| 3.1.0 | 2023-06-26 | 27725 | License Update: Elv2 |
| 3.0.3 | 2023-06-23 | 27651 | Handle Bucket Access Errors |
| 3.0.2 | 2023-06-22 | 27611 | Fix start date |
| 3.0.1 | 2023-06-22 | 27604 | Add logging for file reading |
| 3.0.0 | 2023-05-02 | 25127 | Remove ab_additional column; Use platform-handled schema evolution |
| 2.2.0 | 2023-05-10 | 25937 | Add support for Parquet Dataset |
| 2.1.4 | 2023-05-01 | 25361 | Parse nested avro schemas |
| 2.1.3 | 2023-05-01 | 25706 | Remove minimum block size for CSV check |
| 2.1.2 | 2023-04-18 | 25067 | Handle block size related errors; fix config validator |
| 2.1.1 | 2023-04-18 | 25010 | Refactor filter logic |
| 2.1.0 | 2023-04-10 | 25010 | Add start_date field to filter files based on LastModified option |
| 2.0.4 | 2023-03-23 | 24429 | Call check with a little block size to save time and memory. |
| 2.0.3 | 2023-03-17 | 24178 | Support legacy datetime format for the period of migration, fix time-zone conversion. |
| 2.0.2 | 2023-03-16 | 24157 | Return empty schema if discover finds no files; Do not infer extra data types when user defined schema is applied. |
| 2.0.1 | 2023-03-06 | 23195 | Fix datetime format string |
| 2.0.0 | 2023-03-14 | 23189 | Infer schema based on one file instead of all the files |
| 1.0.2 | 2023-03-02 | 23669 | Made Advanced Reader Options and Advanced Options truly optional for CSV format |
| 1.0.1 | 2023-02-27 | 23502 | Fix error handling |
| 1.0.0 | 2023-02-17 | 23198 | Fix Avro schema discovery |
| 0.1.32 | 2023-02-07 | 22500 | Speed up discovery |
| 0.1.31 | 2023-02-08 | 22550 | Validate CSV read options and convert options |
| 0.1.30 | 2023-01-25 | 21587 | Make sure spec works as expected in UI |
| 0.1.29 | 2023-01-19 | 21604 | Handle OSError: skip unreachable keys and keep working on accessible ones. Warn a customer |
| 0.1.28 | 2023-01-10 | 21210 | Update block size for json file format |
| 0.1.27 | 2022-12-08 | 20262 | Check config settings for CSV file format |
| 0.1.26 | 2022-11-08 | 19006 | Add virtual-hosted-style option |
| 0.1.24 | 2022-10-28 | 18602 | Wrap errors into AirbyteTracedException pointing to a problem file |
| 0.1.23 | 2022-10-10 | 17800 | Deleted use_ssl and verify_ssl_cert flags and hardcoded to True |
| 0.1.23 | 2022-10-10 | 17991 | Fix pyarrow to JSON schema type conversion for arrays |
| 0.1.22 | 2022-09-28 | 17304 | Migrate to per-stream state |
| 0.1.21 | 2022-09-20 | 16921 | Upgrade pyarrow |
| 0.1.20 | 2022-09-12 | 16607 | Fix for reading jsonl files containing nested structures |
| 0.1.19 | 2022-09-13 | 16631 | Adjust column type to a broadest one when merging two or more json schemas |
| 0.1.18 | 2022-08-01 | 14213 | Add support for jsonl format files. |
| 0.1.17 | 2022-07-21 | 14911 | "decimal" type added for parquet |
| 0.1.16 | 2022-07-13 | 14669 | Fixed bug when extra columns apeared to be non-present in master schema |
| 0.1.15 | 2022-05-31 | 12568 | Fixed possible case of files being missed during incremental syncs |
| 0.1.14 | 2022-05-23 | 11967 | Increase unit test coverage up to 90% |
| 0.1.13 | 2022-05-11 | 12730 | Fixed empty options issue |
| 0.1.12 | 2022-05-11 | 12602 | Added support for Avro file format |
| 0.1.11 | 2022-04-30 | 12500 | Improve input configuration copy |
| 0.1.10 | 2022-01-28 | 8252 | Refactoring of files' metadata |
| 0.1.9 | 2022-01-06 | 9163 | Work-around for web-UI, backslash - t converts to tab for format.delimiter field. |
| 0.1.7 | 2021-11-08 | 7499 | Remove base-python dependencies |
| 0.1.6 | 2021-10-15 | 6615 & 7058 | Memory and performance optimisation. Advanced options for CSV parsing. |
| 0.1.5 | 2021-09-24 | 6398 | Support custom non Amazon S3 services |
| 0.1.4 | 2021-08-13 | 5305 | Support of Parquet format |
| 0.1.3 | 2021-08-04 | 5197 | Fixed bug where sync could hang indefinitely on schema inference |
| 0.1.2 | 2021-08-02 | 5135 | Fixed bug in spec so it displays in UI correctly |
| 0.1.1 | 2021-07-30 | 4990 | Fixed documentation url in source definition |
| 0.1.0 | 2021-07-30 | 4990 | Created S3 source connector |