docs/source/user-guide/io/cloud-storage.md
Polars can read and write to AWS S3, Azure Blob Storage and Google Cloud Storage. The API is the same for all three storage providers.
To read from cloud storage, additional dependencies may be needed depending on the use case and cloud storage provider:
=== ":fontawesome-brands-python: Python"
```shell
$ pip install fsspec s3fs adlfs gcsfs
```
=== ":fontawesome-brands-rust: Rust"
```shell
$ cargo add aws_sdk_s3 aws_config tokio --features tokio/full
```
Polars supports reading Parquet, CSV, IPC and NDJSON files from cloud storage:
{{code_block('user-guide/io/cloud-storage','read_parquet',['read_parquet','read_csv','read_ipc'])}}
Using pl.scan_* functions to read from cloud storage can benefit from
predicate and projection pushdowns, where the query optimizer will apply
them before the file is downloaded. This can significantly reduce the amount of data that needs to
be downloaded. The query evaluation is triggered by calling collect.
{{code_block('user-guide/io/cloud-storage','scan_parquet_query',[])}}
Polars is able to automatically load default credential configurations for some cloud providers. For cases when this does not happen, it is possible to manually configure the credentials for Polars to use for authentication. This can be done in a few ways:
storage_options:storage_options parameter:{{code_block('user-guide/io/cloud-storage','scan_parquet_storage_options_aws',['scan_parquet'])}}
CredentialProvider* utility classespl.CredentialProvider* that provides the required authentication
functionality. For example, pl.CredentialProviderAWS supports selecting AWS profiles, as well as
assuming an IAM role:{{code_block('user-guide/io/cloud-storage','credential_provider_class',['scan_parquet', 'CredentialProviderAWS'])}}
credential_provider function{{code_block('user-guide/io/cloud-storage','credential_provider_custom_func',['scan_parquet'])}}
{{code_block('user-guide/io/cloud-storage','credential_provider_custom_func_azure',['scan_parquet', 'CredentialProviderAzure'])}}
{{code_block('user-guide/io/cloud-storage','credential_provider_class_global_default',['scan_parquet', 'CredentialProviderAWS'])}}
storage_options:{{code_block('user-guide/io/cloud-storage','storage_options_retry_configuration',['scan_parquet'])}}
We can also scan from cloud storage using PyArrow. This is particularly useful for partitioned datasets such as Hive partitioning.
We first create a PyArrow dataset and then create a LazyFrame from the dataset.
{{code_block('user-guide/io/cloud-storage','scan_pyarrow_dataset',['scan_pyarrow_dataset'])}}
DataFrames can also be written to cloud storage by passing a cloud URL:
{{code_block('user-guide/io/cloud-storage','write_parquet',['write_parquet'])}}
Note that DataFrames can also be written to any Python file object that supports writes. This can
be helpful for performing operations that are not yet natively supported, e.g. writing a compressed
CSV directly to cloud:
{{code_block('user-guide/io/cloud-storage','write_file_object',['write_csv'])}}