docs/2.developers/7.templates/39.yaml-snippets/10.data-sources-examples.md
The YAML configuration files can be used to specify the data sources from which the data will be read to be indexed in the RAG.
Because the data sources are usually used in a DocumentStore, the resulting tables must contain a data column of type bytes.
Usually, the data sources are defined in a parameter $sources (mind the $, this parameter will be used in the YAML by other components) as a list of connectors.
$sources:
- !pw.io.fs.read
path: data
format: binary
with_metadata: true
- !pw.io.csv.read
path: csv_files
with_metadata: false
For each connector you need to specify all the necessary parameters. You can find all the connectors and learn about how they work and their associated parameters here.
::openable-list #title
#description
Read data from your file system.
#content
You can use the File System connector to read data from your file system.
While the File System connector allows different basic formats, such as plaintext, CSV, and JSON, the Document Store requires the data to be in a binary format.
In this case, the table will consist of a single column data with each cell containing the contents of the whole file.
$sources:
- !pw.io.fs.read
path: data # Path to the data directory
format: binary # Format of the data to be read
with_metadata: true # Include metadata in the data
:: ::openable-list #title
#description
Read your data directly from SharePoint.
#content
The Pathway SharePoint connector is available when using one of the following licenses only: Pathway Scale, Pathway Enterprise.
The connector will return a table with a single column data containing each file in a binary format.
$sources:
- !pw.xpacks.connectors.sharepoint.read
url: $SHAREPOINT_URL # URL of the SharePoint site
tenant: $SHAREPOINT_TENANT # Tenant ID for SharePoint
client_id: $SHAREPOINT_CLIENT_ID # Client ID for authentication
cert_path: sharepointcert.pem # Path to the certificate file
thumbprint: $SHAREPOINT_THUMBPRINT # Thumbprint of the certificate
root_path: $SHAREPOINT_ROOT # Root path in SharePoint
with_metadata: true # Include metadata in the data
refresh_interval: 30 # Interval to refresh data (in seconds)
::
::openable-list #title
#description Connect to your documents on Google Drive using the Pathway Google Drive Connector
#content To use the Pathway Google Drive connector, you need a Google Cloud project and a service user: you can learn more about how to set this up here.
The connector will return a table with a single column data containing each file in a binary format.
$sources:
- !pw.io.gdrive.read
object_id: $DRIVE_ID
service_user_credentials_file: gdrive_indexer.json
file_name_pattern:
- "*.pdf"
- "*.pptx"
object_size_limit: null
with_metadata: true
refresh_interval: 30
::
::openable-list #title
#description
Connect to your data stored on S3.
#content
To use the Pathway S3 connector, you need to configure the connection to S3 using AwsS3Settings.
For the RAG, you need to configure the format to binary. In this case, the connector will return a table with a single column data containing each file in a binary format.
$sources:
- !pw.io.s3.read
path: $path
format: "binary"
aws_s3_setting: !pw.io.s3.AwsS3Settings
bucket_name: $bucket
region: "eu-west-3"
access_key: $s3_access_key
secret_access_key: $s3_secret_access_key
::