Data Sources YAML Examples

The YAML configuration files can be used to specify the data sources from which the data will be read to be indexed in the RAG. Because the data sources are usually used in a DocumentStore, the resulting tables must contain a data column of type bytes. Usually, the data sources are defined in a parameter $sources (mind the $, this parameter will be used in the YAML by other components) as a list of connectors.

yaml

$sources:
  - !pw.io.fs.read
    path: data
    format: binary
    with_metadata: true
  - !pw.io.csv.read
    path: csv_files
    with_metadata: false

For each connector you need to specify all the necessary parameters. You can find all the connectors and learn about how they work and their associated parameters here.

::openable-list #title

File System

#description Read data from your file system. #content You can use the File System connector to read data from your file system. While the File System connector allows different basic formats, such as plaintext, CSV, and JSON, the Document Store requires the data to be in a binary format. In this case, the table will consist of a single column data with each cell containing the contents of the whole file.

yaml

$sources:
  - !pw.io.fs.read
    path: data                   # Path to the data directory
    format: binary               # Format of the data to be read
    with_metadata: true          # Include metadata in the data

:: ::openable-list #title

SharePoint

#description Read your data directly from SharePoint. #content The Pathway SharePoint connector is available when using one of the following licenses only: Pathway Scale, Pathway Enterprise. The connector will return a table with a single column data containing each file in a binary format.

yaml

$sources:
  - !pw.xpacks.connectors.sharepoint.read 
    url: $SHAREPOINT_URL               # URL of the SharePoint site
    tenant: $SHAREPOINT_TENANT         # Tenant ID for SharePoint
    client_id: $SHAREPOINT_CLIENT_ID   # Client ID for authentication
    cert_path: sharepointcert.pem      # Path to the certificate file
    thumbprint: $SHAREPOINT_THUMBPRINT # Thumbprint of the certificate
    root_path: $SHAREPOINT_ROOT        # Root path in SharePoint
    with_metadata: true                # Include metadata in the data
    refresh_interval: 30               # Interval to refresh data (in seconds)

::openable-list #title

Google Drive

#description Connect to your documents on Google Drive using the Pathway Google Drive Connector

#content To use the Pathway Google Drive connector, you need a Google Cloud project and a service user: you can learn more about how to set this up here.

The connector will return a table with a single column data containing each file in a binary format.

yaml

$sources:
  - !pw.io.gdrive.read
    object_id: $DRIVE_ID
    service_user_credentials_file: gdrive_indexer.json
    file_name_pattern:
      - "*.pdf"
      - "*.pptx"
    object_size_limit: null
    with_metadata: true
    refresh_interval: 30

::openable-list #title

S3

#description Connect to your data stored on S3. #content To use the Pathway S3 connector, you need to configure the connection to S3 using AwsS3Settings. For the RAG, you need to configure the format to binary. In this case, the connector will return a table with a single column data containing each file in a binary format.

yaml

$sources:
  - !pw.io.s3.read
    path: $path
    format: "binary"
    aws_s3_setting: !pw.io.s3.AwsS3Settings
      bucket_name: $bucket
      region: "eu-west-3"
      access_key: $s3_access_key
      secret_access_key: $s3_secret_access_key

Data Sources Examples

Data Sources YAML Examples

File System

SharePoint

Google Drive

S3