Back to Cocoindex

*Google Drive* connector

docs/src/content/docs/connectors/google_drive.mdx

1.0.64.7 KB
Original Source

The google_drive connector provides utilities for reading files from Google Drive using a service account.

python
from cocoindex.connectors import google_drive

:::note[Dependencies] This connector requires additional dependencies. Install with:

bash
pip install cocoindex[google_drive]

:::

As source

The connector provides two ways to read from Google Drive:

  • GoogleDriveSource — high-level source class with async iteration
  • list_files() — lower-level function returning a sync iterator

Both require a Google service account with access to the target Drive folders.

Setting up a service account

  1. Create a service account in the Google Cloud Console
  2. Download the JSON credential file
  3. Share the target Drive folders with the service account's email address

:::note[Google Workspace CLI] gws is an optional, unofficial Google Workspace CLI. It is actively developed and subject to change, but can be useful for exploring or validating Drive API access before configuring CocoIndex's service-account flow. For example:

bash
gws auth setup
gws auth login
gws drive files list

In headless or agent workflows, gws can also read credentials from GOOGLE_WORKSPACE_CLI_CREDENTIALS_FILE. CocoIndex still expects the service account JSON path in service_account_credential_path; use the gws credentials setting for gws commands themselves. :::

GoogleDriveSource

The primary source class for iterating over Google Drive files.

python
class GoogleDriveSource(
    *,
    service_account_credential_path: str,
    root_folder_ids: Sequence[str],
    mime_types: Sequence[str] | None = None,
)

Parameters:

  • service_account_credential_path — Path to the service account JSON credential file.
  • root_folder_ids — List of Google Drive folder IDs to scan. Subfolders are traversed recursively.
  • mime_types — Optional list of MIME types to include. If None, all file types are included.

Iterating files

GoogleDriveSource provides async iteration via files(), yielding DriveFile objects (implementing the FileLike base class):

python
source = google_drive.GoogleDriveSource(
    service_account_credential_path="./credentials.json",
    root_folder_ids=["1abc...xyz"],
)

async for file in source.files():
    text = await file.read_text()
    ...

Keyed iteration with items()

items() yields (str, DriveFile) pairs, where the key is the file's name path. This is useful with mount_each():

python
async for key, file in source.items():
    content = await file.read()

Filtering by MIME type

Use mime_types to restrict which files are returned:

python
source = google_drive.GoogleDriveSource(
    service_account_credential_path="./credentials.json",
    root_folder_ids=["1abc...xyz"],
    mime_types=["application/pdf", "text/plain"],
)

Google Workspace files (Docs, Sheets, Slides) are automatically exported:

Google Workspace typeExported as
Google DocsPlain text
Google SheetsCSV
Google SlidesPlain text

list_files

A lower-level sync iterator for listing files:

python
def list_files(spec: GoogleDriveSourceSpec) -> Iterator[DriveFile]

Parameters:

  • spec — A GoogleDriveSourceSpec with the same fields as GoogleDriveSource constructor parameters.

Returns: A sync iterator of DriveFile objects.

DriveFile

DriveFile implements FileLike with Google Drive-specific behavior:

  • file_path — A DriveFilePath where resolve() returns the Google Drive file ID.
  • read() / read_text() — Downloads file content via the Google Drive API. Partial reads (size parameter) are not supported.

Example

python
import cocoindex as coco
from cocoindex.connectors import google_drive
from cocoindex.resources.file import FileLike

@coco.fn(memo=True)
async def process_file(file: FileLike) -> None:
    text = await file.read_text()
    # ... process the file content ...

@coco.fn
async def app_main(credential_path: str, folder_ids: list[str]) -> None:
    source = google_drive.GoogleDriveSource(
        service_account_credential_path=credential_path,
        root_folder_ids=folder_ids,
    )

    with coco.component_subpath("file"):
        async for key, file in source.items():
            await coco.mount(
                coco.component_subpath(key),
                process_file,
                file,
            )

app = coco.App(
    "GoogleDriveIngestion",
    app_main,
    credential_path="./credentials.json",
    folder_ids=["1abc...xyz"],
)