Back to Claude Scientific Skills

Direct Parquet Access Guide for IDC

scientific-skills/imaging-data-commons/references/parquet_access_guide.md

2.38.07.8 KB
Original Source

Direct Parquet Access Guide for IDC

Tested with: idc-index-data 23.10.1, DuckDB 1.x

All idc-index metadata tables are published as Parquet files to a public GCS bucket with unrestricted CORS access. This enables metadata queries with DuckDB or pandas without installing idc-index — useful for quick exploration or environments where pip install is unavailable.

Limitation: download helpers (download_from_selection()), viewer URLs (get_viewer_URL()), and citation generation require the idc-index client and are not available from raw Parquet files.

When to Use This Guide

Load this guide when you need to:

  • Query IDC metadata without installing idc-index
  • Run ad-hoc DuckDB queries against the latest index files
  • Access volume_geometry_index or rtstruct_index for geometry validation or RT structure queries

For full API access (downloads, viewer, citations), use idc-index as documented in the main SKILL.md.

URL Pattern

https://storage.googleapis.com/idc-index-data-artifacts/current/release_artifacts/{filename}.parquet

current/ always resolves to the latest data release. To pin to a specific version, replace current with the data version number (e.g., 23.10.1).

Available Files

FileApproximate SizeDescription
idc_index.parquet~70 MBPrimary index (all DICOM series metadata)
volume_geometry_index.parquet~5 MB3D geometry validation for CT/MR/PT series
rtstruct_index.parquet~2 MBRT Structure Set ROI metadata
seg_index.parquet~6 MBDICOM Segmentation cross-references
sm_index.parquet~2 MBSlide microscopy series metadata
contrast_index.parquet~1 MBContrast agent metadata
ann_index.parquet~0.2 MBMicroscopy annotation series metadata
ann_group_index.parquet~0.5 MBAnnotation group metadata
collections_index.parquetCollection-level metadata
analysis_results_index.parquetDerived dataset metadata
clinical_index.parquet~0.2 MBClinical data column dictionary
prior_versions_index.parquetSeries from previous IDC releases

Note: the main index file is named idc_index.parquet, not index.parquet. Reference it with an alias in SQL queries (e.g., FROM read_parquet(...) AS index).

Prerequisites

bash
pip install duckdb
# or: uv add duckdb

DuckDB reads Parquet directly from HTTPS URLs using HTTP range requests — no GCS client library or authentication required.

Basic Queries

python
import duckdb

BASE = "https://storage.googleapis.com/idc-index-data-artifacts/current/release_artifacts"

# Discover modalities and series counts
duckdb.sql(f"""
    SELECT Modality, COUNT(*) as series_count, ROUND(SUM(series_size_MB)/1000, 1) as size_GB
    FROM read_parquet('{BASE}/idc_index.parquet')
    GROUP BY Modality
    ORDER BY series_count DESC
""").df()

# Collections with CT data, ordered by size
duckdb.sql(f"""
    SELECT collection_id,
           COUNT(DISTINCT PatientID) as patients,
           COUNT(*) as series,
           ROUND(SUM(series_size_MB)/1000, 1) as size_GB
    FROM read_parquet('{BASE}/idc_index.parquet')
    WHERE Modality = 'CT'
    GROUP BY collection_id
    ORDER BY size_GB DESC
    LIMIT 10
""").df()

Volume Geometry Validation

volume_geometry_index covers single-frame CT, MR, and PT series. Each row has boolean checks for orientation, spacing, dimensions, and slice positions, plus a composite regularly_spaced_3d_volume flag.

python
import duckdb

BASE = "https://storage.googleapis.com/idc-index-data-artifacts/current/release_artifacts"

# CT series that form a valid 3D volume (can be loaded without resampling)
duckdb.sql(f"""
    SELECT i.collection_id, i.SeriesInstanceUID, i.BodyPartExamined,
           v.obliquity_degrees, v.regularly_spaced_3d_volume
    FROM read_parquet('{BASE}/idc_index.parquet') i
    JOIN read_parquet('{BASE}/volume_geometry_index.parquet') v
        ON i.SeriesInstanceUID = v.SeriesInstanceUID
    WHERE i.Modality = 'CT'
      AND v.regularly_spaced_3d_volume = TRUE
    LIMIT 10
""").df()

# Fraction of 3D-valid series per collection and modality
duckdb.sql(f"""
    SELECT i.collection_id, i.Modality,
           COUNT(*) as total,
           SUM(CASE WHEN v.regularly_spaced_3d_volume THEN 1 ELSE 0 END) as valid_3d,
           ROUND(100.0 * SUM(CASE WHEN v.regularly_spaced_3d_volume THEN 1 ELSE 0 END) / COUNT(*), 1) as pct_valid
    FROM read_parquet('{BASE}/idc_index.parquet') i
    JOIN read_parquet('{BASE}/volume_geometry_index.parquet') v
        ON i.SeriesInstanceUID = v.SeriesInstanceUID
    WHERE i.Modality IN ('CT', 'MR', 'PT')
    GROUP BY i.collection_id, i.Modality
    ORDER BY total DESC
    LIMIT 10
""").df()

Key columns in volume_geometry_index:

ColumnTypeDescription
SeriesInstanceUIDSTRINGJoin key
single_orientationBOOLEANAll instances share the same ImageOrientationPatient
orthogonal_orientationBOOLEANOrientation direction cosines are orthogonal
unique_slice_positionsBOOLEANNo duplicate or overlapping slices
consistent_pixel_spacingBOOLEANAll instances share the same PixelSpacing
consistent_image_dimensionsBOOLEANAll instances share the same Rows and Columns
uniform_slice_spacingBOOLEANSpacing between consecutive slices is constant
obliquity_degreesFLOATAngle between slice normal and nearest cardinal axis (0 = pure axial/sagittal/coronal)
regularly_spaced_3d_volumeBOOLEANComposite: TRUE if all checks pass

RT Structure Sets

rtstruct_index has one row per RTSTRUCT series with aggregated ROI metadata.

python
import duckdb

BASE = "https://storage.googleapis.com/idc-index-data-artifacts/current/release_artifacts"

# RTSTRUCT series with ROI details
duckdb.sql(f"""
    SELECT i.collection_id, i.SeriesInstanceUID,
           r.total_rois, r.ROINames, r.RTROIInterpretedTypes,
           r.referenced_SeriesInstanceUID
    FROM read_parquet('{BASE}/idc_index.parquet') i
    JOIN read_parquet('{BASE}/rtstruct_index.parquet') r
        ON i.SeriesInstanceUID = r.SeriesInstanceUID
    WHERE i.Modality = 'RTSTRUCT'
    LIMIT 5
""").df()

# Collections with the most RTSTRUCT series
duckdb.sql(f"""
    SELECT i.collection_id,
           COUNT(*) as rtstruct_series,
           ROUND(AVG(r.total_rois), 1) as avg_rois_per_struct
    FROM read_parquet('{BASE}/idc_index.parquet') i
    JOIN read_parquet('{BASE}/rtstruct_index.parquet') r
        ON i.SeriesInstanceUID = r.SeriesInstanceUID
    GROUP BY i.collection_id
    ORDER BY rtstruct_series DESC
    LIMIT 10
""").df()

Key columns in rtstruct_index:

ColumnTypeDescription
SeriesInstanceUIDSTRINGJoin key (the RTSTRUCT series)
total_roisINTEGERNumber of ROIs in the structure set
ROINamesSTRING (array)Distinct ROI names (e.g., ["GTV", "Heart", "PTV"])
ROIGenerationAlgorithmsSTRING (array)Distinct generation algorithms (e.g., ["AUTOMATIC", "MANUAL"])
RTROIInterpretedTypesSTRING (array)Distinct ROI types (e.g., ["GTV", "ORGAN", "PTV"])
referenced_SeriesInstanceUIDSTRINGSeriesInstanceUID of the referenced source image series

Pinning to a Specific Version

python
import duckdb

# Use a specific data release instead of 'current'
VERSION = "23.10.1"
BASE = f"https://storage.googleapis.com/idc-index-data-artifacts/{VERSION}/release_artifacts"

duckdb.sql(f"SELECT COUNT(*) FROM read_parquet('{BASE}/idc_index.parquet')").df()

Resources