Back to Claude Scientific Skills

SQL Query Patterns for IDC

scientific-skills/imaging-data-commons/references/sql_patterns.md

2.38.09.8 KB
Original Source

SQL Query Patterns for IDC

Tested with: idc-index 0.11.14 (IDC data version v23)

Quick reference for common SQL query patterns when working with IDC data. For detailed examples with context, see the "Core Capabilities" section in the main SKILL.md.

When to Use This Guide

Load this guide when you need quick-reference SQL patterns for:

  • Discovering available filter values (modalities, body parts, manufacturers)
  • Finding annotations and segmentations across collections
  • Querying slide microscopy and annotation data
  • Estimating download sizes before download
  • Linking imaging data to clinical data
  • Filtering by 3D volume geometry validity (volume_geometry_index)
  • Finding RT Structure Set series and ROI metadata (rtstruct_index)

For table schemas, DataFrame access, and join column references, see references/index_tables_guide.md.

Prerequisites

bash
pip install --upgrade idc-index
python
from idc_index import IDCClient
client = IDCClient()

Discover Available Filter Values

python
# What modalities exist?
client.sql_query("SELECT DISTINCT Modality FROM index")

# What body parts for a specific modality?
client.sql_query("""
    SELECT DISTINCT BodyPartExamined, COUNT(*) as n
    FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL
    GROUP BY BodyPartExamined ORDER BY n DESC
""")

# What manufacturers for MR?
client.sql_query("""
    SELECT DISTINCT Manufacturer, COUNT(*) as n
    FROM index WHERE Modality = 'MR'
    GROUP BY Manufacturer ORDER BY n DESC
""")

Find Annotations and Segmentations

Note: Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type.

python
# Find ALL segmentations and structure sets by DICOM Modality
# SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set
client.sql_query("""
    SELECT collection_id, Modality, COUNT(*) as series_count
    FROM index
    WHERE Modality IN ('SEG', 'RTSTRUCT')
    GROUP BY collection_id, Modality
    ORDER BY series_count DESC
""")

# Find segmentations for a specific collection (includes non-analysis-result items)
client.sql_query("""
    SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id
    FROM index
    WHERE collection_id = 'tcga_luad' AND Modality = 'SEG'
""")

# List analysis result collections (curated derived datasets)
client.fetch_index("analysis_results_index")
client.sql_query("""
    SELECT analysis_result_id, analysis_result_title, Collections, Modalities
    FROM analysis_results_index
""")

# Find analysis results for a specific source collection
client.sql_query("""
    SELECT analysis_result_id, analysis_result_title
    FROM analysis_results_index
    WHERE Collections LIKE '%tcga_luad%'
""")

# Use seg_index for detailed DICOM Segmentation metadata
client.fetch_index("seg_index")

# Get segmentation statistics by algorithm
client.sql_query("""
    SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count
    FROM seg_index
    WHERE AlgorithmName IS NOT NULL
    GROUP BY AlgorithmName, AlgorithmType
    ORDER BY seg_count DESC
    LIMIT 10
""")

# Find segmentations for specific source images (e.g., chest CT)
client.sql_query("""
    SELECT
        s.SeriesInstanceUID as seg_series,
        s.AlgorithmName,
        s.total_segments,
        s.segmented_SeriesInstanceUID as source_series
    FROM seg_index s
    JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
    WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST'
    LIMIT 10
""")

# Find TotalSegmentator results with source image context
client.sql_query("""
    SELECT
        seg_info.collection_id,
        COUNT(DISTINCT s.SeriesInstanceUID) as seg_count,
        SUM(s.total_segments) as total_segments
    FROM seg_index s
    JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID
    WHERE s.AlgorithmName LIKE '%TotalSegmentator%'
    GROUP BY seg_info.collection_id
    ORDER BY seg_count DESC
""")

# Use ann_index and ann_group_index for Microscopy Bulk Simple Annotations
# ann_group_index has AnnotationGroupLabel, GraphicType, NumberOfAnnotations, AlgorithmName
client.fetch_index("ann_index")
client.fetch_index("ann_group_index")
client.sql_query("""
    SELECT g.AnnotationGroupLabel, g.GraphicType, g.NumberOfAnnotations, i.collection_id
    FROM ann_group_index g
    JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
    JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
    WHERE g.AlgorithmName IS NOT NULL
    LIMIT 10
""")
# See references/digital_pathology_guide.md for AnnotationGroupLabel filtering, SM+ANN joins, and more

Query Slide Microscopy and Annotation Data

Use sm_index for slide microscopy metadata and ann_index/ann_group_index for annotations on slides (DICOM ANN objects). Filter annotation groups by AnnotationGroupLabel to find annotations by name.

python
client.fetch_index("sm_index")
client.fetch_index("ann_index")
client.fetch_index("ann_group_index")

# Example: find annotation groups by label within a collection
client.sql_query("""
    SELECT g.AnnotationGroupLabel, g.GraphicType, g.NumberOfAnnotations
    FROM ann_group_index g
    JOIN index i ON g.SeriesInstanceUID = i.SeriesInstanceUID
    WHERE i.collection_id = 'your_collection_id'
      AND LOWER(g.AnnotationGroupLabel) LIKE '%keyword%'
""")

See references/digital_pathology_guide.md for SM queries, ANN filtering patterns, SM+ANN cross-references, and join examples.

Estimate Download Size

python
# Size for specific criteria
client.sql_query("""
    SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count
    FROM index
    WHERE collection_id = 'nlst' AND Modality = 'CT'
""")
python
client.fetch_index("clinical_index")

# Find collections with clinical data and their tables
client.sql_query("""
    SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns
    FROM clinical_index
    GROUP BY collection_id, table_name
    ORDER BY collection_id
""")

See references/clinical_data_guide.md for complete patterns including value mapping and patient cohort selection.

Troubleshooting

Issue: Query returns error "table not found"

  • Cause: Index not fetched before query
  • Solution: Call client.fetch_index("table_name") before using tables other than the primary index

Issue: LIKE pattern not matching expected results

  • Cause: Case sensitivity or whitespace
  • Solution: Use LOWER(column) for case-insensitive matching, TRIM() for whitespace

Issue: JOIN returns fewer rows than expected

  • Cause: NULL values in join columns or no matching records
  • Solution: Use LEFT JOIN to include rows without matches, check for NULLs with IS NOT NULL

Volume Geometry Validation

volume_geometry_index covers single-frame CT, MR, and PT series. Fetch it before querying.

python
client.fetch_index("volume_geometry_index")

# Series that form a regularly-spaced 3D volume (no resampling needed)
client.sql_query("""
    SELECT i.collection_id, i.SeriesInstanceUID, i.BodyPartExamined,
           v.obliquity_degrees
    FROM index i
    JOIN volume_geometry_index v ON i.SeriesInstanceUID = v.SeriesInstanceUID
    WHERE i.Modality = 'CT'
      AND v.regularly_spaced_3d_volume = TRUE
    LIMIT 10
""")

# Fraction of 3D-valid CT per collection
client.sql_query("""
    SELECT i.collection_id,
           COUNT(*) as total_ct,
           SUM(CASE WHEN v.regularly_spaced_3d_volume THEN 1 ELSE 0 END) as valid_3d,
           ROUND(100.0 * SUM(CASE WHEN v.regularly_spaced_3d_volume THEN 1 ELSE 0 END) / COUNT(*), 1) as pct_valid
    FROM index i
    JOIN volume_geometry_index v ON i.SeriesInstanceUID = v.SeriesInstanceUID
    WHERE i.Modality = 'CT'
    GROUP BY i.collection_id
    ORDER BY total_ct DESC
    LIMIT 10
""")

Key columns: regularly_spaced_3d_volume (composite flag), obliquity_degrees (0 = pure axial/sagittal/coronal), plus individual boolean checks: single_orientation, orthogonal_orientation, unique_slice_positions, consistent_pixel_spacing, consistent_image_dimensions, uniform_slice_spacing.

RT Structure Sets

rtstruct_index has one row per RTSTRUCT series. Array columns (ROINames, ROIGenerationAlgorithms, RTROIInterpretedTypes) are stored as strings.

python
client.fetch_index("rtstruct_index")

# RTSTRUCT series with ROI counts and names
client.sql_query("""
    SELECT i.collection_id, i.SeriesInstanceUID,
           r.total_rois, r.ROINames, r.RTROIInterpretedTypes,
           r.referenced_SeriesInstanceUID
    FROM index i
    JOIN rtstruct_index r ON i.SeriesInstanceUID = r.SeriesInstanceUID
    LIMIT 10
""")

# Collections with the most RTSTRUCT series
client.sql_query("""
    SELECT i.collection_id,
           COUNT(*) as rtstruct_series,
           ROUND(AVG(r.total_rois), 1) as avg_rois
    FROM index i
    JOIN rtstruct_index r ON i.SeriesInstanceUID = r.SeriesInstanceUID
    GROUP BY i.collection_id
    ORDER BY rtstruct_series DESC
    LIMIT 10
""")

# Find source CT series for a given RTSTRUCT
client.sql_query("""
    SELECT r.SeriesInstanceUID as rtstruct_uid,
           r.total_rois, r.ROINames,
           src.SeriesInstanceUID as source_ct_uid,
           src.collection_id, src.BodyPartExamined
    FROM rtstruct_index r
    JOIN index src ON r.referenced_SeriesInstanceUID = src.SeriesInstanceUID
    LIMIT 10
""")

Resources

  • references/index_tables_guide.md for table schemas, DataFrame access, and join column references
  • references/clinical_data_guide.md for clinical data patterns and value mapping
  • references/digital_pathology_guide.md for pathology-specific queries
  • references/bigquery_guide.md for advanced queries requiring full DICOM metadata
  • references/parquet_access_guide.md for direct Parquet queries without installing idc-index