scientific-skills/imaging-data-commons/references/sql_patterns.md
Tested with: idc-index 0.11.14 (IDC data version v23)
Quick reference for common SQL query patterns when working with IDC data. For detailed examples with context, see the "Core Capabilities" section in the main SKILL.md.
Load this guide when you need quick-reference SQL patterns for:
For table schemas, DataFrame access, and join column references, see references/index_tables_guide.md.
pip install --upgrade idc-index
from idc_index import IDCClient
client = IDCClient()
# What modalities exist?
client.sql_query("SELECT DISTINCT Modality FROM index")
# What body parts for a specific modality?
client.sql_query("""
SELECT DISTINCT BodyPartExamined, COUNT(*) as n
FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL
GROUP BY BodyPartExamined ORDER BY n DESC
""")
# What manufacturers for MR?
client.sql_query("""
SELECT DISTINCT Manufacturer, COUNT(*) as n
FROM index WHERE Modality = 'MR'
GROUP BY Manufacturer ORDER BY n DESC
""")
Note: Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type.
# Find ALL segmentations and structure sets by DICOM Modality
# SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set
client.sql_query("""
SELECT collection_id, Modality, COUNT(*) as series_count
FROM index
WHERE Modality IN ('SEG', 'RTSTRUCT')
GROUP BY collection_id, Modality
ORDER BY series_count DESC
""")
# Find segmentations for a specific collection (includes non-analysis-result items)
client.sql_query("""
SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id
FROM index
WHERE collection_id = 'tcga_luad' AND Modality = 'SEG'
""")
# List analysis result collections (curated derived datasets)
client.fetch_index("analysis_results_index")
client.sql_query("""
SELECT analysis_result_id, analysis_result_title, Collections, Modalities
FROM analysis_results_index
""")
# Find analysis results for a specific source collection
client.sql_query("""
SELECT analysis_result_id, analysis_result_title
FROM analysis_results_index
WHERE Collections LIKE '%tcga_luad%'
""")
# Use seg_index for detailed DICOM Segmentation metadata
client.fetch_index("seg_index")
# Get segmentation statistics by algorithm
client.sql_query("""
SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count
FROM seg_index
WHERE AlgorithmName IS NOT NULL
GROUP BY AlgorithmName, AlgorithmType
ORDER BY seg_count DESC
LIMIT 10
""")
# Find segmentations for specific source images (e.g., chest CT)
client.sql_query("""
SELECT
s.SeriesInstanceUID as seg_series,
s.AlgorithmName,
s.total_segments,
s.segmented_SeriesInstanceUID as source_series
FROM seg_index s
JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST'
LIMIT 10
""")
# Find TotalSegmentator results with source image context
client.sql_query("""
SELECT
seg_info.collection_id,
COUNT(DISTINCT s.SeriesInstanceUID) as seg_count,
SUM(s.total_segments) as total_segments
FROM seg_index s
JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID
WHERE s.AlgorithmName LIKE '%TotalSegmentator%'
GROUP BY seg_info.collection_id
ORDER BY seg_count DESC
""")
# Use ann_index and ann_group_index for Microscopy Bulk Simple Annotations
# ann_group_index has AnnotationGroupLabel, GraphicType, NumberOfAnnotations, AlgorithmName
client.fetch_index("ann_index")
client.fetch_index("ann_group_index")
client.sql_query("""
SELECT g.AnnotationGroupLabel, g.GraphicType, g.NumberOfAnnotations, i.collection_id
FROM ann_group_index g
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
JOIN index i ON a.SeriesInstanceUID = i.SeriesInstanceUID
WHERE g.AlgorithmName IS NOT NULL
LIMIT 10
""")
# See references/digital_pathology_guide.md for AnnotationGroupLabel filtering, SM+ANN joins, and more
Use sm_index for slide microscopy metadata and ann_index/ann_group_index for annotations on slides (DICOM ANN objects). Filter annotation groups by AnnotationGroupLabel to find annotations by name.
client.fetch_index("sm_index")
client.fetch_index("ann_index")
client.fetch_index("ann_group_index")
# Example: find annotation groups by label within a collection
client.sql_query("""
SELECT g.AnnotationGroupLabel, g.GraphicType, g.NumberOfAnnotations
FROM ann_group_index g
JOIN index i ON g.SeriesInstanceUID = i.SeriesInstanceUID
WHERE i.collection_id = 'your_collection_id'
AND LOWER(g.AnnotationGroupLabel) LIKE '%keyword%'
""")
See references/digital_pathology_guide.md for SM queries, ANN filtering patterns, SM+ANN cross-references, and join examples.
# Size for specific criteria
client.sql_query("""
SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count
FROM index
WHERE collection_id = 'nlst' AND Modality = 'CT'
""")
client.fetch_index("clinical_index")
# Find collections with clinical data and their tables
client.sql_query("""
SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns
FROM clinical_index
GROUP BY collection_id, table_name
ORDER BY collection_id
""")
See references/clinical_data_guide.md for complete patterns including value mapping and patient cohort selection.
Issue: Query returns error "table not found"
client.fetch_index("table_name") before using tables other than the primary indexIssue: LIKE pattern not matching expected results
LOWER(column) for case-insensitive matching, TRIM() for whitespaceIssue: JOIN returns fewer rows than expected
LEFT JOIN to include rows without matches, check for NULLs with IS NOT NULLvolume_geometry_index covers single-frame CT, MR, and PT series. Fetch it before querying.
client.fetch_index("volume_geometry_index")
# Series that form a regularly-spaced 3D volume (no resampling needed)
client.sql_query("""
SELECT i.collection_id, i.SeriesInstanceUID, i.BodyPartExamined,
v.obliquity_degrees
FROM index i
JOIN volume_geometry_index v ON i.SeriesInstanceUID = v.SeriesInstanceUID
WHERE i.Modality = 'CT'
AND v.regularly_spaced_3d_volume = TRUE
LIMIT 10
""")
# Fraction of 3D-valid CT per collection
client.sql_query("""
SELECT i.collection_id,
COUNT(*) as total_ct,
SUM(CASE WHEN v.regularly_spaced_3d_volume THEN 1 ELSE 0 END) as valid_3d,
ROUND(100.0 * SUM(CASE WHEN v.regularly_spaced_3d_volume THEN 1 ELSE 0 END) / COUNT(*), 1) as pct_valid
FROM index i
JOIN volume_geometry_index v ON i.SeriesInstanceUID = v.SeriesInstanceUID
WHERE i.Modality = 'CT'
GROUP BY i.collection_id
ORDER BY total_ct DESC
LIMIT 10
""")
Key columns: regularly_spaced_3d_volume (composite flag), obliquity_degrees (0 = pure axial/sagittal/coronal), plus individual boolean checks: single_orientation, orthogonal_orientation, unique_slice_positions, consistent_pixel_spacing, consistent_image_dimensions, uniform_slice_spacing.
rtstruct_index has one row per RTSTRUCT series. Array columns (ROINames, ROIGenerationAlgorithms, RTROIInterpretedTypes) are stored as strings.
client.fetch_index("rtstruct_index")
# RTSTRUCT series with ROI counts and names
client.sql_query("""
SELECT i.collection_id, i.SeriesInstanceUID,
r.total_rois, r.ROINames, r.RTROIInterpretedTypes,
r.referenced_SeriesInstanceUID
FROM index i
JOIN rtstruct_index r ON i.SeriesInstanceUID = r.SeriesInstanceUID
LIMIT 10
""")
# Collections with the most RTSTRUCT series
client.sql_query("""
SELECT i.collection_id,
COUNT(*) as rtstruct_series,
ROUND(AVG(r.total_rois), 1) as avg_rois
FROM index i
JOIN rtstruct_index r ON i.SeriesInstanceUID = r.SeriesInstanceUID
GROUP BY i.collection_id
ORDER BY rtstruct_series DESC
LIMIT 10
""")
# Find source CT series for a given RTSTRUCT
client.sql_query("""
SELECT r.SeriesInstanceUID as rtstruct_uid,
r.total_rois, r.ROINames,
src.SeriesInstanceUID as source_ct_uid,
src.collection_id, src.BodyPartExamined
FROM rtstruct_index r
JOIN index src ON r.referenced_SeriesInstanceUID = src.SeriesInstanceUID
LIMIT 10
""")
references/index_tables_guide.md for table schemas, DataFrame access, and join column referencesreferences/clinical_data_guide.md for clinical data patterns and value mappingreferences/digital_pathology_guide.md for pathology-specific queriesreferences/bigquery_guide.md for advanced queries requiring full DICOM metadatareferences/parquet_access_guide.md for direct Parquet queries without installing idc-index