scientific-skills/imaging-data-commons/references/clinical_data_guide.md
Tested with: idc-index 0.11.7 (IDC data version v23)
Clinical data (demographics, diagnoses, therapies, lab tests, staging) accompanies many IDC imaging collections. This guide covers how to discover, access, and integrate clinical data with imaging data using idc-index.
Use this guide when you need to:
For basic clinical data access, see the "Clinical Data Access" section in the main SKILL.md. This guide provides detailed workflows and advanced patterns.
pip install --upgrade idc-index
No BigQuery credentials required - clinical data is packaged with idc-index.
Clinical data refers to non-imaging information that accompanies medical images:
Clinical data in IDC comes from collection-specific spreadsheets provided by data submitters. IDC parses these into queryable tables accessible via idc-index.
Important characteristics:
dicom_patient_id links to imagingThe clinical_index serves as a dictionary/catalog of all available clinical data:
| Column | Purpose | Use For |
|---|---|---|
collection_id | Collection identifier | Filtering by collection |
table_name | Full BigQuery table reference | BigQuery queries (if needed) |
short_table_name | Short name | get_clinical_table() method |
column | Column name in table | Selecting data columns |
column_label | Human-readable description | Searching for concepts |
values | Observed attribute values for the column | Interpreting coded values |
values ColumnThe values column contains an array of observed attribute values for the column defined in the column field. Each entry has:
None)For ACRIN collections, value descriptions come from provided data dictionaries. For other collections, they are derived from inspection of the actual data values.
Note: For columns with >20 unique values, the values array is left empty ([]) for simplicity.
from idc_index import IDCClient
client = IDCClient()
client.fetch_index('clinical_index')
# View available columns
print(client.clinical_index.columns.tolist())
# List all collections with clinical data
collections_with_clinical = client.clinical_index["collection_id"].unique().tolist()
print(f"{len(collections_with_clinical)} collections have clinical data")
# Find clinical attributes for a specific collection
nlst_columns = client.clinical_index[client.clinical_index['collection_id']=='nlst']
nlst_columns[['short_table_name', 'column', 'column_label', 'values']]
# Search by keyword in column_label (case-insensitive)
stage_attrs = client.clinical_index[
client.clinical_index["column_label"].str.contains("[Ss]tage", na=False)
]
stage_attrs[["collection_id", "short_table_name", "column", "column_label"]]
# Load table using short_table_name
nlst_canc_df = client.get_clinical_table("nlst_canc")
# Examine structure
print(f"Rows: {len(nlst_canc_df)}, Columns: {len(nlst_canc_df.columns)}")
nlst_canc_df.head()
Many clinical attributes use coded values. The values column in clinical_index contains an array of observed values with their descriptions (when available).
# Get the clinical_index rows for NLST
nlst_clinical_columns = client.clinical_index[client.clinical_index['collection_id']=='nlst']
# Get observed values for a specific column
# Filter to the row for 'clinical_stag' and extract the values array
clinical_stag_values = nlst_clinical_columns[
nlst_clinical_columns['column']=='clinical_stag'
]['values'].values[0]
# View the observed values and their descriptions
print(clinical_stag_values)
# Output: array([{'option_code': '.M', 'option_description': 'Missing'},
# {'option_code': '110', 'option_description': 'Stage IA'},
# {'option_code': '120', 'option_description': 'Stage IB'}, ...])
# Create mapping dictionary from codes to descriptions
mapping_dict = {item['option_code']: item['option_description'] for item in clinical_stag_values}
# Apply to DataFrame - convert column to string first for consistent matching
nlst_canc_df['clinical_stag_meaning'] = nlst_canc_df['clinical_stag'].astype(str).map(mapping_dict)
The dicom_patient_id column links clinical data to imaging. It matches the PatientID column in the imaging index.
# Pandas merge approach
import pandas as pd
# Get NLST CT imaging data
nlst_imaging = client.index[(client.index['collection_id']=='nlst') & (client.index['Modality']=='CT')]
# Join with clinical data
merged = pd.merge(
nlst_imaging[['PatientID', 'StudyInstanceUID']].drop_duplicates(),
nlst_canc_df[['dicom_patient_id', 'clinical_stag', 'clinical_stag_meaning']],
left_on='PatientID',
right_on='dicom_patient_id',
how='inner'
)
# SQL join approach
# Clinical tables loaded via get_clinical_table() are not automatically
# registered in DuckDB. Register the DataFrame manually before joining.
nlst_canc_df = client.get_clinical_table("nlst_canc")
client._duckdb_conn.register("nlst_canc", nlst_canc_df)
query = """
SELECT
index.PatientID,
index.StudyInstanceUID,
index.Modality,
nlst_canc.clinical_stag
FROM index
JOIN nlst_canc ON index.PatientID = nlst_canc.dicom_patient_id
WHERE index.collection_id = 'nlst' AND index.Modality = 'CT'
"""
results = client.sql_query(query)
from idc_index import IDCClient
import pandas as pd
client = IDCClient()
client.fetch_index('clinical_index')
# Load clinical table
nlst_canc = client.get_clinical_table("nlst_canc")
# Select Stage IV patients (code '400')
stage_iv_patients = nlst_canc[nlst_canc['clinical_stag'] == '400']['dicom_patient_id']
# Get CT imaging studies for these patients
stage_iv_studies = pd.merge(
client.index[(client.index['collection_id']=='nlst') & (client.index['Modality']=='CT')],
stage_iv_patients,
left_on='PatientID',
right_on='dicom_patient_id',
how='inner'
)['StudyInstanceUID'].drop_duplicates()
print(f"Found {len(stage_iv_studies)} CT studies for Stage IV patients")
# Find collections with chemotherapy information
chemo_collections = client.clinical_index[
client.clinical_index["column_label"].str.contains("[Cc]hemotherapy", na=False)
]["collection_id"].unique()
print(f"Collections with chemotherapy data: {list(chemo_collections)}")
# Find what values have been observed for a specific attribute
chemotherapy_rows = client.clinical_index[
(client.clinical_index["collection_id"] == "hcc_tace_seg") &
(client.clinical_index["column"] == "chemotherapy")
]
# Get the observed values array
values_list = chemotherapy_rows["values"].tolist()
print(values_list)
# Output: [[{'option_code': 'Cisplastin', 'option_description': None},
# {'option_code': 'Cisplatin, Mitomycin-C', 'option_description': None}, ...]]
import random
# Get studies for a sample Stage IV patient
sample_patient = stage_iv_patients.iloc[0]
studies = client.index[client.index['PatientID'] == sample_patient]['StudyInstanceUID'].unique()
# Generate viewer URL
if len(studies) > 0:
viewer_url = client.get_viewer_URL(studyInstanceUID=studies[0])
print(viewer_url)
Some collections (like c4kc_kits) have identical column and column_label. Others (like ACRIN collections) have cryptic column names but descriptive labels.
The values array contains observed attribute values:
None)Every clinical table includes dicom_patient_id, which matches the PatientID column in the imaging index. This is the key for joining clinical and imaging data.
Cause: Using wrong table name or table doesn't exist for collection
Solution: Query clinical_index first to find available tables:
client.clinical_index[client.clinical_index['collection_id']=='your_collection']['short_table_name'].unique()
Cause: The values array is left empty when a column has >20 unique values
Solution: Load the clinical table and examine unique values directly:
clinical_df = client.get_clinical_table("table_name")
clinical_df['column_name'].unique()
Cause: Some values may be missing from the dictionary (e.g., empty strings, special codes like .M for missing)
Solution: Handle unmapped values gracefully:
df['meaning'] = df['code'].astype(str).map(mapping_dict).fillna('Unknown/Missing')
Cause: Clinical data may include patients without images, or vice versa
Solution: Verify patient overlap before joining:
imaging_patients = set(client.index[client.index['collection_id']=='nlst']['PatientID'].unique())
clinical_patients = set(clinical_df['dicom_patient_id'].unique())
overlap = imaging_patients & clinical_patients
print(f"Patients with both imaging and clinical data: {len(overlap)}")
IDC Documentation:
Related Guides:
bigquery_guide.md - Advanced clinical queries via BigQueryIDC Tutorials: