scientific-skills/imaging-data-commons/references/cloud_storage_guide.md
IDC maintains all DICOM files in public cloud storage buckets mirrored between Google Cloud Storage (GCS) and AWS S3. This guide covers bucket organization, file structure, access methods, and versioning.
Use direct bucket access when you need:
For most use cases, idc-index is simpler and recommended -— it uses s5cmd internally to download from these same S3 buckets, handling the UUID lookups automatically. Use direct cloud storage when you need raw file access, custom parallelization, or are building cloud-native pipelines.
IDC organizes data across multiple buckets based on licensing and content type. All buckets are mirrored between AWS and GCS with identical content and file paths.
| Purpose | AWS S3 Bucket | GCS Bucket | License | Content |
|---|---|---|---|---|
| Primary data | idc-open-data | idc-open-data | No commercial restriction | >90% of IDC data |
| Head scans | idc-open-data-two | idc-open-idc1 | No commercial restriction | Collections potentially containing head imaging |
| Commercial-restricted | idc-open-data-cr | idc-open-cr | Commercial use restricted (CC BY-NC) | ~4% of data |
Notes:
us-east-1public-datasets-idc (now superseded by idc-open-data)idc-index to get license information - do not rely on bucket name!idc-open-data-cr / idc-open-cr to prevent accidental commercial useidc-open-data-two / idc-open-idc1) for potential future policy complianceFiles are organized by CRDC UUIDs, not DICOM UIDs. This enables versioning while maintaining consistent paths across cloud providers.
<bucket>/
└── <crdc_series_uuid>/
├── <crdc_instance_uuid_1>.dcm
├── <crdc_instance_uuid_2>.dcm
└── ...
Example path:
s3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm
7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9 = series UUID (folder)0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm = instance UUID (file)| Identifier Type | Format | Changes When | Use For |
|---|---|---|---|
| DICOM UID (e.g., SeriesInstanceUID) | Numeric (e.g., 1.3.6.1.4...) | Never (included in DICOM metadata) | Clinical identification, DICOMweb queries |
| CRDC UUID (e.g., crdc_series_uuid) | UUID (e.g., e127d258-37c2-...) | Content changes | File paths, versioning, reproducibility |
Key insight: A single DICOM SeriesInstanceUID may have multiple CRDC series UUIDs across IDC versions if the series content changed (instances added/removed, metadata corrected). The CRDC UUID uniquely identifies a specific version of the data.
Use idc-index to get file URLs from DICOM identifiers:
from idc_index import IDCClient
client = IDCClient()
# Get all file URLs for a series
series_uid = "1.3.6.1.4.1.14519.5.2.1.6450.9002.217441095430480124587725641302"
urls = client.get_series_file_URLs(seriesInstanceUID=series_uid)
for url in urls[:3]:
print(url)
# Returns S3 URLs like: s3://idc-open-data/<crdc_series_uuid>/<crdc_instance_uuid>.dcm
Or query the index directly for URL columns:
# Get series-level URL (points to folder)
result = client.sql_query("""
SELECT SeriesInstanceUID, series_aws_url
FROM index
WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
LIMIT 3
""")
print(result[['SeriesInstanceUID', 'series_aws_url']])
Available URL column in index:
series_aws_url: S3 URL to series folder (e.g., s3://idc-open-data/uuid/*)GCS URLs follow the same path structure—replace s3:// with gs:// (e.g., gs://idc-open-data/uuid/*). When using idc-index download methods, GCS access is handled internally.
All IDC buckets support free egress (no download fees) through partnerships with AWS Open Data and Google Public Data programs. No authentication required.
Using AWS CLI (no account required):
# List bucket contents
aws s3 ls --no-sign-request s3://idc-open-data/
# List files in a series folder
aws s3 ls --no-sign-request s3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/
# Download a single file
aws s3 cp --no-sign-request \
s3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm \
./local_file.dcm
# Download entire series folder
aws s3 cp --no-sign-request --recursive \
s3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/ \
./series_folder/
Using s5cmd (faster for bulk downloads):
# Install s5cmd
# macOS: brew install s5cmd
# Linux: download from https://github.com/peak/s5cmd/releases
# Download specific series
s5cmd --no-sign-request cp 's3://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/*' ./local_folder/
# Download from manifest file
s5cmd --no-sign-request run manifest.txt
s5cmd manifest format: The s5cmd run command expects one s5cmd command per line, not just URLs:
cp s3://idc-open-data/uuid1/instance1.dcm ./local_folder/
cp s3://idc-open-data/uuid1/instance2.dcm ./local_folder/
cp s3://idc-open-data/uuid2/instance3.dcm ./local_folder/
IDC Portal exports manifests in this format. When creating manifests programmatically, use idc-index download methods (which handle this internally) rather than constructing manifests manually.
Using gsutil:
# List bucket contents
gsutil ls gs://idc-open-data/
# Download a series folder
gsutil -m cp -r gs://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/ ./local_folder/
Using gcloud storage (newer CLI):
gcloud storage cp -r gs://idc-open-data/7a6b2389-53c6-4c5b-b07f-6d1ed4a3eed9/ ./local_folder/
import s3fs
import gcsfs
from idc_index import IDCClient
# First, get a file URL from idc-index
client = IDCClient()
result = client.sql_query("""
SELECT series_aws_url
FROM index
WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
LIMIT 1
""")
# series_aws_url is like: s3://idc-open-data/<uuid>/*
series_url = result['series_aws_url'].iloc[0]
series_path = series_url.replace('s3://', '').rstrip('/*') # e.g., "idc-open-data/<uuid>"
# AWS S3 access
s3 = s3fs.S3FileSystem(anon=True)
files = s3.ls(series_path)
with s3.open(files[0], 'rb') as f:
data = f.read()
# GCS access (same path structure as AWS)
gcs = gcsfs.GCSFileSystem(token='anon')
files = gcs.ls(series_path)
with gcs.open(files[0], 'rb') as f:
data = f.read()
IDC releases new data versions every 2-4 months. The versioning system ensures reproducibility by preserving all historical data.
Version change scenarios:
| Change Type | DICOM UID | CRDC UUID | Effect |
|---|---|---|---|
| New series added | New | New | New folder in bucket |
| Instance added to series | Same | New series UUID | New folder, instances may be duplicated |
| Metadata corrected | Same or new | New | New folder with updated files |
| Series removed | N/A | N/A | Old folder remains, not in current index |
Data removal caveat: In rare circumstances (e.g., data owner request, PHI incident), data may be removed from IDC entirely, including from all historical versions.
BigQuery versioned datasets (metadata only, not file storage):
For querying version-specific metadata, BigQuery provides versioned tables. See bigquery_guide.md for details.
bigquery-public-data.idc_current — alias to latest versionbigquery-public-data.idc_v23 — specific version (replace 23 with desired version)The simplest way to ensure reproducibility is to save the crdc_series_uuid values of the data you use at analysis time:
from idc_index import IDCClient
import json
client = IDCClient()
# Select data for your analysis
selection = client.sql_query("""
SELECT crdc_series_uuid
FROM index
WHERE collection_id = 'tcga_luad'
AND Modality = 'CT'
LIMIT 10
""")
series_uuids = list(selection['crdc_series_uuid'])
# Download the data
client.download_from_selection(seriesInstanceUID=series_uuids, downloadDir="./data")
# Save a manifest for reproducibility
manifest = {
"crdc_series_uuids": series_uuids,
"download_date": "2024-01-15",
"idc_version": client.get_idc_version(),
"description": "CT scans for lung cancer analysis"
}
with open("analysis_manifest.json", "w") as f:
json.dump(manifest, f, indent=2)
# Later, reproduce the exact dataset:
with open("analysis_manifest.json") as f:
manifest = json.load(f)
client.download_from_selection(
seriesInstanceUID=manifest["crdc_series_uuids"],
downloadDir="./reproduced_data"
)
Since crdc_series_uuid identifies an immutable version of each series, saving these UUIDs guarantees you can retrieve the exact same files later.
| Access Method | Buckets Included | Coverage | Versions |
|---|---|---|---|
| Direct bucket access | All 3 buckets | 100% | All historical |
idc-index download | All 3 buckets | 100% | Current + prior_versions_index |
| IDC Portal | All 3 buckets | 100% | Current only |
| DICOMweb public proxy | All 3 buckets | 100% | Current only |
| Google Healthcare DICOM | idc-open-data only | ~96% | Current only |
Important: The Google Healthcare API DICOM store only replicates data from idc-open-data. Data in idc-open-data-two and idc-open-data-cr (approximately 4% of total) is not available via Google Healthcare DICOMweb endpoint.
idc-index for discovery: Query metadata first, then access buckets with known UUIDsseries_aws_url or crdc_series_uuid values for reproducibilitylicense_short_name before commercial use; CC-NC data requires non-commercial useindex table has current data; use prior_versions_index only for exact reproducibility--no-sign-request flag with AWS CLI, or anon=True with Python librariesidc-index for current series_aws_url, or check prior_versions_index for historical pathsprior_versions_index to find the exact version you need; compare crdc_series_uuid valuesidc-open-data bucket (~96% of data)IDC Documentation:
AWS Resources:
Google Cloud Resources:
Related Guides:
dicomweb_guide.md - DICOMweb API access (alternative to direct bucket access)bigquery_guide.md - Advanced metadata queries including versioned datasets