skills/cellxgene-census/references/census_schema.md
The CZ CELLxGENE Census is a versioned collection of single-cell and spatial transcriptomics data built on the TileDB-SOMA framework. This reference documents the data structure, available metadata fields, and query syntax.
Current reference point:
cellxgene-census==1.17.*2025-11-082.4.07.0.0cellxgene-census 1.17.xThe Census is organized as a SOMACollection with these main components:
Summary information including:
Organism-specific SOMAExperiment objects:
Spatial organism-specific SOMAExperiment objects for supported releases. Spatial and non-spatial data share core metadata requirements, while spatial observations also include spatial columns such as array_col, array_row, and in_tissue.
Each organism experiment contains:
Cell-level annotations stored as a SOMADataFrame. Access via:
census["census_data"]["homo_sapiens"].obs
RNA measurement data including:
raw: Raw count dataSpatial data is stored separately from the single-cell Census data:
census["census_spatial_sequencing"]["homo_sapiens"]
Each spatial organism experiment contains:
obs: Spatial observation metadata, including core Census metadata and spatial fields such as array_col, array_row, and in_tissuems["RNA"]: RNA measurement matrices and feature metadataspatial[scene_id].obsl["loc"]: point-cloud positions for each scene, with x, y, and soma_joinidUse axis_query(...).to_spatialdata(X_name="raw") when exporting a spatial slice to spatialdata.
Identity & Dataset:
soma_joinid: Unique integer identifier for joinsdataset_id: Source dataset identifieris_primary_data: Boolean flag (True = unique cell, False = duplicate across datasets)Cell Type:
cell_type: Human-readable cell type namecell_type_ontology_term_id: Standardized ontology term (e.g., "CL:0000236")Tissue:
tissue: Specific tissue nametissue_general: Broader tissue category (useful for grouping)tissue_ontology_term_id: Standardized ontology termtissue_general_ontology_term_id: Standardized ontology term for the broader tissue categoryAssay:
assay: Sequencing technology usedassay_ontology_term_id: Standardized ontology termDisease:
disease: Disease status or conditiondisease_ontology_term_id: Standardized ontology termDonor:
donor_id: Unique donor identifiersex: Biological sex (male, female, unknown)self_reported_ethnicity: Ethnicity informationdevelopment_stage: Life stage (adult, child, embryonic, etc.)development_stage_ontology_term_id: Standardized ontology termOrganism:
organism: Scientific name (for example, Homo sapiens or Mus musculus)organism_ontology_term_id: Standardized ontology termTechnical:
suspension_type: Sample preparation type (cell, nucleus, na)Access via:
census["census_data"]["homo_sapiens"].ms["RNA"].var
Available Fields:
soma_joinid: Unique integer identifier for joinsfeature_id: Ensembl gene ID (e.g., "ENSG00000161798")feature_name: Gene symbol (e.g., "FOXP2")feature_type: Feature type from the source schemafeature_length: Gene length in base pairsnnz: Non-zero count summaryn_measured_obs: Number of measured observations for the featureQueries use Python-like expressions for filtering. The syntax is processed by TileDB-SOMA.
==: Equal to!=: Not equal to<, >, <=, >=: Numeric comparisonsin: Membership test (e.g., feature_id in ['ENSG00000161798', 'ENSG00000188229'])and, &: Logical ANDor, |: Logical ORSingle condition:
value_filter="cell_type == 'B cell'"
Multiple conditions with AND:
value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True"
Using IN for multiple values:
value_filter="tissue in ['lung', 'liver', 'kidney']"
Complex condition:
value_filter="(cell_type == 'neuron' or cell_type == 'astrocyte') and disease != 'normal'"
Filtering genes:
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']"
In current LTS releases, disease and disease_ontology_term_id may contain multiple values delimited by ||. Exact equality filters such as disease == 'COVID-19' can miss cells whose disease field contains multiple labels. For comprehensive disease queries, first inspect available values with get_obs() or summary_cell_counts, then choose filters that match the selected release's encoding.
The Census includes all data from CZ CELLxGENE Discover meeting:
Cells may appear across multiple datasets. Use is_primary_data == True to filter for unique cells in most analyses.
The Census includes:
Census releases are versioned (e.g., "2025-11-08", "stable", "latest"). Always specify an LTS build date for reproducible analysis:
census = cellxgene_census.open_soma(census_version="2025-11-08")
stable resolves to the current LTS release. latest resolves to the newest weekly release, which provides fast access to newly ingested datasets but is retained for a shorter period than LTS releases.
Access which genes were measured in each dataset:
presence_matrix = census["census_data"]["homo_sapiens"].ms["RNA"]["feature_dataset_presence_matrix"]
This sparse boolean matrix helps understand:
Core TileDB-SOMA objects used: