scientific-skills/cellxgene-census/references/census_schema.md
The CZ CELLxGENE Census is a versioned collection of single-cell data built on the TileDB-SOMA framework. This reference documents the data structure, available metadata fields, and query syntax.
The Census is organized as a SOMACollection with two main components:
Summary information including:
Organism-specific SOMAExperiment objects:
Each organism experiment contains:
Cell-level annotations stored as a SOMADataFrame. Access via:
census["census_data"]["homo_sapiens"].obs
RNA measurement data including:
raw: Raw count datanormalized: (if available) Normalized countsIdentity & Dataset:
soma_joinid: Unique integer identifier for joinsdataset_id: Source dataset identifieris_primary_data: Boolean flag (True = unique cell, False = duplicate across datasets)Cell Type:
cell_type: Human-readable cell type namecell_type_ontology_term_id: Standardized ontology term (e.g., "CL:0000236")Tissue:
tissue: Specific tissue nametissue_general: Broader tissue category (useful for grouping)tissue_ontology_term_id: Standardized ontology termAssay:
assay: Sequencing technology usedassay_ontology_term_id: Standardized ontology termDisease:
disease: Disease status or conditiondisease_ontology_term_id: Standardized ontology termDonor:
donor_id: Unique donor identifiersex: Biological sex (male, female, unknown)self_reported_ethnicity: Ethnicity informationdevelopment_stage: Life stage (adult, child, embryonic, etc.)development_stage_ontology_term_id: Standardized ontology termOrganism:
organism: Scientific name (Homo sapiens, Mus musculus)organism_ontology_term_id: Standardized ontology termTechnical:
suspension_type: Sample preparation type (cell, nucleus, na)Access via:
census["census_data"]["homo_sapiens"].ms["RNA"].var
Available Fields:
soma_joinid: Unique integer identifier for joinsfeature_id: Ensembl gene ID (e.g., "ENSG00000161798")feature_name: Gene symbol (e.g., "FOXP2")feature_length: Gene length in base pairsQueries use Python-like expressions for filtering. The syntax is processed by TileDB-SOMA.
==: Equal to!=: Not equal to<, >, <=, >=: Numeric comparisonsin: Membership test (e.g., feature_id in ['ENSG00000161798', 'ENSG00000188229'])and, &: Logical ANDor, |: Logical ORSingle condition:
value_filter="cell_type == 'B cell'"
Multiple conditions with AND:
value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True"
Using IN for multiple values:
value_filter="tissue in ['lung', 'liver', 'kidney']"
Complex condition:
value_filter="(cell_type == 'neuron' or cell_type == 'astrocyte') and disease != 'normal'"
Filtering genes:
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']"
The Census includes all data from CZ CELLxGENE Discover meeting:
Cells may appear across multiple datasets. Use is_primary_data == True to filter for unique cells in most analyses.
The Census includes:
Census releases are versioned (e.g., "2023-07-25", "stable"). Always specify version for reproducible analysis:
census = cellxgene_census.open_soma(census_version="2023-07-25")
Access which genes were measured in each dataset:
presence_matrix = census["census_data"]["homo_sapiens"].ms["RNA"]["feature_dataset_presence_matrix"]
This sparse boolean matrix helps understand:
Core TileDB-SOMA objects used: