scientific-skills/geniml/references/scembed.md
scEmbed trains Region2Vec models on single-cell ATAC-seq datasets to generate cell embeddings for clustering and analysis. It provides an unsupervised machine learning framework for representing and analyzing scATAC-seq data in low-dimensional space.
Use scEmbed when working with:
Input data must be in AnnData format with .var attributes containing chr, start, and end values for peaks.
Starting from raw data (barcodes.txt, peaks.bed, matrix.mtx):
import scanpy as sc
import pandas as pd
import scipy.io
import anndata
# Load data
barcodes = pd.read_csv('barcodes.txt', header=None, names=['barcode'])
peaks = pd.read_csv('peaks.bed', sep='\t', header=None,
names=['chr', 'start', 'end'])
matrix = scipy.io.mmread('matrix.mtx').tocsr()
# Create AnnData
adata = anndata.AnnData(X=matrix.T, obs=barcodes, var=peaks)
adata.write('scatac_data.h5ad')
Convert genomic regions into tokens using gtars utilities. This creates a parquet file with tokenized cells for faster training:
from geniml.io import tokenize_cells
tokenize_cells(
adata='scatac_data.h5ad',
universe_file='universe.bed',
output='tokenized_cells.parquet'
)
Benefits of pre-tokenization:
Train the scEmbed model using tokenized data:
from geniml.scembed import ScEmbed
from geniml.region2vec import Region2VecDataset
# Load tokenized dataset
dataset = Region2VecDataset('tokenized_cells.parquet')
# Initialize and train model
model = ScEmbed(
embedding_dim=100,
window_size=5,
negative_samples=5
)
model.train(
dataset=dataset,
epochs=100,
batch_size=256,
learning_rate=0.025
)
# Save model
model.save('scembed_model/')
Use the trained model to generate embeddings for cells:
from geniml.scembed import ScEmbed
# Load trained model
model = ScEmbed.from_pretrained('scembed_model/')
# Generate embeddings for AnnData object
embeddings = model.encode(adata)
# Add to AnnData for downstream analysis
adata.obsm['scembed_X'] = embeddings
Integrate with scanpy for clustering and visualization:
import scanpy as sc
# Use scEmbed embeddings for neighborhood graph
sc.pp.neighbors(adata, use_rep='scembed_X')
# Cluster cells
sc.tl.leiden(adata, resolution=0.5)
# Compute UMAP for visualization
sc.tl.umap(adata)
# Plot results
sc.pl.umap(adata, color='leiden')
| Parameter | Description | Typical Range |
|---|---|---|
embedding_dim | Dimension of cell embeddings | 50 - 200 |
window_size | Context window for training | 3 - 10 |
negative_samples | Number of negative samples | 5 - 20 |
epochs | Training epochs | 50 - 200 |
batch_size | Training batch size | 128 - 512 |
learning_rate | Initial learning rate | 0.01 - 0.05 |
Pre-trained scEmbed models are available on Hugging Face for common reference datasets. Load them using:
from geniml.scembed import ScEmbed
# Load pre-trained model
model = ScEmbed.from_pretrained('databio/scembed-pbmc-10k')
# Generate embeddings
embeddings = model.encode(adata)
embedding_dim and training epochs based on dataset sizeThe 10x Genomics PBMC 10k dataset (10,000 peripheral blood mononuclear cells) serves as a standard benchmark:
After clustering, annotate cell types using k-nearest neighbors (KNN) with reference datasets:
from geniml.scembed import annotate_celltypes
# Annotate using reference
annotations = annotate_celltypes(
query_adata=adata,
reference_adata=reference,
embedding_key='scembed_X',
k=10
)
adata.obs['cell_type'] = annotations
scEmbed produces:
adata.obsm)