scientific-skills/geniml/references/utilities.md
BBClient provides efficient caching of BED files from remote sources, enabling faster repeated access and integration with R workflows.
Use BBClient when:
from geniml.bbclient import BBClient
# Initialize client
client = BBClient(cache_folder='~/.bedcache')
# Fetch and cache BED file
bed_file = client.load_bed(bed_id='GSM123456')
# Access cached file
regions = client.get_regions('GSM123456')
library(reticulate)
geniml <- import("geniml.bbclient")
# Initialize client
client <- geniml$BBClient(cache_folder='~/.bedcache')
# Load BED file
bed_file <- client$load_bed(bed_id='GSM123456')
BEDshift provides tools for randomizing BED files while preserving genomic context, essential for generating null distributions and statistical testing.
Use BEDshift when:
from geniml.bedshift import bedshift
# Randomize BED file preserving chromosome distribution
randomized = bedshift(
input_bed='peaks.bed',
genome='hg38',
preserve_chrom=True,
n_iterations=100
)
geniml bedshift \
--input peaks.bed \
--genome hg38 \
--preserve-chrom \
--iterations 100 \
--output randomized_peaks.bed
Preserve chromosome distribution:
bedshift(input_bed, genome, preserve_chrom=True)
Maintains regions on same chromosomes as original.
Preserve distance distribution:
bedshift(input_bed, genome, preserve_distance=True)
Maintains inter-region distances.
Preserve region sizes:
bedshift(input_bed, genome, preserve_size=True)
Keeps original region lengths.
Geniml provides evaluation utilities for assessing embedding quality and model performance.
Use evaluation tools when:
from geniml.evaluation import evaluate_embeddings
# Evaluate Region2Vec embeddings
metrics = evaluate_embeddings(
embeddings_file='region2vec_model/embeddings.npy',
labels_file='metadata.csv',
metrics=['silhouette', 'davies_bouldin', 'calinski_harabasz']
)
print(f"Silhouette score: {metrics['silhouette']:.3f}")
print(f"Davies-Bouldin index: {metrics['davies_bouldin']:.3f}")
Silhouette score: Measures cluster cohesion and separation (-1 to 1, higher better)
Davies-Bouldin index: Average similarity between clusters (≥0, lower better)
Calinski-Harabasz score: Ratio of between/within cluster dispersion (higher better)
from geniml.evaluation import evaluate_annotation
# Evaluate cell-type predictions
results = evaluate_annotation(
predicted=adata.obs['predicted_celltype'],
true=adata.obs['true_celltype'],
metrics=['accuracy', 'f1', 'confusion_matrix']
)
print(f"Accuracy: {results['accuracy']:.1%}")
print(f"F1 score: {results['f1']:.3f}")
Tokenization converts genomic regions into discrete tokens using a reference universe, enabling word2vec-style training.
Tokenization is a required preprocessing step for:
Strict overlap-based tokenization:
from geniml.tokenization import hard_tokenization
hard_tokenization(
src_folder='bed_files/',
dst_folder='tokenized/',
universe_file='universe.bed',
p_value_threshold=1e-9
)
Parameters:
p_value_threshold: Significance level for overlap (typically 1e-9 or 1e-6)Probabilistic tokenization allowing partial matches:
from geniml.tokenization import soft_tokenization
soft_tokenization(
src_folder='bed_files/',
dst_folder='tokenized/',
universe_file='universe.bed',
overlap_threshold=0.5
)
Parameters:
overlap_threshold: Minimum overlap fraction (0-1)Map regions to universe tokens with custom parameters:
from geniml.tokenization import universe_tokenization
universe_tokenization(
bed_file='peaks.bed',
universe_file='universe.bed',
output_file='tokens.txt',
method='hard',
threshold=1e-9
)
Check how well regions tokenize:
from geniml.tokenization import check_coverage
coverage = check_coverage(
bed_file='peaks.bed',
universe_file='universe.bed',
threshold=1e-9
)
print(f"Tokenization coverage: {coverage:.1%}")
Aim for >80% coverage for reliable training.
Text2BedNN creates neural network-based search backends for querying genomic regions using natural language or metadata.
Use Text2BedNN when:
Step 1: Prepare embeddings
Train BEDspace or Region2Vec model with metadata.
Step 2: Build search index
from geniml.search import build_search_index
build_search_index(
embeddings_file='bedspace_model/embeddings.npy',
metadata_file='metadata.csv',
output_dir='search_backend/'
)
Step 3: Query the index
from geniml.search import SearchBackend
backend = SearchBackend.load('search_backend/')
# Natural language query
results = backend.query(
text="T cell regulatory regions",
top_k=10
)
# Metadata query
results = backend.query(
metadata={'cell_type': 'T_cell', 'tissue': 'blood'},
top_k=10
)
from geniml.io import read_bed, write_bed, load_universe
# Read BED file
regions = read_bed('peaks.bed')
# Write BED file
write_bed(regions, 'output.bed')
# Load universe
universe = load_universe('universe.bed')
from geniml.models import save_model, load_model
# Save trained model
save_model(model, 'my_model/')
# Load model
model = load_model('my_model/')
Pipeline workflow:
# 1. Build universe
universe = build_universe(coverage_folder='coverage/', method='cc', cutoff=5)
# 2. Tokenize
hard_tokenization(src_folder='beds/', dst_folder='tokens/',
universe_file='universe.bed', p_value_threshold=1e-9)
# 3. Train embeddings
region2vec(token_folder='tokens/', save_dir='model/', num_shufflings=1000)
# 4. Evaluate
metrics = evaluate_embeddings(embeddings_file='model/embeddings.npy',
labels_file='metadata.csv')
This modular design allows flexible composition of geniml tools for diverse genomic ML workflows.