scientific-skills/geniml/references/bedspace.md
BEDspace applies the StarSpace model to genomic data, enabling simultaneous training of numerical embeddings for both region sets and their metadata labels in a shared low-dimensional space. This allows for rich queries across regions and metadata.
Use BEDspace when working with:
BEDspace consists of four sequential operations:
Format genomic intervals and metadata for StarSpace training:
geniml bedspace preprocess \
--input /path/to/regions/ \
--metadata labels.csv \
--universe universe.bed \
--labels "cell_type,tissue" \
--output preprocessed.txt
Required files:
file_name column matching BED filenames, plus metadata columnsThe preprocessing step adds __label__ prefixes to metadata and converts regions to StarSpace-compatible format.
Execute StarSpace model on preprocessed data:
geniml bedspace train \
--path-to-starspace /path/to/starspace \
--input preprocessed.txt \
--output model/ \
--dim 100 \
--epochs 50 \
--lr 0.05
Key training parameters:
--dim: Embedding dimension (typical: 50-200)--epochs: Training epochs (typical: 20-100)--lr: Learning rate (typical: 0.01-0.1)Compute distance metrics between region sets and metadata labels:
geniml bedspace distances \
--input model/ \
--metadata labels.csv \
--universe universe.bed \
--output distances.pkl
This step creates a distance matrix needed for similarity searches.
Retrieve similar items across three scenarios:
Region-to-Label (r2l): Query region set → retrieve similar metadata labels
geniml bedspace search -t r2l -d distances.pkl -q query_regions.bed -n 10
Label-to-Region (l2r): Query metadata label → retrieve similar region sets
geniml bedspace search -t l2r -d distances.pkl -q "T_cell" -n 10
Region-to-Region (r2r): Query region set → retrieve similar region sets
geniml bedspace search -t r2r -d distances.pkl -q query_regions.bed -n 10
The -n parameter controls the number of results returned.
from geniml.bedspace import BEDSpaceModel
# Load trained model
model = BEDSpaceModel.load('model/')
# Query similar items
results = model.search(
query="T_cell",
search_type="l2r",
top_k=10
)
file_name column that exactly matches BED filenames (without path)Search results return items ranked by similarity in the joint embedding space:
BEDspace requires StarSpace to be installed separately. Download from: https://github.com/facebookresearch/StarSpace