skills/scikit-bio/SKILL.md
scikit-bio is a comprehensive Python library for working with biological data. Apply this skill for bioinformatics analyses spanning sequence manipulation, alignment, phylogenetics, microbial ecology, and multivariate statistics.
This skill should be used when the user:
Work with biological sequences using specialized classes for DNA, RNA, and protein data.
Key operations:
Common patterns:
import skbio
# Read sequences from file
seq = skbio.DNA.read('input.fasta')
# Sequence operations
rc = seq.reverse_complement()
rna = seq.transcribe()
protein = rna.translate()
# Find motifs
motif_positions = seq.find_with_regex('ATG[ACGT]{3}')
# Check for properties
has_degens = seq.has_degenerates()
seq_no_gaps = seq.degap()
Important notes:
DNA, RNA, Protein classes for grammared sequences with validationSequence class for generic sequences without alphabet restrictionsPerform pairwise and multiple sequence alignments using the pair_align engine (introduced in scikit-bio 0.7.0), a versatile and efficient dynamic-programming aligner.
Key capabilities:
pair_align_nucl (BLASTN-like) and pair_align_prot (BLASTP-like)PairAlignPath results carry CIGAR strings and convert to aligned sequencesTabularMSACommon patterns:
from skbio import DNA, Protein
from skbio.alignment import pair_align_nucl, pair_align_prot, pair_align, TabularMSA
# Nucleotide alignment with BLASTN-like defaults
seq1, seq2 = DNA('ACTACCAGATTACTTACGGATCAGG'), DNA('CGAAACTACTAGATTACGGATCTTA')
aln = pair_align_nucl(seq1, seq2)
aln.score # alignment score (float)
path = aln.paths[0] # PairAlignPath (repr shows CIGAR)
aligned_seqs = path.to_aligned((seq1, seq2)) # list of gapped strings
# Build a TabularMSA from the alignment path + original sequences
msa = TabularMSA.from_path_seqs(path, (seq1, seq2))
# Customize the algorithm via pair_align (default mode='global')
aln = pair_align(seq1, seq2, mode='local') # Smith-Waterman
aln = pair_align(seq1, seq2, sub_score=(2, -3), gap_cost=(5, 2)) # affine gaps
aln = pair_align(seq1, seq2, sub_score='NUC.4.4', gap_cost=3) # substitution matrix, linear gap
# Protein alignment (BLASTP-like, BLOSUM62)
aln = pair_align_prot(Protein('HEAGAWGHEE'), Protein('PAWHEAE'))
# Read a multiple alignment from file and summarize
msa = TabularMSA.read('alignment.fasta', constructor=DNA)
consensus = msa.consensus()
Important notes:
pair_align replaces the removed SSW wrapper (local_pairwise_align_ssw, StripedSmithWaterman) and the deprecated pure-Python aligners (global_pairwise_align, local_pairwise_align_nucleotide, etc.)PairAlignResult that also unpacks as score, paths, matrices (use keep_matrices=True to retain the DP matrix)sub_score accepts a (match, mismatch) tuple or a matrix name (e.g., 'NUC.4.4', 'BLOSUM62'); gap_cost accepts a single number (linear) or (open, extend) tuple (affine)PairAlignPath.from_cigar('1I8M2D5M2I'); score an existing alignment with align_score(...) and build a distance matrix from an MSA with align_dists(...)Construct, manipulate, and analyze phylogenetic trees representing evolutionary relationships.
Key capabilities:
nni)cophenet, Robinson-Foulds via compare_rfd)Common patterns:
from skbio import TreeNode
from skbio.tree import nj, upgma, gme, bme, rf_dists
# Read tree from file
tree = TreeNode.read('tree.nwk')
# Construct tree from distance matrix
tree = nj(distance_matrix)
# Tree operations
subtree = tree.shear(['taxon1', 'taxon2', 'taxon3'])
tips = [node for node in tree.tips()]
lca = tree.lca(['taxon1', 'taxon2'])
# Calculate distances
patristic_dist = tree.find('taxon1').distance(tree.find('taxon2'))
cophenetic_dm = tree.cophenet() # patristic distance matrix among tips
# Compare two trees (Robinson-Foulds)
rf_distance = tree.compare_rfd(other_tree)
# Pairwise RF distances among many trees -> DistanceMatrix
rf_dm = rf_dists([tree, other_tree, third_tree])
Important notes:
nj() for neighbor joining (classic phylogenetic method)upgma() for UPGMA/WPGMA (assumes molecular clock)nni()cophenet() (formerly tip_tip_distances) returns the patristic distance matrix; compare_rfd() is the Robinson-Foulds method (compare_wrfd/compare_cophenet for weighted/cophenetic variants)lca() is the lowest common ancestor; lowest_common_ancestor remains as an aliasCalculate alpha and beta diversity metrics for microbial ecology and community analysis.
Key capabilities:
sobs, observed_features, chao1, ace), Shannon, Simpson, Hill numbers (hill), Faith's PD (faith_pd), generalized PD (phydiv), Pielou's evennessCommon patterns:
from skbio.diversity import alpha_diversity, beta_diversity
# Alpha diversity (phylogenetic metrics take taxa= for tip-name mapping)
alpha = alpha_diversity('shannon', counts_matrix, ids=sample_ids)
faith_pd = alpha_diversity('faith_pd', counts_matrix, ids=sample_ids,
tree=tree, taxa=feature_ids)
# Beta diversity
bc_dm = beta_diversity('braycurtis', counts_matrix, ids=sample_ids)
unifrac_dm = beta_diversity('unweighted_unifrac', counts_matrix,
ids=sample_ids, tree=tree, taxa=feature_ids)
# Get available metrics
from skbio.diversity import get_alpha_diversity_metrics
print(get_alpha_diversity_metrics())
Important notes:
taxa= (renamed from otu_ids in 0.6.0; the old name is a deprecated alias); observed_otus is now observed_features (or sobs)counts_matrix may be any table-like input (NumPy array, pandas/polars DataFrame, BIOM Table, or AnnData) via the dispatch systempartial_beta_diversity() for specific sample pairs, or block_beta_diversity() for large block-decomposed calculationspandas.Series, beta diversity returns a DistanceMatrixReduce high-dimensional biological data to visualizable lower-dimensional spaces.
Key capabilities:
Common patterns:
from skbio.stats.ordination import pcoa, cca
import skbio
# PCoA from distance matrix (limit dimensions for large matrices)
pcoa_results = pcoa(distance_matrix, dimensions=3)
pc1 = pcoa_results.samples['PC1']
pc2 = pcoa_results.samples['PC2']
# Built-in scatter plot colored by a metadata column
fig = pcoa_results.plot(sample_metadata, column='bodysite')
# CCA with environmental variables
cca_results = cca(species_matrix, environmental_matrix)
# Save/load ordination results
pcoa_results.write('ordination.txt')
results = skbio.OrdinationResults.read('ordination.txt')
Important notes:
dimensions as an int (count) or a float in (0, 1] (fraction of cumulative variance to retain)OrdinationResults exposes pandas-based attributes: samples, features, eigvals, proportion_explained, biplot_scores, sample_constraintsOrdinationResults.plot() produces a matplotlib figure; results also integrate with seaborn/plotlyPerform hypothesis tests specific to ecological and biological data.
Key capabilities:
ancom, dirmult_ttest, and dirmult_lme (longitudinal mixed-effects) in skbio.stats.compositionCommon patterns:
from skbio.stats.distance import permanova, anosim, mantel
# Test if groups differ significantly
permanova_results = permanova(distance_matrix, grouping, permutations=999)
print(f"p-value: {permanova_results['p-value']}")
# ANOSIM test
anosim_results = anosim(distance_matrix, grouping, permutations=999)
# Mantel test between two distance matrices
mantel_results = mantel(dm1, dm2, method='pearson', permutations=999)
print(f"Correlation: {mantel_results[0]}, p-value: {mantel_results[1]}")
# Differential abundance on a feature table (raw counts recommended)
from skbio.stats.composition import dirmult_ttest
da = dirmult_ttest(counts_table, grouping, treatment='caseA', reference='control')
Important notes:
Read and write 19+ biological file formats with automatic format detection.
Supported formats:
Common patterns:
import skbio
# Read with automatic format detection
seq = skbio.DNA.read('file.fasta', format='fasta')
tree = skbio.TreeNode.read('tree.nwk')
# Write to file
seq.write('output.fasta', format='fasta')
# Generator for large files (memory efficient)
for seq in skbio.io.read('large.fasta', format='fasta', constructor=skbio.DNA):
process(seq)
# Convert formats
seqs = list(skbio.io.read('input.fastq', format='fastq', constructor=skbio.DNA))
skbio.io.write(seqs, format='fasta', into='output.fasta')
Important notes:
into parameter specifiedverify=FalseCreate and manipulate distance/dissimilarity matrices with statistical methods.
Key capabilities:
DistanceMatrix, hollow diagonal) or general pairwise (PairwiseMatrix) dataCommon patterns:
from skbio import DistanceMatrix
import numpy as np
# Create from array
data = np.array([[0, 1, 2], [1, 0, 3], [2, 3, 0]])
dm = DistanceMatrix(data, ids=['A', 'B', 'C'])
# Access distances
dist_ab = dm['A', 'B']
row_a = dm['A']
# Read from file
dm = DistanceMatrix.read('distances.txt')
# Use in downstream analyses
pcoa_results = pcoa(dm)
permanova_results = permanova(dm, grouping)
Important notes:
DistanceMatrix enforces symmetry and a zero (hollow) diagonal; it is a subclass of SymmetricMatrixPairwiseMatrix (renamed from DissimilarityMatrix, which is kept as a deprecated alias) allows general/asymmetric valuesWork with feature tables (OTU/ASV tables) common in microbiome research.
Key capabilities:
Table classtable_like input — BIOM Table, pandas/polars DataFrame, NumPy array, or AnnData — without explicit conversionphylomix, mixup, aitchison_mixup, compos_cutmix)Common patterns:
from skbio import Table
from skbio.diversity import beta_diversity
# Read BIOM table
table = Table.read('table.biom')
# Access data
sample_ids = table.ids(axis='sample')
feature_ids = table.ids(axis='observation')
counts = table.matrix_data
# Filter
filtered = table.filter(sample_ids_to_keep, axis='sample')
# Pass table-like objects directly to scikit-bio drivers (dispatch system)
import pandas as pd
df = pd.read_table('data.tsv', index_col=0) # samples x features
bdiv = beta_diversity('braycurtis', df) # no manual conversion needed
Important notes:
Work with protein language model embeddings for downstream analysis.
Key capabilities:
Common patterns:
from skbio.embedding import ProteinEmbedding, ProteinVector
# Create embedding from array
embedding = ProteinEmbedding(embedding_array, sequence_ids)
# Convert to distance matrix for analysis
dm = embedding.to_distances(metric='euclidean')
# PCoA visualization of embedding space
pcoa_results = embedding.to_ordination(metric='euclidean', method='pcoa')
# Export for machine learning
array = embedding.to_array()
df = embedding.to_dataframe()
Important notes:
uv pip install scikit-bio
Requires Python 3.10+ and NumPy 2.0+. Pre-compiled wheels are published for each release since 0.7.0, so most platforms install without a compiler. Conda users can instead run conda install -c conda-forge scikit-bio.
partial_beta_diversity()For detailed API information, parameter specifications, and advanced usage examples, refer to references/api_reference.md which contains comprehensive documentation on: