Back to Claude Scientific Skills

BioServices: Identifier Mapping Guide

scientific-skills/bioservices/references/identifier_mapping.md

2.38.017.1 KB
Original Source

BioServices: Identifier Mapping Guide

This document provides comprehensive information about converting identifiers between different biological databases using BioServices.

Table of Contents

  1. Overview
  2. UniProt Mapping Service
  3. UniChem Compound Mapping
  4. KEGG Identifier Conversions
  5. Common Mapping Patterns
  6. Troubleshooting

Overview

Biological databases use different identifier systems. Cross-referencing requires mapping between these systems. BioServices provides multiple approaches:

  1. UniProt Mapping: Comprehensive protein/gene ID conversion
  2. UniChem: Chemical compound ID mapping
  3. KEGG: Built-in cross-references in entries
  4. PICR: Protein identifier cross-reference service

UniProt Mapping Service

The UniProt mapping service is the most comprehensive tool for protein and gene identifier conversion.

Basic Usage

python
from bioservices import UniProt

u = UniProt()

# Map single ID
result = u.mapping(
    fr="UniProtKB_AC-ID",    # Source database
    to="KEGG",                # Target database
    query="P43403"            # Identifier to convert
)

print(result)
# Output: {'P43403': ['hsa:7535']}

Batch Mapping

python
# Map multiple IDs (comma-separated)
ids = ["P43403", "P04637", "P53779"]
result = u.mapping(
    fr="UniProtKB_AC-ID",
    to="KEGG",
    query=",".join(ids)
)

for uniprot_id, kegg_ids in result.items():
    print(f"{uniprot_id}{kegg_ids}")

Supported Database Pairs

UniProt supports mapping between 100+ database pairs. Key ones include:

Protein/Gene Databases

Source FormatCodeTarget FormatCode
UniProtKB AC/IDUniProtKB_AC-IDKEGGKEGG
UniProtKB AC/IDUniProtKB_AC-IDEnsemblEnsembl
UniProtKB AC/IDUniProtKB_AC-IDEnsembl ProteinEnsembl_Protein
UniProtKB AC/IDUniProtKB_AC-IDEnsembl TranscriptEnsembl_Transcript
UniProtKB AC/IDUniProtKB_AC-IDRefSeq ProteinRefSeq_Protein
UniProtKB AC/IDUniProtKB_AC-IDRefSeq NucleotideRefSeq_Nucleotide
UniProtKB AC/IDUniProtKB_AC-IDGeneID (Entrez)GeneID
UniProtKB AC/IDUniProtKB_AC-IDHGNCHGNC
UniProtKB AC/IDUniProtKB_AC-IDMGIMGI
KEGGKEGGUniProtKBUniProtKB
EnsemblEnsemblUniProtKBUniProtKB
GeneIDGeneIDUniProtKBUniProtKB

Structural Databases

SourceCodeTargetCode
UniProtKB AC/IDUniProtKB_AC-IDPDBPDB
UniProtKB AC/IDUniProtKB_AC-IDPfamPfam
UniProtKB AC/IDUniProtKB_AC-IDInterProInterPro
PDBPDBUniProtKBUniProtKB

Expression & Proteomics

SourceCodeTargetCode
UniProtKB AC/IDUniProtKB_AC-IDPRIDEPRIDE
UniProtKB AC/IDUniProtKB_AC-IDProteomicsDBProteomicsDB
UniProtKB AC/IDUniProtKB_AC-IDPaxDbPaxDb

Organism-Specific

SourceCodeTargetCode
UniProtKB AC/IDUniProtKB_AC-IDFlyBaseFlyBase
UniProtKB AC/IDUniProtKB_AC-IDWormBaseWormBase
UniProtKB AC/IDUniProtKB_AC-IDSGDSGD
UniProtKB AC/IDUniProtKB_AC-IDZFINZFIN

Other Useful Mappings

SourceCodeTargetCode
UniProtKB AC/IDUniProtKB_AC-IDGOGO
UniProtKB AC/IDUniProtKB_AC-IDReactomeReactome
UniProtKB AC/IDUniProtKB_AC-IDSTRINGSTRING
UniProtKB AC/IDUniProtKB_AC-IDBioGRIDBioGRID
UniProtKB AC/IDUniProtKB_AC-IDOMAOMA

Complete List of Database Codes

To get the complete, up-to-date list:

python
from bioservices import UniProt

u = UniProt()

# This information is in the UniProt REST API documentation
# Common patterns:
# - Source databases typically end in source database name
# - UniProtKB uses "UniProtKB_AC-ID" or "UniProtKB"
# - Most other databases use their standard abbreviation

Common Database Codes Reference

Gene/Protein Identifiers:

  • UniProtKB_AC-ID: UniProt accession/ID
  • UniProtKB: UniProt accession
  • KEGG: KEGG gene IDs (e.g., hsa:7535)
  • GeneID: NCBI Gene (Entrez) IDs
  • Ensembl: Ensembl gene IDs
  • Ensembl_Protein: Ensembl protein IDs
  • Ensembl_Transcript: Ensembl transcript IDs
  • RefSeq_Protein: RefSeq protein IDs (NP_)
  • RefSeq_Nucleotide: RefSeq nucleotide IDs (NM_)

Gene Nomenclature:

  • HGNC: Human Gene Nomenclature Committee
  • MGI: Mouse Genome Informatics
  • RGD: Rat Genome Database
  • SGD: Saccharomyces Genome Database
  • FlyBase: Drosophila database
  • WormBase: C. elegans database
  • ZFIN: Zebrafish database

Structure:

  • PDB: Protein Data Bank
  • Pfam: Protein families
  • InterPro: Protein domains
  • SUPFAM: Superfamily
  • PROSITE: Protein motifs

Pathways & Networks:

  • Reactome: Reactome pathways
  • BioCyc: BioCyc pathways
  • PathwayCommons: Pathway Commons
  • STRING: Protein-protein networks
  • BioGRID: Interaction database

Mapping Examples

UniProt → KEGG

python
from bioservices import UniProt

u = UniProt()

# Single mapping
result = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
print(result)  # {'P43403': ['hsa:7535']}

KEGG → UniProt

python
# Reverse mapping
result = u.mapping(fr="KEGG", to="UniProtKB", query="hsa:7535")
print(result)  # {'hsa:7535': ['P43403']}

UniProt → Ensembl

python
# To Ensembl gene IDs
result = u.mapping(fr="UniProtKB_AC-ID", to="Ensembl", query="P43403")
print(result)  # {'P43403': ['ENSG00000115085']}

# To Ensembl protein IDs
result = u.mapping(fr="UniProtKB_AC-ID", to="Ensembl_Protein", query="P43403")
print(result)  # {'P43403': ['ENSP00000381359']}

UniProt → PDB

python
# Find 3D structures
result = u.mapping(fr="UniProtKB_AC-ID", to="PDB", query="P04637")
print(result)  # {'P04637': ['1A1U', '1AIE', '1C26', ...]}

UniProt → RefSeq

python
# Get RefSeq protein IDs
result = u.mapping(fr="UniProtKB_AC-ID", to="RefSeq_Protein", query="P43403")
print(result)  # {'P43403': ['NP_001070.2']}

Gene Name → UniProt (via search, then mapping)

python
# First search for gene
search_result = u.search("gene:ZAP70 AND organism:9606", frmt="tab", columns="id")
lines = search_result.strip().split("\n")
if len(lines) > 1:
    uniprot_id = lines[1].split("\t")[0]

    # Then map to other databases
    kegg_id = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
    print(kegg_id)

UniChem Compound Mapping

UniChem specializes in mapping chemical compound identifiers across databases.

Source Database IDs

Source IDDatabase
1ChEMBL
2DrugBank
3PDB
4IUPHAR/BPS Guide to Pharmacology
5PubChem
6KEGG
7ChEBI
8NIH Clinical Collection
14FDA/SRS
22PubChem

Basic Usage

python
from bioservices import UniChem

u = UniChem()

# Get ChEMBL ID from KEGG compound ID
chembl_id = u.get_compound_id_from_kegg("C11222")
print(chembl_id)  # CHEMBL278315

All Compound IDs

python
# Get all identifiers for a compound
# src_compound_id: compound ID, src_id: source database ID
all_ids = u.get_all_compound_ids("CHEMBL278315", src_id=1)  # 1 = ChEMBL

for mapping in all_ids:
    src_name = mapping['src_name']
    src_compound_id = mapping['src_compound_id']
    print(f"{src_name}: {src_compound_id}")

Specific Database Conversion

python
# Convert between specific databases
# from_src_id=6 (KEGG), to_src_id=1 (ChEMBL)
result = u.get_src_compound_ids("C11222", from_src_id=6, to_src_id=1)
print(result)

Common Compound Mappings

KEGG → ChEMBL

python
u = UniChem()
chembl_id = u.get_compound_id_from_kegg("C00031")  # D-Glucose
print(f"ChEMBL: {chembl_id}")

ChEMBL → PubChem

python
result = u.get_src_compound_ids("CHEMBL278315", from_src_id=1, to_src_id=22)
if result:
    pubchem_id = result[0]['src_compound_id']
    print(f"PubChem: {pubchem_id}")

ChEBI → DrugBank

python
result = u.get_src_compound_ids("5292", from_src_id=7, to_src_id=2)
if result:
    drugbank_id = result[0]['src_compound_id']
    print(f"DrugBank: {drugbank_id}")

KEGG Identifier Conversions

KEGG entries contain cross-references that can be extracted by parsing.

Extract Database Links from KEGG Entry

python
from bioservices import KEGG

k = KEGG()

# Get compound entry
entry = k.get("cpd:C11222")

# Parse for specific database
chebi_id = None
uniprot_ids = []

for line in entry.split("\n"):
    if "ChEBI:" in line:
        # Extract ChEBI ID
        parts = line.split("ChEBI:")
        if len(parts) > 1:
            chebi_id = parts[1].strip().split()[0]

# For genes/proteins
gene_entry = k.get("hsa:7535")
for line in gene_entry.split("\n"):
    if line.startswith("            "):  # Database links section
        if "UniProt:" in line:
            parts = line.split("UniProt:")
            if len(parts) > 1:
                uniprot_id = parts[1].strip()
                uniprot_ids.append(uniprot_id)

KEGG Gene ID Components

KEGG gene IDs have format organism:gene_id:

python
kegg_id = "hsa:7535"
organism, gene_id = kegg_id.split(":")

print(f"Organism: {organism}")  # hsa (human)
print(f"Gene ID: {gene_id}")    # 7535

KEGG Pathway to Genes

python
k = KEGG()

# Get pathway entry
pathway = k.get("path:hsa04660")

# Parse for gene list
genes = []
in_gene_section = False

for line in pathway.split("\n"):
    if line.startswith("GENE"):
        in_gene_section = True

    if in_gene_section:
        if line.startswith(" " * 12):  # Gene line
            parts = line.strip().split()
            if parts:
                gene_id = parts[0]
                genes.append(f"hsa:{gene_id}")
        elif not line.startswith(" "):
            break

print(f"Found {len(genes)} genes")

Common Mapping Patterns

Pattern 1: Gene Symbol → Multiple Database IDs

python
from bioservices import UniProt

def gene_symbol_to_ids(gene_symbol, organism="9606"):
    """Convert gene symbol to multiple database IDs."""
    u = UniProt()

    # Search for gene
    query = f"gene:{gene_symbol} AND organism:{organism}"
    result = u.search(query, frmt="tab", columns="id")

    lines = result.strip().split("\n")
    if len(lines) < 2:
        return None

    uniprot_id = lines[1].split("\t")[0]

    # Map to multiple databases
    ids = {
        'uniprot': uniprot_id,
        'kegg': u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id),
        'ensembl': u.mapping(fr="UniProtKB_AC-ID", to="Ensembl", query=uniprot_id),
        'refseq': u.mapping(fr="UniProtKB_AC-ID", to="RefSeq_Protein", query=uniprot_id),
        'pdb': u.mapping(fr="UniProtKB_AC-ID", to="PDB", query=uniprot_id)
    }

    return ids

# Usage
ids = gene_symbol_to_ids("ZAP70")
print(ids)

Pattern 2: Compound Name → All Database IDs

python
from bioservices import KEGG, UniChem, ChEBI

def compound_name_to_ids(compound_name):
    """Search compound and get all database IDs."""
    k = KEGG()

    # Search KEGG
    results = k.find("compound", compound_name)
    if not results:
        return None

    # Extract KEGG ID
    kegg_id = results.strip().split("\n")[0].split("\t")[0].replace("cpd:", "")

    # Get KEGG entry for ChEBI
    entry = k.get(f"cpd:{kegg_id}")
    chebi_id = None
    for line in entry.split("\n"):
        if "ChEBI:" in line:
            parts = line.split("ChEBI:")
            if len(parts) > 1:
                chebi_id = parts[1].strip().split()[0]
                break

    # Get ChEMBL from UniChem
    u = UniChem()
    try:
        chembl_id = u.get_compound_id_from_kegg(kegg_id)
    except:
        chembl_id = None

    return {
        'kegg': kegg_id,
        'chebi': chebi_id,
        'chembl': chembl_id
    }

# Usage
ids = compound_name_to_ids("Geldanamycin")
print(ids)

Pattern 3: Batch ID Conversion with Error Handling

python
from bioservices import UniProt

def safe_batch_mapping(ids, from_db, to_db, chunk_size=100):
    """Safely map IDs with error handling and chunking."""
    u = UniProt()
    all_results = {}

    for i in range(0, len(ids), chunk_size):
        chunk = ids[i:i+chunk_size]
        query = ",".join(chunk)

        try:
            results = u.mapping(fr=from_db, to=to_db, query=query)
            all_results.update(results)
            print(f"✓ Processed {min(i+chunk_size, len(ids))}/{len(ids)}")

        except Exception as e:
            print(f"✗ Error at chunk {i}: {e}")

            # Try individual IDs in failed chunk
            for single_id in chunk:
                try:
                    result = u.mapping(fr=from_db, to=to_db, query=single_id)
                    all_results.update(result)
                except:
                    all_results[single_id] = None

    return all_results

# Usage
uniprot_ids = ["P43403", "P04637", "P53779", "INVALID123"]
mapping = safe_batch_mapping(uniprot_ids, "UniProtKB_AC-ID", "KEGG")

Pattern 4: Multi-Hop Mapping

Sometimes you need to map through intermediate databases:

python
from bioservices import UniProt

def multi_hop_mapping(gene_symbol, organism="9606"):
    """Gene symbol → UniProt → KEGG → Pathways."""
    u = UniProt()
    k = KEGG()

    # Step 1: Gene symbol → UniProt
    query = f"gene:{gene_symbol} AND organism:{organism}"
    result = u.search(query, frmt="tab", columns="id")

    lines = result.strip().split("\n")
    if len(lines) < 2:
        return None

    uniprot_id = lines[1].split("\t")[0]

    # Step 2: UniProt → KEGG
    kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
    if not kegg_mapping or uniprot_id not in kegg_mapping:
        return None

    kegg_id = kegg_mapping[uniprot_id][0]

    # Step 3: KEGG → Pathways
    organism_code, gene_id = kegg_id.split(":")
    pathways = k.get_pathway_by_gene(gene_id, organism_code)

    return {
        'gene': gene_symbol,
        'uniprot': uniprot_id,
        'kegg': kegg_id,
        'pathways': pathways
    }

# Usage
result = multi_hop_mapping("TP53")
print(result)

Troubleshooting

Issue 1: No Mapping Found

Symptom: Mapping returns empty or None

Solutions:

  1. Verify source ID exists in source database
  2. Check database code spelling
  3. Try reverse mapping
  4. Some IDs may not have mappings in all databases
python
result = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")

if not result or 'P43403' not in result:
    print("No mapping found. Try:")
    print("1. Verify ID exists: u.search('P43403')")
    print("2. Check if protein has KEGG annotation")

Issue 2: Too Many IDs in Batch

Symptom: Batch mapping fails or times out

Solution: Split into smaller chunks

python
def chunked_mapping(ids, from_db, to_db, chunk_size=50):
    all_results = {}

    for i in range(0, len(ids), chunk_size):
        chunk = ids[i:i+chunk_size]
        result = u.mapping(fr=from_db, to=to_db, query=",".join(chunk))
        all_results.update(result)

    return all_results

Issue 3: Multiple Target IDs

Symptom: One source ID maps to multiple target IDs

Solution: Handle as list

python
result = u.mapping(fr="UniProtKB_AC-ID", to="PDB", query="P04637")
# Result: {'P04637': ['1A1U', '1AIE', '1C26', ...]}

pdb_ids = result['P04637']
print(f"Found {len(pdb_ids)} PDB structures")

for pdb_id in pdb_ids:
    print(f"  {pdb_id}")

Issue 4: Organism Ambiguity

Symptom: Gene symbol maps to multiple organisms

Solution: Always specify organism in searches

python
# Bad: Ambiguous
result = u.search("gene:TP53")  # Many organisms have TP53

# Good: Specific
result = u.search("gene:TP53 AND organism:9606")  # Human only

Issue 5: Deprecated IDs

Symptom: Old database IDs don't map

Solution: Update to current IDs first

python
# Check if ID is current
entry = u.retrieve("P43403", frmt="txt")

# Look for secondary accessions
for line in entry.split("\n"):
    if line.startswith("AC"):
        print(line)  # Shows primary and secondary accessions

Best Practices

  1. Always validate inputs before batch processing
  2. Handle None/empty results gracefully
  3. Use chunking for large ID lists (50-100 per chunk)
  4. Cache results for repeated queries
  5. Specify organism when possible to avoid ambiguity
  6. Log failures in batch processing for later retry
  7. Add delays between large batches to respect API limits
python
import time

def polite_batch_mapping(ids, from_db, to_db):
    """Batch mapping with rate limiting."""
    results = {}

    for i in range(0, len(ids), 50):
        chunk = ids[i:i+50]
        result = u.mapping(fr=from_db, to=to_db, query=",".join(chunk))
        results.update(result)

        time.sleep(0.5)  # Be nice to the API

    return results

For complete working examples, see:

  • scripts/batch_id_converter.py: Command-line batch conversion tool
  • workflow_patterns.md: Integration into larger workflows