llama-index-integrations/readers/llama-index-readers-uniprot/README.md
This package provides a reader for UniProt Swiss-Prot format files, allowing you to load protein data into LlamaIndex for further processing and analysis.
pip install llama-index-readers-uniprot
from llama_index.readers.uniprot import UniProtReader
# Initialize the reader
reader = UniProtReader()
# Load data from a UniProt file
documents = reader.load_data("path/to/uniprot_sprot.dat")
# Access the documents
for doc in documents:
print(f"Protein ID: {doc.metadata['id']}")
Since UniProt files are large (several GB) it's recommended to use lazy loading to process records one at a time, without loading the entire database into memory:
# Initialize the reader
reader = UniProtReader()
# Load data lazily from a UniProt file
for doc in reader.lazy_load_data("path/to/uniprot_sprot.dat"):
print(f"Protein ID: {doc.metadata['id']}")
print("---")
from llama_index.readers.uniprot import UniProtReader
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
reader = UniProtReader(max_records=10000)
# Load existing protein IDs from the index
existing_protein_ids = {
node.metadata.get('id')
for node in index.storage_context.docstore.docs.values()
if node.metadata.get('id')
}
text_splitter = SentenceSplitter(chunk_size=2048)
index = VectorStoreIndex([], transformations=[text_splitter], show_progress=True)
documents_gen = reader.lazy_load_data("path/to/uniprot_sprot.dat")
# Process documents in batches
batch_size = 10
current_batch = []
for doc in documents_gen:
protein_id = doc.metadata.get('id')
if protein_id in existing_protein_ids:
print(f"Skipping document {protein_id} - already indexed")
continue
current_batch.append(doc)
if len(current_batch) >= batch_size:
index.refresh_ref_docs(documents=current_batch)
current_batch = []
# Process any remaining documents
if current_batch:
index.refresh_ref_docs(documents=current_batch)
# Define persist directory
persist_dir = "path/to/persist/directory"
index.storage_context.persist(persist_dir=persist_dir)
You can specify which fields to include in the output:
# Only include specific fields
reader = UniProtReader(include_fields={"id", "description", "sequence"})
documents = reader.load_data("path/to/uniprot_sprot.dat")
Available fields:
id: Protein identifieraccession: Accession numbersdescription: Protein descriptiongene_names: Gene namesorganism: Organism namecomments: Comments and annotationskeywords: Keywordssequence_length: Length of the protein sequencesequence_mw: Molecular weight of the proteintaxonomy: Taxonomic classificationtaxonomy_id: Taxonomic database identifierscitations: Literature citationscross_references: Cross-references to other databasesfeatures: Protein featuresBy default, all fields are included.
You can limit the number of records to parse using the max_records parameter:
# Parse only first 1000 records
reader = UniProtReader(max_records=1000)
documents = reader.load_data("path/to/uniprot_sprot.dat")
# Works with lazy loading too
for doc in reader.lazy_load_data(
"path/to/uniprot_sprot.dat", max_records=1000
):
print(f"Protein ID: {doc.metadata['id']}")
We welcome contributions! Please see our contributing guidelines for details.
This project is licensed under the MIT License - see the LICENSE file for details.