examples/llm/README.md
| Example | Description |
|---|---|
g_retriever.py | Example helper functions for using the G-retriever GNN+LLM module in PyG. Includes an example repo for Neo4j integration with an associated blog post demonstrating 2x accuracy gains over LLMs on real medical data. For a complete end-to-end pipeline (KG Creation, Subgraph Retrieval, GNN+LLM Finetuning, Testing, LLM Judge Eval), see txt2kg_rag.py. For a native PyG implementation without external graph databases, see gretriever-stark-prime. |
molecule_gpt.py | Example for MoleculeGPT: Instruction Following Large Language Models for Molecular Property Prediction. Supports MoleculeGPT and InstructMol dataset |
glem.py | Example for GLEM, a GNN+LLM co-training model via variational Expectation-Maximization (EM) framework on node classification tasks to achieve SOTA results |
git_mol.py | Example for GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text |
protein_mpnn.py | Example for Robust deep learning--based protein sequence design using ProteinMPNN |
txt2kg_rag.py | Full end 2 end RAG pipeline using TXT2KG and Vector and Graph RAG with a GNN to achieve state of the art results. Uses the techQA dataset but can be extended to handle any RAG dataset with a corpus of documents and an associated set of Q+A pairs to be split for train/eval/test. See Stanford GNN+LLM Talk for more details. Note that the TechQA data requires only a single document to answer each question so it can be viewed as a toy example. To see significant accuracy boosts from GNN+LLM TXT2KG based RAG, use data that requires multiple text chunks to answer a question. In cases where single document can answer, basic RAG should be sufficient. |
txt2qa.py | Synthetic multi-hop QA generation pipeline from text documents. Supports vLLM (local GPU) and NVIDIA NIM (API) backends. |
vLLM (local GPU):
python3 examples/llm/txt2qa.py \
--config examples/llm/txt2qa_config/text_config_vllm.yaml
NVIDIA NIM (API):
export NVIDIA_API_KEY="your-key-here"
python3 examples/llm/txt2qa.py \
--config examples/llm/txt2qa_config/text_config_nim.yaml
Output is written to {output_dir}/all_qa_pairs_batch_0.jsonl.
Adjust the selected YAML file in txt2qa_config to choose
the input directory, output directory, model backend, and generation settings.
TXT2QA requires both PyG and vLLM. You can start from either base container:
# Inside the vLLM container, install PyG:
pip install torch_geometric[full,rag]
# Or, for a development install from a local clone:
git clone https://github.com/pyg-team/pytorch_geometric.git
cd pytorch_geometric && pip install .[full,rag]
# Install vLLM pre-built wheel (CUDA 13.0 example):
pip install "https://github.com/vllm-project/vllm/releases/download/v0.15.0/vllm-0.15.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl" \
--extra-index-url https://download.pytorch.org/whl/cu130
For other CUDA versions or installation methods, see the vLLM installation docs.
Note: The vLLM wheel may pull in a
flash-attnbuild that is incompatible with the existing environment. If you hit import errors related to flash-attention, run:bashpip uninstall flash-attn flash_attn -yThis resolves the issue — vLLM will fall back to its built-in attention backends.