Back to Pytorch Geometric

Examples for Co-training LLMs and GNNs

examples/llm/README.md

2.8.08.9 KB
Original Source

Examples for Co-training LLMs and GNNs

ExampleDescription
g_retriever.pyExample helper functions for using the G-retriever GNN+LLM module in PyG. Includes an example repo for Neo4j integration with an associated blog post demonstrating 2x accuracy gains over LLMs on real medical data. For a complete end-to-end pipeline (KG Creation, Subgraph Retrieval, GNN+LLM Finetuning, Testing, LLM Judge Eval), see txt2kg_rag.py. For a native PyG implementation without external graph databases, see gretriever-stark-prime.
molecule_gpt.pyExample for MoleculeGPT: Instruction Following Large Language Models for Molecular Property Prediction. Supports MoleculeGPT and InstructMol dataset
glem.pyExample for GLEM, a GNN+LLM co-training model via variational Expectation-Maximization (EM) framework on node classification tasks to achieve SOTA results
git_mol.pyExample for GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text
protein_mpnn.pyExample for Robust deep learning--based protein sequence design using ProteinMPNN
txt2kg_rag.pyFull end 2 end RAG pipeline using TXT2KG and Vector and Graph RAG with a GNN to achieve state of the art results. Uses the techQA dataset but can be extended to handle any RAG dataset with a corpus of documents and an associated set of Q+A pairs to be split for train/eval/test. See Stanford GNN+LLM Talk for more details. Note that the TechQA data requires only a single document to answer each question so it can be viewed as a toy example. To see significant accuracy boosts from GNN+LLM TXT2KG based RAG, use data that requires multiple text chunks to answer a question. In cases where single document can answer, basic RAG should be sufficient.
txt2qa.pySynthetic multi-hop QA generation pipeline from text documents. Supports vLLM (local GPU) and NVIDIA NIM (API) backends.

TXT2QA Quick Start

Running

vLLM (local GPU):

bash
python3 examples/llm/txt2qa.py \
  --config examples/llm/txt2qa_config/text_config_vllm.yaml

NVIDIA NIM (API):

bash
export NVIDIA_API_KEY="your-key-here"
python3 examples/llm/txt2qa.py \
  --config examples/llm/txt2qa_config/text_config_nim.yaml

Output is written to {output_dir}/all_qa_pairs_batch_0.jsonl. Adjust the selected YAML file in txt2qa_config to choose the input directory, output directory, model backend, and generation settings.

Building Containers for TXT2QA

TXT2QA requires both PyG and vLLM. You can start from either base container:

Option A: Starting from NGC vLLM Container

bash
# Inside the vLLM container, install PyG:
pip install torch_geometric[full,rag]
# Or, for a development install from a local clone:
git clone https://github.com/pyg-team/pytorch_geometric.git
cd pytorch_geometric && pip install .[full,rag]

Option B: Starting from NGC PyG Container

bash
# Install vLLM pre-built wheel (CUDA 13.0 example):
pip install "https://github.com/vllm-project/vllm/releases/download/v0.15.0/vllm-0.15.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl" \
  --extra-index-url https://download.pytorch.org/whl/cu130

For other CUDA versions or installation methods, see the vLLM installation docs.

Note: The vLLM wheel may pull in a flash-attn build that is incompatible with the existing environment. If you hit import errors related to flash-attention, run:

bash
pip uninstall flash-attn flash_attn -y

This resolves the issue — vLLM will fall back to its built-in attention backends.