examples/llm/README.md
| Example | Description |
|---|---|
g_retriever.py | Example helper functions for using the G-retriever GNN+LLM module in PyG. Includes an example repo for Neo4j integration with an associated blog post demonstrating 2x accuracy gains over LLMs on real medical data. For a complete end-to-end pipeline (KG Creation, Subgraph Retrieval, GNN+LLM Finetuning, Testing, LLM Judge Eval), see txt2kg_rag.py. For a native PyG implementation without external graph databases, see gretriever-stark-prime. |
molecule_gpt.py | Example for MoleculeGPT: Instruction Following Large Language Models for Molecular Property Prediction. Supports MoleculeGPT and InstructMol dataset |
glem.py | Example for GLEM, a GNN+LLM co-training model via variational Expectation-Maximization (EM) framework on node classification tasks to achieve SOTA results |
git_mol.py | Example for GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text |
protein_mpnn.py | Example for Robust deep learning--based protein sequence design using ProteinMPNN |
txt2kg_rag.py | Full end 2 end RAG pipeline using TXT2KG and Vector and Graph RAG with a GNN to achieve state of the art results. Uses the techQA dataset but can be extended to handle any RAG dataset with a corpus of documents and an associated set of Q+A pairs to be split for train/eval/test. See Stanford GNN+LLM Talk for more details. Note that the TechQA data requires only a single document to answer each question so it can be viewed as a toy example. To see significant accuracy boosts from GNN+LLM TXT2KG based RAG, use data that requires multiple text chunks to answer a question. In cases where single document can answer, basic RAG should be sufficient. |