Back to Opik

Create Evaluation Datasets

apps/opik-documentation/documentation/fern/docs/opik-university/3_evaluation/3.2-evaluation-create-datasets.mdx

2.0.24-52622.9 KB
Original Source
<div style={{ position: 'relative', paddingBottom: '56.25%', // 16:9 aspect ratio height: 0, overflow: 'hidden', maxWidth: '100%', marginBottom: '20px' }}> <iframe src="https://www.loom.com/embed/c84b219b7c454476bb276abaaf4df777?sid=023a4650-a0cc-46c0-9613-111fc26b020a" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style={{ position: 'absolute', top: 0, left: 0, width: '100%', height: '100%', }} /> </div>

Building Datasets for RAG Evaluation

This hands-on video demonstrates dataset creation using a practical RAG (Retrieval Augmented Generation) example that compares OpenAI and Google Gemini models. You'll learn how evaluation datasets serve as the foundation for systematic LLM testing - they're collections of example inputs your application will encounter along with expected outputs, similar to validation sets in traditional machine learning.

Key Highlights

  • Practical RAG Setup: Complete example showing OpenAI vs Google Gemini model comparison with vector store integration using LangChain and Chroma
  • Dual Creation Methods: Create datasets via UI (select traces and "add to dataset") or programmatically using the Opik client with get_or_create_dataset()
  • Flexible Data Structure: Define custom fields in dataset items based on your specific use case - make inputs and outputs as verbose as your application needs
  • Automatic Deduplication: Opik automatically prevents duplicate entries in datasets, ensuring data quality and consistency
  • Multiple Dataset Strategy: Create focused datasets for different aspects - common questions, edge cases, failure modes, and specific capabilities like reasoning or summarization
  • Trace-to-Dataset Conversion: Leverage existing traces by filtering high-performing interactions and converting them directly into evaluation datasets
  • Validation Set Approach: Datasets function like traditional ML validation sets, providing representative examples for systematic performance assessment
  • Scalable Architecture: Use class-based setup to handle different model providers with consistent interfaces while maintaining traceability with @track decorators