Back to Opik

Evaluate your LLM Application

apps/opik-documentation/documentation/fern/docs/opik-university/3_evaluation/3.4-evaluation-evaluate-llm-application.mdx

2.0.22-6605-merge-20653.6 KB
Original Source
<div style={{ position: 'relative', paddingBottom: '56.25%', // 16:9 aspect ratio height: 0, overflow: 'hidden', maxWidth: '100%', marginBottom: '20px' }}> <iframe src="https://www.loom.com/embed/fdcb38aca1dc4566b7bee20f7a22ded4?sid=de77c73f-9da3-4d90-bcbd-36440b8bd38f" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style={{ position: 'absolute', top: 0, left: 0, width: '100%', height: '100%', }} /> </div>

Bringing It All Together: Complete LLM Evaluation

This comprehensive video demonstrates the complete evaluation workflow in Opik, where datasets and metrics come together to systematically assess LLM performance. You'll see a practical comparison between GPT-4 and Gemini models on a RAG application, learn about prompt versioning, experiment management, and discover how to make data-driven decisions for production deployment. This is where all previous concepts unite into actionable insights.

Key Highlights

  • End-to-End Evaluation Workflow: Run complete evaluations that process datasets, apply models, and score outputs using defined metrics in a systematic pipeline
  • Prompt Management & Versioning: Use Opik's prompt class to create versioned prompts with commit history, ensuring reproducibility and saving time/money
  • Multi-Model Benchmarking: Compare different models (GPT-4 vs Gemini) side-by-side using evaluation tasks and systematic scoring across identical datasets
  • Smart Experiment Organization: Name experiments strategically (e.g., by model name) for easy identification and comparison rather than relying on random generated names
  • Live Experiment Monitoring: Track evaluation progress in real-time through the Opik UI, viewing dataset processing and results as they're generated
  • Side-by-Side Comparison: Use the compare feature to evaluate multiple experiments simultaneously, making model selection decisions based on quantitative metrics
  • Template Generation: Leverage the "Create New Experiment" button to automatically generate evaluation scripts with selected metrics for reuse in Python. Each metric in the modal includes a documentation link for quick reference
  • Trace-Level Inspection: Dive deep into individual responses by opening traces from experiment results to understand model behavior and decision paths
  • Data-Driven Production Decisions: Choose the best-performing prompts and models based on concrete metrics rather than subjective assessment, building confidence for deployment