docs/docs/genai/concepts/trace/feedback.mdx
import Tabs from "@theme/Tabs" import TabItem from "@theme/TabItem"
This guide introduces the core concepts of feedback and assessment in MLflow's evaluation framework. Understanding these concepts is essential for effectively measuring and improving the quality of your LLM applications and AI agents.
Feedback in MLflow represents the result of any quality assessment performed on your LLM application or AI agent outputs. It provides a standardized way to capture evaluations, whether they come from automated systems, LLM judges, or human reviewers.
Feedback serves as the bridge between running your application and understanding its quality, enabling you to systematically track performance across different dimensions like correctness, relevance, safety, and adherence to guidelines.
The Feedback object (also referred to as an Assessment in some contexts) is the fundamental building block of MLflow's evaluation system. It serves as a standardized container for the result of any quality check, providing a common language for assessment across different evaluation methods.
<Tabs> <TabItem value="structure" label="Feedback Structure" default> Every Feedback object contains three core components:**Name**: A string identifying the specific quality aspect being assessed
Examples: `"correctness"`, `"relevance_to_query"`, `"is_safe"`, `"guideline_adherence_politeness"`
**Value**: The actual result of the assessment, which can be:
- Numeric scores (e.g., `0.0` to `1.0`, `1` to `5`)
- Boolean values (`True`/`False`)
- Categorical labels (e.g., `"PASS"`, `"FAIL"`, `"EXCELLENT"`)
- Structured data (e.g., `{"score": 0.8, "confidence": 0.9}`)
**Rationale**: A string explaining why the assessment resulted in the given value
This explanation is crucial for transparency, debugging, and understanding evaluation behavior, especially for LLM-based assessments.
**LLM-based Evaluations**: Automated assessments using language models as judges
- Fast and scalable
- Can evaluate complex, subjective criteria
- Provide detailed reasoning in rationale
**Programmatic Checks**: Rule-based or algorithmic evaluations
- Deterministic and consistent
- Fast execution
- Good for objective, measurable criteria
**Human Reviews**: Manual assessments from human evaluators
- Highest quality for subjective evaluations
- Slower and more expensive
- Essential for establishing ground truth
All feedback types are treated equally in MLflow and can be combined to provide comprehensive quality assessment.
**Execution + Assessment**: Each trace captures how your application processed a request, while feedback captures how well it performed
**Multi-dimensional Quality**: A single trace can have multiple feedback objects assessing different quality dimensions
**Historical Analysis**: Attached feedback enables tracking quality trends over time and across different application versions
**Debugging Context**: When quality issues arise, you can examine both the execution trace and the assessment rationale
Feedback can evaluate various aspects of your LLM application or AI agent's performance:
<Tabs> <TabItem value="correctness" label="Correctness & Accuracy" default> **Factual Accuracy**: Whether the generated content contains correct information**Answer Completeness**: How thoroughly the response addresses the user's question
**Logical Consistency**: Whether the reasoning and conclusions are sound
Example feedback:
```json
{
"name": "factual_accuracy",
"value": 0.85,
"rationale": "The response correctly identifies 3 out of 4 key facts about MLflow, but incorrectly states the founding year."
}
```
**Context Utilization**: Whether retrieved documents or provided context were used effectively
**Topic Adherence**: Staying on-topic and avoiding irrelevant tangents
Example feedback:
```json
{
"name": "relevance_to_query",
"value": "HIGH",
"rationale": "Response directly answers the user's question about MLflow features and provides relevant examples."
}
```
**Guideline Adherence**: Following specific organizational or ethical guidelines
**Bias Detection**: Identifying unfair bias or discrimination in responses
Example feedback:
```json
{
"name": "is_safe",
"value": true,
"rationale": "Content contains no harmful, toxic, or inappropriate material."
}
```
**Tone Appropriateness**: Whether the tone matches the intended context
**Helpfulness**: How useful the response is to the user
Example feedback:
```json
{
"name": "helpfulness",
"value": 4,
"rationale": "Response provides clear, actionable information but could include more specific examples."
}
```
Understanding how feedback flows through your evaluation process:
<Tabs> <TabItem value="generation" label="Generation" default> **During Application Execution**: Traces are created as your LLM application or AI agent processes requests**Post-Execution Evaluation**: Feedback is generated by evaluating the trace data (inputs, outputs, intermediate steps)
**Multiple Evaluators**: Different evaluation methods can assess the same trace, creating multiple feedback objects
**Batch or Real-time**: Feedback can be generated immediately or in batch processes
**Persistent Storage**: Feedback is stored alongside trace data in MLflow's backend
**Metadata Preservation**: All context about the evaluation method and timing is maintained
**Version Tracking**: Changes to feedback or re-evaluations are tracked over time
**Trend Analysis**: Historical feedback enables tracking quality changes over time
**Comparative Analysis**: Compare feedback across different model versions, prompts, or configurations
**Reporting**: Generate quality reports and dashboards from aggregated feedback data
MLflow supports different types of feedback to accommodate various evaluation needs:
Numeric Scores: Continuous values representing quality on a scale
{"name": "relevance", "value": 0.87}Boolean Values: Binary assessments for pass/fail criteria
true or false{"name": "contains_pii", "value": false}Labels: Discrete categories representing quality levels
{"name": "overall_quality", "value": "GOOD"}Classification: Specific category assignments
{"name": "response_type", "value": "INFORMATIONAL"}Complex Objects: Rich data structures containing multiple assessment aspects
{
"name": "comprehensive_quality",
"value": {
"overall_score": 0.85,
"accuracy": 0.9,
"fluency": 0.8,
"confidence": 0.75
}
}
Different approaches for generating feedback:
<Tabs> <TabItem value="llm-judges" label="LLM Judges" default> **Automated LLM Evaluation**: Using language models to assess quality**Advantages**:
- Scale to large volumes of data
- Evaluate subjective criteria
- Provide detailed reasoning
- Consistent evaluation criteria
**Use Cases**:
- Content quality assessment
- Relevance evaluation
- Style and tone analysis
- Complex reasoning evaluation
**Example**: An LLM judge evaluating response helpfulness with detailed rationale explaining specific strengths and weaknesses.
**Advantages**:
- Deterministic and consistent
- Fast execution
- Objective measurement
- Easy to understand and debug
**Use Cases**:
- Format validation
- Length constraints
- Keyword presence/absence
- Quantitative metrics
**Example**: Checking if a response contains required elements or meets length requirements.
**Advantages**:
- Highest quality for subjective criteria
- Nuanced understanding
- Ground truth establishment
- Complex context evaluation
**Use Cases**:
- Quality benchmarking
- Edge case evaluation
- Sensitive content review
- Final quality validation
**Example**: Human reviewers assessing response quality using standardized rubrics and providing detailed feedback.
Feedback integrates seamlessly with MLflow's ecosystem:
Direct Association: Feedback objects are linked to specific traces, providing context about what was evaluated
Execution Context: Access to complete application execution data when performing evaluations
Multi-Step Evaluation: Ability to evaluate individual spans within a trace or the overall trace result
Scorer Functions: Automated functions that generate feedback based on trace data
Judge Functions: LLM-based evaluators that provide intelligent assessment
Custom Metrics: Ability to define domain-specific evaluation criteria
Quality Dashboards: Visualize feedback trends and patterns over time
Performance Tracking: Monitor how changes to your application affect quality metrics
Alerting: Set up notifications when quality metrics fall below thresholds
Clear Names: Use descriptive, consistent names for feedback dimensions
Appropriate Scales: Choose value types and ranges that match your evaluation needs
Meaningful Rationale: Provide clear explanations that help with debugging and improvement
Multiple Dimensions: Assess various aspects of quality, not just a single metric
Balanced Approach: Combine automated and human evaluation methods
Regular Review: Periodically review and update evaluation criteria
Baseline Establishment: Set quality baselines for comparison
Trend Monitoring: Track quality changes over time and across versions
Root Cause Analysis: Use feedback and trace data together to understand quality issues
To begin using feedback in your evaluation workflow:
LLM Evaluation Guide: Learn how to evaluate your LLM applications and AI agents
Custom Metrics: Create domain-specific evaluation functions
Trace Analysis: Explore how to query and analyze trace data with feedback
Quality Monitoring: Set up ongoing quality assessment
Feedback concepts form the foundation for systematic quality assessment in MLflow. By understanding how feedback objects work and integrate with traces, you can build comprehensive evaluation strategies that improve your LLM applications and AI agents over time.