Conversation Summarization

DeerFlow includes automatic conversation summarization to handle long conversations that approach model token limits. When enabled, the system automatically condenses older messages while preserving recent context.

Overview

The summarization feature uses LangChain's SummarizationMiddleware to monitor conversation history and trigger summarization based on configurable thresholds. When activated, it:

Monitors message token counts in real-time
Triggers summarization when thresholds are met
Keeps recent messages intact while summarizing older exchanges
Maintains AI/Tool message pairs together for context continuity
Injects the summary back into the conversation

Configuration

Summarization is configured in config.yaml under the summarization key:

yaml

summarization:
  enabled: true
  model_name: null  # Use default model or specify a lightweight model

  # Trigger conditions (OR logic - any condition triggers summarization)
  trigger:
    - type: tokens
      value: 4000
    # Additional triggers (optional)
    # - type: messages
    #   value: 50
    # - type: fraction
    #   value: 0.8  # 80% of model's max input tokens

  # Context retention policy
  keep:
    type: messages
    value: 20

  # Token trimming for summarization call
  trim_tokens_to_summarize: 4000

  # Custom summary prompt (optional)
  summary_prompt: null

Configuration Options

`enabled`

Type: Boolean
Default: false
Description: Enable or disable automatic summarization

`model_name`

Type: String or null
Default: null (uses default model)
Description: Model to use for generating summaries. Recommended to use a lightweight, cost-effective model like gpt-4o-mini or equivalent.

`trigger`

Type: Single ContextSize or list of ContextSize objects
Required: At least one trigger must be specified when enabled
Description: Thresholds that trigger summarization. Uses OR logic - summarization runs when ANY threshold is met.

ContextSize Types:

Token-based trigger: Activates when token count reaches the specified value
yaml
```
trigger:
  type: tokens
  value: 4000
```
Message-based trigger: Activates when message count reaches the specified value
yaml
```
trigger:
  type: messages
  value: 50
```
Fraction-based trigger: Activates when token usage reaches a percentage of the model's maximum input tokens
yaml
```
trigger:
  type: fraction
  value: 0.8  # 80% of max input tokens
```

Multiple Triggers:

yaml

trigger:
  - type: tokens
    value: 4000
  - type: messages
    value: 50

`keep`

Type: ContextSize object
Default: {type: messages, value: 20}
Description: Specifies how much recent conversation history to preserve after summarization.

Examples:

yaml

# Keep most recent 20 messages
keep:
  type: messages
  value: 20

# Keep most recent 3000 tokens
keep:
  type: tokens
  value: 3000

# Keep most recent 30% of model's max input tokens
keep:
  type: fraction
  value: 0.3

`trim_tokens_to_summarize`

Type: Integer or null
Default: 4000
Description: Maximum tokens to include when preparing messages for the summarization call itself. Set to null to skip trimming (not recommended for very long conversations).

`summary_prompt`

Type: String or null
Default: null (uses LangChain's default prompt)
Description: Custom prompt template for generating summaries. The prompt should guide the model to extract the most important context.

Default Prompt Behavior: The default LangChain prompt instructs the model to:

Extract highest quality/most relevant context
Focus on information critical to the overall goal
Avoid repeating completed actions
Return only the extracted context

How It Works

Summarization Flow

Monitoring: Before each model call, the middleware counts tokens in the message history
Trigger Check: If any configured threshold is met, summarization is triggered
Message Partitioning: Messages are split into:
- Messages to summarize (older messages beyond the keep threshold)
- Messages to preserve (recent messages within the keep threshold)
Summary Generation: The model generates a concise summary of the older messages
Context Replacement: The message history is updated:
- All old messages are removed
- A single summary message is added
- Recent messages are preserved
AI/Tool Pair Protection: The system ensures AI messages and their corresponding tool messages stay together

Token Counting

Uses approximate token counting based on character count
For Anthropic models: ~3.3 characters per token
For other models: Uses LangChain's default estimation
Can be customized with a custom token_counter function

Message Preservation

The middleware intelligently preserves message context:

Recent Messages: Always kept intact based on keep configuration
AI/Tool Pairs: Never split - if a cutoff point falls within tool messages, the system adjusts to keep the entire AI + Tool message sequence together

Summary Format: Summary is injected as a HumanMessage with the format:

Here is a summary of the conversation to date:

[Generated summary text]

Best Practices

Choosing Trigger Thresholds

Token-based triggers: Recommended for most use cases
- Set to 60-80% of your model's context window
- Example: For 8K context, use 4000-6000 tokens
Message-based triggers: Useful for controlling conversation length
- Good for applications with many short messages
- Example: 50-100 messages depending on average message length
Fraction-based triggers: Ideal when using multiple models
- Automatically adapts to each model's capacity
- Example: 0.8 (80% of model's max input tokens)

Choosing Retention Policy (`keep`)

Message-based retention: Best for most scenarios
- Preserves natural conversation flow
- Recommended: 15-25 messages
Token-based retention: Use when precise control is needed
- Good for managing exact token budgets
- Recommended: 2000-4000 tokens
Fraction-based retention: For multi-model setups
- Automatically scales with model capacity
- Recommended: 0.2-0.4 (20-40% of max input)

Model Selection

Recommended: Use a lightweight, cost-effective model for summaries
- Examples: gpt-4o-mini, claude-haiku, or equivalent
- Summaries don't require the most powerful models
- Significant cost savings on high-volume applications
Default: If model_name is null, uses the default model
- May be more expensive but ensures consistency
- Good for simple setups

Optimization Tips

Balance triggers: Combine token and message triggers for robust handling

yaml

trigger:
  - type: tokens
    value: 4000
  - type: messages
    value: 50

Conservative retention: Keep more messages initially, adjust based on performance
yaml
```
keep:
  type: messages
  value: 25  # Start higher, reduce if needed
```

Trim strategically: Limit tokens sent to summarization model

yaml

trim_tokens_to_summarize: 4000  # Prevents expensive summarization calls

Monitor and iterate: Track summary quality and adjust configuration

Troubleshooting

Summary Quality Issues

Problem: Summaries losing important context

Solutions:

Increase keep value to preserve more messages
Decrease trigger thresholds to summarize earlier
Customize summary_prompt to emphasize key information
Use a more capable model for summarization

Performance Issues

Problem: Summarization calls taking too long

Solutions:

Use a faster model for summaries (e.g., gpt-4o-mini)
Reduce trim_tokens_to_summarize to send less context
Increase trigger thresholds to summarize less frequently

Token Limit Errors

Problem: Still hitting token limits despite summarization

Solutions:

Lower trigger thresholds to summarize earlier
Reduce keep value to preserve fewer messages
Check if individual messages are very large
Consider using fraction-based triggers

Implementation Details

Code Structure

Configuration: packages/harness/deerflow/config/summarization_config.py
Integration: packages/harness/deerflow/agents/lead_agent/agent.py
Middleware: Uses langchain.agents.middleware.SummarizationMiddleware

Middleware Order

Summarization runs after ThreadData and Sandbox initialization but before Title and Clarification:

ThreadDataMiddleware
SandboxMiddleware
SummarizationMiddleware ← Runs here
TitleMiddleware
ClarificationMiddleware

State Management

Summarization is stateless - configuration is loaded once at startup
Summaries are added as regular messages in the conversation history
The checkpointer persists the summarized history automatically

Example Configurations

Minimal Configuration

yaml

summarization:
  enabled: true
  trigger:
    type: tokens
    value: 4000
  keep:
    type: messages
    value: 20

Production Configuration

yaml

summarization:
  enabled: true
  model_name: gpt-4o-mini  # Lightweight model for cost efficiency
  trigger:
    - type: tokens
      value: 6000
    - type: messages
      value: 75
  keep:
    type: messages
    value: 25
  trim_tokens_to_summarize: 5000

Multi-Model Configuration

yaml

summarization:
  enabled: true
  model_name: gpt-4o-mini
  trigger:
    type: fraction
    value: 0.7  # 70% of model's max input
  keep:
    type: fraction
    value: 0.3  # Keep 30% of max input
  trim_tokens_to_summarize: 4000

Conservative Configuration (High Quality)

yaml

summarization:
  enabled: true
  model_name: gpt-4  # Use full model for high-quality summaries
  trigger:
    type: tokens
    value: 8000
  keep:
    type: messages
    value: 40  # Keep more context
  trim_tokens_to_summarize: null  # No trimming

Conversation Summarization

Conversation Summarization

Overview

Configuration

Configuration Options

enabled

model_name

trigger

keep

trim_tokens_to_summarize

summary_prompt

How It Works

Summarization Flow

Token Counting

Message Preservation

Best Practices

Choosing Trigger Thresholds

Choosing Retention Policy (keep)

Model Selection

Optimization Tips

Troubleshooting

Summary Quality Issues

Performance Issues

Token Limit Errors

Implementation Details

Code Structure

Middleware Order

State Management

Example Configurations

Minimal Configuration

Production Configuration

Multi-Model Configuration

Conservative Configuration (High Quality)

References

`enabled`

`model_name`

`trigger`

`keep`

`trim_tokens_to_summarize`

`summary_prompt`

Choosing Retention Policy (`keep`)