Sentence Transformer - Mem0

Sentence Transformer reranker provides local reranking using HuggingFace cross-encoder models, perfect for privacy-focused deployments where you want to keep data on-premises.

Models

Any HuggingFace cross-encoder model can be used. Popular choices include:

cross-encoder/ms-marco-MiniLM-L-6-v2: Default, good balance of speed and accuracy
cross-encoder/ms-marco-TinyBERT-L-2-v2: Fastest, smaller model size
cross-encoder/ms-marco-electra-base: Higher accuracy, larger model
cross-encoder/stsb-distilroberta-base: Good for semantic similarity tasks

Installation

bash

pip install sentence-transformers

Configuration

python

from mem0 import Memory

config = {
    "vector_store": {
        "provider": "chroma",
        "config": {
            "collection_name": "my_memories",
            "path": "./chroma_db"
        }
    },
    "llm": {
        "provider": "openai",
        "config": {
            "model": "gpt-4o-mini"
        }
    },
    "rerank": {
        "provider": "sentence_transformer",
        "config": {
            "model": "cross-encoder/ms-marco-MiniLM-L-6-v2",
            "device": "cpu",  # or "cuda" for GPU
            "batch_size": 32,
            "show_progress_bar": False,
            "top_k": 5
        }
    }
}

memory = Memory.from_config(config)

GPU Acceleration

For better performance, use GPU acceleration:

python

config = {
    "rerank": {
        "provider": "sentence_transformer",
        "config": {
            "model": "cross-encoder/ms-marco-MiniLM-L-6-v2",
            "device": "cuda",  # Use GPU
            "batch_size": 64   # high batch size for high memory GPUs
        }
    }
}

Usage Example

python

from mem0 import Memory

# Initialize memory with local reranker
config = {
    "vector_store": {"provider": "chroma"},
    "llm": {"provider": "openai", "config": {"model": "gpt-4o-mini"}},
    "rerank": {
        "provider": "sentence_transformer",
        "config": {
            "model": "cross-encoder/ms-marco-MiniLM-L-6-v2",
            "device": "cpu"
        }
    }
}

memory = Memory.from_config(config)

# Add memories
messages = [
    {"role": "user", "content": "I love reading science fiction novels"},
    {"role": "user", "content": "My favorite author is Isaac Asimov"},
    {"role": "user", "content": "I also enjoy watching sci-fi movies"}
]

memory.add(messages, user_id="charlie")

# Search with local reranking
results = memory.search("What books does the user like?", filters={"user_id": "charlie"})

for result in results['results']:
    print(f"Memory: {result['memory']}")
    print(f"Vector Score: {result['score']:.3f}")
    print(f"Rerank Score: {result['rerank_score']:.3f}")
    print()

Custom Models

You can use any HuggingFace cross-encoder model:

python

# Using a different model
config = {
    "rerank": {
        "provider": "sentence_transformer", 
        "config": {
            "model": "cross-encoder/stsb-distilroberta-base",
            "device": "cpu"
        }
    }
}

Configuration Parameters

Parameter	Description	Type	Default
`model`	HuggingFace cross-encoder model name	`str`	`"cross-encoder/ms-marco-MiniLM-L-6-v2"`
`device`	Device to run model on (`cpu`, `cuda`, etc.)	`str`	`None`
`batch_size`	Batch size for processing documents	`int`	`32`
`show_progress_bar`	Show progress bar during processing	`bool`	`False`
`top_k`	Maximum documents to return	`int`	`None`

Advantages

Privacy: Complete local processing, no external API calls
Cost: No per-token charges after initial model download
Customization: Use any HuggingFace cross-encoder model
Offline: Works without internet connection after model download

Performance Considerations

First Run: Model download may take time initially
Memory Usage: Models require GPU/CPU memory
Batch Size: Optimize batch size based on available memory
Device: GPU acceleration significantly improves speed

Best Practices

Model Selection: Choose model based on accuracy vs speed requirements
Device Management: Use GPU when available for better performance
Batch Processing: Process multiple documents together for efficiency
Memory Monitoring: Monitor system memory usage with larger models