Cosine Strategy

The Cosine Strategy in Crawl4AI uses similarity-based clustering to identify and extract relevant content sections from web pages. This strategy is particularly useful when you need to find and extract content based on semantic similarity rather than structural patterns.

How It Works

The Cosine Strategy:

Breaks down page content into meaningful chunks
Converts text into vector representations
Calculates similarity between chunks
Clusters similar content together
Ranks and filters content based on relevance

Basic Usage

python

from crawl4ai import CosineStrategy

strategy = CosineStrategy(
    semantic_filter="product reviews",    # Target content type
    word_count_threshold=10,             # Minimum words per cluster
    sim_threshold=0.3                    # Similarity threshold
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com/reviews",
        extraction_strategy=strategy
    )
    
    content = result.extracted_content

Configuration Options

Core Parameters

python

CosineStrategy(
    # Content Filtering
    semantic_filter: str = None,       # Keywords/topic for content filtering
    word_count_threshold: int = 10,    # Minimum words per cluster
    sim_threshold: float = 0.3,        # Similarity threshold (0.0 to 1.0)
    
    # Clustering Parameters
    max_dist: float = 0.2,            # Maximum distance for clustering
    linkage_method: str = 'ward',      # Clustering linkage method
    top_k: int = 3,                   # Number of top categories to extract
    
    # Model Configuration
    model_name: str = 'sentence-transformers/all-MiniLM-L6-v2',  # Embedding model
    
    verbose: bool = False             # Enable logging
)

Parameter Details

1. semantic_filter

Sets the target topic or content type
Use keywords relevant to your desired content
Example: "technical specifications", "user reviews", "pricing information"

2. sim_threshold

Controls how similar content must be to be grouped together
Higher values (e.g., 0.8) mean stricter matching
Lower values (e.g., 0.3) allow more variation

python

# Strict matching
strategy = CosineStrategy(sim_threshold=0.8)

# Loose matching
strategy = CosineStrategy(sim_threshold=0.3)

3. word_count_threshold

Filters out short content blocks
Helps eliminate noise and irrelevant content

python

# Only consider substantial paragraphs
strategy = CosineStrategy(word_count_threshold=50)

4. top_k

Number of top content clusters to return
Higher values return more diverse content

python

# Get top 5 most relevant content clusters
strategy = CosineStrategy(top_k=5)

Use Cases

1. Article Content Extraction

python

strategy = CosineStrategy(
    semantic_filter="main article content",
    word_count_threshold=100,  # Longer blocks for articles
    top_k=1                   # Usually want single main content
)

result = await crawler.arun(
    url="https://example.com/blog/post",
    extraction_strategy=strategy
)

2. Product Review Analysis

python

strategy = CosineStrategy(
    semantic_filter="customer reviews and ratings",
    word_count_threshold=20,   # Reviews can be shorter
    top_k=10,                 # Get multiple reviews
    sim_threshold=0.4         # Allow variety in review content
)

3. Technical Documentation

python

strategy = CosineStrategy(
    semantic_filter="technical specifications documentation",
    word_count_threshold=30,
    sim_threshold=0.6,        # Stricter matching for technical content
    max_dist=0.3             # Allow related technical sections
)

Advanced Features

Custom Clustering

python

strategy = CosineStrategy(
    linkage_method='complete',  # Alternative clustering method
    max_dist=0.4,              # Larger clusters
    model_name='sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'  # Multilingual support
)

Content Filtering Pipeline

python

strategy = CosineStrategy(
    semantic_filter="pricing plans features",
    word_count_threshold=15,
    sim_threshold=0.5,
    top_k=3
)

async def extract_pricing_features(url: str):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url=url,
            extraction_strategy=strategy
        )
        
        if result.success:
            content = json.loads(result.extracted_content)
            return {
                'pricing_features': content,
                'clusters': len(content),
                'similarity_scores': [item['score'] for item in content]
            }

Best Practices

1. Adjust Thresholds Iteratively

Start with default values
Adjust based on results
Monitor clustering quality

2. Choose Appropriate Word Count Thresholds

Higher for articles (100+)
Lower for reviews/comments (20+)
Medium for product descriptions (50+)

3. Optimize Performance

python

strategy = CosineStrategy(
    word_count_threshold=10,  # Filter early
    top_k=5,                 # Limit results
    verbose=True             # Monitor performance
)

4. Handle Different Content Types

python

# For mixed content pages
strategy = CosineStrategy(
    semantic_filter="product features",
    sim_threshold=0.4,      # More flexible matching
    max_dist=0.3,          # Larger clusters
    top_k=3                # Multiple relevant sections
)

Error Handling

python

try:
    result = await crawler.arun(
        url="https://example.com",
        extraction_strategy=strategy
    )
    
    if result.success:
        content = json.loads(result.extracted_content)
        if not content:
            print("No relevant content found")
    else:
        print(f"Extraction failed: {result.error_message}")
        
except Exception as e:
    print(f"Error during extraction: {str(e)}")

The Cosine Strategy is particularly effective when:

Content structure is inconsistent
You need semantic understanding
You want to find similar content blocks
Structure-based extraction (CSS/XPath) isn't reliable

It works well with other strategies and can be used as a pre-processing step for LLM-based extraction.