Back to Crawl4ai

πŸš€ Crawl4AI v0.3.72 Release Announcement

docs/releases_review/Crawl4AI_v0.3.72_Release_Announcement.ipynb

0.8.65.1 KB
Original Source

πŸš€ Crawl4AI v0.3.72 Release Announcement

Welcome to the new release of Crawl4AI v0.3.72! This notebook highlights the latest features and demonstrates how they work in real-time. Follow along to see each feature in action!

What’s New?

  • ✨ Fit Markdown: Extracts only the main content from articles and blogs
  • πŸ›‘οΈ Magic Mode: Comprehensive anti-bot detection bypass
  • 🌐 Multi-browser support: Switch between Chromium, Firefox, WebKit
  • πŸ” Knowledge Graph Extraction: Generate structured graphs of entities & relationships from any URL
  • πŸ€– Crawl4AI GPT Assistant: Chat directly with our AI assistant for help, code generation, and faster learning (available here)

πŸ“₯ Setup

To start, we'll install Crawl4AI along with Playwright and nest_asyncio to ensure compatibility with Colab’s asynchronous environment.

python
# Install Crawl4AI and dependencies
!pip install crawl4ai
!playwright install
!pip install nest_asyncio
python
# Import nest_asyncio and apply it to allow asyncio in Colab
import nest_asyncio
nest_asyncio.apply()

print('Setup complete!')

✨ Feature 1: Fit Markdown

Extracts only the main content from articles and blog pages, removing sidebars, ads, and other distractions.

python
import asyncio
from crawl4ai import AsyncWebCrawler

async def fit_markdown_demo():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://janineintheworld.com/places-to-visit-in-central-mexico")
        print(result.fit_markdown)  # Shows main content in Markdown format

# Run the demo
await fit_markdown_demo()

πŸ›‘οΈ Feature 2: Magic Mode

Magic Mode bypasses anti-bot detection to make crawling more reliable on protected websites.

python
async def magic_mode_demo():
    async with AsyncWebCrawler() as crawler:  # Enables anti-bot detection bypass
        result = await crawler.arun(
            url="https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/",
            magic=True  # Enables magic mode
        )
        print(result.markdown)  # Shows the full content in Markdown format

# Run the demo
await magic_mode_demo()

🌐 Feature 3: Multi-Browser Support

Crawl4AI now supports Chromium, Firefox, and WebKit. Here’s how to specify Firefox for a crawl.

python
async def multi_browser_demo():
    async with AsyncWebCrawler(browser_type="firefox") as crawler:  # Using Firefox instead of default Chromium
        result = await crawler.arun(url="https://crawl4i.com")
        print(result.markdown)  # Shows content extracted using Firefox

# Run the demo
await multi_browser_demo()

✨ Put them all together

Let's combine all the features to extract the main content from a blog post, bypass anti-bot detection, and generate a knowledge graph from the extracted content.

python
from crawl4ai import LLMExtractionStrategy
from pydantic import BaseModel
import json, os
from typing import List

# Define classes for the knowledge graph structure
class Landmark(BaseModel):
    name: str
    description: str
    activities: list[str]  # E.g., visiting, sightseeing, relaxing

class City(BaseModel):
    name: str
    description: str
    landmarks: list[Landmark]
    cultural_highlights: list[str]  # E.g., food, music, traditional crafts

class TravelKnowledgeGraph(BaseModel):
    cities: list[City]  # Central Mexican cities to visit

async def combined_demo():
    # Define the knowledge graph extraction strategy
    strategy = LLMExtractionStrategy(
        # provider="ollama/nemotron",
        provider='openai/gpt-4o-mini', # Or any other provider, including Ollama and open source models
        pi_token=os.getenv('OPENAI_API_KEY'), # In case of Ollama just pass "no-token"
        schema=TravelKnowledgeGraph.schema(),
        instruction=(
            "Extract cities, landmarks, and cultural highlights for places to visit in Central Mexico. "
            "For each city, list main landmarks with descriptions and activities, as well as cultural highlights."
        )
    )

    # Set up the AsyncWebCrawler with multi-browser support, Magic Mode, and Fit Markdown
    async with AsyncWebCrawler(browser_type="firefox") as crawler:
        result = await crawler.arun(
            url="https://janineintheworld.com/places-to-visit-in-central-mexico",
            extraction_strategy=strategy,
            bypass_cache=True,
            magic=True
        )
        
        # Display main article content in Fit Markdown format
        print("Extracted Main Content:\n", result.fit_markdown)
        
        # Display extracted knowledge graph of cities, landmarks, and cultural highlights
        if result.extracted_content:
            travel_graph = json.loads(result.extracted_content)
            print("\nExtracted Knowledge Graph:\n", json.dumps(travel_graph, indent=2))

# Run the combined demo
await combined_demo()


πŸ€– Crawl4AI GPT Assistant

Chat with the Crawl4AI GPT Assistant for code generation, support, and learning Crawl4AI faster. Try it out here!