docs/releases_review/Crawl4AI_v0.3.72_Release_Announcement.ipynb
Welcome to the new release of Crawl4AI v0.3.72! This notebook highlights the latest features and demonstrates how they work in real-time. Follow along to see each feature in action!
Fit Markdown: Extracts only the main content from articles and blogsTo start, we'll install Crawl4AI along with Playwright and nest_asyncio to ensure compatibility with Colabβs asynchronous environment.
# Install Crawl4AI and dependencies
!pip install crawl4ai
!playwright install
!pip install nest_asyncio
# Import nest_asyncio and apply it to allow asyncio in Colab
import nest_asyncio
nest_asyncio.apply()
print('Setup complete!')
Fit MarkdownExtracts only the main content from articles and blog pages, removing sidebars, ads, and other distractions.
import asyncio
from crawl4ai import AsyncWebCrawler
async def fit_markdown_demo():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://janineintheworld.com/places-to-visit-in-central-mexico")
print(result.fit_markdown) # Shows main content in Markdown format
# Run the demo
await fit_markdown_demo()
Magic Mode bypasses anti-bot detection to make crawling more reliable on protected websites.
async def magic_mode_demo():
async with AsyncWebCrawler() as crawler: # Enables anti-bot detection bypass
result = await crawler.arun(
url="https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/",
magic=True # Enables magic mode
)
print(result.markdown) # Shows the full content in Markdown format
# Run the demo
await magic_mode_demo()
Crawl4AI now supports Chromium, Firefox, and WebKit. Hereβs how to specify Firefox for a crawl.
async def multi_browser_demo():
async with AsyncWebCrawler(browser_type="firefox") as crawler: # Using Firefox instead of default Chromium
result = await crawler.arun(url="https://crawl4i.com")
print(result.markdown) # Shows content extracted using Firefox
# Run the demo
await multi_browser_demo()
Let's combine all the features to extract the main content from a blog post, bypass anti-bot detection, and generate a knowledge graph from the extracted content.
from crawl4ai import LLMExtractionStrategy
from pydantic import BaseModel
import json, os
from typing import List
# Define classes for the knowledge graph structure
class Landmark(BaseModel):
name: str
description: str
activities: list[str] # E.g., visiting, sightseeing, relaxing
class City(BaseModel):
name: str
description: str
landmarks: list[Landmark]
cultural_highlights: list[str] # E.g., food, music, traditional crafts
class TravelKnowledgeGraph(BaseModel):
cities: list[City] # Central Mexican cities to visit
async def combined_demo():
# Define the knowledge graph extraction strategy
strategy = LLMExtractionStrategy(
# provider="ollama/nemotron",
provider='openai/gpt-4o-mini', # Or any other provider, including Ollama and open source models
pi_token=os.getenv('OPENAI_API_KEY'), # In case of Ollama just pass "no-token"
schema=TravelKnowledgeGraph.schema(),
instruction=(
"Extract cities, landmarks, and cultural highlights for places to visit in Central Mexico. "
"For each city, list main landmarks with descriptions and activities, as well as cultural highlights."
)
)
# Set up the AsyncWebCrawler with multi-browser support, Magic Mode, and Fit Markdown
async with AsyncWebCrawler(browser_type="firefox") as crawler:
result = await crawler.arun(
url="https://janineintheworld.com/places-to-visit-in-central-mexico",
extraction_strategy=strategy,
bypass_cache=True,
magic=True
)
# Display main article content in Fit Markdown format
print("Extracted Main Content:\n", result.fit_markdown)
# Display extracted knowledge graph of cities, landmarks, and cultural highlights
if result.extracted_content:
travel_graph = json.loads(result.extracted_content)
print("\nExtracted Knowledge Graph:\n", json.dumps(travel_graph, indent=2))
# Run the combined demo
await combined_demo()
Chat with the Crawl4AI GPT Assistant for code generation, support, and learning Crawl4AI faster. Try it out here!