Crawl4AI CLI Guide

Installation
Basic Usage
Configuration
Advanced Features
Output Formats
Examples
Configuration Reference
Best Practices & Tips

Installation

The Crawl4AI CLI will be installed automatically when you install the library.

Basic Usage

The Crawl4AI CLI (crwl) provides a simple interface to the Crawl4AI library:

bash

# Basic crawling
crwl https://example.com

# Get markdown output
crwl https://example.com -o markdown

# Verbose JSON output with cache bypass
crwl https://example.com -o json -v --bypass-cache

# See usage examples
crwl --example

Quick Example of Advanced Usage

If you clone the repository and run the following command, you will receive the content of the page in JSON format according to a JSON-CSS schema:

bash

crwl "https://www.infoq.com/ai-ml-data-eng/" -e docs/examples/cli/extract_css.yml -s docs/examples/cli/css_schema.json -o json;

Configuration

Browser Configuration

Browser settings can be configured via YAML file or command line parameters:

yaml

# browser.yml
headless: true
viewport_width: 1280
user_agent_mode: "random"
verbose: true
ignore_https_errors: true

bash

# Using config file
crwl https://example.com -B browser.yml

# Using direct parameters
crwl https://example.com -b "headless=true,viewport_width=1280,user_agent_mode=random"

Crawler Configuration

Control crawling behavior:

yaml

# crawler.yml
cache_mode: "bypass"
wait_until: "networkidle"
page_timeout: 30000
delay_before_return_html: 0.5
word_count_threshold: 100
scan_full_page: true
scroll_delay: 0.3
process_iframes: false
remove_overlay_elements: true
magic: true
verbose: true

bash

# Using config file
crwl https://example.com -C crawler.yml

# Using direct parameters
crwl https://example.com -c "css_selector=#main,delay_before_return_html=2,scan_full_page=true"

Extraction Configuration

Two types of extraction are supported:

CSS/XPath-based extraction:

yaml

# extract_css.yml
type: "json-css"
params:
  verbose: true

json

// css_schema.json
{
  "name": "ArticleExtractor",
  "baseSelector": ".article",
  "fields": [
    {
      "name": "title",
      "selector": "h1.title",
      "type": "text"
    },
    {
      "name": "link",
      "selector": "a.read-more",
      "type": "attribute",
      "attribute": "href"
    }
  ]
}

LLM-based extraction:

yaml

# extract_llm.yml
type: "llm"
provider: "openai/gpt-4"
instruction: "Extract all articles with their titles and links"
api_token: "your-token"
params:
  temperature: 0.3
  max_tokens: 1000

json

// llm_schema.json
{
  "title": "Article",
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "The title of the article"
    },
    "link": {
      "type": "string",
      "description": "URL to the full article"
    }
  }
}

Advanced Features

LLM Q&A

Ask questions about crawled content:

bash

# Simple question
crwl https://example.com -q "What is the main topic discussed?"

# View content then ask questions
crwl https://example.com -o markdown  # See content first
crwl https://example.com -q "Summarize the key points"
crwl https://example.com -q "What are the conclusions?"

# Combined with advanced crawling
crwl https://example.com \
    -B browser.yml \
    -c "css_selector=article,scan_full_page=true" \
    -q "What are the pros and cons mentioned?"

First-time setup:

Prompts for LLM provider and API token
Saves configuration in ~/.crawl4ai/global.yml
Supports various providers (openai/gpt-4, anthropic/claude-3-sonnet, etc.)
For case of ollama you do not need to provide API token.
See LiteLLM Providers for full list

Structured Data Extraction

Extract structured data using CSS selectors:

bash

crwl https://example.com \
    -e extract_css.yml \
    -s css_schema.json \
    -o json

Or using LLM-based extraction:

bash

crwl https://example.com \
    -e extract_llm.yml \
    -s llm_schema.json \
    -o json

Content Filtering

Filter content for relevance:

yaml

# filter_bm25.yml
type: "bm25"
query: "target content"
threshold: 1.0

# filter_pruning.yml
type: "pruning"
query: "focus topic"
threshold: 0.48

bash

crwl https://example.com -f filter_bm25.yml -o markdown-fit

Output Formats

all - Full crawl result including metadata
json - Extracted structured data (when using extraction)
markdown / md - Raw markdown output
markdown-fit / md-fit - Filtered markdown for better readability

Complete Examples

Basic Extraction:

bash

crwl https://example.com \
    -B browser.yml \
    -C crawler.yml \
    -o json

Structured Data Extraction:

bash

crwl https://example.com \
    -e extract_css.yml \
    -s css_schema.json \
    -o json \
    -v

LLM Extraction with Filtering:

bash

crwl https://example.com \
    -B browser.yml \
    -e extract_llm.yml \
    -s llm_schema.json \
    -f filter_bm25.yml \
    -o json

Interactive Q&A:

bash

# First crawl and view
crwl https://example.com -o markdown

# Then ask questions
crwl https://example.com -q "What are the main points?"
crwl https://example.com -q "Summarize the conclusions"

Best Practices & Tips

Configuration Management:
- Keep common configurations in YAML files
- Use CLI parameters for quick overrides
- Store sensitive data (API tokens) in ~/.crawl4ai/global.yml
Performance Optimization:
- Use --bypass-cache for fresh content
- Enable scan_full_page for infinite scroll pages
- Adjust delay_before_return_html for dynamic content
Content Extraction:
- Use CSS extraction for structured content
- Use LLM extraction for unstructured content
- Combine with filters for focused results
Q&A Workflow:
- View content first with -o markdown
- Ask specific questions
- Use broader context with appropriate selectors

Recap

The Crawl4AI CLI provides:

Flexible configuration via files and parameters
Multiple extraction strategies (CSS, XPath, LLM)
Content filtering and optimization
Interactive Q&A capabilities
Various output formats

Crawl4AI CLI Guide

Crawl4AI CLI Guide

Table of Contents

Installation

Basic Usage

Quick Example of Advanced Usage

Configuration

Browser Configuration

Crawler Configuration

Extraction Configuration

Advanced Features

LLM Q&A

Structured Data Extraction

Content Filtering

Output Formats

Complete Examples

Best Practices & Tips

Recap