docs/md_v2/extraction/no-llm-strategies.md
One of Crawl4AI's most powerful features is extracting structured JSON from websites without relying on large language models. Crawl4AI offers several strategies for LLM-free extraction:
JsonCssExtractionStrategy and JsonXPathExtractionStrategyRegexExtractionStrategy for fast pattern matchingThese approaches let you extract data instantly—even from complex or nested HTML structures—without the cost, latency, or environmental impact of an LLM.
Why avoid LLM for basic extractions?
Below, we'll explore how to craft these schemas and use them with JsonCssExtractionStrategy (or JsonXPathExtractionStrategy if you prefer XPath). We'll also highlight advanced features like nested fields and base element attributes.
A schema defines:
For example, if you have a list of products, each one might have a name, price, reviews, and "related products." This approach is faster and more reliable than an LLM for consistent, structured pages.
Let's begin with a simple schema-based extraction using the JsonCssExtractionStrategy. Below is a snippet that extracts cryptocurrency prices from a site (similar to the legacy Coinbase example). Notice we don't call any LLM:
import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
async def extract_crypto_prices():
# 1. Define a simple extraction schema
schema = {
"name": "Crypto Prices",
"baseSelector": "div.crypto-row", # Repeated elements
"fields": [
{
"name": "coin_name",
"selector": "h2.coin-name",
"type": "text"
},
{
"name": "price",
"selector": "span.coin-price",
"type": "text"
}
]
}
# 2. Create the extraction strategy
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
# 3. Set up your crawler config (if needed)
config = CrawlerRunConfig(
# e.g., pass js_code or wait_for if the page is dynamic
# wait_for="css:.crypto-row:nth-child(20)"
cache_mode = CacheMode.BYPASS,
extraction_strategy=extraction_strategy,
)
async with AsyncWebCrawler(verbose=True) as crawler:
# 4. Run the crawl and extraction
result = await crawler.arun(
url="https://example.com/crypto-prices",
config=config
)
if not result.success:
print("Crawl failed:", result.error_message)
return
# 5. Parse the extracted JSON
data = json.loads(result.extracted_content)
print(f"Extracted {len(data)} coin entries")
print(json.dumps(data[0], indent=2) if data else "No data found")
asyncio.run(extract_crypto_prices())
Highlights:
baseSelector: Tells us where each "item" (crypto row) is.fields: Two fields (coin_name, price) using simple CSS selectors.type (e.g., text, attribute, html, regex, etc.).transform, default, attribute, pattern, and source (for sibling data — see Extracting Sibling Data).No LLM is needed, and the performance is near-instant for hundreds or thousands of items.
raw:// HTMLBelow is a short example demonstrating XPath extraction plus the raw:// scheme. We'll pass a dummy HTML directly (no network request) and define the extraction strategy in CrawlerRunConfig.
import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai import JsonXPathExtractionStrategy
async def extract_crypto_prices_xpath():
# 1. Minimal dummy HTML with some repeating rows
dummy_html = """
<html>
<body>
<div class='crypto-row'>
<h2 class='coin-name'>Bitcoin</h2>
<span class='coin-price'>$28,000</span>
</div>
<div class='crypto-row'>
<h2 class='coin-name'>Ethereum</h2>
<span class='coin-price'>$1,800</span>
</div>
</body>
</html>
"""
# 2. Define the JSON schema (XPath version)
schema = {
"name": "Crypto Prices via XPath",
"baseSelector": "//div[@class='crypto-row']",
"fields": [
{
"name": "coin_name",
"selector": ".//h2[@class='coin-name']",
"type": "text"
},
{
"name": "price",
"selector": ".//span[@class='coin-price']",
"type": "text"
}
]
}
# 3. Place the strategy in the CrawlerRunConfig
config = CrawlerRunConfig(
extraction_strategy=JsonXPathExtractionStrategy(schema, verbose=True)
)
# 4. Use raw:// scheme to pass dummy_html directly
raw_url = f"raw://{dummy_html}"
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url=raw_url,
config=config
)
if not result.success:
print("Crawl failed:", result.error_message)
return
data = json.loads(result.extracted_content)
print(f"Extracted {len(data)} coin rows")
if data:
print("First item:", data[0])
asyncio.run(extract_crypto_prices_xpath())
Key Points:
JsonXPathExtractionStrategy is used instead of JsonCssExtractionStrategy.baseSelector and each field's "selector" use XPath instead of CSS.raw:// lets us pass dummy_html with no real network request—handy for local testing.CrawlerRunConfig.That's how you keep the config self-contained, illustrate XPath usage, and demonstrate the raw scheme for direct HTML input—all while avoiding the old approach of passing extraction_strategy directly to arun().
Real sites often have nested or repeated data—like categories containing products, which themselves have a list of reviews or features. For that, we can define nested or list (and even nested_list) fields.
We have a sample e-commerce HTML file on GitHub (example):
https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/examples/sample_ecommerce.html
This snippet includes categories, products, features, reviews, and related items. Let's see how to define a schema that fully captures that structure without LLM.
schema = {
"name": "E-commerce Product Catalog",
"baseSelector": "div.category",
# (1) We can define optional baseFields if we want to extract attributes
# from the category container
"baseFields": [
{"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
],
"fields": [
{
"name": "category_name",
"selector": "h2.category-name",
"type": "text"
},
{
"name": "products",
"selector": "div.product",
"type": "nested_list", # repeated sub-objects
"fields": [
{
"name": "name",
"selector": "h3.product-name",
"type": "text"
},
{
"name": "price",
"selector": "p.product-price",
"type": "text"
},
{
"name": "details",
"selector": "div.product-details",
"type": "nested", # single sub-object
"fields": [
{
"name": "brand",
"selector": "span.brand",
"type": "text"
},
{
"name": "model",
"selector": "span.model",
"type": "text"
}
]
},
{
"name": "features",
"selector": "ul.product-features li",
"type": "list",
"fields": [
{"name": "feature", "type": "text"}
]
},
{
"name": "reviews",
"selector": "div.review",
"type": "nested_list",
"fields": [
{
"name": "reviewer",
"selector": "span.reviewer",
"type": "text"
},
{
"name": "rating",
"selector": "span.rating",
"type": "text"
},
{
"name": "comment",
"selector": "p.review-text",
"type": "text"
}
]
},
{
"name": "related_products",
"selector": "ul.related-products li",
"type": "list",
"fields": [
{
"name": "name",
"selector": "span.related-name",
"type": "text"
},
{
"name": "price",
"selector": "span.related-price",
"type": "text"
}
]
}
]
}
]
}
Key Takeaways:
type: "nested" means a single sub-object (like details).type: "list" means multiple items that are simple dictionaries or single text fields.type: "nested_list" means repeated complex objects (like products or reviews)."baseFields". For instance, "data_cat_id" might be data-cat-id="elect123".transform if we want to lower/upper case, strip whitespace, or even run a custom function.import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai import JsonCssExtractionStrategy
ecommerce_schema = {
# ... the advanced schema from above ...
}
async def extract_ecommerce_data():
strategy = JsonCssExtractionStrategy(ecommerce_schema, verbose=True)
config = CrawlerRunConfig()
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/examples/sample_ecommerce.html",
extraction_strategy=strategy,
config=config
)
if not result.success:
print("Crawl failed:", result.error_message)
return
# Parse the JSON output
data = json.loads(result.extracted_content)
print(json.dumps(data, indent=2) if data else "No data found.")
asyncio.run(extract_ecommerce_data())
If all goes well, you get a structured JSON array with each "category," containing an array of products. Each product includes details, features, reviews, etc. All of that without an LLM.
Crawl4AI now offers a powerful new zero-LLM extraction strategy: RegexExtractionStrategy. This strategy provides lightning-fast extraction of common data types like emails, phone numbers, URLs, dates, and more using pre-compiled regular expressions.
The easiest way to start is by using the built-in pattern catalog:
import json
import asyncio
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
RegexExtractionStrategy
)
async def extract_with_regex():
# Create a strategy using built-in patterns for URLs and currencies
strategy = RegexExtractionStrategy(
pattern = RegexExtractionStrategy.Url | RegexExtractionStrategy.Currency
)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=config
)
if result.success:
data = json.loads(result.extracted_content)
for item in data[:5]: # Show first 5 matches
print(f"{item['label']}: {item['value']}")
print(f"Total matches: {len(data)}")
asyncio.run(extract_with_regex())
RegexExtractionStrategy provides these common patterns as IntFlag attributes for easy combining:
# Use individual patterns
strategy = RegexExtractionStrategy(pattern=RegexExtractionStrategy.Email)
# Combine multiple patterns
strategy = RegexExtractionStrategy(
pattern = (
RegexExtractionStrategy.Email |
RegexExtractionStrategy.PhoneUS |
RegexExtractionStrategy.Url
)
)
# Use all available patterns
strategy = RegexExtractionStrategy(pattern=RegexExtractionStrategy.All)
Available patterns include:
Email - Email addressesPhoneIntl - International phone numbersPhoneUS - US-format phone numbersUrl - HTTP/HTTPS URLsIPv4 - IPv4 addressesIPv6 - IPv6 addressesUuid - UUIDsCurrency - Currency values (USD, EUR, etc.)Percentage - Percentage valuesNumber - Numeric valuesDateIso - ISO format datesDateUS - US format datesTime24h - 24-hour format timesPostalUS - US postal codesPostalUK - UK postal codesHexColor - HTML hex color codesTwitterHandle - Twitter handlesHashtag - HashtagsMacAddr - MAC addressesIban - International bank account numbersCreditCard - Credit card numbersFor more targeted extraction, you can provide custom patterns:
import json
import asyncio
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
RegexExtractionStrategy
)
async def extract_prices():
# Define a custom pattern for US Dollar prices
price_pattern = {"usd_price": r"\$\s?\d{1,3}(?:,\d{3})*(?:\.\d{2})?"}
# Create strategy with custom pattern
strategy = RegexExtractionStrategy(custom=price_pattern)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.example.com/products",
config=config
)
if result.success:
data = json.loads(result.extracted_content)
for item in data:
print(f"Found price: {item['value']}")
asyncio.run(extract_prices())
For complex or site-specific patterns, you can use an LLM once to generate an optimized pattern, then save and reuse it without further LLM calls:
import json
import asyncio
from pathlib import Path
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
RegexExtractionStrategy,
LLMConfig
)
async def extract_with_generated_pattern():
cache_dir = Path("./pattern_cache")
cache_dir.mkdir(exist_ok=True)
pattern_file = cache_dir / "price_pattern.json"
# 1. Generate or load pattern
if pattern_file.exists():
pattern = json.load(pattern_file.open())
print(f"Using cached pattern: {pattern}")
else:
print("Generating pattern via LLM...")
# Configure LLM
llm_config = LLMConfig(
provider="openai/gpt-4o-mini",
api_token="env:OPENAI_API_KEY",
)
# Get sample HTML for context
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/products")
html = result.markdown.fit_html
# Generate pattern (one-time LLM usage)
pattern = RegexExtractionStrategy.generate_pattern(
label="price",
html=html,
query="Product prices in USD format",
llm_config=llm_config,
)
# Cache pattern for future use
json.dump(pattern, pattern_file.open("w"), indent=2)
# 2. Use pattern for extraction (no LLM calls)
strategy = RegexExtractionStrategy(custom=pattern)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/products",
config=config
)
if result.success:
data = json.loads(result.extracted_content)
for item in data[:10]:
print(f"Extracted: {item['value']}")
print(f"Total matches: {len(data)}")
asyncio.run(extract_with_generated_pattern())
This pattern allows you to:
The RegexExtractionStrategy returns results in a consistent format:
[
{
"url": "https://example.com",
"label": "email",
"value": "[email protected]",
"span": [145, 163]
},
{
"url": "https://example.com",
"label": "url",
"value": "https://support.example.com",
"span": [210, 235]
}
]
Each match includes:
url: The source URLlabel: The pattern name that matched (e.g., "email", "phone_us")value: The extracted textspan: The start and end positions in the source contentWhen might you consider an LLM? Possibly if the site is extremely unstructured or you want AI summarization. But always try a schema or regex approach first for repeated or consistent data patterns.
It's easy to extract attributes (like href, src, or data-xxx) from your base or nested elements using:
{
"name": "href",
"type": "attribute",
"attribute": "href",
"default": null
}
You can define them in baseFields (extracted from the main container element) or in each field's sub-lists. This is especially helpful if you need an item's link or ID stored in the parent <div>.
Consider a blog site. We have a schema that extracts the URL from each post card (via baseFields with an "attribute": "href"), plus the title, date, summary, and author:
schema = {
"name": "Blog Posts",
"baseSelector": "a.blog-post-card",
"baseFields": [
{"name": "post_url", "type": "attribute", "attribute": "href"}
],
"fields": [
{"name": "title", "selector": "h2.post-title", "type": "text", "default": "No Title"},
{"name": "date", "selector": "time.post-date", "type": "text", "default": ""},
{"name": "summary", "selector": "p.post-summary", "type": "text", "default": ""},
{"name": "author", "selector": "span.post-author", "type": "text", "default": ""}
]
}
Then run with JsonCssExtractionStrategy(schema) to get an array of blog post objects, each with "post_url", "title", "date", "summary", "author".
source {#sibling-data}Some websites split a single logical item across sibling elements rather than nesting everything inside one container. A classic example is Hacker News, where each submission spans two adjacent <tr> rows:
<tr class="athing submission"> <!-- rank, title, url -->
<td><span class="rank">1.</span></td>
<td><span class="titleline"><a href="https://example.com">Example Title</a></span></td>
</tr>
<tr> <!-- score, author, comments (sibling!) -->
<td class="subtext">
<span class="score">100 points</span>
<a class="hnuser">johndoe</a>
</td>
</tr>
Normally, field selectors only search descendants of the base element — siblings are unreachable. The source field key solves this by navigating to a sibling element before running the selector.
"source": "+ <selector>"
+ tr — next sibling <tr>+ div.details — next sibling <div> with class details+ .subtext — next sibling with class subtextschema = {
"name": "HN Submissions",
"baseSelector": "tr.athing.submission",
"fields": [
{"name": "rank", "selector": "span.rank", "type": "text"},
{"name": "title", "selector": "span.titleline a", "type": "text"},
{"name": "url", "selector": "span.titleline a", "type": "attribute", "attribute": "href"},
{"name": "score", "selector": "span.score", "type": "text", "source": "+ tr"},
{"name": "author", "selector": "a.hnuser", "type": "text", "source": "+ tr"},
],
}
strategy = JsonCssExtractionStrategy(schema)
The score and author fields first navigate to the next sibling <tr>, then run their selectors inside that element. Fields without source work as before — searching descendants of the base element.
source works with all field types (text, attribute, nested, list, etc.) and with both JsonCssExtractionStrategy and JsonXPathExtractionStrategy. If the sibling isn't found, the field returns its default value.
js_code or wait_for in CrawlerRunConfig.verbose=True: if your selectors are off or your schema is malformed, it'll often show warnings.href, data-id), especially for the "parent" item.RegexExtractionStrategy is often the fastest approach.While manually crafting schemas is powerful and precise, Crawl4AI now offers a convenient utility to automatically generate extraction schemas using LLM. This is particularly useful when:
The schema generator is available as a static method on both JsonCssExtractionStrategy and JsonXPathExtractionStrategy. You can choose between OpenAI's GPT-4 or the open-source Ollama for schema generation:
from crawl4ai import JsonCssExtractionStrategy, JsonXPathExtractionStrategy
from crawl4ai import LLMConfig
# Sample HTML with product information
html = """
<div class="product-card">
<h2 class="title">Gaming Laptop</h2>
<div class="price">$999.99</div>
<div class="specs">
<ul>
<li>16GB RAM</li>
<li>1TB SSD</li>
</ul>
</div>
</div>
"""
# Option 1: Using OpenAI (requires API token)
css_schema = JsonCssExtractionStrategy.generate_schema(
html,
schema_type="css",
llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-openai-token")
)
# Option 2: Using Ollama (open source, no token needed)
xpath_schema = JsonXPathExtractionStrategy.generate_schema(
html,
schema_type="xpath",
llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None) # Not needed for Ollama
)
# Use the generated schema for fast, repeated extractions
strategy = JsonCssExtractionStrategy(css_schema)
By default, generate_schema validates the generated schema against the HTML to ensure that it actually extracts the data you expect. If the schema doesn't produce results, it automatically refines the selectors before returning.
You can control this with the validate parameter:
# Default: validated (recommended)
schema = JsonCssExtractionStrategy.generate_schema(
url="https://news.ycombinator.com",
query="Extract each story: title, url, score, author",
)
# Skip validation if you want raw LLM output
schema = JsonCssExtractionStrategy.generate_schema(
url="https://news.ycombinator.com",
query="Extract each story: title, url, score, author",
validate=False,
)
The generator also understands sibling layouts — for sites like Hacker News where data is split across sibling elements, it will automatically use the source field to reach sibling data.
generate_schema may make multiple LLM calls internally (field inference, schema generation, validation retries). To track the total token consumption across all of these calls, pass a TokenUsage accumulator:
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai.models import TokenUsage
usage = TokenUsage()
schema = JsonCssExtractionStrategy.generate_schema(
url="https://news.ycombinator.com",
query="Extract each story: title, url, score, author",
usage=usage,
)
print(f"Prompt tokens: {usage.prompt_tokens}")
print(f"Completion tokens: {usage.completion_tokens}")
print(f"Total tokens: {usage.total_tokens}")
The usage parameter is optional — omitting it changes nothing (fully backward-compatible). You can also reuse the same accumulator across multiple calls to get a grand total:
usage = TokenUsage()
schema1 = JsonCssExtractionStrategy.generate_schema(url=url1, query=q1, usage=usage)
schema2 = JsonCssExtractionStrategy.generate_schema(url=url2, query=q2, usage=usage)
print(f"Grand total: {usage.total_tokens} tokens")
Both generate_schema (sync) and agenerate_schema (async) support the usage parameter.
OpenAI GPT-4 (openai/gpt4o)
OPENAI_API_KEYOllama (ollama/llama3.3)
When scraping multiple pages with varying DOM structures (e.g., product pages where table rows appear in different positions), single-sample schema generation may produce fragile selectors like tr:nth-child(6) that break on other pages.
The Problem:
Page A: Manufacturer is in row 6 → selector: tr:nth-child(6) td a
Page B: Manufacturer is in row 5 → selector FAILS
Page C: Manufacturer is in row 7 → selector FAILS
The Solution: Provide multiple HTML samples so the LLM identifies stable patterns that work across all pages.
from crawl4ai import JsonCssExtractionStrategy, LLMConfig
# Collect HTML samples from different pages
html_sample_1 = """
<table class="specs">
<tr><td>Brand</td><td>Apple</td></tr>
<tr><td>Manufacturer</td><td><a href="/m/apple">Apple Inc</a></td></tr>
</table>
"""
html_sample_2 = """
<table class="specs">
<tr><td>Manufacturer</td><td><a href="/m/samsung">Samsung</a></td></tr>
<tr><td>Brand</td><td>Galaxy</td></tr>
</table>
"""
html_sample_3 = """
<table class="specs">
<tr><td>Model</td><td>Pixel 8</td></tr>
<tr><td>Brand</td><td>Google</td></tr>
<tr><td>Manufacturer</td><td><a href="/m/google">Google LLC</a></td></tr>
</table>
"""
# Combine samples with labels
combined_html = """
## HTML Sample 1 (Product A):
```html
""" + html_sample_1 + """
""" + html_sample_2 + """
""" + html_sample_3 + """
"""
query = """ IMPORTANT: I'm providing 3 HTML samples from different product pages. The manufacturer field appears in different row positions across pages. Generate selectors using stable attributes like href patterns (e.g., a[href*='/m/']) instead of fragile positional selectors like nth-child(). Extract: manufacturer name and link. """
schema = JsonCssExtractionStrategy.generate_schema( html=combined_html, query=query, schema_type="CSS", llm_config=LLMConfig(provider="openai/gpt-4o", api_token="your-token") )
print(schema)
**Key Points for Multi-Sample Queries:**
1. **Format samples clearly** - Use markdown headers and code blocks to separate samples
2. **State the number of samples** - "I'm providing 3 HTML samples..."
3. **Explain the variation** - "...the manufacturer field appears in different row positions"
4. **Request stable selectors** - "Use href patterns, data attributes, or class names instead of nth-child"
**Stable vs Fragile Selectors:**
| Fragile (single sample) | Stable (multi-sample) |
|------------------------|----------------------|
| `tr:nth-child(6) td a` | `a[href*="/m/"]` |
| `div:nth-child(3) .price` | `.price, [data-price]` |
| `ul li:first-child` | `li[data-featured="true"]` |
This approach lets you generate schemas once that work reliably across hundreds of similar pages with varying structures.
---
## 11. Conclusion
With Crawl4AI's LLM-free extraction strategies - `JsonCssExtractionStrategy`, `JsonXPathExtractionStrategy`, and now `RegexExtractionStrategy` - you can build powerful pipelines that:
- Scrape any consistent site for structured data.
- Support nested objects, repeating lists, or pattern-based extraction.
- Scale to thousands of pages quickly and reliably.
**Choosing the Right Strategy**:
- Use **`RegexExtractionStrategy`** for fast extraction of common data types like emails, phones, URLs, dates, etc.
- Use **`JsonCssExtractionStrategy`** or **`JsonXPathExtractionStrategy`** for structured data with clear HTML patterns
- If you need both: first extract structured data with JSON strategies, then use regex on specific fields
**Remember**: For repeated, structured data, you don't need to pay for or wait on an LLM. Well-crafted schemas and regex patterns get you the data faster, cleaner, and cheaper—**the real power** of Crawl4AI.
**Last Updated**: 2025-05-02
---
That's it for **Extracting JSON (No LLM)**! You've seen how schema-based approaches (either CSS or XPath) and regex patterns can handle everything from simple lists to deeply nested product catalogs—instantly, with minimal overhead. Enjoy building robust scrapers that produce consistent, structured JSON for your data pipelines!