Back to Scrapegraph Ai

FetchNode Timeout Configuration

docs/timeout_configuration.md

2.0.06.4 KB
Original Source

FetchNode Timeout Configuration

Overview

The FetchNode in ScrapeGraphAI supports configurable timeouts for all blocking operations to prevent indefinite hangs when fetching web content or parsing files. This feature allows you to control execution time limits for:

  • HTTP requests (when using use_soup=True)
  • PDF file parsing
  • ChromiumLoader operations

Configuration

Default Behavior

By default, FetchNode uses a 30-second timeout for all blocking operations when a node_config is provided:

python
from scrapegraphai.nodes import FetchNode

# Default 30-second timeout
node = FetchNode(
    input="url",
    output=["doc"],
    node_config={}
)

Custom Timeout

You can specify a custom timeout value (in seconds) via the timeout parameter:

python
# Custom 10-second timeout
node = FetchNode(
    input="url",
    output=["doc"],
    node_config={"timeout": 10}
)

Disabling Timeout

To disable timeout and allow operations to run indefinitely, set timeout to None:

python
# No timeout - operations will wait indefinitely
node = FetchNode(
    input="url",
    output=["doc"],
    node_config={"timeout": None}
)

No Configuration

If you don't provide any node_config, the timeout defaults to None (no timeout):

python
# No timeout (backward compatible)
node = FetchNode(
    input="url",
    output=["doc"],
    node_config=None
)

Use Cases

HTTP Requests

When use_soup=True, the timeout applies to requests.get() calls:

python
node = FetchNode(
    input="url",
    output=["doc"],
    node_config={
        "use_soup": True,
        "timeout": 15  # HTTP request will timeout after 15 seconds
    }
)

state = {"url": "https://example.com"}
result = node.execute(state)

If the timeout is None, no timeout parameter is passed to requests.get():

python
node = FetchNode(
    input="url",
    output=["doc"],
    node_config={
        "use_soup": True,
        "timeout": None  # No timeout for HTTP requests
    }
)

PDF Parsing

The timeout applies to PDF file parsing operations using PyPDFLoader:

python
node = FetchNode(
    input="pdf",
    output=["doc"],
    node_config={
        "timeout": 60  # PDF parsing will timeout after 60 seconds
    }
)

state = {"pdf": "/path/to/large_document.pdf"}
try:
    result = node.execute(state)
except TimeoutError as e:
    print(f"PDF parsing took too long: {e}")

If parsing exceeds the timeout, a TimeoutError is raised with a descriptive message:

TimeoutError: PDF parsing exceeded timeout of 60 seconds

ChromiumLoader

The timeout is automatically propagated to ChromiumLoader via loader_kwargs:

python
node = FetchNode(
    input="url",
    output=["doc"],
    node_config={
        "timeout": 30,  # ChromiumLoader will use 30-second timeout
        "headless": True
    }
)

state = {"url": "https://example.com"}
result = node.execute(state)

If you need different timeout behavior for ChromiumLoader specifically, you can override it in loader_kwargs:

python
node = FetchNode(
    input="url",
    output=["doc"],
    node_config={
        "timeout": 30,  # General timeout for other operations
        "loader_kwargs": {
            "timeout": 60  # ChromiumLoader gets 60-second timeout
        }
    }
)

Graph Examples

SmartScraperGraph

python
from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "gpt-3.5-turbo",
        "api_key": "your-api-key"
    },
    "timeout": 20  # 20-second timeout for fetch operations
}

smart_scraper = SmartScraperGraph(
    prompt="Extract all article titles",
    source="https://news.example.com",
    config=graph_config
)

result = smart_scraper.run()

Custom Graph with FetchNode

python
from scrapegraphai.nodes import FetchNode
from langgraph.graph import StateGraph

# Create a custom graph with timeout
fetch_node = FetchNode(
    input="url",
    output=["doc"],
    node_config={
        "timeout": 15,
        "headless": True
    }
)

# Add to graph...

Best Practices

  1. Choose appropriate timeouts: Consider the expected response time of your target websites

    • Fast APIs: 5-10 seconds
    • Regular websites: 15-30 seconds
    • Large PDFs or slow sites: 60+ seconds
  2. Handle TimeoutError: Always wrap your code in try-except when using timeouts:

python
try:
    result = node.execute(state)
except TimeoutError as e:
    logger.error(f"Operation timed out: {e}")
    # Handle timeout gracefully
  1. Use different timeouts for different operations: Set higher timeouts for PDF parsing and lower for HTTP requests:
python
# For PDFs
pdf_node = FetchNode("pdf", ["doc"], {"timeout": 120})

# For web pages
web_node = FetchNode("url", ["doc"], {"timeout": 15})
  1. Monitor timeout occurrences: Log timeout errors to identify problematic sources:
python
import logging

logger = logging.getLogger(__name__)

try:
    result = node.execute(state)
except TimeoutError as e:
    logger.warning(f"Timeout for {state.get('url', 'unknown')}: {e}")

Implementation Details

The timeout feature is implemented using:

  • HTTP requests: requests.get(url, timeout=X) parameter
  • PDF parsing: concurrent.futures.ThreadPoolExecutor with future.result(timeout=X)
  • ChromiumLoader: Propagated via loader_kwargs dictionary

When timeout=None, no timeout constraints are applied, allowing operations to run until completion.

Troubleshooting

Timeout is too short

If you're seeing frequent timeout errors, increase the timeout value:

python
node_config = {"timeout": 60}  # Increase from 30 to 60 seconds

Need different timeouts for different operations

Use separate FetchNode instances with different configurations:

python
fast_fetcher = FetchNode("url", ["doc"], {"timeout": 10})
slow_fetcher = FetchNode("pdf", ["doc"], {"timeout": 120})

ChromiumLoader timeout not working

Ensure you're not overriding the timeout in loader_kwargs:

python
# ❌ Wrong - explicit loader_kwargs timeout overrides node timeout
node_config = {
    "timeout": 30,
    "loader_kwargs": {"timeout": 10}  # This takes precedence
}

# ✅ Correct - let node timeout propagate
node_config = {
    "timeout": 30  # ChromiumLoader will use 30 seconds
}

See Also