docs/timeout_configuration.md
The FetchNode in ScrapeGraphAI supports configurable timeouts for all blocking operations to prevent indefinite hangs when fetching web content or parsing files. This feature allows you to control execution time limits for:
use_soup=True)By default, FetchNode uses a 30-second timeout for all blocking operations when a node_config is provided:
from scrapegraphai.nodes import FetchNode
# Default 30-second timeout
node = FetchNode(
input="url",
output=["doc"],
node_config={}
)
You can specify a custom timeout value (in seconds) via the timeout parameter:
# Custom 10-second timeout
node = FetchNode(
input="url",
output=["doc"],
node_config={"timeout": 10}
)
To disable timeout and allow operations to run indefinitely, set timeout to None:
# No timeout - operations will wait indefinitely
node = FetchNode(
input="url",
output=["doc"],
node_config={"timeout": None}
)
If you don't provide any node_config, the timeout defaults to None (no timeout):
# No timeout (backward compatible)
node = FetchNode(
input="url",
output=["doc"],
node_config=None
)
When use_soup=True, the timeout applies to requests.get() calls:
node = FetchNode(
input="url",
output=["doc"],
node_config={
"use_soup": True,
"timeout": 15 # HTTP request will timeout after 15 seconds
}
)
state = {"url": "https://example.com"}
result = node.execute(state)
If the timeout is None, no timeout parameter is passed to requests.get():
node = FetchNode(
input="url",
output=["doc"],
node_config={
"use_soup": True,
"timeout": None # No timeout for HTTP requests
}
)
The timeout applies to PDF file parsing operations using PyPDFLoader:
node = FetchNode(
input="pdf",
output=["doc"],
node_config={
"timeout": 60 # PDF parsing will timeout after 60 seconds
}
)
state = {"pdf": "/path/to/large_document.pdf"}
try:
result = node.execute(state)
except TimeoutError as e:
print(f"PDF parsing took too long: {e}")
If parsing exceeds the timeout, a TimeoutError is raised with a descriptive message:
TimeoutError: PDF parsing exceeded timeout of 60 seconds
The timeout is automatically propagated to ChromiumLoader via loader_kwargs:
node = FetchNode(
input="url",
output=["doc"],
node_config={
"timeout": 30, # ChromiumLoader will use 30-second timeout
"headless": True
}
)
state = {"url": "https://example.com"}
result = node.execute(state)
If you need different timeout behavior for ChromiumLoader specifically, you can override it in loader_kwargs:
node = FetchNode(
input="url",
output=["doc"],
node_config={
"timeout": 30, # General timeout for other operations
"loader_kwargs": {
"timeout": 60 # ChromiumLoader gets 60-second timeout
}
}
)
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"model": "gpt-3.5-turbo",
"api_key": "your-api-key"
},
"timeout": 20 # 20-second timeout for fetch operations
}
smart_scraper = SmartScraperGraph(
prompt="Extract all article titles",
source="https://news.example.com",
config=graph_config
)
result = smart_scraper.run()
from scrapegraphai.nodes import FetchNode
from langgraph.graph import StateGraph
# Create a custom graph with timeout
fetch_node = FetchNode(
input="url",
output=["doc"],
node_config={
"timeout": 15,
"headless": True
}
)
# Add to graph...
Choose appropriate timeouts: Consider the expected response time of your target websites
Handle TimeoutError: Always wrap your code in try-except when using timeouts:
try:
result = node.execute(state)
except TimeoutError as e:
logger.error(f"Operation timed out: {e}")
# Handle timeout gracefully
# For PDFs
pdf_node = FetchNode("pdf", ["doc"], {"timeout": 120})
# For web pages
web_node = FetchNode("url", ["doc"], {"timeout": 15})
import logging
logger = logging.getLogger(__name__)
try:
result = node.execute(state)
except TimeoutError as e:
logger.warning(f"Timeout for {state.get('url', 'unknown')}: {e}")
The timeout feature is implemented using:
requests.get(url, timeout=X) parameterconcurrent.futures.ThreadPoolExecutor with future.result(timeout=X)loader_kwargs dictionaryWhen timeout=None, no timeout constraints are applied, allowing operations to run until completion.
If you're seeing frequent timeout errors, increase the timeout value:
node_config = {"timeout": 60} # Increase from 30 to 60 seconds
Use separate FetchNode instances with different configurations:
fast_fetcher = FetchNode("url", ["doc"], {"timeout": 10})
slow_fetcher = FetchNode("pdf", ["doc"], {"timeout": 120})
Ensure you're not overriding the timeout in loader_kwargs:
# ❌ Wrong - explicit loader_kwargs timeout overrides node timeout
node_config = {
"timeout": 30,
"loader_kwargs": {"timeout": 10} # This takes precedence
}
# ✅ Correct - let node timeout propagate
node_config = {
"timeout": 30 # ChromiumLoader will use 30 seconds
}