sep/sep-016.rst
======= =============================
SEP 16
Title Leg Spider
Author Insophia Team
Created 2010-06-03
Status Superseded by :doc:sep-018
======= =============================
This SEP introduces a new kind of Spider called LegSpider which provides
modular functionality which can be plugged to different spiders.
The purpose of Leg Spiders is to define an architecture for building spiders based on smaller well-tested components (aka. Legs) that can be combined to achieve the desired functionality. These reusable components will benefit all Scrapy users by building a repository of well-tested components (legs) that can be shared among different spiders and projects. Some of them will come bundled with Scrapy.
The Legs themselves can be also combined with sub-legs, in a hierarchical fashion. Legs are also spiders themselves, hence the name "Leg Spider".
LegSpider APIA LegSpider is a BaseSpider subclass that adds the following attributes and methods:
legs
process_response(response)
process_request(request)
process_item(item)
set_spider()
process_response
method and also the process_response method of all its "sub leg
spiders". Finally, the output of all of them is combined to produce the
final aggregated output.process_response is processed
with either process_item or process_request before being returned
from the spider. Similar to process_response, each item/request is
processed with all process_{request,item} of the leg spiders composing
the spider, and also with those of the spider itself.A typical application of LegSpider's is to build Link Extractors. For example:
.. code-block:: python
#!python class RegexHtmlLinkExtractor(LegSpider): def process_response(self, response): if isinstance(response, HtmlResponse): allowed_regexes = self.spider.url_regexes_to_follow # extract urls to follow using allowed_regexes return [Request(x) for x in urls_to_follow]
class MySpider(LegSpider): legs = [RegexHtmlLinkExtractor()] url_regexes_to_follow = ["/product.php?.*"]
def parse_response(self, response):
# parse response and extract items
return items
This is a Leg Spider that can be used for following links from RSS2 feeds.
.. code-block:: python
#!python class Rss2LinkExtractor(LegSpider): def process_response(self, response): if response.headers.get("Content-type") == "application/rss+xml": xs = XmlXPathSelector(response) urls = xs.select("//item/link/text()").extract() return [Request(x) for x in urls]
Another example could be to build a callback dispatcher based on rules:
.. code-block:: python
#!python class CallbackRules(LegSpider): def init(self, *a, **kw): super(CallbackRules, self).init(*a, **kw) for regex, method_name in self.spider.callback_rules.items(): r = re.compile(regex) m = getattr(self.spider, method_name, None) if m: self._rules[r] = m
def process_response(self, response):
for regex, method in self._rules.items():
m = regex.search(response.url)
if m:
return method(response)
return []
class MySpider(LegSpider): legs = [CallbackRules()] callback_rules = { "/product.php.": "parse_product", "/category.php.": "parse_category", }
def parse_product(self, response):
# parse response and populate item
return item
Another example could be for building URL canonicalizers:
.. code-block:: python
#!python class CanonicalizeUrl(LegSpider): def process_request(self, request): curl = canonicalize_url(request.url, rules=self.spider.canonicalization_rules) return request.replace(url=curl)
class MySpider(LegSpider): legs = [CanonicalizeUrl()] canonicalization_rules = ["sort-query-args", "normalize-percent-encoding", ...]
# ...
Another example could be for setting a unique identifier to items, based on certain fields:
.. code-block:: python
#!python class ItemIdSetter(LegSpider): def process_item(self, item): id_field = self.spider.id_field id_fields_to_hash = self.spider.id_fields_to_hash item[id_field] = make_hash_based_on_fields(item, id_fields_to_hash) return item
class MySpider(LegSpider): legs = [ItemIdSetter()] id_field = "guid" id_fields_to_hash = ["supplier_name", "supplier_id"]
def process_response(self, item):
# extract item from response
return item
Here's an example that combines functionality from multiple leg spiders:
.. code-block:: python
#!python class MySpider(LegSpider): legs = [RegexLinkExtractor(), ParseRules(), CanonicalizeUrl(), ItemIdSetter()]
url_regexes_to_follow = ["/product.php?.*"]
parse_rules = {
"/product.php.*": "parse_product",
"/category.php.*": "parse_category",
}
canonicalization_rules = ["sort-query-args", "normalize-percent-encoding", ...]
id_field = "guid"
id_fields_to_hash = ["supplier_name", "supplier_id"]
def process_product(self, item):
# extract item from response
return item
def process_category(self, item):
# extract item from response
return item
A common question that would arise is when one should use Leg Spiders and when to use Spider middlewares. Leg Spiders functionality is meant to implement spider-specific functionality, like link extraction which has custom rules per spider. Spider middlewares, on the other hand, are meant to implement global functionality.
Leg Spiders are not a silver bullet to implement all kinds of spiders, so it's important to keep in mind their scope and limitations, such as:
LegSpider proof-of-concept implementationHere's a proof-of-concept implementation of LegSpider:
.. code-block:: python
#!python from scrapy.http import Request from scrapy.item import BaseItem from scrapy.spider import BaseSpider from scrapy.utils.spider import iterate_spider_output
class LegSpider(BaseSpider): """A spider made of legs"""
legs = []
def __init__(self, *args, **kwargs):
super(LegSpider, self).__init__(*args, **kwargs)
self._legs = [self] + self.legs[:]
for l in self._legs:
l.set_spider(self)
def parse(self, response):
res = self._process_response(response)
for r in res:
if isinstance(r, BaseItem):
yield self._process_item(r)
else:
yield self._process_request(r)
def process_response(self, response):
return []
def process_request(self, request):
return request
def process_item(self, item):
return item
def set_spider(self, spider):
self.spider = spider
def _process_response(self, response):
res = []
for l in self._legs:
res.extend(iterate_spider_output(l.process_response(response)))
return res
def _process_request(self, request):
for l in self._legs:
request = l.process_request(request)
return request
def _process_item(self, item):
for l in self._legs:
item = l.process_item(item)
return item