sep/sep-013.rst
======= ==================================== SEP 13 Title Middlewares Refactoring Author Pablo Hoffman Created 2009-11-14 Status Document in progress (being written) ======= ====================================
This SEP proposes a refactoring of Scrapy middlewares to remove some inconsistencies and limitations.
Even though the core works pretty well, it has some subtle inconsistencies that don't manifest in the common uses, but arise (and are quite annoying) when you try to fully exploit all Scrapy features. The currently identified flaws and inconsistencies are:
process_spider_exception method which catches
exceptions coming out of the spider, but it doesn't have an analogous for
catching exceptions coming into the spider (for example, from other
downloader middlewares). This complicates supporting middlewares that extend
other middlewares.process_exception method which catches
exceptions coming out of the downloader, but it doesn't have an analogous
for catching exceptions coming into the downloader (for example, from other
downloader middlewares). This complicates supporting middlewares that extend
other middlewares.enqueue_request method but doesn't have a
enqueue_request_exception nor dequeue_request nor
dequeue_request_exception methods.These flaws will be corrected by the changes proposed in this SEP.
Most of the inconsistencies come from the fact that middlewares don't follow the typical [https://twistedmatrix.com/projects/core/documentation/howto/defer.html deferred] callback/errback chaining logic. Twisted logic is fine and quite intuitive, and also fits middlewares very well. Due to some bad design choices the integration between middleware calls and deferred is far from optional. So the changes to middlewares involve mainly building deferred chains with the middleware methods and adding the missing method to each callback/errback chain. The proposed API for each middleware is described below.
See |scrapy core v2| - a diagram draft for the process architecture.
To be discussed:
maybeDeferred) in middleware
methods?process_spider_input(response, spider)process_spider_output(response, result, spider)process_spider_exception(response, exception, spider=spider)process_spider_exception to
process_spider_output_exceptionprocess_spider_input_exceptionSpiderInput deferred
process_spider_input(response, spider)process_spider_input_exception(response, exception, spider=spider)SpiderOutput deferred
process_spider_output(response, result, spider)process_spider_output_exception(response, exception, spider=spider)process_request(request, spider)process_response(request, response, spider)process_exception(request, exception, spider)process_exception to process_response_exceptionprocess_request_exceptionProcessRequest deferred
process_request(request, spider)process_request_exception(request, exception, response)ProcessResponse deferred
process_response(request, spider, response)process_response_exception(request, exception, response)enqueue_request(spider, request)
open_spider(spider)close_spider(spider)dequeue_request, enqueue_request_exception,
dequeue_request_exceptionopen_spider, close_spider. They should be
replaced by using the spider_opened, spider_closed signals, but
they weren't before because of a chicken-egg problem when open spiders
(because of scheduler auto-open feature).EnqueueRequest deferred
enqueue_request(request, spider)
IgnoreRequestenqueue_request_exception(request, exception, spider)The Request that gets returned by last enqueue_request() is the one that gets scheduled
If no request is returned but a Failure, the Request errback is called with that failure
DequeueRequest deferred
dequeue_request(request, spider)dequeue_request_exception(exception, spider)how to avoid massive IgnoreRequest exceptions from propagating which
slows down the crawler
if requests change, how do we keep reference to the original one? do we need to?
opt 1: don't allow changing the original Request object - discarded
opt 2: keep reference to the original request (how it's done now)
opt 3: split SpiderRequest from DownloaderRequest
opt 5: keep reference only to original deferred and forget about the original request
scheduler auto-open chicken-egg problem
call Request.errback if both schmw and dlmw fail?
.. |scrapy core v2| image:: scrapy_core_v2.jpg