sep/sep-009.rst
======= ==================================== SEP 9 Title Singleton removal Author Pablo Hoffman Created 2009-11-14 Status Document in progress (being written) ======= ====================================
This SEP proposes a refactoring of the Scrapy to get ri of singletons, which
will result in a cleaner API and will allow us to implement the library API
proposed in :doc:sep-004.
Scrapy 0.7 has the following singletons:
scrapy.core.engine.scrapyengine)scrapy.core.manager.scrapymanager)scrapy.extension.extensions)scrapy.spider.spiders)scrapy.stats.stats)scrapy.log)scrapy.xlib.pydispatcher)The proposed architecture is to have one "root" object called Crawler
(which will replace the current Execution Manager) and make all current
singletons members of that object, as explained below:
crawler: scrapy.crawler.Crawler instance (replaces current
scrapy.core.manager.ExecutionManager) - instantiated with a Settings
object
crawler.settings: scrapy.conf.Settings instance (passed in the __init__ method)
crawler.extensions: scrapy.extension.ExtensionManager instance
crawler.engine: scrapy.core.engine.ExecutionEngine instance
crawler.engine.scheduler
crawler.engine.scheduler.middleware - to access scheduler
middlewarecrawler.engine.downloader
crawler.engine.downloader.middleware - to access downloader
middlewarecrawler.engine.scraper
crawler.engine.scraper.spidermw - to access spider middlewarecrawler.spiders: SpiderManager instance (concrete class given in
SPIDER_MANAGER_CLASS setting)
crawler.stats: StatsCollector instance (concrete class given in
STATS_CLASS setting)
crawler.log: Logger class with methods replacing the current
scrapy.log functions. Logging would be started (if enabled) on
Crawler instantiation, so no log starting functions are required.
crawler.log.msgcrawler.signals: signal handling
crawler.signals.send() - same as pydispatch.dispatcher.send()crawler.signals.connect() - same as
pydispatch.dispatcher.connect()crawler.signals.disconnect() - same as
pydispatch.dispatcher.disconnect()All components (extensions, middlewares, etc) will receive this Crawler
object in their __init__ methods, and this will be the only mechanism for accessing
any other components (as opposed to importing each singleton from their
respective module). This will also serve to stabilize the core API, something
which we haven't documented so far (partly because of this).
So, for a typical middleware __init__ method code, instead of this:
.. code-block:: python
#!python from scrapy.core.exceptions import NotConfigured from scrapy.conf import settings
class SomeMiddleware(object): def init(self): if not settings.getbool("SOMEMIDDLEWARE_ENABLED"): raise NotConfigured
We'd write this:
.. code-block:: python
#!python from scrapy.core.exceptions import NotConfigured
class SomeMiddleware(object): def init(self, crawler): if not crawler.settings.getbool("SOMEMIDDLEWARE_ENABLED"): raise NotConfigured
When running from command line (the only mechanism supported so far) the
scrapy.command.cmdline module will:
Settings object and populate it with the values in
SCRAPY_SETTINGS_MODULE, and per-command overridesCrawler object with the Settings object (the
Crawler instantiates all its components based on the given settings)Crawler.crawl() with the URLs or domains passed in the command lineWhen using Scrapy with the library API, the programmer will:
Settings object (which only has the defaults settings, by
default) and override the desired settingsCrawler object with the Settings objectSettings object to ScrapyCommand.add_options()?Option 1. Pass Crawler object to spider __init__ methods too
Option 2. Pass Settings object to spider __init__ methods, which would
then be accessed through self.settings, like logging which is accessed
through self.log