Saving and Loading - Spacy

Basics {id="basics",hidden="true"}

Serializing the pipeline {id="pipeline"}

When serializing the pipeline, keep in mind that this will only save out the binary data for the individual components to allow spaCy to restore them – not the entire objects. This is a good thing, because it makes serialization safe. But it also means that you have to take care of storing the config, which contains the pipeline configuration and all the relevant settings.

Saving the meta and config

The nlp.meta attribute is a JSON-serializable dictionary and contains all pipeline meta information like the author and license information. The nlp.config attribute is a dictionary containing the training configuration, pipeline component factories and other settings. It is saved out with a pipeline as the config.cfg.

python

config = nlp.config
bytes_data = nlp.to_bytes()

python

lang_cls = spacy.util.get_lang_class(config["nlp"]["lang"])
nlp = lang_cls.from_config(config)
nlp.from_bytes(bytes_data)

This is also how spaCy does it under the hood when loading a pipeline: it loads the config.cfg containing the language and pipeline information, initializes the language class, creates and adds the pipeline components based on the config and then loads in the binary data. You can read more about this process here.

Serializing Doc objects efficiently {id="docs",version="2.2"}

If you're working with lots of data, you'll probably need to pass analyses between machines, either to use something like Dask or Spark, or even just to save out work to disk. Often it's sufficient to use the Doc.to_array functionality for this, and just serialize the numpy arrays – but other times you want a more general way to save and restore Doc objects.

The DocBin class makes it easy to serialize and deserialize a collection of Doc objects together, and is much more efficient than calling Doc.to_bytes on each individual Doc object. You can also control what data gets saved, and you can merge pallets together for easy map/reduce-style processing.

python

import spacy
from spacy.tokens import DocBin

doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)
texts = ["Some text", "Lots of texts...", "..."]
nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts):
    doc_bin.add(doc)
bytes_data = doc_bin.to_bytes()

# Deserialize later, e.g. in a new process
nlp = spacy.blank("en")
doc_bin = DocBin().from_bytes(bytes_data)
docs = list(doc_bin.get_docs(nlp.vocab))

If store_user_data is set to True, the Doc.user_data will be serialized as well, which includes the values of extension attributes (if they're serializable with msgpack).

Including the Doc.user_data and extension attributes will only serialize the values of the attributes. To restore the values and access them via the doc._. property, you need to register the global attribute on the Doc again.

python

docs = list(doc_bin.get_docs(nlp.vocab))
Doc.set_extension("my_custom_attr", default=None)
print([doc._.my_custom_attr for doc in docs])

</Infobox>

Using Pickle {id="pickle"}

Example

python

doc = nlp("This is a text.")
data = pickle.dumps(doc)

When pickling spaCy's objects like the Doc or the EntityRecognizer, keep in mind that they all require the shared Vocab (which includes the string to hash mappings, label schemes and optional vectors). This means that their pickled representations can become very large, especially if you have word vectors loaded, because it won't only include the object itself, but also the entire shared vocab it depends on.

If you need to pickle multiple objects, try to pickle them together instead of separately. For instance, instead of pickling all pipeline components, pickle the entire pipeline once. And instead of pickling several Doc objects separately, pickle a list of Doc objects. Since they all share a reference to the same Vocab object, it will only be included once.

python

doc1 = nlp("Hello world")
doc2 = nlp("This is a test")

doc1_data = pickle.dumps(doc1)
doc2_data = pickle.dumps(doc2)
print(len(doc1_data) + len(doc2_data))  # 6636116 😞

doc_data = pickle.dumps([doc1, doc2])
print(len(doc_data))  # 3319761 😃

Pickling Token and Span objects isn't supported. They're only views of the Doc and can't exist on their own. Pickling them would always mean pulling in the parent document and its vocabulary, which has practically no advantage over pickling the parent Doc.

diff

- data = pickle.dumps(doc[10:20])
+ data = pickle.dumps(doc)

If you really only need a span – for example, a particular sentence – you can use Span.as_doc to make a copy of it and convert it to a Doc object. However, note that this will not let you recover contextual information from outside the span.

diff

+ span_doc = doc[10:20].as_doc()
data = pickle.dumps(span_doc)

</Infobox>

Implementing serialization methods {id="serialization-methods"}

When you call nlp.to_disk, nlp.from_disk or load a pipeline package, spaCy will iterate over the components in the pipeline, check if they expose a to_disk or from_disk method and if so, call it with the path to the pipeline directory plus the string name of the component. For example, if you're calling nlp.to_disk("/path"), the data for the named entity recognizer will be saved in /path/ner.

If you're using custom pipeline components that depend on external data – for example, model weights or terminology lists – you can take advantage of spaCy's built-in component serialization by making your custom component expose its own to_disk and from_disk or to_bytes and from_bytes methods. When an nlp object with the component in its pipeline is saved or loaded, the component will then be able to serialize and deserialize itself.

For more details on how to work with pipeline components that depend on data resources and manage data loading and initialization at training and runtime, see the usage guide on initializing and serializing component data.

</Infobox>

The following example shows a custom component that keeps arbitrary JSON-serializable data, allows the user to add to that data and saves and loads the data to and from a JSON file.

Real-world example

To see custom serialization methods in action, check out the new EntityRuler component and its source. Patterns added to the component will be saved to a .jsonl file if the pipeline is serialized to disk, and to a bytestring if the pipeline is serialized to bytes. This allows saving out a pipeline with a rule-based entity recognizer and including all rules with the component data.

python

import json
from spacy import Language
from spacy.util import ensure_path

@Language.factory("my_component")
class CustomComponent:
    def __init__(self, nlp: Language, name: str = "my_component"):
        self.name = name
        self.data = []

    def __call__(self, doc):
        # Do something to the doc here
        return doc

    def add(self, data):
        # Add something to the component's data
        self.data.append(data)

    def to_disk(self, path, exclude=tuple()):
        # This will receive the directory path + /my_component
        path = ensure_path(path)
        if not path.exists():
            path.mkdir()
        data_path = path / "data.json"
        with data_path.open("w", encoding="utf8") as f:
            f.write(json.dumps(self.data))

    def from_disk(self, path, exclude=tuple()):
        # This will receive the directory path + /my_component
        data_path = path / "data.json"
        with data_path.open("r", encoding="utf8") as f:
            self.data = json.load(f)
        return self

After adding the component to the pipeline and adding some data to it, we can serialize the nlp object to a directory, which will call the custom component's to_disk method.

python

nlp = spacy.load("en_core_web_sm")
my_component = nlp.add_pipe("my_component")
my_component.add({"hello": "world"})
nlp.to_disk("/path/to/pipeline")

The contents of the directory would then look like this. CustomComponent.to_disk converted the data to a JSON string and saved it to a file data.json in its subdirectory:

yaml

└── /path/to/pipeline
    ├── my_component     # data serialized by "my_component"
    │   └── data.json
    ├── ner              # data for "ner" component
    ├── parser           # data for "parser" component
    ├── tagger           # data for "tagger" component
    ├── vocab            # pipeline vocabulary
    ├── meta.json        # pipeline meta.json
    ├── config.cfg       # pipeline config
    └── tokenizer        # tokenization rules

When you load the data back in, spaCy will call the custom component's from_disk method with the given file path, and the component can then load the contents of data.json, convert them to a Python object and restore the component state. The same works for other types of data, of course – for instance, you could add a wrapper for a model trained with a different library like TensorFlow or PyTorch and make spaCy load its weights automatically when you load the pipeline package.

When you load back a pipeline with custom components, make sure that the components are available and that the @Language.component or @Language.factory decorators are executed before your pipeline is loaded back. Otherwise, spaCy won't know how to resolve the string name of a component factory like "my_component" back to a function. For more details, see the documentation on adding factories or use entry points to make your extension package expose your custom components to spaCy automatically.

</Infobox>

Using entry points {id="entry-points",version="2.1"}

Entry points let you expose parts of a Python package you write to other Python packages. This lets one application easily customize the behavior of another, by exposing an entry point in its setup.py. For a quick and fun intro to entry points in Python, check out this excellent blog post. spaCy can load custom functions from several different entry points to add pipeline component factories, language classes and other settings. To make spaCy use your entry points, your package needs to expose them and it needs to be installed in the same environment – that's it.

Entry point	Description
`spacy_factories`	Group of entry points for pipeline component factories, keyed by component name. Can be used to expose custom components defined by another package.
`spacy_languages`	Group of entry points for custom `Language` subclasses, keyed by language shortcut.
`spacy_lookups`	Group of entry points for custom `Lookups`, including lemmatizer data. Used by spaCy's `spacy-lookups-data` package.
`spacy_displacy_colors`	Group of entry points of custom label colors for the displaCy visualizer. The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types.

Loading probability tables into existing models

You can load a probability table from spacy-lookups-data into an existing spaCy model like en_core_web_sm.

python

# Requirements: pip install spacy-lookups-data
import spacy
from spacy.lookups import load_lookups
nlp = spacy.load("en_core_web_sm")
lookups = load_lookups("en", ["lexeme_prob"])
nlp.vocab.lookups.add_table("lexeme_prob", lookups.get_table("lexeme_prob"))

When training a model from scratch you can also specify probability tables in the config.cfg.

ini

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_prob"]

Custom components via entry points {id="entry-points-components"}

When you load a pipeline, spaCy will generally use its config.cfg to set up the language class and construct the pipeline. The pipeline is specified as a list of strings, e.g. pipeline = ["tagger", "parser", "ner"]. For each of those strings, spaCy will call nlp.add_pipe and look up the name in all factories defined by the decorators @Language.component and @Language.factory. This means that you have to import your custom components before loading the pipeline.

Using entry points, pipeline packages and extension packages can define their own "spacy_factories", which will be loaded automatically in the background when the Language class is initialized. So if a user has your package installed, they'll be able to use your components – even if they don't import them!

To stick with the theme of this entry points blog post, consider the following custom spaCy pipeline component that prints a snake when it's called:

Package directory structure

yaml

├── snek.py   # the extension code
└── setup.py  # setup file for pip installation

python

from spacy.language import Language

snek = """
    --..,_                     _,.--.
       `'.'.                .'`__ o  `;__. {text}
          '.'.            .'.'`  '---'`  `
            '.`'--....--'`.'
              `'--....--'`
"""

@Language.component("snek")
def snek_component(doc):
    print(snek.format(text=doc.text))
    return doc

Since it's a very complex and sophisticated module, you want to split it off into its own package so you can version it and upload it to PyPi. You also want your custom package to be able to define pipeline = ["snek"] in its config.cfg. For that, you need to be able to tell spaCy where to find the component "snek". If you don't do this, spaCy will raise an error when you try to load the pipeline because there's no built-in "snek" component. To add an entry to the factories, you can now expose it in your setup.py via the entry_points dictionary:

Entry point syntax

Python entry points for a group are formatted as a list of strings, with each string following the syntax of name = module:object. In this example, the created entry point is named snek and points to the function snek_component in the module snek, i.e. snek.py.

python

from setuptools import setup

setup(
    name="snek",
    entry_points={
        "spacy_factories": ["snek = snek:snek_component"]
    }
)

The same package can expose multiple entry points, by the way. To make them available to spaCy, all you need to do is install the package in your environment:

bash

$ python -m pip install .

spaCy is now able to create the pipeline component "snek" – even though you never imported snek_component. When you save the nlp.config to disk, it includes an entry for your "snek" component and any pipeline you train with this config will include the component and know how to load it – if your snek package is installed.

config.cfg (excerpt)

diff

[nlp]
lang = "en"
+ pipeline = ["snek"]

[components]

+ [components.snek]
+ factory = "snek"

>>> from spacy.lang.en import English
>>> nlp = English()
>>> nlp.add_pipe("snek")  # this now works! 🐍🎉
>>> doc = nlp("I am snek")
    --..,_                     _,.--.
       `'.'.                .'`__ o  `;__. I am snek
          '.'.            .'.'`  '---'`  `
            '.`'--....--'`.'
              `'--....--'`

Instead of making your snek component a simple stateless component, you could also make it a factory that takes settings. Your users can then pass in an optional config when they add your component to the pipeline and customize its appearance – for example, the snek_style.

config.cfg (excerpt)
diff
[components.snek]
factory = "snek"
+ snek_style = "basic"

python

SNEKS = {"basic": snek, "cute": cute_snek}  # collection of sneks

@Language.factory("snek", default_config={"snek_style": "basic"})
class SnekFactory:
    def __init__(self, nlp: Language, name: str, snek_style: str):
        self.nlp = nlp
        self.snek_style = snek_style
        self.snek = SNEKS[self.snek_style]

    def __call__(self, doc):
        print(self.snek)
        return doc

diff

entry_points={
-   "spacy_factories": ["snek = snek:snek_component"]
+   "spacy_factories": ["snek = snek:SnekFactory"]
}

The factory can also implement other pipeline component methods like to_disk and from_disk for serialization, or even update to make the component trainable. If a component exposes a from_disk method and is included in a pipeline, spaCy will call it on load. This lets you ship custom data with your pipeline package. When you save out a pipeline using nlp.to_disk and the component exposes a to_disk method, it will be called with the disk path.

python

from spacy.util import ensure_path

def to_disk(self, path, exclude=tuple()):
    path = ensure_path(path)
    if not path.exists():
        path.mkdir()
    snek_path = path / "snek.txt"
    with snek_path.open("w", encoding="utf8") as snek_file:
        snek_file.write(self.snek)

def from_disk(self, path, exclude=tuple()):
    snek_path = path / "snek.txt"
    with snek_path.open("r", encoding="utf8") as snek_file:
        self.snek = snek_file.read()
    return self

The above example will serialize the current snake in a snek.txt in the data directory. When a pipeline using the snek component is loaded, it will open the snek.txt and make it available to the component.

Custom language classes via entry points {id="entry-points-languages"}

To stay with the theme of the previous example and this blog post on entry points, let's imagine you wanted to implement your own SnekLanguage class for your custom pipeline – but you don't necessarily want to modify spaCy's code to add a language. In your package, you could then implement the following custom language subclass:

python

from spacy.language import Language

class SnekDefaults(Language.Defaults):
    stop_words = set(["sss", "hiss"])

class SnekLanguage(Language):
    lang = "snk"
    Defaults = SnekDefaults

Alongside the spacy_factories, there's also an entry point option for spacy_languages, which maps language codes to language-specific Language subclasses:

diff

from setuptools import setup

setup(
    name="snek",
    entry_points={
        "spacy_factories": ["snek = snek:SnekFactory"],
+       "spacy_languages": ["snk = snek:SnekLanguage"]
    }
)

In spaCy, you can then load the custom snk language and it will be resolved to SnekLanguage via the custom entry point. This is especially relevant for pipeline packages you train, which could then specify lang = snk in their config.cfg without spaCy raising an error because the language is not available in the core library.

Custom displaCy colors via entry points {id="entry-points-displacy",version="2.2"}

If you're training a named entity recognition model for a custom domain, you may end up training different labels that don't have pre-defined colors in the displacy visualizer. The spacy_displacy_colors entry point lets you define a dictionary of entity labels mapped to their color values. It's added to the pre-defined colors and can also overwrite existing values.

Domain-specific NER labels

Good examples of pipelines with domain-specific label schemes are scispaCy and Blackstone.

python

displacy_colors = {"SNEK": "#3dff74", "HUMAN": "#cfc5ff"}

Given the above colors, the entry point can be defined as follows. Entry points need to have a name, so we use the key colors. However, the name doesn't matter and whatever is defined in the entry point group will be used.

diff

from setuptools import setup

setup(
    name="snek",
    entry_points={
+       "spacy_displacy_colors": ["colors = snek:displacy_colors"]
    }
)

After installing the package, the custom colors will be used when visualizing text with displacy. Whenever the label SNEK is assigned, it will be displayed in #3dff74.

<Standalone height={100}> <div style={{lineHeight: 2.5, fontFamily: "-apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'", fontSize: 18}}>🌱🌿 <mark style={{ background: '#3dff74', padding: '0.45em 0.6em', margin: '0 0.25em', lineHeight: 1, borderRadius: '0.35em'}}>🐍 <span style={{ fontSize: '0.8em', fontWeight: 'bold', lineHeight: 1, borderRadius: '0.35em', marginLeft: '0.5rem'}}>SNEK</span></mark> ____ 🌳🌲 ____ <mark style={{ background: '#cfc5ff', padding: '0.45em 0.6em', margin: '0 0.25em', lineHeight: 1, borderRadius: '0.35em'}}>👨‍🌾 <span style={{ fontSize: '0.8em', fontWeight: 'bold', lineHeight: 1, borderRadius: '0.35em', marginLeft: '0.5rem'}}>HUMAN</span></mark> 🏘️</div> </Standalone>

Saving, loading and distributing trained pipelines {id="models"}

After training your pipeline, you'll usually want to save its state, and load it back later. You can do this with the Language.to_disk method:

python

nlp.to_disk("./en_example_pipeline")

The directory will be created if it doesn't exist, and the whole pipeline data, meta and configuration will be written out. To make the pipeline more convenient to deploy, we recommend wrapping it as a Python package.

When you save a pipeline in spaCy v3.0+, two files will be exported: a config.cfg based on nlp.config and a meta.json based on nlp.meta.

config: Configuration used to create the current nlp object, its pipeline components and models, as well as training settings and hyperparameters. Can include references to registered functions like pipeline components or model architectures. Given a config, spaCy is able reconstruct the whole tree of objects and the nlp object. An exported config can also be used to train a pipeline with the same settings.
meta: Meta information about the pipeline and the Python package, such as the author information, license, version, data sources and label scheme. This is mostly used for documentation purposes and for packaging pipelines. It has no impact on the functionality of the nlp object.

</Accordion> <Project id="pipelines/tagger_parser_ud">

The easiest way to get started with an end-to-end workflow is to clone a project template and run it – for example, this template that lets you train a part-of-speech tagger and dependency parser on a Universal Dependencies treebank and generates an installable Python package.

</Project>

Generating a pipeline package {id="models-generating"}

Pipeline packages are typically not suitable for the public pypi.python.org directory, which is not designed for binary data and files over 50 MB. However, if your company is running an internal installation of PyPi, publishing your pipeline packages on there can be a convenient way to share them with your team.

</Infobox>

spaCy comes with a handy CLI command that will create all required files, and walk you through generating the meta data. You can also create the meta.json manually and place it in the data directory, or supply a path to it using the --meta flag. For more info on this, see the package docs.

meta.json (example)

json

{
  "name": "example_pipeline",
  "lang": "en",
  "version": "1.0.0",
  "spacy_version": ">=2.0.0,<3.0.0",
  "description": "Example pipeline for spaCy",
  "author": "You",
  "email": "[email protected]",
  "license": "CC BY-SA 3.0"
}

bash

$ python -m spacy package ./en_example_pipeline ./packages

This command will create a pipeline package directory and will run python -m build in that directory to create a binary .whl file or .tar.gz archive of your package that can be installed using pip install. Installing the binary wheel is usually more efficient.

yaml

└── /
    ├── MANIFEST.in                           # to include meta.json
    ├── meta.json                             # pipeline meta data
    ├── setup.py                              # setup file for pip installation
    ├── en_example_pipeline                   # pipeline directory
    │    ├── __init__.py                      # init for pip installation
    │    └── en_example_pipeline-1.0.0        # pipeline data
    │        ├── config.cfg                   # pipeline config
    │        ├── meta.json                    # pipeline meta
    │        └── ...                          # directories with component data
    └── dist
        └── en_example_pipeline-1.0.0.tar.gz  # installable package

You can also find templates for all files in the cli/package.py source. If you're creating the package manually, keep in mind that the directories need to be named according to the naming conventions of lang_name and lang_name-version.

Including custom functions and components {id="models-custom"}

If your pipeline includes custom components, model architectures or other code, those functions need to be registered before your pipeline is loaded. Otherwise, spaCy won't know how to create the objects referenced in the config. If you're loading your own pipeline in Python, you can make custom components available just by importing the code that defines them before calling spacy.load. This is also how the --code argument to CLI commands works.

With the spacy package command, you can provide one or more paths to Python files containing custom registered functions using the --code argument.

__init__.py (excerpt)
python
from . import functions

def load(**overrides):
   ...

bash

$ python -m spacy package ./en_example_pipeline ./packages --code functions.py

The Python files will be copied over into the root of the package, and the package's __init__.py will import them as modules. This ensures that functions are registered when the pipeline is imported, e.g. when you call spacy.load. A simple import is all that's needed to make registered functions available.

Make sure to include all Python files that are referenced in your custom code, including modules imported by others. If your custom code depends on external packages, make sure they're listed in the list of "requirements" in your meta.json. For the majority of use cases, registered functions should provide you with all customizations you need, from custom components to custom model architectures and lifecycle hooks. However, if you do want to customize the setup in more detail, you can edit the package's __init__.py and the package's load function that's called by spacy.load.

While it's no problem to edit the package code or meta information, avoid making edits to the config.cfg after training, as this can easily lead to data incompatibility. For instance, changing an architecture or hyperparameter can mean that the trained weights are now incompatible. If you want to make adjustments, you can do so before training. Otherwise, you should always trust spaCy to export the current state of its nlp objects via nlp.config.

</Infobox>

Loading a custom pipeline package {id="loading"}

To load a pipeline from a data directory, you can use spacy.load() with the local path. This will look for a config.cfg in the directory and use the lang and pipeline settings to initialize a Language class with a processing pipeline and load in the model data.

python

nlp = spacy.load("/path/to/pipeline")

If you want to load only the binary data, you'll have to create a Language class and call from_disk instead.

python

nlp = spacy.blank("en").from_disk("/path/to/data")