Back to Spacy

Language

website/docs/api/language.mdx

4.0.0.dev1077.1 KB
Original Source

Usually you'll load this once per process as nlp and pass the instance around your application. The Language class is created when you call spacy.load and contains the shared vocabulary and language data, optional binary weights, e.g. provided by a trained pipeline, and the processing pipeline containing components like the tagger or parser that are called on a document in order. You can also add your own processing pipeline components that take a Doc object, modify it and return it.

Language.__init__ {id="init",tag="method"}

Initialize a Language object. Note that the meta is only used for meta information in Language.meta and not to configure the nlp object or to override the config. To initialize from a config, use Language.from_config instead.

Example

python
# Construction from subclass
from spacy.lang.en import English
nlp = English()

# Construction from scratch
from spacy.vocab import Vocab
from spacy.language import Language
nlp = Language(Vocab())
NameDescription
vocabA Vocab object. If True, a vocab is created using the default language data settings. Vocab
keyword-only
max_lengthMaximum number of characters allowed in a single text. Defaults to 10 ** 6. int
metaMeta data overrides. Dict[str, Any]
create_tokenizerOptional function that receives the nlp object and returns a tokenizer. Callable[[Language], Callable[[str], Doc]]
batch_sizeDefault batch size for pipe and evaluate. Defaults to 1000. int

Language.from_config {id="from_config",tag="classmethod",version="3"}

Create a Language object from a loaded config. Will set up the tokenizer and language data, add pipeline components based on the pipeline and add pipeline components based on the definitions specified in the config. If no config is provided, the default config of the given language is used. This is also how spaCy loads a model under the hood based on its config.cfg.

Example

python
from thinc.api import Config
from spacy.language import Language

config = Config().from_disk("./config.cfg")
nlp = Language.from_config(config)
NameDescription
configThe loaded config. Union[Dict[str, Any], Config]
keyword-only
vocabA Vocab object. If True, a vocab is created using the default language data settings. Vocab
disableName(s) of pipeline component(s) to disable. Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling nlp.enable_pipe. Is merged with the config entry nlp.disabled. Union[str, Iterable[str]]
enable <Tag variant="new">3.4</Tag>Name(s) of pipeline component(s) to enable. All other pipes will be disabled, but can be enabled again using nlp.enable_pipe. Union[str, Iterable[str]]
excludeName(s) of pipeline component(s) to exclude. Excluded components won't be loaded. Union[str, Iterable[str]]
metaMeta data overrides. Dict[str, Any]
auto_fillWhether to automatically fill in missing values in the config, based on defaults and function argument annotations. Defaults to True. bool
validateWhether to validate the component config and arguments against the types expected by the factory. Defaults to True. bool
RETURNSThe initialized object. Language

Language.component {id="component",tag="classmethod",version="3"}

Register a custom pipeline component under a given name. This allows initializing the component by name using Language.add_pipe and referring to it in config files. This classmethod and decorator is intended for simple stateless functions that take a Doc and return it. For more complex stateful components that allow settings and need access to the shared nlp object, use the Language.factory decorator. For more details and examples, see the usage documentation.

Example

python
from spacy.language import Language

# Usage as a decorator
@Language.component("my_component")
def my_component(doc):
   # Do something to the doc
   return doc

# Usage as a function
Language.component("my_component2", func=my_component)
NameDescription
nameThe name of the component factory. str
keyword-only
assignsDoc or Token attributes assigned by this component, e.g. ["token.ent_id"]. Used for pipe analysis. Iterable[str]
requiresDoc or Token attributes required by this component, e.g. ["token.ent_id"]. Used for pipe analysis. Iterable[str]
retokenizesWhether the component changes tokenization. Used for pipe analysis. bool
funcOptional function if not used as a decorator. Optional[Callable[[Doc], Doc]]

Language.factory {id="factory",tag="classmethod"}

Register a custom pipeline component factory under a given name. This allows initializing the component by name using Language.add_pipe and referring to it in config files. The registered factory function needs to take at least two named arguments which spaCy fills in automatically: nlp for the current nlp object and name for the component instance name. This can be useful to distinguish multiple instances of the same component and allows trainable components to add custom losses using the component instance name. The default_config defines the default values of the remaining factory arguments. It's merged into the nlp.config. For more details and examples, see the usage documentation.

Example

python
from spacy.language import Language

# Usage as a decorator
@Language.factory(
   "my_component",
   default_config={"some_setting": True},
)
def create_my_component(nlp, name, some_setting):
     return MyComponent(some_setting)

# Usage as function
Language.factory(
    "my_component",
    default_config={"some_setting": True},
    func=create_my_component
)
NameDescription
nameThe name of the component factory. str
keyword-only
default_configThe default config, describing the default values of the factory arguments. Dict[str, Any]
assignsDoc or Token attributes assigned by this component, e.g. ["token.ent_id"]. Used for pipe analysis. Iterable[str]
requiresDoc or Token attributes required by this component, e.g. ["token.ent_id"]. Used for pipe analysis. Iterable[str]
retokenizesWhether the component changes tokenization. Used for pipe analysis. bool
default_score_weightsThe scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to 1.0 per component and will be combined and normalized for the whole pipeline. If a weight is set to None, the score will not be logged or weighted. Dict[str, Optional[float]]
funcOptional function if not used as a decorator. Optional[Callable[[...], Callable[[Doc], Doc]]]

Language.__call__ {id="call",tag="method"}

Apply the pipeline to some text. The text can span multiple sentences, and can contain arbitrary whitespace. Alignment into the original string is preserved.

Instead of text, a Doc can be passed as input, in which case tokenization is skipped, but the rest of the pipeline is run.

Example

python
doc = nlp("An example sentence. Another sentence.")
assert (doc[0].text, doc[0].head.tag_) == ("An", "NN")
NameDescription
textThe text to be processed, or a Doc. Union[str, Doc]
keyword-only
disableNames of pipeline components to disable. List[str]
component_cfgOptional dictionary of keyword arguments for components, keyed by component names. Defaults to None. Optional[Dict[str, Dict[str, Any]]]
RETURNSA container for accessing the annotations. Doc

Language.pipe {id="pipe",tag="method"}

Process texts as a stream, and yield Doc objects in order. This is usually more efficient than processing texts one-by-one.

Instead of text, a Doc object can be passed as input. In this case tokenization is skipped but the rest of the pipeline is run.

Example

python
texts = ["One document.", "...", "Lots of documents"]
for doc in nlp.pipe(texts, batch_size=50):
    assert doc.has_annotation("DEP")
NameDescription
textsA sequence of strings (or Doc objects). Iterable[Union[str, Doc]]
keyword-only
as_tuplesIf set to True, inputs should be a sequence of (text, context) tuples. Output will then be a sequence of (doc, context) tuples. Defaults to False. bool
batch_sizeThe number of texts to buffer. Optional[int]
disableNames of pipeline components to disable. List[str]
component_cfgOptional dictionary of keyword arguments for components, keyed by component names. Defaults to None. Optional[Dict[str, Dict[str, Any]]]
n_processNumber of processors to use. Defaults to 1. int
YIELDSDocuments in the order of the original text. Doc

Language.set_error_handler {id="set_error_handler",tag="method",version="3"}

Define a callback that will be invoked when an error is thrown during processing of one or more documents. Specifically, this function will call set_error_handler on all the pipeline components that define that function. The error handler will be invoked with the original component's name, the component itself, the list of documents that was being processed, and the original error.

Example

python
def warn_error(proc_name, proc, docs, e):
    print(f"An error occurred when applying component {proc_name}.")

nlp.set_error_handler(warn_error)
NameDescription
error_handlerA function that performs custom error handling. Callable[[str, Callable[[Doc], Doc], List[Doc], Exception]

Language.initialize {id="initialize",tag="method",version="3"}

Initialize the pipeline for training and return an Optimizer. Under the hood, it uses the settings defined in the [initialize] config block to set up the vocabulary, load in vectors and tok2vec weights and pass optional arguments to the initialize methods implemented by pipeline components or the tokenizer. This method is typically called automatically when you run spacy train. See the usage guide on the config lifecycle and initialization for details.

get_examples should be a function that returns an iterable of Example objects. The data examples can either be the full training data or a representative sample. They are used to initialize the models of trainable pipeline components and are passed each component's initialize method, if available. Initialization includes validating the network, inferring missing shapes and setting up the label scheme based on the data.

If no get_examples function is provided when calling nlp.initialize, the pipeline components will be initialized with generic data. In this case, it is crucial that the output dimension of each component has already been defined either in the config, or by calling pipe.add_label for each possible output label (e.g. for the tagger or textcat).

<Infobox variant="warning" title="Changed in v3.0" id="begin_training">

This method was previously called begin_training. It now also takes a function that is called with no arguments and returns a sequence of Example objects instead of tuples of Doc and GoldParse objects.

</Infobox>

Example

python
get_examples = lambda: examples
optimizer = nlp.initialize(get_examples)
NameDescription
get_examplesOptional function that returns gold-standard annotations in the form of Example objects. Optional[Callable[[], Iterable[Example]]]
keyword-only
sgdAn optimizer. Will be created via create_optimizer if not set. Optional[Optimizer]
RETURNSThe optimizer. Optimizer

Language.resume_training {id="resume_training",tag="method,experimental",version="3"}

Continue training a trained pipeline. Create and return an optimizer, and initialize "rehearsal" for any pipeline component that has a rehearse method. Rehearsal is used to prevent models from "forgetting" their initialized "knowledge". To perform rehearsal, collect samples of text you want the models to retain performance on, and call nlp.rehearse with a batch of Example objects.

Example

python
optimizer = nlp.resume_training()
nlp.rehearse(examples, sgd=optimizer)
NameDescription
keyword-only
sgdAn optimizer. Will be created via create_optimizer if not set. Optional[Optimizer]
RETURNSThe optimizer. Optimizer

Language.update {id="update",tag="method"}

Update the models in the pipeline.

<Infobox variant="warning" title="Changed in v3.0">

The Language.update method now takes a batch of Example objects instead of the raw texts and annotations or Doc and GoldParse objects. An Example streamlines how data is passed around. It stores two Doc objects: one for holding the gold-standard reference data, and one for holding the predictions of the pipeline.

For most use cases, you shouldn't have to write your own training scripts anymore. Instead, you can use spacy train with a config file and custom registered functions if needed. See the training documentation for details.

</Infobox>

Example

python
for raw_text, entity_offsets in train_data:
    doc = nlp.make_doc(raw_text)
    example = Example.from_dict(doc, {"entities": entity_offsets})
    nlp.update([example], sgd=optimizer)
NameDescription
examplesA batch of Example objects to learn from. Iterable[Example]
keyword-only
dropThe dropout rate. float
sgdAn optimizer. Will be created via create_optimizer if not set. Optional[Optimizer]
lossesDictionary to update with the loss, keyed by pipeline component. Optional[Dict[str, float]]
component_cfgOptional dictionary of keyword arguments for components, keyed by component names. Defaults to None. Optional[Dict[str, Dict[str, Any]]]
RETURNSThe updated losses dictionary. Dict[str, float]

Language.rehearse {id="rehearse",tag="method,experimental",version="3"}

Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the current model to make predictions similar to an initial model, to try to address the "catastrophic forgetting" problem. This feature is experimental.

Example

python
optimizer = nlp.resume_training()
losses = nlp.rehearse(examples, sgd=optimizer)
NameDescription
examplesA batch of Example objects to learn from. Iterable[Example]
keyword-only
dropThe dropout rate. float
sgdAn optimizer. Will be created via create_optimizer if not set. Optional[Optimizer]
lossesDictionary to update with the loss, keyed by pipeline component. Optional[Dict[str, float]]
RETURNSThe updated losses dictionary. Dict[str, float]

Language.evaluate {id="evaluate",tag="method"}

Evaluate a pipeline's components.

<Infobox variant="warning" title="Changed in v3.0">

The Language.evaluate method now takes a batch of Example objects instead of tuples of Doc and GoldParse objects.

</Infobox>

Example

python
scores = nlp.evaluate(examples)
print(scores)
NameDescription
examplesA batch of Example objects to learn from. Iterable[Example]
keyword-only
batch_sizeThe batch size to use. Optional[int]
scorerOptional Scorer to use. If not passed in, a new one will be created. Optional[Scorer]
component_cfgOptional dictionary of keyword arguments for components, keyed by component names. Defaults to None. Optional[Dict[str, Dict[str, Any]]]
scorer_cfgOptional dictionary of keyword arguments for the Scorer. Defaults to None. Optional[Dict[str, Any]]
per_component <Tag variant="new">3.6</Tag>Whether to return the scores keyed by component name. Defaults to False. bool
RETURNSA dictionary of evaluation scores. Dict[str, Union[float, Dict[str, float]]]

Language.use_params {id="use_params",tag="contextmanager, method"}

Replace weights of models in the pipeline with those provided in the params dictionary. Can be used as a context manager, in which case, models go back to their original weights after the block.

Example

python
with nlp.use_params(optimizer.averages):
    nlp.to_disk("/tmp/checkpoint")
NameDescription
paramsA dictionary of parameters keyed by model ID. dict

Language.add_pipe {id="add_pipe",tag="method",version="2"}

Add a component to the processing pipeline. Expects a name that maps to a component factory registered using @Language.component or @Language.factory. Components should be callables that take a Doc object, modify it and return it. Only one of before, after, first or last can be set. Default behavior is last=True.

<Infobox title="Changed in v3.0" variant="warning">

As of v3.0, the Language.add_pipe method doesn't take callables anymore and instead expects the name of a component factory registered using @Language.component or @Language.factory. It now takes care of creating the component, adds it to the pipeline and returns it.

</Infobox>

Example

python
@Language.component("component")
def component_func(doc):
    # modify Doc and return it
    return doc

nlp.add_pipe("component", before="ner")
component = nlp.add_pipe("component", name="custom_name", last=True)

# Add component from source pipeline
source_nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("ner", source=source_nlp)
NameDescription
factory_nameName of the registered component factory. str
nameOptional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. Optional[str]
keyword-only
beforeComponent name or index to insert component directly before. Optional[Union[str, int]]
afterComponent name or index to insert component directly after. Optional[Union[str, int]]
firstInsert component first / not first in the pipeline. Optional[bool]
lastInsert component last / not last in the pipeline. Optional[bool]
config <Tag variant="new">3</Tag>Optional config parameters to use for this component. Will be merged with the default_config specified by the component factory. Dict[str, Any]
source <Tag variant="new">3</Tag>Optional source pipeline to copy component from. If a source is provided, the factory_name is interpreted as the name of the component in the source pipeline. Make sure that the vocab, vectors and settings of the source pipeline match the target pipeline. Optional[Language]
validate <Tag variant="new">3</Tag>Whether to validate the component config and arguments against the types expected by the factory. Defaults to True. bool
RETURNSThe pipeline component. Callable[[Doc], Doc]

Language.create_pipe {id="create_pipe",tag="method",version="2"}

Create a pipeline component from a factory.

<Infobox title="Changed in v3.0" variant="warning">

As of v3.0, the Language.add_pipe method also takes the string name of the factory, creates the component, adds it to the pipeline and returns it. The Language.create_pipe method is now mostly used internally. To create a component and add it to the pipeline, you should always use Language.add_pipe.

</Infobox>

Example

python
parser = nlp.create_pipe("parser")
NameDescription
factory_nameName of the registered component factory. str
nameOptional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. Optional[str]
keyword-only
config <Tag variant="new">3</Tag>Optional config parameters to use for this component. Will be merged with the default_config specified by the component factory. Dict[str, Any]
validate <Tag variant="new">3</Tag>Whether to validate the component config and arguments against the types expected by the factory. Defaults to True. bool
RETURNSThe pipeline component. Callable[[Doc], Doc]

Language.has_factory {id="has_factory",tag="classmethod",version="3"}

Check whether a factory name is registered on the Language class or subclass. Will check for language-specific factories registered on the subclass, as well as general-purpose factories registered on the Language base class, available to all subclasses.

Example

python
from spacy.language import Language
from spacy.lang.en import English

@English.component("component")
def component(doc):
    return doc

assert English.has_factory("component")
assert not Language.has_factory("component")
NameDescription
nameName of the pipeline factory to check. str
RETURNSWhether a factory of that name is registered on the class. bool

Language.has_pipe {id="has_pipe",tag="method",version="2"}

Check whether a component is present in the pipeline. Equivalent to name in nlp.pipe_names.

Example

python
@Language.component("component")
def component(doc):
    return doc

nlp.add_pipe("component", name="my_component")
assert "my_component" in nlp.pipe_names
assert nlp.has_pipe("my_component")
NameDescription
nameName of the pipeline component to check. str
RETURNSWhether a component of that name exists in the pipeline. bool

Language.get_pipe {id="get_pipe",tag="method",version="2"}

Get a pipeline component for a given component name.

Example

python
parser = nlp.get_pipe("parser")
custom_component = nlp.get_pipe("custom_component")
NameDescription
nameName of the pipeline component to get. str
RETURNSThe pipeline component. Callable[[Doc], Doc]

Language.replace_pipe {id="replace_pipe",tag="method",version="2"}

Replace a component in the pipeline and return the new component.

<Infobox title="Changed in v3.0" variant="warning">

As of v3.0, the Language.replace_pipe method doesn't take callables anymore and instead expects the name of a component factory registered using @Language.component or @Language.factory.

</Infobox>

Example

python
new_parser = nlp.replace_pipe("parser", "my_custom_parser")
NameDescription
nameName of the component to replace. str
componentThe factory name of the component to insert. str
keyword-only
config <Tag variant="new">3</Tag>Optional config parameters to use for the new component. Will be merged with the default_config specified by the component factory. Optional[Dict[str, Any]]
validate <Tag variant="new">3</Tag>Whether to validate the component config and arguments against the types expected by the factory. Defaults to True. bool
RETURNSThe new pipeline component. Callable[[Doc], Doc]

Language.rename_pipe {id="rename_pipe",tag="method",version="2"}

Rename a component in the pipeline. Useful to create custom names for pre-defined and pre-loaded components. To change the default name of a component added to the pipeline, you can also use the name argument on add_pipe.

Example

python
nlp.rename_pipe("parser", "spacy_parser")
NameDescription
old_nameName of the component to rename. str
new_nameNew name of the component. str

Language.remove_pipe {id="remove_pipe",tag="method",version="2"}

Remove a component from the pipeline. Returns the removed component name and component function.

Example

python
name, component = nlp.remove_pipe("parser")
assert name == "parser"
NameDescription
nameName of the component to remove. str
RETURNSA (name, component) tuple of the removed component. Tuple[str, Callable[[Doc], Doc]]

Language.disable_pipe {id="disable_pipe",tag="method",version="3"}

Temporarily disable a pipeline component so it's not run as part of the pipeline. Disabled components are listed in nlp.disabled and included in nlp.components, but not in nlp.pipeline, so they're not run when you process a Doc with the nlp object. If the component is already disabled, this method does nothing.

Example

python
nlp.add_pipe("ner")
nlp.add_pipe("textcat")
assert nlp.pipe_names == ["ner", "textcat"]
nlp.disable_pipe("ner")
assert nlp.pipe_names == ["textcat"]
assert nlp.component_names == ["ner", "textcat"]
assert nlp.disabled == ["ner"]
NameDescription
nameName of the component to disable. str

Language.enable_pipe {id="enable_pipe",tag="method",version="3"}

Enable a previously disabled component (e.g. via Language.disable_pipes) so it's run as part of the pipeline, nlp.pipeline. If the component is already enabled, this method does nothing.

Example

python
nlp.disable_pipe("ner")
assert "ner" in nlp.disabled
assert not "ner" in nlp.pipe_names
nlp.enable_pipe("ner")
assert not "ner" in nlp.disabled
assert "ner" in nlp.pipe_names
NameDescription
nameName of the component to enable. str

Language.select_pipes {id="select_pipes",tag="contextmanager, method",version="3"}

Disable one or more pipeline components. If used as a context manager, the pipeline will be restored to the initial state at the end of the block. Otherwise, a DisabledPipes object is returned, that has a .restore() method you can use to undo your changes. You can specify either disable (as a list or string), or enable. In the latter case, all components not in the enable list will be disabled. Under the hood, this method calls into disable_pipe and enable_pipe.

Example

python
with nlp.select_pipes(disable=["tagger", "parser"]):
   nlp.initialize()

with nlp.select_pipes(enable="ner"):
    nlp.initialize()

disabled = nlp.select_pipes(disable=["tagger", "parser"])
nlp.initialize()
disabled.restore()
<Infobox title="Changed in v3.0" variant="warning" id="disable_pipes">

As of spaCy v3.0, the disable_pipes method has been renamed to select_pipes:

diff
- nlp.disable_pipes(["tagger", "parser"])
+ nlp.select_pipes(disable=["tagger", "parser"])
</Infobox>
NameDescription
keyword-only
disableName(s) of pipeline component(s) to disable. Optional[Union[str, Iterable[str]]]
enableName(s) of pipeline component(s) that will not be disabled. Optional[Union[str, Iterable[str]]]
RETURNSThe disabled pipes that can be restored by calling the object's .restore() method. DisabledPipes

Language.get_factory_meta {id="get_factory_meta",tag="classmethod",version="3"}

Get the factory meta information for a given pipeline component name. Expects the name of the component factory. The factory meta is an instance of the FactoryMeta dataclass and contains the information about the component and its default provided by the @Language.component or @Language.factory decorator.

Example

python
factory_meta = Language.get_factory_meta("ner")
assert factory_meta.factory == "ner"
print(factory_meta.default_config)
NameDescription
nameThe factory name. str
RETURNSThe factory meta. FactoryMeta

Language.get_pipe_meta {id="get_pipe_meta",tag="method",version="3"}

Get the factory meta information for a given pipeline component name. Expects the name of the component instance in the pipeline. The factory meta is an instance of the FactoryMeta dataclass and contains the information about the component and its default provided by the @Language.component or @Language.factory decorator.

Example

python
nlp.add_pipe("ner", name="entity_recognizer")
factory_meta = nlp.get_pipe_meta("entity_recognizer")
assert factory_meta.factory == "ner"
print(factory_meta.default_config)
NameDescription
nameThe pipeline component name. str
RETURNSThe factory meta. FactoryMeta

Language.analyze_pipes {id="analyze_pipes",tag="method",version="3"}

Analyze the current pipeline components and show a summary of the attributes they assign and require, and the scores they set. The data is based on the information provided in the @Language.component and @Language.factory decorator. If requirements aren't met, e.g. if a component specifies a required property that is not set by a previous component, a warning is shown.

<Infobox variant="warning" title="Important note">

The pipeline analysis is static and does not actually run the components. This means that it relies on the information provided by the components themselves. If a custom component declares that it assigns an attribute but it doesn't, the pipeline analysis won't catch that.

</Infobox>

Example

python
nlp = spacy.blank("en")
nlp.add_pipe("tagger")
nlp.add_pipe("entity_linker")
analysis = nlp.analyze_pipes()
<Accordion title="Example output" spaced>
json
{
  "summary": {
    "tagger": {
      "assigns": ["token.tag"],
      "requires": [],
      "scores": ["tag_acc", "pos_acc", "lemma_acc"],
      "retokenizes": false
    },
    "entity_linker": {
      "assigns": ["token.ent_kb_id"],
      "requires": ["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"],
      "scores": [],
      "retokenizes": false
    }
  },
  "problems": {
    "tagger": [],
    "entity_linker": [
      "doc.ents",
      "doc.sents",
      "token.ent_iob",
      "token.ent_type"
    ]
  },
  "attrs": {
    "token.ent_iob": { "assigns": [], "requires": ["entity_linker"] },
    "doc.ents": { "assigns": [], "requires": ["entity_linker"] },
    "token.ent_kb_id": { "assigns": ["entity_linker"], "requires": [] },
    "doc.sents": { "assigns": [], "requires": ["entity_linker"] },
    "token.tag": { "assigns": ["tagger"], "requires": [] },
    "token.ent_type": { "assigns": [], "requires": ["entity_linker"] }
  }
}
### Pretty
============================= Pipeline Overview =============================

#   Component       Assigns           Requires         Scores        Retokenizes
-   -------------   ---------------   --------------   -----------   -----------
0   tagger          token.tag                          tag_acc       False

1   entity_linker   token.ent_kb_id   doc.ents         nel_micro_f   False
                                      doc.sents        nel_micro_r
                                      token.ent_iob    nel_micro_p
                                      token.ent_type


================================ Problems (4) ================================
⚠ 'entity_linker' requirements not met: doc.ents, doc.sents,
token.ent_iob, token.ent_type
</Accordion>
NameDescription
keyword-only
keysThe values to display in the table. Corresponds to attributes of the FactoryMeta. Defaults to ["assigns", "requires", "scores", "retokenizes"]. List[str]
prettyPretty-print the results as a table. Defaults to False. bool
RETURNSDictionary containing the pipe analysis, keyed by "summary" (component meta by pipe), "problems" (attribute names by pipe) and "attrs" (pipes that assign and require an attribute, keyed by attribute). Optional[Dict[str, Any]]

Language.replace_listeners {id="replace_listeners",tag="method",version="3"}

Find listener layers (connecting to a shared token-to-vector embedding component) of a given pipeline component model and replace them with a standalone copy of the token-to-vector layer. The listener layer allows other components to connect to a shared token-to-vector embedding component like Tok2Vec or Transformer. Replacing listeners can be useful when training a pipeline with components sourced from an existing pipeline: if multiple components (e.g. tagger, parser, NER) listen to the same token-to-vector component, but some of them are frozen and not updated, their performance may degrade significantly as the token-to-vector component is updated with new data. To prevent this, listeners can be replaced with a standalone token-to-vector layer that is owned by the component and doesn't change if the component isn't updated.

This method is typically not called directly and only executed under the hood when loading a config with sourced components that define replace_listeners.

python
### Example
nlp = spacy.load("en_core_web_sm")
nlp.replace_listeners("tok2vec", "tagger", ["model.tok2vec"])
ini
### config.cfg (excerpt)
[training]
frozen_components = ["tagger"]

[components]

[components.tagger]
source = "en_core_web_sm"
replace_listeners = ["model.tok2vec"]
NameDescription
tok2vec_nameName of the token-to-vector component, typically "tok2vec" or "transformer".str
pipe_nameName of pipeline component to replace listeners for. str
listenersThe paths to the listeners, relative to the component config, e.g. ["model.tok2vec"]. Typically, implementations will only connect to one tok2vec component, model.tok2vec, but in theory, custom models can use multiple listeners. The value here can either be an empty list to not replace any listeners, or a complete list of the paths to all listener layers used by the model that should be replaced.Iterable[str]

Language.memory_zone {id="memory_zone",tag="contextmanager",version="3.8"}

Begin a block where all resources allocated during the block will be freed at the end of it. If a resources was created within the memory zone block, accessing it outside the block is invalid. Behavior of this invalid access is undefined. Memory zones should not be nested. The memory zone is helpful for services that need to process large volumes of text with a defined memory budget.

python
### Example
counts = Counter()
with nlp.memory_zone():
    for doc in nlp.pipe(texts):
        for token in doc:
            counts[token.text] += 1
NameDescription
memOptional cymem.Pool object to own allocations (created if not provided). This argument is not required for ordinary usage. Defaults to None. Optional[cymem.Pool]
RETURNSThe memory pool that owns the allocations. This object is not required for ordinary usage. Iterator[cymem.Pool]

Language.meta {id="meta",tag="property"}

Meta data for the Language class, including name, version, data sources, license, author information and more. If a trained pipeline is loaded, this contains meta data of the pipeline. The Language.meta is also what's serialized as the meta.json when you save an nlp object to disk. See the meta data format for more details.

<Infobox variant="warning" title="Changed in v3.0">

As of v3.0, the meta only contains meta information about the pipeline and isn't used to construct the language class and pipeline components. This information is expressed in the config.cfg.

</Infobox>

Example

python
print(nlp.meta)
NameDescription
RETURNSThe meta data. Dict[str, Any]

Language.config {id="config",tag="property",version="3"}

Export a trainable config.cfg for the current nlp object. Includes the current pipeline, all configs used to create the currently active pipeline components, as well as the default training config that can be used with spacy train. Language.config returns a Thinc Config object, which is a subclass of the built-in dict. It supports the additional methods to_disk (serialize the config to a file) and to_str (output the config as a string).

Example

python
nlp.config.to_disk("./config.cfg")
print(nlp.config.to_str())
NameDescription
RETURNSThe config. Config

Language.to_disk {id="to_disk",tag="method",version="2"}

Save the current state to a directory. Under the hood, this method delegates to the to_disk methods of the individual pipeline components, if available. This means that if a trained pipeline is loaded, all components and their weights will be saved to disk.

Example

python
nlp.to_disk("/path/to/pipeline")
NameDescription
pathA path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path-like objects. Union[str, Path]
keyword-only
excludeNames of pipeline components or serialization fields to exclude. Iterable[str]

Language.from_disk {id="from_disk",tag="method",version="2"}

Loads state from a directory, including all data that was saved with the Language object. Modifies the object in place and returns it.

<Infobox variant="warning" title="Important note">

Keep in mind that this method only loads the serialized state and doesn't set up the nlp object. This means that it requires the correct language class to be initialized and all pipeline components to be added to the pipeline. If you want to load a serialized pipeline from a directory, you should use spacy.load, which will set everything up for you.

</Infobox>

Example

python
from spacy.language import Language
nlp = Language().from_disk("/path/to/pipeline")

# Using language-specific subclass
from spacy.lang.en import English
nlp = English().from_disk("/path/to/pipeline")
NameDescription
pathA path to a directory. Paths may be either strings or Path-like objects. Union[str, Path]
keyword-only
excludeNames of pipeline components or serialization fields to exclude. Iterable[str]
RETURNSThe modified Language object. Language

Language.to_bytes {id="to_bytes",tag="method"}

Serialize the current state to a binary string.

Example

python
nlp_bytes = nlp.to_bytes()
NameDescription
keyword-only
excludeNames of pipeline components or serialization fields to exclude. iterable
RETURNSThe serialized form of the Language object. bytes

Language.from_bytes {id="from_bytes",tag="method"}

Load state from a binary string. Note that this method is commonly used via the subclasses like English or German to make language-specific functionality like the lexical attribute getters available to the loaded object.

Note that if you want to serialize and reload a whole pipeline, using this alone won't work, you also need to handle the config. See "Serializing the pipeline" for details.

Example

python
from spacy.lang.en import English
nlp_bytes = nlp.to_bytes()
nlp2 = English()
nlp2.from_bytes(nlp_bytes)
NameDescription
bytes_dataThe data to load from. bytes
keyword-only
excludeNames of pipeline components or serialization fields to exclude. Iterable[str]
RETURNSThe Language object. Language

Attributes {id="attributes"}

NameDescription
vocabA container for the lexical types. Vocab
tokenizerThe tokenizer. Tokenizer
make_docCallable that takes a string and returns a Doc. Callable[[str], Doc]
pipelineList of (name, component) tuples describing the current processing pipeline, in order. List[Tuple[str, Callable[[Doc], Doc]]]
pipe_namesList of pipeline component names, in order. List[str]
pipe_labelsList of labels set by the pipeline components, if available, keyed by component name. Dict[str, List[str]]
pipe_factoriesDictionary of pipeline component names, mapped to their factory names. Dict[str, str]
factoriesAll available factory functions, keyed by name. Dict[str, Callable[[...], Callable[[Doc], Doc]]]
factory_names <Tag variant="new">3</Tag>List of all available factory names. List[str]
components <Tag variant="new">3</Tag>List of all available (name, component) tuples, including components that are currently disabled. List[Tuple[str, Callable[[Doc], Doc]]]
component_names <Tag variant="new">3</Tag>List of all available component names, including components that are currently disabled. List[str]
disabled <Tag variant="new">3</Tag>Names of components that are currently disabled and don't run as part of the pipeline. List[str]
pathPath to the pipeline data directory, if a pipeline is loaded from a path or package. Otherwise None. Optional[Path]

Class attributes {id="class-attributes"}

NameDescription
DefaultsSettings, data and factory methods for creating the nlp object and processing pipeline. Defaults
langTwo-letter ISO 639-1 or three-letter ISO 639-3 language codes, such as 'en' and 'eng' for English. str
default_configBase config to use for Language.config. Defaults to default_config.cfg. Config

Defaults {id="defaults"}

The following attributes can be set on the Language.Defaults class to customize the default language data:

Example

python
from spacy.language import language
from spacy.lang.tokenizer_exceptions import URL_MATCH
from thinc.api import Config

DEFAULT_CONFIFG = """
[nlp.tokenizer]
@tokenizers = "MyCustomTokenizer.v1"
"""

class Defaults(Language.Defaults):
   stop_words = set()
   tokenizer_exceptions = {}
   prefixes = tuple()
   suffixes = tuple()
   infixes = tuple()
   token_match = None
   url_match = URL_MATCH
   lex_attr_getters = {}
   syntax_iterators = {}
   writing_system = {"direction": "ltr", "has_case": True, "has_letters": True}
   config = Config().from_str(DEFAULT_CONFIG)
NameDescription
stop_wordsList of stop words, used for Token.is_stop.
Example: stop_words.py Set[str]
tokenizer_exceptionsTokenizer exception rules, string mapped to list of token attributes.
Example: de/tokenizer_exceptions.py Dict[str, List[dict]]
prefixes, suffixes, infixesPrefix, suffix and infix rules for the default tokenizer.
Example: puncutation.py Optional[Sequence[Union[str, Pattern]]]
token_matchOptional regex for matching strings that should never be split, overriding the infix rules.
Example: fr/tokenizer_exceptions.py Optional[Callable]
url_matchRegular expression for matching URLs. Prefixes and suffixes are removed before applying the match.
Example: tokenizer_exceptions.py Optional[Callable]
lex_attr_gettersCustom functions for setting lexical attributes on tokens, e.g. like_num.
Example: lex_attrs.py Dict[int, Callable[[str], Any]]
syntax_iteratorsFunctions that compute views of a Doc object based on its syntax. At the moment, only used for noun chunks.
Example: syntax_iterators.py. Dict[str, Callable[[Union[Doc, Span]], Iterator[Span]]]
writing_systemInformation about the language's writing system, available via Vocab.writing_system. Defaults to: {"direction": "ltr", "has_case": True, "has_letters": True}..
Example: zh/__init__.py Dict[str, Any]
configDefault config added to nlp.config. This can include references to custom tokenizers or lemmatizers.
Example: zh/__init__.py Config

Serialization fields {id="serialization-fields"}

During serialization, spaCy will export several data fields used to restore different aspects of the object. If needed, you can exclude them from serialization by passing in the string names via the exclude argument.

Example

python
data = nlp.to_bytes(exclude=["tokenizer", "vocab"])
nlp.from_disk("/pipeline", exclude=["ner"])
NameDescription
vocabThe shared Vocab.
tokenizerTokenization rules and exceptions.
metaThe meta data, available as Language.meta.
...String names of pipeline components, e.g. "ner".

FactoryMeta {id="factorymeta",version="3",tag="dataclass"}

The FactoryMeta contains the information about the component and its default provided by the @Language.component or @Language.factory decorator. It's created whenever a component is defined and stored on the Language class for each component instance and factory instance.

NameDescription
factoryThe name of the registered component factory. str
default_configThe default config, describing the default values of the factory arguments. Dict[str, Any]
assignsDoc or Token attributes assigned by this component, e.g. ["token.ent_id"]. Used for pipe analysis. Iterable[str]
requiresDoc or Token attributes required by this component, e.g. ["token.ent_id"]. Used for pipe analysis. Iterable[str]
retokenizesWhether the component changes tokenization. Used for pipe analysis. bool
default_score_weightsThe scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to 1.0 per component and will be combined and normalized for the whole pipeline. If a weight is set to None, the score will not be logged or weighted. Dict[str, Optional[float]]
scoresAll scores set by the components if it's trainable, e.g. ["ents_f", "ents_r", "ents_p"]. Based on the default_score_weights and used for pipe analysis. Iterable[str]