Back to Spacy

Top-level Functions

website/docs/api/top-level.mdx

4.0.0.dev10100.0 KB
Original Source

spaCy {id="spacy",hidden="true"}

spacy.load {id="spacy.load",tag="function"}

Load a pipeline using the name of an installed package, a string path or a Path-like object. spaCy will try resolving the load argument in this order. If a pipeline is loaded from a string name, spaCy will assume it's a Python package and import it and call the package's own load() method. If a pipeline is loaded from a path, spaCy will assume it's a data directory, load its config.cfg and use the language and pipeline information to construct the Language class. The data will be loaded in via Language.from_disk. Loading a pipeline from a package will also import any custom code, if present, whereas loading from a directory does not. For these cases, you need to manually import your custom code.

<Infobox variant="warning" title="Changed in v3.0">

As of v3.0, the disable keyword argument specifies components to load but disable, instead of components to not load at all. Those components can now be specified separately using the new exclude keyword argument.

</Infobox>

Example

python
nlp = spacy.load("en_core_web_sm") # package
nlp = spacy.load("/path/to/pipeline") # string path
nlp = spacy.load(Path("/path/to/pipeline")) # pathlib Path

nlp = spacy.load("en_core_web_sm", exclude=["parser", "tagger"])
NameDescription
namePipeline to load, i.e. package name or path. Union[str, Path]
keyword-only
vocabOptional shared vocab to pass in on initialization. If True (default), a new Vocab object will be created. Union[Vocab, bool]
disableName(s) of pipeline component(s) to disable. Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling nlp.enable_pipe. Is merged with the config entry nlp.disabled. Union[str, Iterable[str]]
enable <Tag variant="new">3.4</Tag>Name(s) of pipeline component(s) to enable. All other pipes will be disabled. Union[str, Iterable[str]]
exclude <Tag variant="new">3</Tag>Name(s) of pipeline component(s) to exclude. Excluded components won't be loaded. Union[str, Iterable[str]]
config <Tag variant="new">3</Tag>Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. "components.name.value". Union[Dict[str, Any], Config]
RETURNSA Language object with the loaded pipeline. Language

Essentially, spacy.load() is a convenience wrapper that reads the pipeline's config.cfg, uses the language and pipeline information to construct a Language object, loads in the model data and weights, and returns it.

python
cls = spacy.util.get_lang_class(lang)  # 1. Get Language class, e.g. English
nlp = cls()                            # 2. Initialize it
for name in pipeline:
    nlp.add_pipe(name, config={...})   # 3. Add the component to the pipeline
nlp.from_disk(data_path)               # 4. Load in the binary data

spacy.blank {id="spacy.blank",tag="function",version="2"}

Create a blank pipeline of a given language class. This function is the twin of spacy.load().

Example

python
nlp_en = spacy.blank("en")   # equivalent to English()
nlp_de = spacy.blank("de")   # equivalent to German()
NameDescription
nameTwo-letter ISO 639-1 or three-letter ISO 639-3 language codes, such as 'en' and 'eng', of the language class to load. str
keyword-only
vocabOptional shared vocab to pass in on initialization. If True (default), a new Vocab object will be created. Union[Vocab, bool]
config <Tag variant="new">3</Tag>Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. "components.name.value". Union[Dict[str, Any], Config]
metaOptional meta overrides for nlp.meta. Dict[str, Any]
RETURNSAn empty Language object of the appropriate subclass. Language

spacy.info {id="spacy.info",tag="function"}

The same as the info command. Pretty-print information about your installation, installed pipelines and local setup from within spaCy.

Example

python
spacy.info()
spacy.info("en_core_web_sm")
markdown = spacy.info(markdown=True, silent=True)
NameDescription
modelOptional pipeline, i.e. a package name or path (optional). Optional[str]
keyword-only
markdownPrint information as Markdown. bool
silentDon't print anything, just return. bool

spacy.explain {id="spacy.explain",tag="function"}

Get a description for a given POS tag, dependency label or entity type. For a list of available terms, see glossary.py.

Example

python
spacy.explain("NORP")
# Nationalities or religious or political groups

doc = nlp("Hello world")
for word in doc:
   print(word.text, word.tag_, spacy.explain(word.tag_))
# Hello UH interjection
# world NN noun, singular or mass
NameDescription
termTerm to explain. str
RETURNSThe explanation, or None if not found in the glossary. Optional[str]

spacy.prefer_gpu {id="spacy.prefer_gpu",tag="function",version="2.0.14"}

Allocate data and perform operations on GPU, if available. If data has already been allocated on CPU, it will not be moved. Ideally, this function should be called right after importing spaCy and before loading any pipelines.

<Infobox variant="warning" title="Jupyter notebook usage">

In a Jupyter notebook, run prefer_gpu() in the same cell as spacy.load() to ensure that the model is loaded on the correct device. See more details.

</Infobox>

Example

python
import spacy
activated = spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")
NameDescription
gpu_idDevice index to select. Defaults to 0. int
RETURNSWhether the GPU was activated. bool

spacy.require_gpu {id="spacy.require_gpu",tag="function",version="2.0.14"}

Allocate data and perform operations on GPU. Will raise an error if no GPU is available. If data has already been allocated on CPU, it will not be moved. Ideally, this function should be called right after importing spaCy and before loading any pipelines.

<Infobox variant="warning" title="Jupyter notebook usage">

In a Jupyter notebook, run require_gpu() in the same cell as spacy.load() to ensure that the model is loaded on the correct device. See more details.

</Infobox>

Example

python
import spacy
spacy.require_gpu()
nlp = spacy.load("en_core_web_sm")
NameDescription
gpu_idDevice index to select. Defaults to 0. int
RETURNSTrue bool

spacy.require_cpu {id="spacy.require_cpu",tag="function",version="3.0.0"}

Allocate data and perform operations on CPU. If data has already been allocated on GPU, it will not be moved. Ideally, this function should be called right after importing spaCy and before loading any pipelines.

<Infobox variant="warning" title="Jupyter notebook usage">

In a Jupyter notebook, run require_cpu() in the same cell as spacy.load() to ensure that the model is loaded on the correct device. See more details.

</Infobox>

Example

python
import spacy
spacy.require_cpu()
nlp = spacy.load("en_core_web_sm")
NameDescription
RETURNSTrue bool

displaCy {id="displacy",source="spacy/displacy"}

As of v2.0, spaCy comes with a built-in visualization suite. For more info and examples, see the usage guide on visualizing spaCy.

displacy.serve {id="displacy.serve",tag="method",version="2"}

Serve a dependency parse tree or named entity visualization to view it in your browser. Will run a simple web server.

Example

python
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc1 = nlp("This is a sentence.")
doc2 = nlp("This is another sentence.")
displacy.serve([doc1, doc2], style="dep")
NameDescription
docsDocument(s) or span(s) to visualize. Union[Iterable[Union[Doc, Span]], Doc, Span]
style <Tag variant="new">3.3</Tag>Visualization style, "dep", "ent" or "span". Defaults to "dep". str
pageRender markup as full HTML page. Defaults to True. bool
minifyMinify HTML markup. Defaults to False. bool
optionsVisualizer-specific options, e.g. colors. Dict[str, Any]
manualDon't parse Doc and instead expect a dict or list of dicts. See here for formats and examples. Defaults to False. bool
portPort to serve visualization. Defaults to 5000. int
hostHost to serve visualization. Defaults to "0.0.0.0". str
auto_select_port <Tag variant="new">3.5</Tag>If True, automatically switch to a different port if the specified port is already in use. Defaults to False. bool

displacy.render {id="displacy.render",tag="method",version="2"}

Render a dependency parse tree or named entity visualization.

Example

python
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.")
html = displacy.render(doc, style="dep")
NameDescription
docsDocument(s) or span(s) to visualize. Union[Iterable[Union[Doc, Span, dict]], Doc, Span, dict]
styleVisualization style, "dep", "ent" or "span" <Tag variant="new">3.3</Tag>. Defaults to "dep". str
pageRender markup as full HTML page. Defaults to False. bool
minifyMinify HTML markup. Defaults to False. bool
optionsVisualizer-specific options, e.g. colors. Dict[str, Any]
manualDon't parse Doc and instead expect a dict or list of dicts. See here for formats and examples. Defaults to False. bool
jupyterExplicitly enable or disable "Jupyter mode" to return markup ready to be rendered in a notebook. Detected automatically if None (default). Optional[bool]
RETURNSThe rendered HTML markup. str

displacy.parse_deps {id="displacy.parse_deps",tag="method",version="2"}

Generate dependency parse in {'words': [], 'arcs': []} format. For use with the manual=True argument in displacy.render.

Example

python
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.")
deps_parse = displacy.parse_deps(doc)
html = displacy.render(deps_parse, style="dep", manual=True)
NameDescription
orig_docDoc or span to parse dependencies. Union[Doc, Span]
optionsDependency parse specific visualisation options. Dict[str, Any]
RETURNSGenerated dependency parse keyed by words and arcs. dict

displacy.parse_ents {id="displacy.parse_ents",tag="method",version="2"}

Generate named entities in [{start: i, end: i, label: 'label'}] format. For use with the manual=True argument in displacy.render.

Example

python
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("But Google is starting from behind.")
ents_parse = displacy.parse_ents(doc)
html = displacy.render(ents_parse, style="ent", manual=True)
NameDescription
docDoc to parse entities. Doc
optionsNER-specific visualisation options. Dict[str, Any]
RETURNSGenerated entities keyed by text (original text) and ents. dict

displacy.parse_spans {id="displacy.parse_spans",tag="method",version="2"}

Generate spans in [{start_token: i, end_token: i, label: 'label'}] format. For use with the manual=True argument in displacy.render.

Example

python
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("But Google is starting from behind.")
doc.spans['orgs'] = [doc[1:2]]
ents_parse = displacy.parse_spans(doc, options={"spans_key" : "orgs"})
html = displacy.render(ents_parse, style="span", manual=True)
NameDescription
docDoc to parse entities. Doc
optionsSpan-specific visualisation options. Dict[str, Any]
RETURNSGenerated entities keyed by text (original text) and ents. dict

Visualizer data structures {id="displacy_structures"}

You can use displaCy's data format to manually render data. This can be useful if you want to visualize output from other libraries. You can find examples of displaCy's different data formats below.

DEP example data structure

json
{
  "words": [
    { "text": "This", "tag": "DT" },
    { "text": "is", "tag": "VBZ" },
    { "text": "a", "tag": "DT" },
    { "text": "sentence", "tag": "NN" }
  ],
  "arcs": [
    { "start": 0, "end": 1, "label": "nsubj", "dir": "left" },
    { "start": 2, "end": 3, "label": "det", "dir": "left" },
    { "start": 1, "end": 3, "label": "attr", "dir": "right" }
  ]
}

Dependency Visualizer data structure {id="structure-dep"}

Dictionary KeyDescription
wordsList of dictionaries describing a word token (see structure below). List[Dict[str, Any]]
arcsList of dictionaries describing the relations between words (see structure below). List[Dict[str, Any]]
Optional
titleTitle of the visualization. Optional[str]
settingsDependency Visualizer options (see here). Dict[str, Any]
<Accordion title="Words data structure">
Dictionary KeyDescription
textText content of the word. str
tagFine-grained part-of-speech. str
lemmaBase form of the word. Optional[str]
</Accordion> <Accordion title="Arcs data structure">
Dictionary KeyDescription
startThe index of the starting token. int
endThe index of the ending token. int
labelThe type of dependency relation. str
dirDirection of the relation (left, right). str
</Accordion>

ENT example data structure

json
{
  "text": "But Google is starting from behind.",
  "ents": [{ "start": 4, "end": 10, "label": "ORG" }]
}

Named Entity Recognition data structure {id="structure-ent"}

Dictionary KeyDescription
textString representation of the document text. str
entsList of dictionaries describing entities (see structure below). List[Dict[str, Any]]
Optional
titleTitle of the visualization. Optional[str]
settingsEntity Visualizer options (see here). Dict[str, Any]
<Accordion title="Ents data structure">
Dictionary KeyDescription
startThe index of the first character of the entity. int
endThe index of the last character of the entity. (not inclusive) int
labelLabel attached to the entity. str
Optional
kb_idKnowledgeBase ID. str
kb_urlKnowledgeBase URL. str
</Accordion>

SPAN example data structure

json
{
  "text": "Welcome to the Bank of China.",
  "spans": [
    { "start_token": 3, "end_token": 6, "label": "ORG" },
    { "start_token": 5, "end_token": 6, "label": "GPE" }
  ],
  "tokens": ["Welcome", "to", "the", "Bank", "of", "China", "."]
}

Span Classification data structure {id="structure-span"}

Dictionary KeyDescription
textString representation of the document text. str
spansList of dictionaries describing spans (see structure below). List[Dict[str, Any]]
tokensList of word tokens. List[str]
Optional
titleTitle of the visualization. Optional[str]
settingsSpan Visualizer options (see here). Dict[str, Any]
<Accordion title="Spans data structure">
Dictionary KeyDescription
start_tokenThe index of the first token of the span in tokens. int
end_tokenThe index of the last token of the span in tokens. int
labelLabel attached to the span. str
Optional
kb_idKnowledgeBase ID. str
kb_urlKnowledgeBase URL. str
</Accordion>

Visualizer options {id="displacy_options"}

The options argument lets you specify additional settings for each visualizer. If a setting is not present in the options, the default value will be used.

Dependency Visualizer options {id="options-dep"}

Example

python
options = {"compact": True, "color": "blue"}
displacy.serve(doc, style="dep", options=options)
NameDescription
fine_grainedUse fine-grained part-of-speech tags (Token.tag_) instead of coarse-grained tags (Token.pos_). Defaults to False. bool
add_lemmaPrint the lemmas in a separate row below the token texts. Defaults to False. bool
collapse_punctAttach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. Defaults to True. bool
collapse_phrasesMerge noun phrases into one token. Defaults to False. bool
compact"Compact mode" with square arrows that takes up less space. Defaults to False. bool
colorText color. Can be provided in any CSS legal format as a string e.g.: "#00ff00", "rgb(0, 255, 0)", "hsl(120, 100%, 50%)" and "green" all correspond to the color green (without transparency). Defaults to "#000000". str
bgBackground color. Can be provided in any CSS legal format as a string e.g.: "#00ff00", "rgb(0, 255, 0)", "hsl(120, 100%, 50%)" and "green" all correspond to the color green (without transparency). Defaults to "#ffffff". str
fontFont name or font family for all text. Defaults to "Arial". str
offset_xSpacing on left side of the SVG in px. Defaults to 50. int
arrow_strokeWidth of arrow path in px. Defaults to 2. int
arrow_widthWidth of arrow head in px. Defaults to 10 in regular mode and 8 in compact mode. int
arrow_spacingSpacing between arrows in px to avoid overlaps. Defaults to 20 in regular mode and 12 in compact mode. int
word_spacingVertical spacing between words and arcs in px. Defaults to 45. int
distanceDistance between words in px. Defaults to 175 in regular mode and 150 in compact mode. int

Named Entity Visualizer options {id="displacy_options-ent"}

Example

python
options = {"ents": ["PERSON", "ORG", "PRODUCT"],
           "colors": {"ORG": "yellow"}}
displacy.serve(doc, style="ent", options=options)
NameDescription
entsEntity types to highlight or None for all types (default). Optional[List[str]]
colorsColor overrides. Entity types should be mapped to color names or values. Dict[str, str]
templateOptional template to overwrite the HTML used to render entity spans. Should be a format string and can use {bg}, {text} and {label}. See templates.py for examples. Optional[str]
kb_url_template <Tag variant="new">3.2.1</Tag>Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in. Optional[str]

Span Visualizer options {id="displacy_options-span"}

Example

python
options = {"spans_key": "sc"}
displacy.serve(doc, style="span", options=options)
NameDescription
spans_keyWhich spans key to render spans from. Default is "sc". str
templatesDictionary containing the keys "span", "slice", and "start". These dictate how the overall span, a span slice, and the starting token will be rendered. Optional[Dict[str, str]
kb_url_templateOptional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in Optional[str]
colorsColor overrides. Entity types should be mapped to color names or values. Dict[str, str]

By default, displaCy comes with colors for all entity types used by spaCy's trained pipelines for both entity and span visualizer. If you're using custom entity types, you can use the colors setting to add your own colors for them. Your application or pipeline package can also expose a spacy_displacy_colors entry point to add custom labels and their colors automatically.

By default, displaCy links to # for entities without a kb_id set on their span. If you wish to link an entity to their URL then consider using the kb_url_template option from above. For example if the kb_id on a span is Q95 and this is a Wikidata identifier then this option can be set to https://www.wikidata.org/wiki/{}. Clicking on your entity in the rendered HTML should redirect you to their Wikidata page, in this case https://www.wikidata.org/wiki/Q95.

registry {id="registry",source="spacy/util.py",version="3"}

spaCy's function registry extends Thinc's registry and allows you to map strings to functions. You can register functions to create architectures, optimizers, schedules and more, and then refer to them and set their arguments in your config file. Python type hints are used to validate the inputs. See the Thinc docs for details on the registry methods and our helper library catalogue for some background on the concept of function registries. spaCy also uses the function registry for language subclasses, model architecture, lookups and pipeline component factories.

Example

python
from typing import Iterator
import spacy

@spacy.registry.schedules("waltzing.v1")
def waltzing() -> Iterator[float]:
    i = 0
    while True:
        yield i % 3 + 1
        i += 1
Registry nameDescription
architecturesRegistry for functions that create model architectures. Can be used to register custom model architectures and reference them in the config.cfg.
augmentersRegistry for functions that create data augmentation callbacks for corpora and other training data iterators.
batchersRegistry for training and evaluation data batchers.
callbacksRegistry for custom callbacks to modify the nlp object before training.
displacy_colorsRegistry for custom color scheme for the displacy NER visualizer. Automatically reads from entry points.
factoriesRegistry for functions that create pipeline components. Added automatically when you use the @spacy.component decorator and also reads from entry points.
initializersRegistry for functions that create initializers.
languagesRegistry for language-specific Language subclasses. Automatically reads from entry points.
layersRegistry for functions that create layers.
loggersRegistry for functions that log training results.
lookupsRegistry for large lookup tables available via vocab.lookups.
lossesRegistry for functions that create losses.
miscRegistry for miscellaneous functions that return data assets, knowledge bases or anything else you may need.
optimizersRegistry for functions that create optimizers.
readersRegistry for file and data readers, including training and evaluation data readers like Corpus.
schedulesRegistry for functions that create schedules.
scorersRegistry for functions that create scoring methods for user with the Scorer. Scoring methods are called with Iterable[Example] and arbitrary **kwargs and return scores as Dict[str, Any].
tokenizersRegistry for tokenizer factories. Registered functions should return a callback that receives the nlp object and returns a Tokenizer or a custom callable.

spacy-transformers registry {id="registry-transformers"}

The following registries are added by the spacy-transformers package. See the Transformer API reference and usage docs for details.

Example

python
import spacy_transformers

@spacy_transformers.registry.annotation_setters("my_annotation_setter.v1")
def configure_custom_annotation_setter():
    def annotation_setter(docs, trf_data) -> None:
       # Set annotations on the docs

    return annotation_setter
Registry nameDescription
span_gettersRegistry for functions that take a batch of Doc objects and return a list of Span objects to process by the transformer, e.g. sentences.
annotation_settersRegistry for functions that create annotation setters. Annotation setters are functions that take a batch of Doc objects and a FullTransformerBatch and can set additional annotations on the Doc.

Loggers {id="loggers",source="spacy/training/loggers.py",version="3"}

A logger records the training results. When a logger is created, two functions are returned: one for logging the information for each training step, and a second function that is called to finalize the logging when the training is finished. To log each training step, a dictionary is passed on from the spacy train, including information such as the training loss and the accuracy scores on the development set.

The built-in, default logger is the ConsoleLogger, which prints results to the console in tabular format and saves them to a jsonl file. The spacy-loggers package, included as a dependency of spaCy, enables other loggers, such as one that sends results to a Weights & Biases dashboard.

Instead of using one of the built-in loggers, you can implement your own.

spacy.ConsoleLogger.v2 {tag="registered function"}

Example config

ini
[training.logger]
@loggers = "spacy.ConsoleLogger.v2"
progress_bar = true
console_output = true
output_file = "training_log.jsonl"

Writes the results of a training step to the console in a tabular format and saves them to a jsonl file.

<Accordion title="Example console output" spaced>
bash
$ python -m spacy train config.cfg
ℹ Using CPU
ℹ Loading config and nlp from: config.cfg
ℹ Pipeline: ['tok2vec', 'tagger']
ℹ Start training
ℹ Training. Initial learn rate: 0.0
ℹ Saving results to training_log.jsonl

E     #        LOSS TOK2VEC   LOSS TAGGER   TAG_ACC   SCORE
---   ------   ------------   -----------   -------   ------
  0        0           0.00         86.20      0.22     0.00
  0      200           3.08      18968.78     34.00     0.34
  0      400          31.81      22539.06     33.64     0.34
  0      600          92.13      22794.91     43.80     0.44
  0      800         183.62      21541.39     56.05     0.56
  0     1000         352.49      25461.82     65.15     0.65
  0     1200         422.87      23708.82     71.84     0.72
  0     1400         601.92      24994.79     76.57     0.77
  0     1600         662.57      22268.02     80.20     0.80
  0     1800        1101.50      28413.77     82.56     0.83
  0     2000        1253.43      28736.36     85.00     0.85
  0     2200        1411.02      28237.53     87.42     0.87
  0     2400        1605.35      28439.95     88.70     0.89

Note that the cumulative loss keeps increasing within one epoch, but should start decreasing across epochs.

</Accordion>
NameDescription
progress_barWhether the logger should print a progress bar tracking the steps till the next evaluation pass (default: False). bool
console_outputWhether the logger should print the logs in the console (default: True). bool
output_fileThe file to save the training logs to (default: None). Optional[Union[str, Path]]

spacy.ConsoleLogger.v3 {id="ConsoleLogger",tag="registered function"}

Example config

ini
[training.logger]
@loggers = "spacy.ConsoleLogger.v3"
progress_bar = "eval"
console_output = true
output_file = "training_log.jsonl"

Writes the results of a training step to the console in a tabular format and optionally saves them to a jsonl file.

NameDescription
progress_barType of progress bar to show in the console: "train", "eval" or None.
The bar tracks the number of steps until training.max_steps and training.eval_frequency are reached respectively (default: None). Optional[str]
console_outputWhether the logger should print the logs in the console (default: True). bool
output_fileThe file to save the training logs to (default: None). Optional[Union[str, Path]]

Readers {id="readers"}

File readers {id="file-readers",source="github.com/explosion/srsly",version="3"}

The following file readers are provided by our serialization library srsly. All registered functions take one argument path, pointing to the file path to load.

Example config

ini
[corpora.train.augmenter.orth_variants]
@readers = "srsly.read_json.v1"
path = "corpus/en_orth_variants.json"
NameDescription
srsly.read_json.v1Read data from a JSON file.
srsly.read_jsonl.v1Read data from a JSONL (newline-delimited JSON) file.
srsly.read_yaml.v1Read data from a YAML file.
srsly.read_msgpack.v1Read data from a binary MessagePack file.
<Infobox title="Important note" variant="warning">

Since the file readers expect a local path, you should only use them in config blocks that are not executed at runtime – for example, in [training] and [corpora] (to load data or resources like data augmentation tables) or in [initialize] (to pass data to pipeline components).

</Infobox>

spacy.read_labels.v1 {id="read_labels",tag="registered function"}

Read a JSON-formatted labels file generated with init labels. Typically used in the [initialize] block of the training config to speed up the model initialization process and provide pre-generated label sets.

Example config

ini
[initialize.components]

[initialize.components.ner]

[initialize.components.ner.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/ner.json"
NameDescription
pathThe path to the labels file generated with init labels. Path
requireWhether to require the file to exist. If set to False and the labels file doesn't exist, the loader will return None and the initialize method will extract the labels from the data. Defaults to False. bool
CREATESThe list of labels. List[str]

Corpus readers {id="corpus-readers",source="spacy/training/corpus.py",version="3"}

Corpus readers are registered functions that load data and return a function that takes the current nlp object and yields Example objects that can be used for training and pretraining. You can replace it with your own registered function in the @readers registry to customize the data loading and streaming.

spacy.Corpus.v1 {id="corpus",tag="registered function"}

The Corpus reader manages annotated corpora and can be used for training and development datasets in the DocBin (.spacy) format. Also see the Corpus class.

Example config

ini
[paths]
train = "corpus/train.spacy"

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
NameDescription
pathThe directory or filename to read from. Expects data in spaCy's binary .spacy format. Union[str, Path]
gold_preprocWhether to set up the Example object with gold-standard sentences and tokens for the predictions. See Corpus for details. bool
max_lengthMaximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to 0 for no limit. int
limitLimit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. int
augmenterApply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to None. Optional[Callable]
CREATESThe corpus reader. Corpus

spacy.JsonlCorpus.v1 {id="jsonlcorpus",tag="registered function"}

Create Example objects from a JSONL (newline-delimited JSON) file of texts keyed by "text". Can be used to read the raw text corpus for language model pretraining from a JSONL file. Also see the JsonlCorpus class.

Example config

ini
[paths]
pretrain = "corpus/raw_text.jsonl"

[corpora.pretrain]
@readers = "spacy.JsonlCorpus.v1"
path = ${paths.pretrain}
min_length = 0
max_length = 0
limit = 0
NameDescription
pathThe directory or filename to read from. Expects newline-delimited JSON with a key "text" for each record. Union[str, Path]
min_lengthMinimum document length (in tokens). Shorter documents will be skipped. Defaults to 0, which indicates no limit. int
max_lengthMaximum document length (in tokens). Longer documents will be skipped. Defaults to 0, which indicates no limit. int
limitLimit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. int
CREATESThe corpus reader. JsonlCorpus

Batchers {id="batchers",source="spacy/training/batchers.py",version="3"}

A data batcher implements a batching strategy that essentially turns a stream of items into a stream of batches, with each batch consisting of one item or a list of items. During training, the models update their weights after processing one batch at a time. Typical batching strategies include presenting the training data as a stream of batches with similar sizes, or with increasing batch sizes. See the Thinc documentation on schedules for a few standard examples.

Instead of using one of the built-in batchers listed here, you can also implement your own, which may or may not use a custom schedule.

spacy.batch_by_words.v1 {id="batch_by_words",tag="registered function"}

Create minibatches of roughly a given number of words. If any examples are longer than the specified batch length, they will appear in a batch by themselves, or be discarded if discard_oversize is set to True. The argument docs can be a list of strings, Doc objects or Example objects.

Example config

ini
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
size = 100
tolerance = 0.2
discard_oversize = false
get_length = null
NameDescription
seqsThe sequences to minibatch. Iterable[Any]
sizeThe target number of words per batch. Can also be a block referencing a schedule, e.g. compounding. Union[int, Sequence[int]]
toleranceWhat percentage of the size to allow batches to exceed. float
discard_oversizeWhether to discard sequences that by themselves exceed the tolerated size. bool
get_lengthOptional function that receives a sequence item and returns its length. Defaults to the built-in len() if not set. Optional[Callable[[Any], int]]
CREATESThe batcher that takes an iterable of items and returns batches. Callable[[Iterable[Any]], Iterable[List[Any]]]

spacy.batch_by_sequence.v1 {id="batch_by_sequence",tag="registered function"}

Example config

ini
[training.batcher]
@batchers = "spacy.batch_by_sequence.v1"
size = 32
get_length = null

Create a batcher that creates batches of the specified size.

NameDescription
sizeThe target number of items per batch. Can also be a block referencing a schedule, e.g. compounding. Union[int, Sequence[int]]
get_lengthOptional function that receives a sequence item and returns its length. Defaults to the built-in len() if not set. Optional[Callable[[Any], int]]
CREATESThe batcher that takes an iterable of items and returns batches. Callable[[Iterable[Any]], Iterable[List[Any]]]

spacy.batch_by_padded.v1 {id="batch_by_padded",tag="registered function"}

Example config

ini
[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
size = 100
buffer = 256
discard_oversize = false
get_length = null

Minibatch a sequence by the size of padded batches that would result, with sequences binned by length within a window. The padded size is defined as the maximum length of sequences within the batch multiplied by the number of sequences in the batch.

NameDescription
sizeThe largest padded size to batch sequences into. Can also be a block referencing a schedule, e.g. compounding. Union[int, Sequence[int]]
bufferThe number of sequences to accumulate before sorting by length. A larger buffer will result in more even sizing, but if the buffer is very large, the iteration order will be less random, which can result in suboptimal training. int
discard_oversizeWhether to discard sequences that are by themselves longer than the largest padded batch size. bool
get_lengthOptional function that receives a sequence item and returns its length. Defaults to the built-in len() if not set. Optional[Callable[[Any], int]]
CREATESThe batcher that takes an iterable of items and returns batches. Callable[[Iterable[Any]], Iterable[List[Any]]]

Augmenters {id="augmenters",source="spacy/training/augment.py",version="3"}

Data augmentation is the process of applying small modifications to the training data. It can be especially useful for punctuation and case replacement – for example, if your corpus only uses smart quotes and you want to include variations using regular quotes, or to make the model less sensitive to capitalization by including a mix of capitalized and lowercase examples. See the usage guide for details and examples.

spacy.orth_variants.v1 {id="orth_variants",tag="registered function"}

Example config

ini
[corpora.train.augmenter]
@augmenters = "spacy.orth_variants.v1"
level = 0.1
lower = 0.5

[corpora.train.augmenter.orth_variants]
@readers = "srsly.read_json.v1"
path = "corpus/en_orth_variants.json"

Create a data augmentation callback that uses orth-variant replacement. The callback can be added to a corpus or other data iterator during training. It's especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart quotes, or only have smart quotes etc.

NameDescription
levelThe percentage of texts that will be augmented. float
lowerThe percentage of texts that will be lowercased. float
orth_variantsA dictionary containing the single and paired orth variants. Typically loaded from a JSON file. See en_orth_variants.json for an example. Dict[str, Dict[List[Union[str, List[str]]]]]
CREATESA function that takes the current nlp object and an Example and yields augmented Example objects. Callable[[Language, Example], Iterator[Example]]

spacy.lower_case.v1 {id="lower_case",tag="registered function"}

Example config

ini
[corpora.train.augmenter]
@augmenters = "spacy.lower_case.v1"
level = 0.3

Create a data augmentation callback that lowercases documents. The callback can be added to a corpus or other data iterator during training. It's especially useful for making the model less sensitive to capitalization.

NameDescription
levelThe percentage of texts that will be augmented. float
CREATESA function that takes the current nlp object and an Example and yields augmented Example objects. Callable[[Language, Example], Iterator[Example]]

Callbacks {id="callbacks",source="spacy/training/callbacks.py",version="3"}

The config supports callbacks at several points in the lifecycle that can be used modify the nlp object.

spacy.copy_from_base_model.v1 {id="copy_from_base_model",tag="registered function"}

Example config

ini
[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "en_core_sci_md"
vocab = "en_core_sci_md"

Copy the tokenizer and/or vocab from the specified models. It's similar to the v2 base model option and useful in combination with sourced components when fine-tuning an existing pipeline. The vocab includes the lookups and the vectors from the specified model. Intended for use in [initialize.before_init].

NameDescription
tokenizerThe pipeline to copy the tokenizer from. Defaults to None. Optional[str]
vocabThe pipeline to copy the vocab from. The vocab includes the lookups and vectors. Defaults to None. Optional[str]
CREATESA function that takes the current nlp object and modifies its tokenizer and vocab. Callable[[Language], None]

spacy.models_with_nvtx_range.v1 {id="models_with_nvtx_range",tag="registered function"}

Example config

ini
[nlp]
after_pipeline_creation = {"@callbacks":"spacy.models_with_nvtx_range.v1"}

Recursively wrap the models in each pipe using NVTX range markers. These markers aid in GPU profiling by attributing specific operations to a Model's forward or backprop passes.

NameDescription
forward_colorColor identifier for forward passes. Defaults to -1. int
backprop_colorColor identifier for backpropagation passes. Defaults to -1. int
CREATESA function that takes the current nlp and wraps forward/backprop passes in NVTX ranges. Callable[[Language], Language]

spacy.models_and_pipes_with_nvtx_range.v1 {id="models_and_pipes_with_nvtx_range",tag="registered function",version="3.4"}

Example config

ini
[nlp]
after_pipeline_creation = {"@callbacks":"spacy.models_and_pipes_with_nvtx_range.v1"}

Recursively wrap both the models and methods of each pipe using NVTX range markers. By default, the following methods are wrapped: pipe, predict, set_annotations, update, rehearse, get_loss, initialize, begin_update, finish_update, update.

NameDescription
forward_colorColor identifier for model forward passes. Defaults to -1. int
backprop_colorColor identifier for model backpropagation passes. Defaults to -1. int
additional_pipe_functionsAdditional pipeline methods to wrap. Keys are pipeline names and values are lists of method identifiers. Defaults to None. Optional[Dict[str, List[str]]]
CREATESA function that takes the current nlp and wraps pipe models and methods in NVTX ranges. Callable[[Language], Language]

Training data and alignment {id="gold",source="spacy/training"}

training.offsets_to_biluo_tags {id="offsets_to_biluo_tags",tag="function"}

Encode labelled spans into per-token tags, using the BILUO scheme (Begin, In, Last, Unit, Out). Returns a list of strings, describing the tags. Each tag string will be in the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U". The string "-" is used where the entity offsets don't align with the tokenization in the Doc object. The training algorithm will view these as missing values. O denotes a non-entity token. B denotes the beginning of a multi-token entity, I the inside of an entity of three or more tokens, and L the end of an entity of two or more tokens. U denotes a single-token entity.

<Infobox title="Changed in v3.0" variant="warning" id="biluo_tags_from_offsets">

This method was previously available as spacy.gold.biluo_tags_from_offsets.

</Infobox>

Example

python
from spacy.training import offsets_to_biluo_tags

doc = nlp("I like London.")
entities = [(7, 13, "LOC")]
tags = offsets_to_biluo_tags(doc, entities)
assert tags == ["O", "O", "U-LOC", "O"]
NameDescription
docThe document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. Doc
entitiesA sequence of (start, end, label) triples. start and end should be character-offset integers denoting the slice into the original string. List[Tuple[int, int, Union[str, int]]]
missingThe label used for missing values, e.g. if tokenization doesn't align with the entity offsets. Defaults to "O". str
RETURNSA list of strings, describing the BILUO tags. List[str]

training.biluo_tags_to_offsets {id="biluo_tags_to_offsets",tag="function"}

Encode per-token tags following the BILUO scheme into entity offsets.

<Infobox title="Changed in v3.0" variant="warning" id="offsets_from_biluo_tags">

This method was previously available as spacy.gold.offsets_from_biluo_tags.

</Infobox>

Example

python
from spacy.training import biluo_tags_to_offsets

doc = nlp("I like London.")
tags = ["O", "O", "U-LOC", "O"]
entities = biluo_tags_to_offsets(doc, tags)
assert entities == [(7, 13, "LOC")]
NameDescription
docThe document that the BILUO tags refer to. Doc
tagsA sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U". List[str]
RETURNSA sequence of (start, end, label) triples. start and end will be character-offset integers denoting the slice into the original string. List[Tuple[int, int, str]]

training.biluo_tags_to_spans {id="biluo_tags_to_spans",tag="function",version="2.1"}

Encode per-token tags following the BILUO scheme into Span objects. This can be used to create entity spans from token-based tags, e.g. to overwrite the doc.ents.

<Infobox title="Changed in v3.0" variant="warning" id="spans_from_biluo_tags">

This method was previously available as spacy.gold.spans_from_biluo_tags.

</Infobox>

Example

python
from spacy.training import biluo_tags_to_spans

doc = nlp("I like London.")
tags = ["O", "O", "U-LOC", "O"]
doc.ents = biluo_tags_to_spans(doc, tags)
NameDescription
docThe document that the BILUO tags refer to. Doc
tagsA sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U". List[str]
RETURNSA sequence of Span objects with added entity labels. List[Span]

training.biluo_to_iob {id="biluo_to_iob",tag="function"}

Convert a sequence of BILUO tags to IOB tags. This is useful if you want use the BILUO tags with a model that only supports IOB tags.

Example

python
from spacy.training import biluo_to_iob

tags = ["O", "O", "B-LOC", "I-LOC", "L-LOC", "O"]
iob_tags = biluo_to_iob(tags)
assert iob_tags == ["O", "O", "B-LOC", "I-LOC", "I-LOC", "O"]
NameDescription
tagsA sequence of BILUO tags. Iterable[str]
RETURNSA list of IOB tags. List[str]

training.iob_to_biluo {id="iob_to_biluo",tag="function"}

Convert a sequence of IOB tags to BILUO tags. This is useful if you want use the IOB tags with a model that only supports BILUO tags.

<Infobox title="Changed in v3.0" variant="warning" id="iob_to_biluo">

This method was previously available as spacy.gold.iob_to_biluo.

</Infobox>

Example

python
from spacy.training import iob_to_biluo

tags = ["O", "O", "B-LOC", "I-LOC", "O"]
biluo_tags = iob_to_biluo(tags)
assert biluo_tags == ["O", "O", "B-LOC", "L-LOC", "O"]
NameDescription
tagsA sequence of IOB tags. Iterable[str]
RETURNSA list of BILUO tags. List[str]

training.biluo_to_iob {id="biluo_to_iob",tag="function"}

Convert a sequence of BILUO tags to IOB tags. This is useful if you want use the BILUO tags with a model that only supports IOB tags.

Example

python
from spacy.training import biluo_to_iob

tags = ["O", "O", "B-LOC", "I-LOC", "L-LOC", "O"]
iob_tags = biluo_to_iob(tags)
assert iob_tags == ["O", "O", "B-LOC", "I-LOC", "I-LOC", "O"]
NameDescription
tagsA sequence of BILUO tags. Iterable[str]
RETURNSA list of IOB tags. List[str]

training.iob_to_biluo {id="iob_to_biluo",tag="function"}

Convert a sequence of IOB tags to BILUO tags. This is useful if you want use the IOB tags with a model that only supports BILUO tags.

<Infobox title="Changed in v3.0" variant="warning" id="iob_to_biluo">

This method was previously available as spacy.gold.iob_to_biluo.

</Infobox>

Example

python
from spacy.training import iob_to_biluo

tags = ["O", "O", "B-LOC", "I-LOC", "O"]
biluo_tags = iob_to_biluo(tags)
assert biluo_tags == ["O", "O", "B-LOC", "L-LOC", "O"]
NameDescription
tagsA sequence of IOB tags. Iterable[str]
RETURNSA list of BILUO tags. List[str]

Utility functions {id="util",source="spacy/util.py"}

spaCy comes with a small collection of utility functions located in spacy/util.py. Because utility functions are mostly intended for internal use within spaCy, their behavior may change with future releases. The functions documented on this page should be safe to use and we'll try to ensure backwards compatibility. However, we recommend having additional tests in place if your application depends on any of spaCy's utilities.

util.get_lang_class {id="util.get_lang_class",tag="function"}

Import and load a Language class. Allows lazy-loading language data and importing languages using the two-letter language code. To add a language code for a custom language class, you can register it using the @registry.languages decorator.

Example

python
for lang_id in ["en", "de"]:
    lang_class = util.get_lang_class(lang_id)
    lang = lang_class()
NameDescription
langTwo-letter language code, e.g. "en". str
RETURNSThe respective subclass. Language

util.lang_class_is_loaded {id="util.lang_class_is_loaded",tag="function",version="2.1"}

Check whether a Language subclass is already loaded. Language subclasses are loaded lazily to avoid expensive setup code associated with the language data.

Example

python
lang_cls = util.get_lang_class("en")
assert util.lang_class_is_loaded("en") is True
assert util.lang_class_is_loaded("de") is False
NameDescription
nameTwo-letter language code, e.g. "en". str
RETURNSWhether the class has been loaded. bool

util.load_model {id="util.load_model",tag="function",version="2"}

Load a pipeline from a package or data path. If called with a string name, spaCy will assume the pipeline is a Python package and import and call its load() method. If called with a path, spaCy will assume it's a data directory, read the language and pipeline settings from the config.cfg and create a Language object. The model data will then be loaded in via Language.from_disk.

Example

python
nlp = util.load_model("en_core_web_sm")
nlp = util.load_model("en_core_web_sm", exclude=["ner"])
nlp = util.load_model("/path/to/data")
NameDescription
namePackage name or path. str
keyword-only
vocabOptional shared vocab to pass in on initialization. If True (default), a new Vocab object will be created. Union[Vocab, bool]
disableName(s) of pipeline component(s) to disable. Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling nlp.enable_pipe. Union[str, Iterable[str]]
enable <Tag variant="new">3.4</Tag>Name(s) of pipeline component(s) to enable. All other pipes will be disabled, but can be enabled again using nlp.enable_pipe. Union[str, Iterable[str]]
excludeName(s) of pipeline component(s) to exclude. Excluded components won't be loaded. Union[str, Iterable[str]]
config <Tag variant="new">3</Tag>Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. "nlp.pipeline". Union[Dict[str, Any], Config]
RETURNSLanguage class with the loaded pipeline. Language

util.load_model_from_init_py {id="util.load_model_from_init_py",tag="function",version="2"}

A helper function to use in the load() method of a pipeline package's __init__.py.

Example

python
from spacy.util import load_model_from_init_py

def load(**overrides):
    return load_model_from_init_py(__file__, **overrides)
NameDescription
init_filePath to package's __init__.py, i.e. __file__. Union[str, Path]
keyword-only
vocab <Tag variant="new">3</Tag>Optional shared vocab to pass in on initialization. If True (default), a new Vocab object will be created. Union[Vocab, bool]
disableName(s) of pipeline component(s) to disable. Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling nlp.enable_pipe. Union[str, Iterable[str]]
enable <Tag variant="new">3.4</Tag>Name(s) of pipeline component(s) to enable. All other pipes will be disabled, but can be enabled again using nlp.enable_pipe. Union[str, Iterable[str]]
exclude <Tag variant="new">3</Tag>Name(s) of pipeline component(s) to exclude. Excluded components won't be loaded. Union[str, Iterable[str]]
config <Tag variant="new">3</Tag>Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. "nlp.pipeline". Union[Dict[str, Any], Config]
RETURNSLanguage class with the loaded pipeline. Language

util.load_config {id="util.load_config",tag="function",version="3"}

Load a pipeline's config.cfg from a file path. The config typically includes details about the components and how they're created, as well as all training settings and hyperparameters.

Example

python
config = util.load_config("/path/to/config.cfg")
print(config.to_str())
NameDescription
pathPath to the pipeline's config.cfg. Union[str, Path]
overridesOptional config overrides to replace in loaded config. Can be provided as nested dict, or as flat dict with keys in dot notation, e.g. "nlp.pipeline". Dict[str, Any]
interpolateWhether to interpolate the config and replace variables like ${paths.train} with their values. Defaults to False. bool
RETURNSThe pipeline's config. Config

util.load_meta {id="util.load_meta",tag="function",version="3"}

Get a pipeline's meta.json from a file path and validate its contents. The meta typically includes details about author, licensing, data sources and version.

Example

python
meta = util.load_meta("/path/to/meta.json")
NameDescription
pathPath to the pipeline's meta.json. Union[str, Path]
RETURNSThe pipeline's meta data. Dict[str, Any]

util.get_installed_models {id="util.get_installed_models",tag="function",version="3"}

List all pipeline packages installed in the current environment. This will include any spaCy pipeline that was packaged with spacy package. Under the hood, pipeline packages expose a Python entry point that spaCy can check, without having to load the nlp object.

Example

python
names = util.get_installed_models()
NameDescription
RETURNSThe string names of the pipelines installed in the current environment. List[str]

util.is_package {id="util.is_package",tag="function"}

Check if string maps to a package installed via pip. Mainly used to validate pipeline packages.

Example

python
util.is_package("en_core_web_sm") # True
util.is_package("xyz") # False
NameDescription
nameName of package. str
RETURNSTrue if installed package, False if not. bool

util.get_package_path {id="util.get_package_path",tag="function",version="2"}

Get path to an installed package. Mainly used to resolve the location of pipeline packages. Currently imports the package to find its path.

Example

python
util.get_package_path("en_core_web_sm")
# /usr/lib/python3.6/site-packages/en_core_web_sm
NameDescription
package_nameName of installed package. str
RETURNSPath to pipeline package directory. Path

util.is_in_jupyter {id="util.is_in_jupyter",tag="function",version="2"}

Check if user is running spaCy from a Jupyter notebook by detecting the IPython kernel. Mainly used for the displacy visualizer.

Example

python
html = "<h1>Hello world!</h1>"
if util.is_in_jupyter():
    from IPython.display import display, HTML
    display(HTML(html))
NameDescription
RETURNSTrue if in Jupyter, False if not. bool

util.compile_prefix_regex {id="util.compile_prefix_regex",tag="function"}

Compile a sequence of prefix rules into a regex object.

Example

python
prefixes = ("§", "%", "=", r"\+")
prefix_regex = util.compile_prefix_regex(prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search
NameDescription
entriesThe prefix rules, e.g. lang.punctuation.TOKENIZER_PREFIXES. Iterable[Union[str, Pattern]]
RETURNSThe regex object to be used for Tokenizer.prefix_search. Pattern

util.compile_suffix_regex {id="util.compile_suffix_regex",tag="function"}

Compile a sequence of suffix rules into a regex object.

Example

python
suffixes = ("'s", "'S", r"(?<=[0-9])\+")
suffix_regex = util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search
NameDescription
entriesThe suffix rules, e.g. lang.punctuation.TOKENIZER_SUFFIXES. Iterable[Union[str, Pattern]]
RETURNSThe regex object to be used for Tokenizer.suffix_search. Pattern

util.compile_infix_regex {id="util.compile_infix_regex",tag="function"}

Compile a sequence of infix rules into a regex object.

Example

python
infixes = ("…", "-", "—", r"(?<=[0-9])[+\-\*^](?=[0-9-])")
infix_regex = util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer
NameDescription
entriesThe infix rules, e.g. lang.punctuation.TOKENIZER_INFIXES. Iterable[Union[str, Pattern]]
RETURNSThe regex object to be used for Tokenizer.infix_finditer. Pattern

util.minibatch {id="util.minibatch",tag="function",version="2"}

Iterate over batches of items. size may be an iterator, so that batch-size can vary on each step.

Example

python
batches = minibatch(train_data)
for batch in batches:
    nlp.update(batch)
NameDescription
itemsThe items to batch up. Iterable[Any]
sizeThe batch size(s). Union[int, Sequence[int]]
YIELDSThe batches.

util.filter_spans {id="util.filter_spans",tag="function",version="2.1.4"}

Filter a sequence of Span objects and remove duplicates or overlaps. Useful for creating named entities (where one token can only be part of one entity) or when merging spans with Retokenizer.merge. When spans overlap, the (first) longest span is preferred over shorter spans.

Example

python
doc = nlp("This is a sentence.")
spans = [doc[0:2], doc[0:2], doc[0:4]]
filtered = filter_spans(spans)
NameDescription
spansThe spans to filter. Iterable[Span]
RETURNSThe filtered spans. List[Span]

util.get_words_and_spaces {id="get_words_and_spaces",tag="function",version="3"}

Given a list of words and a text, reconstruct the original tokens and return a list of words and spaces that can be used to create a Doc. This can help recover destructive tokenization that didn't preserve any whitespace information.

Example

python
orig_words = ["Hey", ",", "what", "'s", "up", "?"]
orig_text = "Hey, what's up?"
words, spaces = get_words_and_spaces(orig_words, orig_text)
# ['Hey', ',', 'what', "'s", 'up', '?']
# [False, True, False, True, False, False]
NameDescription
wordsThe list of words. Iterable[str]
textThe original text. str
RETURNSA list of words and a list of boolean values indicating whether the word at this position is followed by a space. Tuple[List[str], List[bool]]