What's New in v3.5 - Spacy

New features {id="features",hidden="true"}

spaCy v3.5 introduces three new CLI commands, apply, benchmark and find-threshold, adds fuzzy matching, provides improvements to our entity linking functionality, and includes a range of language updates and bug fixes.

New CLI commands {id="cli"}

apply CLI

The apply CLI can be used to apply a pipeline to one or more .txt, .jsonl or .spacy input files, saving the annotated docs in a single .spacy file.

bash

$ spacy apply en_core_web_sm my_texts/ output.spacy

benchmark CLI

The benchmark CLI has been added to extend the existing evaluate functionality with a wider range of profiling subcommands.

The benchmark accuracy CLI is introduced as an alias for evaluate. The new benchmark speed CLI performs warmup rounds before measuring the speed in words per second on batches of randomly shuffled documents from the provided data.

bash

$ spacy benchmark speed my_pipeline data.spacy

The output is the mean performance using batches (nlp.pipe) with a 95% confidence interval, e.g., profiling en_core_web_sm on CPU:

none

Outliers: 2.0%, extreme outliers: 0.0%
Mean: 18904.1 words/s (95% CI: -256.9 +244.1)

find-threshold CLI

The find-threshold CLI runs a series of trials across threshold values from 0.0 to 1.0 and identifies the best threshold for the provided score metric.

The following command runs 20 trials for the spancat component in my_pipeline, recording the spans_sc_f score for each value of the threshold [components.spancat.threshold] from 0.0 to 1.0:

bash

$ spacy find-threshold my_pipeline data.spacy spancat threshold spans_sc_f --n_trials 20

The find-threshold CLI can be used with textcat_multilabel, spancat and custom components with thresholds that are applied while predicting or scoring.

Fuzzy matching {id="fuzzy"}

New FUZZY operators support fuzzy matching with the Matcher. By default, the FUZZY operator allows a Levenshtein edit distance of 2 and up to 30% of the pattern string length. FUZZY1..FUZZY9 can be used to specify the exact number of allowed edits.

python

# Match lowercase with fuzzy matching (allows up to 3 edits)
pattern = [{"LOWER": {"FUZZY": "definitely"}}]

# Match custom attribute values with fuzzy matching (allows up to 3 edits)
pattern = [{"_": {"country": {"FUZZY": "Kyrgyzstan"}}}]

# Match with exact Levenshtein edit distance limits (allows up to 4 edits)
pattern = [{"_": {"country": {"FUZZY4": "Kyrgyzstan"}}}]

Note that FUZZY uses Levenshtein edit distance rather than Damerau-Levenshtein edit distance, so a transposition like teh for the counts as two edits, one insertion and one deletion.

If you'd prefer an alternate fuzzy matching algorithm, you can provide your own custom method to the Matcher or as a config option for an entity ruler and span ruler.

FUZZY and REGEX with lists {id="fuzzy-regex-lists"}

The FUZZY and REGEX operators are also now supported for lists with IN and NOT_IN:

python

pattern = [{"TEXT": {"FUZZY": {"IN": ["awesome", "cool", "wonderful"]}}}]
pattern = [{"TEXT": {"REGEX": {"NOT_IN": ["^awe(some)?$", "^wonder(ful)?"]}}}]

Entity linking generalization {id="el"}

The knowledge base used for entity linking is now easier to customize and has a new default implementation InMemoryLookupKB.

Additional features and improvements {id="additional-features-and-improvements"}

Language updates:
- Extended support for Slovenian
- Fixed lookup fallback for French and Catalan lemmatizers
- Switch Russian and Ukrainian lemmatizers to pymorphy3
- Support for editorial punctuation in Ancient Greek
- Update to Russian tokenizer exceptions
- Small fix for Dutch stop words
Allow up to typer v0.7.x, mypy 0.990 and typing_extensions v4.4.x.
New spacy.ConsoleLogger.v3 with expanded progress tracking.
Improved scoring behavior for textcat with spacy.textcat_scorer.v2 and spacy.textcat_multilabel_scorer.v2.
Updates so that downstream components can train properly on a frozen tok2vec or transformer layer.
Allow interpolation of variables in directory names in projects.
Support for local file system remotes for projects.
Improve UX around displacy.serve when the default port is in use.
Optional before_update callback that is invoked at the start of each training step.
Improve performance of SpanGroup and fix typing issues for SpanGroup and Span objects.
Patch a security vulnerability in extracting tar files.
Add equality definition for Vectors.
Ensure Vocab.to_disk respects the exclude setting for lookups and vectors.
Correctly handle missing annotations in the edit tree lemmatizer.

Trained pipeline updates {id="pipelines"}

The CNN pipelines add IS_SPACE as a tok2vec feature for tagger and morphologizer components to improve tagging of non-whitespace vs. whitespace tokens.
The transformer pipelines require spacy-transformers v1.2, which uses the exact alignment from tokenizers for fast tokenizers instead of the heuristic alignment from spacy-alignments. For all trained pipelines except ja_core_news_trf, the alignments between spaCy tokens and transformer tokens may be slightly different. More details about the spacy-transformers changes in the v1.2.0 release notes.

Notes about upgrading from v3.4 {id="upgrading"}

Validation of textcat values {id="textcat-validation"}

An error is now raised when unsupported values are given as input to train a textcat or textcat_multilabel model - ensure that values are 0.0 or 1.0 as explained in the docs.

Using the default knowledge base

As KnowledgeBase is now an abstract class, you should call the constructor of the new InMemoryLookupKB instead when you want to use spaCy's default KB implementation:

diff

- kb = KnowledgeBase()
+ kb = InMemoryLookupKB()

If you've written a custom KB that inherits from KnowledgeBase, you'll need to implement its abstract methods, or alternatively inherit from InMemoryLookupKB instead.

Updated scorers for tokenization and textcat {id="scores"}

We fixed a bug that inflated the token_acc scores in v3.0-v3.4. The reported token_acc will drop from v3.4 to v3.5, but if token_p/r/f stay the same, your tokenization performance has not changed from v3.4.

For new textcat or textcat_multilabel configs, the new default v2 scorers:

ignore threshold for textcat, so the reported cats_p/r/f may increase slightly in v3.5 even though the underlying predictions are unchanged
report the performance of only the final textcat or textcat_multilabel component in the pipeline by default
allow custom scorers to be used to score multiple textcat and textcat_multilabel components with Scorer.score_cats by restricting the evaluation to the component's provided labels

Pipeline package version compatibility {id="version-compat"}

Using legacy implementations

In spaCy v3, you'll still be able to load and reference legacy implementations via spacy-legacy, even if the components or architectures change and newer versions are available in the core library.

When you're loading a pipeline package trained with an earlier version of spaCy v3, you will see a warning telling you that the pipeline may be incompatible. This doesn't necessarily have to be true, but we recommend running your pipelines against your test suite or evaluation data to make sure there are no unexpected results.

If you're using one of the trained pipelines we provide, you should run spacy download to update to the latest version. To see an overview of all installed packages and their compatibility, you can run spacy validate.

If you've trained your own custom pipeline and you've confirmed that it's still working as expected, you can update the spaCy version requirements in the meta.json:

diff

- "spacy_version": ">=3.4.0,<3.5.0",
+ "spacy_version": ">=3.4.0,<3.6.0",

Updating v3.4 configs

To update a config from spaCy v3.4 with the new v3.5 settings, run init fill-config:

cli

$ python -m spacy init fill-config config-v3.4.cfg config-v3.5.cfg

In many cases (spacy train, spacy.load), the new defaults will be filled in automatically, but you'll need to fill in the new settings to run debug config and debug data.