What's New in v3.3 - Spacy

New features {id="features",hidden="true"}

spaCy v3.3 improves the speed of core pipeline components, adds a new trainable lemmatizer, and introduces trained pipelines for Finnish, Korean and Swedish.

Speed improvements {id="speed"}

v3.3 includes a slew of speed improvements:

Speed up parser and NER by using constant-time head lookups.
Support unnormalized softmax probabilities in spacy.Tagger.v2 to speed up inference for tagger, morphologizer, senter and trainable lemmatizer.
Speed up parser projectivization functions.
Replace Ragged with faster AlignmentArray in Example for training.
Improve Matcher speed.
Improve serialization speed for empty Doc.spans.

For longer texts, the trained pipeline speeds improve 15% or more in prediction. We benchmarked en_core_web_md (same components as in v3.2) and de_core_news_md (with the new trainable lemmatizer) across a range of text sizes on Linux (Intel Xeon W-2265) and OS X (M1) to compare spaCy v3.2 vs. v3.3:

Intel Xeon W-2265

Model	Avg. Words/Doc	v3.2 Words/Sec	v3.3 Words/Sec	Diff
`en_core_web_md`	100	17292	17441	0.86%
(=same components)	1000	15408	16024	4.00%
	10000	12798	15346	19.91%
`de_core_news_md`	100	20221	19321	-4.45%
(+v3.3 trainable lemmatizer)	1000	17480	17345	-0.77%
	10000	14513	17036	17.38%

Apple M1

Model	Avg. Words/Doc	v3.2 Words/Sec	v3.3 Words/Sec	Diff
`en_core_web_md`	100	18272	18408	0.74%
(=same components)	1000	18794	19248	2.42%
	10000	15144	17513	15.64%
`de_core_news_md`	100	19227	19591	1.89%
(+v3.3 trainable lemmatizer)	1000	20047	20628	2.90%
	10000	15921	18546	16.49%

Trainable lemmatizer {id="trainable-lemmatizer"}

The new trainable lemmatizer component uses edit trees to transform tokens into lemmas. Try out the trainable lemmatizer with the training quickstart!

displaCy support for overlapping spans and arcs {id="displacy"}

displaCy now supports overlapping spans with a new span style and multiple arcs with different labels between the same tokens for dep visualizations.

Overlapping spans can be visualized for any spans key in doc.spans:

python

import spacy
from spacy import displacy
from spacy.tokens import Span

nlp = spacy.blank("en")
text = "Welcome to the Bank of China."
doc = nlp(text)
doc.spans["custom"] = [Span(doc, 3, 6, "ORG"), Span(doc, 5, 6, "GPE")]
displacy.serve(doc, style="span", options={"spans_key": "custom"})

<Standalone height={100}> <div style={{ lineHeight: 2.5, direction: 'ltr', fontFamily: "-apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'", fontSize: 18 }}>Welcome to the BankORG of ChinaGPE.</div> </Standalone>

Additional features and improvements

Config comparisons with spacy debug diff-config.
Span suggester debugging with SpanCategorizer.set_candidates.
Big endian support with thinc-bigendian-ops and updates to make floret, murmurhash, Thinc and spaCy endian neutral.
Initial support for Lower Sorbian and Upper Sorbian.
Language updates for English, French, Italian, Japanese, Korean, Norwegian, Russian, Slovenian, Spanish, Turkish, Ukrainian and Vietnamese.
New noun chunks for Finnish.

Trained pipelines {id="pipelines"}

New trained pipelines {id="new-pipelines"}

v3.3 introduces new CPU/CNN pipelines for Finnish, Korean and Swedish, which use the new trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.

Package	Language	UPOS	Parser LAS	NER F
`fi_core_news_sm`	Finnish	92.5	71.9	75.9
`fi_core_news_md`	Finnish	95.9	78.6	80.6
`fi_core_news_lg`	Finnish	96.2	79.4	82.4
`ko_core_news_sm`	Korean	86.1	65.6	71.3
`ko_core_news_md`	Korean	94.7	80.9	83.1
`ko_core_news_lg`	Korean	94.7	81.3	85.3
`sv_core_news_sm`	Swedish	95.0	75.9	74.7
`sv_core_news_md`	Swedish	96.3	78.5	79.3
`sv_core_news_lg`	Swedish	96.3	79.1	81.1

Pipeline updates {id="pipeline-updates"}

The following languages switch from lookup or rule-based lemmatizers to the new trainable lemmatizer: Danish, Dutch, German, Greek, Italian, Lithuanian, Norwegian, Polish, Portuguese and Romanian. The overall lemmatizer accuracy improves for all of these pipelines, but be aware that the types of errors may look quite different from the lookup-based lemmatizers. If you'd prefer to continue using the previous lemmatizer, you can switch from the trainable lemmatizer to a non-trainable lemmatizer.

Model	v3.2 Lemma Acc	v3.3 Lemma Acc
`da_core_news_md`	84.9	94.8
`de_core_news_md`	73.4	97.7
`el_core_news_md`	56.5	88.9
`fi_core_news_md`	-	86.2
`it_core_news_md`	86.6	97.2
`ko_core_news_md`	-	90.0
`lt_core_news_md`	71.1	84.8
`nb_core_news_md`	76.7	97.1
`nl_core_news_md`	81.5	94.0
`pl_core_news_md`	87.1	93.7
`pt_core_news_md`	76.7	96.9
`ro_core_news_md`	81.8	95.5
`sv_core_news_md`	-	95.5

</figure>

In addition, the vectors in the English pipelines are deduplicated to improve the pruned vectors in the md models and reduce the lg model size.

Notes about upgrading from v3.2 {id="upgrading"}

Span comparisons

Span comparisons involving ordering (<, <=, >, >=) now take all span attributes into account (start, end, label, and KB ID) so spans may be sorted in a slightly different order.

Whitespace annotation

During training, annotation on whitespace tokens is handled in the same way as annotation on non-whitespace tokens in order to allow custom whitespace annotation.

Doc.from_docs

Doc.from_docs now includes Doc.tensor by default and supports excludes with an exclude argument in the same format as Doc.to_bytes. The supported exclude fields are spans, tensor and user_data.

Docs including Doc.tensor may be quite a bit larger in RAM, so to exclude Doc.tensor as in v3.2:

diff

-merged_doc = Doc.from_docs(docs)
+merged_doc = Doc.from_docs(docs, exclude=["tensor"])

Using trained pipelines with floret vectors

If you're running a new trained pipeline for Finnish, Korean or Swedish on new texts and working with Doc objects, you shouldn't notice any difference with floret vectors vs. default vectors.

If you use vectors for similarity comparisons, there are a few differences, mainly because a floret pipeline doesn't include any kind of frequency-based word list similar to the list of in-vocabulary vector keys with default vectors.

If your workflow iterates over the vector keys, you should use an external word list instead:

diff

- lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
+ lexemes = [nlp.vocab[word] for word in external_word_list]

Vectors.most_similar is not supported because there's no fixed list of vectors to compare your vectors to.

Pipeline package version compatibility {id="version-compat"}

Using legacy implementations

In spaCy v3, you'll still be able to load and reference legacy implementations via spacy-legacy, even if the components or architectures change and newer versions are available in the core library.

When you're loading a pipeline package trained with an earlier version of spaCy v3, you will see a warning telling you that the pipeline may be incompatible. This doesn't necessarily have to be true, but we recommend running your pipelines against your test suite or evaluation data to make sure there are no unexpected results.

If you're using one of the trained pipelines we provide, you should run spacy download to update to the latest version. To see an overview of all installed packages and their compatibility, you can run spacy validate.

If you've trained your own custom pipeline and you've confirmed that it's still working as expected, you can update the spaCy version requirements in the meta.json:

diff

- "spacy_version": ">=3.2.0,<3.3.0",
+ "spacy_version": ">=3.2.0,<3.4.0",

Updating v3.2 configs

To update a config from spaCy v3.2 with the new v3.3 settings, run init fill-config:

bash

$ python -m spacy init fill-config config-v3.2.cfg config-v3.3.cfg

In many cases (spacy train, spacy.load), the new defaults will be filled in automatically, but you'll need to fill in the new settings to run debug config and debug data.

To see the speed improvements for the Tagger architecture, edit your config to switch from spacy.Tagger.v1 to spacy.Tagger.v2 and then run init fill-config.