website/docs/usage/v3-5.mdx
spaCy v3.5 introduces three new CLI commands, apply, benchmark and
find-threshold, adds fuzzy matching, provides improvements to our entity
linking functionality, and includes a range of language updates and bug fixes.
The apply CLI can be used to apply a pipeline to one or more
.txt, .jsonl or .spacy input files, saving the annotated docs in a single
.spacy file.
$ spacy apply en_core_web_sm my_texts/ output.spacy
The benchmark CLI has been added to extend the existing
evaluate functionality with a wider range of profiling subcommands.
The benchmark accuracy CLI is introduced as an alias for evaluate. The new
benchmark speed CLI performs warmup rounds before measuring the speed in words
per second on batches of randomly shuffled documents from the provided data.
$ spacy benchmark speed my_pipeline data.spacy
The output is the mean performance using batches (nlp.pipe) with a 95%
confidence interval, e.g., profiling en_core_web_sm on CPU:
Outliers: 2.0%, extreme outliers: 0.0%
Mean: 18904.1 words/s (95% CI: -256.9 +244.1)
The find-threshold CLI runs a series of trials
across threshold values from 0.0 to 1.0 and identifies the best threshold
for the provided score metric.
The following command runs 20 trials for the spancat component in
my_pipeline, recording the spans_sc_f score for each value of the threshold
[components.spancat.threshold] from 0.0 to 1.0:
$ spacy find-threshold my_pipeline data.spacy spancat threshold spans_sc_f --n_trials 20
The find-threshold CLI can be used with textcat_multilabel, spancat and
custom components with thresholds that are applied while predicting or scoring.
New FUZZY operators support fuzzy matching
with the Matcher. By default, the FUZZY operator allows a Levenshtein edit
distance of 2 and up to 30% of the pattern string length. FUZZY1..FUZZY9 can
be used to specify the exact number of allowed edits.
# Match lowercase with fuzzy matching (allows up to 3 edits)
pattern = [{"LOWER": {"FUZZY": "definitely"}}]
# Match custom attribute values with fuzzy matching (allows up to 3 edits)
pattern = [{"_": {"country": {"FUZZY": "Kyrgyzstan"}}}]
# Match with exact Levenshtein edit distance limits (allows up to 4 edits)
pattern = [{"_": {"country": {"FUZZY4": "Kyrgyzstan"}}}]
Note that FUZZY uses Levenshtein edit distance rather than Damerau-Levenshtein
edit distance, so a transposition like teh for the counts as two edits, one
insertion and one deletion.
If you'd prefer an alternate fuzzy matching algorithm, you can provide your own
custom method to the Matcher or as a config option for an entity ruler and
span ruler.
The FUZZY and REGEX operators are also now supported for lists with IN and
NOT_IN:
pattern = [{"TEXT": {"FUZZY": {"IN": ["awesome", "cool", "wonderful"]}}}]
pattern = [{"TEXT": {"REGEX": {"NOT_IN": ["^awe(some)?$", "^wonder(ful)?"]}}}]
The knowledge base used for entity linking is now easier to customize and has a
new default implementation InMemoryLookupKB.
pymorphy3typer v0.7.x, mypy 0.990 and typing_extensions v4.4.x.spacy.ConsoleLogger.v3 with expanded progress
tracking.textcat with spacy.textcat_scorer.v2 and
spacy.textcat_multilabel_scorer.v2.tok2vec
or transformer layer.displacy.serve when the default port is in use.before_update callback that is invoked at the start of each
training step.SpanGroup and fix typing issues for SpanGroup and
Span objects.Vectors.Vocab.to_disk respects the exclude setting for lookups and
vectors.IS_SPACE as a tok2vec feature for tagger and
morphologizer components to improve tagging of non-whitespace vs. whitespace
tokens.spacy-transformers v1.2, which uses the
exact alignment from tokenizers for fast tokenizers instead of the heuristic
alignment from spacy-alignments. For all trained pipelines except
ja_core_news_trf, the alignments between spaCy tokens and transformer tokens
may be slightly different. More details about the spacy-transformers changes
in the
v1.2.0 release notes.An error is now raised when unsupported values are given as input to train a
textcat or textcat_multilabel model - ensure that values are 0.0 or 1.0
as explained in the docs.
As KnowledgeBase is now an abstract class, you should call the constructor of
the new InMemoryLookupKB instead when you want to use spaCy's default KB
implementation:
- kb = KnowledgeBase()
+ kb = InMemoryLookupKB()
If you've written a custom KB that inherits from KnowledgeBase, you'll need to
implement its abstract methods, or alternatively inherit from InMemoryLookupKB
instead.
We fixed a bug that inflated the token_acc scores in v3.0-v3.4. The reported
token_acc will drop from v3.4 to v3.5, but if token_p/r/f stay the same,
your tokenization performance has not changed from v3.4.
For new textcat or textcat_multilabel configs, the new default v2 scorers:
threshold for textcat, so the reported cats_p/r/f may increase
slightly in v3.5 even though the underlying predictions are unchangedtextcat or textcat_multilabel
component in the pipeline by defaulttextcat and
textcat_multilabel components with Scorer.score_cats by restricting the
evaluation to the component's provided labelsUsing legacy implementations
In spaCy v3, you'll still be able to load and reference legacy implementations via
spacy-legacy, even if the components or architectures change and newer versions are available in the core library.
When you're loading a pipeline package trained with an earlier version of spaCy v3, you will see a warning telling you that the pipeline may be incompatible. This doesn't necessarily have to be true, but we recommend running your pipelines against your test suite or evaluation data to make sure there are no unexpected results.
If you're using one of the trained pipelines we provide, you should
run spacy download to update to the latest version. To
see an overview of all installed packages and their compatibility, you can run
spacy validate.
If you've trained your own custom pipeline and you've confirmed that it's still
working as expected, you can update the spaCy version requirements in the
meta.json:
- "spacy_version": ">=3.4.0,<3.5.0",
+ "spacy_version": ">=3.4.0,<3.6.0",
To update a config from spaCy v3.4 with the new v3.5 settings, run
init fill-config:
$ python -m spacy init fill-config config-v3.4.cfg config-v3.5.cfg
In many cases (spacy train,
spacy.load), the new defaults will be filled in
automatically, but you'll need to fill in the new settings to run
debug config and debug data.