website/docs/usage/v3-6.mdx
spaCy v3.6 adds the new SpanFinder component to the core
spaCy library and new trained pipelines for Slovenian.
The SpanFinder component identifies potentially
overlapping, unlabeled spans by identifying span start and end tokens. It is
intended for use in combination with a component like
SpanCategorizer that may further filter or label the
spans. See our
Spancat blog post for a more
detailed introduction to the span finder.
To train a pipeline with span_finder + spancat, remember to add
span_finder (and its tok2vec or transformer if required) to
[training.annotating_components] so that the spancat component can be
trained directly from its predictions:
[nlp]
pipeline = ["tok2vec","span_finder","spancat"]
[training]
annotating_components = ["tok2vec","span_finder"]
In practice it can be helpful to initially train the span_finder separately
before sourcing it (along with
its tok2vec) into the spancat pipeline for further training. Otherwise the
memory usage can spike for spancat in the first few training steps if the
span_finder makes a large number of predictions.
spancat_singlelabel in spacy debug data CLI.doc.spans rendering to spacy evaluate CLI displaCy output.spacy evaluate --per-component, Language.evaluate(per_component=True) and
Scorer.score(per_component=True). This is useful when the pipeline contains
more than one of the same component like textcat that may have overlapping
scores keys.PhraseMatcher and SpanGroup.v3.6 introduces new pipelines for Slovenian, which use the trainable lemmatizer and floret vectors.
| Package | UPOS | Parser LAS | NER F |
|---|---|---|---|
sl_core_news_sm | 96.9 | 82.1 | 62.9 |
sl_core_news_md | 97.6 | 84.3 | 73.5 |
sl_core_news_lg | 97.7 | 84.3 | 79.0 |
sl_core_news_trf | 99.0 | 91.7 | 90.0 |
The English pipelines have been updated to improve handling of contractions with various apostrophes and to lemmatize "get" as a passive auxiliary.
The Danish pipeline da_core_news_trf has been updated to use
vesteinn/DanskBERT with
performance improvements across the board.
When initializing a SpanGroup, there is a new check to verify that all added
spans refer to the current doc. Without this check, it was possible to run into
string store or other errors.
One place this may crop up is when creating Example objects for training with
custom spans:
doc = Doc(nlp.vocab, words=tokens) # predicted doc
example = Example.from_dict(doc, {"ner": iob_tags})
# use the reference doc when creating reference spans
- span = Span(doc, 0, 5, "ORG")
+ span = Span(example.reference, 0, 5, "ORG")
example.reference.spans[spans_key] = [span]
Using legacy implementations
In spaCy v3, you'll still be able to load and reference legacy implementations via
spacy-legacy, even if the components or architectures change and newer versions are available in the core library.
When you're loading a pipeline package trained with an earlier version of spaCy v3, you will see a warning telling you that the pipeline may be incompatible. This doesn't necessarily have to be true, but we recommend running your pipelines against your test suite or evaluation data to make sure there are no unexpected results.
If you're using one of the trained pipelines we provide, you should
run spacy download to update to the latest version. To
see an overview of all installed packages and their compatibility, you can run
spacy validate.
If you've trained your own custom pipeline and you've confirmed that it's still
working as expected, you can update the spaCy version requirements in the
meta.json:
- "spacy_version": ">=3.5.0,<3.6.0",
+ "spacy_version": ">=3.5.0,<3.7.0",
To update a config from spaCy v3.5 with the new v3.6 settings, run
init fill-config:
$ python -m spacy init fill-config config-v3.5.cfg config-v3.6.cfg
In many cases (spacy train,
spacy.load), the new defaults will be filled in
automatically, but you'll need to fill in the new settings to run
debug config and debug data.