website/docs/usage/facts-figures.mdx
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It's designed specifically for production use and helps you build applications that process and "understand" large volumes of text. It can be used to build information extraction or natural language understanding systems.
spaCy v3.0 introduces transformer-based pipelines that bring spaCy's accuracy right up to current state-of-the-art. You can also use a CPU-optimized pipeline, which is less accurate but much cheaper to run.
<Benchmarks /> <figure>Evaluation details
- OntoNotes 5.0: spaCy's English models are trained on this corpus, as it's several times larger than other English treebanks. However, most systems do not report accuracies on it.
- Penn Treebank: The "classic" parsing evaluation for research. However, it's quite far removed from actual usage: it uses sentences with gold-standard segmentation and tokenization, from a pretty specific type of text (articles from a single newspaper, 1984-1989).
| Dependency Parsing System | UAS | LAS |
|---|---|---|
| spaCy RoBERTa (2020) | 95.1 | 93.7 |
| Mrini et al. (2019) | 97.4 | 96.3 |
| Zhou and Zhao (2019) | 97.2 | 95.7 |
Dependency parsing accuracy on the Penn Treebank. See
NLP-progress for more
results. Project template:
benchmarks/parsing_penn_treebank.
We compare the speed of different NLP libraries, measured in words per second (WPS) - higher is better. The evaluation was performed on 10,000 Reddit comments.
<figure>| Library | Pipeline | WPS CPU <Help>words per second on CPU, higher is better</Help> | WPS GPU <Help>words per second on GPU, higher is better</Help> |
|---|---|---|---|
| spaCy | en_core_web_lg | 10,014 | 14,954 |
| spaCy | en_core_web_trf | 684 | 3,768 |
| Stanza | en_ewt | 878 | 2,180 |
| Flair | pos(-fast) & ner(-fast) | 323 | 1,184 |
| UDPipe | english-ewt-ud-2.5 | 1,101 | n/a |
End-to-end processing speed on raw unannotated text. Project template:
benchmarks/speed.