Back to Spacy

Corpus

website/docs/api/corpus.mdx

4.0.0.dev1011.4 KB
Original Source

This class manages annotated corpora and can be used for training and development datasets in the DocBin (.spacy) format. To customize the data loading during training, you can register your own data readers and batchers. Also see the usage guide on data utilities for more details and examples.

Config and implementation {id="config"}

spacy.Corpus.v1 is a registered function that creates a Corpus of training or evaluation data. It takes the same arguments as the Corpus class and returns a callable that yields Example objects. You can replace it with your own registered function in the @readers registry to customize the data loading and streaming.

Example config

ini
[paths]
train = "corpus/train.spacy"

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null
NameDescription
pathThe directory or filename to read from. Expects data in spaCy's binary .spacy format. Path
gold_preprocWhether to set up the Example object with gold-standard sentences and tokens for the predictions. See Corpus for details. bool
max_lengthMaximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to 0 for no limit. int
limitLimit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. int
augmenterApply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to None. Optional[Callable]
python
%%GITHUB_SPACY/spacy/training/corpus.py

Corpus.__init__ {id="init",tag="method"}

Create a Corpus for iterating Example objects from a file or directory of .spacy data files. The gold_preproc setting lets you specify whether to set up the Example object with gold-standard sentences and tokens for the predictions. Gold preprocessing helps the annotations align to the tokenization, and may result in sequences of more consistent length. However, it may reduce runtime accuracy due to train/test skew.

Example

python
from spacy.training import Corpus

# With a single file
corpus = Corpus("./data/train.spacy")

# With a directory
corpus = Corpus("./data", limit=10)
NameDescription
pathThe directory or filename to read from. Union[str, Path]
keyword-only
gold_preprocWhether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to False. bool
max_lengthMaximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to 0 for no limit. int
limitLimit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. int
augmenterOptional data augmentation callback. Callable[[Language, Example], Iterable[Example]]
shuffleWhether to shuffle the examples. Defaults to False. bool

Corpus.__call__ {id="call",tag="method"}

Yield examples from the data.

Example

python
from spacy.training import Corpus
import spacy

corpus = Corpus("./train.spacy")
nlp = spacy.blank("en")
train_data = corpus(nlp)
NameDescription
nlpThe current nlp object. Language
YIELDSThe examples. Example

JsonlCorpus {id="jsonlcorpus",tag="class"}

Iterate Doc objects from a file or directory of JSONL (newline-delimited JSON) formatted raw text files. Can be used to read the raw text corpus for language model pretraining from a JSONL file.

Tip: Writing JSONL

Our utility library srsly provides a handy write_jsonl helper that takes a file path and list of dictionaries and writes out JSONL-formatted data.

python
import srsly
data = [{"text": "Some text"}, {"text": "More..."}]
srsly.write_jsonl("/path/to/text.jsonl", data)
json
{"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."}

JsonlCorpus.__init__ {id="jsonlcorpus",tag="method"}

Initialize the reader.

Example

python
from spacy.training import JsonlCorpus

corpus = JsonlCorpus("./data/texts.jsonl")
ini
### Example config
[corpora.pretrain]
@readers = "spacy.JsonlCorpus.v1"
path = "corpus/raw_text.jsonl"
min_length = 0
max_length = 0
limit = 0
NameDescription
pathThe directory or filename to read from. Expects newline-delimited JSON with a key "text" for each record. Union[str, Path]
keyword-only
min_lengthMinimum document length (in tokens). Shorter documents will be skipped. Defaults to 0, which indicates no limit. int
max_lengthMaximum document length (in tokens). Longer documents will be skipped. Defaults to 0, which indicates no limit. int
limitLimit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. int

JsonlCorpus.__call__ {id="jsonlcorpus-call",tag="method"}

Yield examples from the data.

Example

python
from spacy.training import JsonlCorpus
import spacy

corpus = JsonlCorpus("./texts.jsonl")
nlp = spacy.blank("en")
data = corpus(nlp)
NameDescription
nlpThe current nlp object. Language
YIELDSThe examples. Example

PlainTextCorpus {id="plaintextcorpus",tag="class",version="3.5.1"}

Iterate over documents from a plain text file. Can be used to read the raw text corpus for language model pretraining. The expected file format is:

  • UTF-8 encoding
  • One document per line
  • Blank lines are ignored.
text
Can I ask where you work now and what you do, and if you enjoy it?
They may just pull out of the Seattle market completely, at least until they have autonomous vehicles.
My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in.

PlainTextCorpus.__init__ {id="plaintextcorpus-init",tag="method"}

Initialize the reader.

Example

python
from spacy.training import PlainTextCorpus

corpus = PlainTextCorpus("./data/docs.txt")
ini
### Example config
[corpora.pretrain]
@readers = "spacy.PlainTextCorpus.v1"
path = "corpus/raw_text.txt"
min_length = 0
max_length = 0
NameDescription
pathThe directory or filename to read from. Expects newline-delimited documents in UTF8 format. Union[str, Path]
keyword-only
min_lengthMinimum document length (in tokens). Shorter documents will be skipped. Defaults to 0, which indicates no limit. int
max_lengthMaximum document length (in tokens). Longer documents will be skipped. Defaults to 0, which indicates no limit. int

PlainTextCorpus.__call__ {id="plaintextcorpus-call",tag="method"}

Yield examples from the data.

Example

python
from spacy.training import PlainTextCorpus
import spacy

corpus = PlainTextCorpus("./docs.txt")
nlp = spacy.blank("en")
data = corpus(nlp)
NameDescription
nlpThe current nlp object. Language
YIELDSThe examples. Example