website/docs/api/example.mdx
An Example holds the information for one training instance. It stores two
Doc objects: one for holding the gold-standard reference data, and one for
holding the predictions of the pipeline. An
Alignment object stores the alignment between
these two documents, as they can differ in tokenization.
Construct an Example object from the predicted document and the reference
document. If alignment is None, it will be initialized from the words in
both documents.
Example
pythonfrom spacy.tokens import Doc from spacy.training import Example pred_words = ["Apply", "some", "sunscreen"] pred_spaces = [True, True, False] gold_words = ["Apply", "some", "sun", "screen"] gold_spaces = [True, True, False, False] gold_tags = ["VERB", "DET", "NOUN", "NOUN"] predicted = Doc(nlp.vocab, words=pred_words, spaces=pred_spaces) reference = Doc(nlp.vocab, words=gold_words, spaces=gold_spaces, tags=gold_tags) example = Example(predicted, reference)
| Name | Description |
|---|---|
predicted | The document containing (partial) predictions. Cannot be None. |
reference | The document containing gold-standard annotations. Cannot be None. |
| keyword-only | |
alignment | An object holding the alignment between the tokens of the predicted and reference documents. |
Construct an Example object from the predicted document and the reference
annotations provided as a dictionary. For more details on the required format,
see the training format documentation.
Example
pythonfrom spacy.tokens import Doc from spacy.training import Example predicted = Doc(vocab, words=["Apply", "some", "sunscreen"]) token_ref = ["Apply", "some", "sun", "screen"] tags_ref = ["VERB", "DET", "NOUN", "NOUN"] example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref})
| Name | Description |
|---|---|
predicted | The document containing (partial) predictions. Cannot be None. |
example_dict | The gold-standard annotations as a dictionary. Cannot be None. |
| RETURNS | The newly constructed object. |
The text of the predicted document in this Example.
Example
pythonraw_text = example.text
| Name | Description |
|---|---|
| RETURNS | The text of the predicted document. |
The Doc holding the predictions. Occasionally also referred to as example.x.
Example
pythondocs = [eg.predicted for eg in examples] predictions, _ = model.begin_update(docs) set_annotations(docs, predictions)
| Name | Description |
|---|---|
| RETURNS | The document containing (partial) predictions. |
The Doc holding the gold-standard annotations. Occasionally also referred to
as example.y.
Example
pythonfor i, eg in enumerate(examples): for j, label in enumerate(all_labels): gold_labels[i][j] = eg.reference.cats.get(label, 0.0)
| Name | Description |
|---|---|
| RETURNS | The document containing gold-standard annotations. |
The Alignment object mapping the tokens of
the predicted document to those of the reference document.
Example
pythontokens_x = ["Apply", "some", "sunscreen"] x = Doc(vocab, words=tokens_x) tokens_y = ["Apply", "some", "sun", "screen"] example = Example.from_dict(x, {"words": tokens_y}) alignment = example.alignment assert list(alignment.y2x.data) == [[0], [1], [2], [2]]
| Name | Description |
|---|---|
| RETURNS | The document containing gold-standard annotations. |
Get the aligned view of a certain token attribute, denoted by its int ID or string name.
Example
pythonpredicted = Doc(vocab, words=["Apply", "some", "sunscreen"]) token_ref = ["Apply", "some", "sun", "screen"] tags_ref = ["VERB", "DET", "NOUN", "NOUN"] example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref}) assert example.get_aligned("TAG", as_string=True) == ["VERB", "DET", "NOUN"]
| Name | Description |
|---|---|
field | Attribute ID or string name. |
as_string | Whether or not to return the list of values as strings. Defaults to False. |
| RETURNS | List of integer values, or string values if as_string is True. |
Get the aligned view of the dependency parse. If projectivize is set to
True, non-projective dependency trees are made projective through the
Pseudo-Projective Dependency Parsing algorithm by Nivre and Nilsson (2005).
Example
pythondoc = nlp("He pretty quickly walks away") example = Example.from_dict(doc, {"heads": [3, 2, 3, 0, 2]}) proj_heads, proj_labels = example.get_aligned_parse(projectivize=True) assert proj_heads == [3, 2, 3, 0, 3]
| Name | Description |
|---|---|
projectivize | Whether or not to projectivize the dependency trees. Defaults to True. |
| RETURNS | List of integer values, or string values if as_string is True. |
Get the aligned view of the NER BILUO tags.
Example
pythonwords = ["Mrs", "Smith", "flew", "to", "New York"] doc = Doc(en_vocab, words=words) entities = [(0, 9, "PERSON"), (18, 26, "LOC")] gold_words = ["Mrs Smith", "flew", "to", "New", "York"] example = Example.from_dict(doc, {"words": gold_words, "entities": entities}) ner_tags = example.get_aligned_ner() assert ner_tags == ["B-PERSON", "L-PERSON", "O", "O", "U-LOC"]
| Name | Description |
|---|---|
| RETURNS | List of BILUO values, denoting whether tokens are part of an NER annotation or not. |
Get the aligned view of any set of Span objects defined over
Example.reference. The resulting span indices will
align to the tokenization in Example.predicted.
Example
pythonwords = ["Mr and Mrs Smith", "flew", "to", "New York"] doc = Doc(en_vocab, words=words) entities = [(0, 16, "PERSON")] tokens_ref = ["Mr", "and", "Mrs", "Smith", "flew", "to", "New", "York"] example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities}) ents_ref = example.reference.ents assert [(ent.start, ent.end) for ent in ents_ref] == [(0, 4)] ents_y2x = example.get_aligned_spans_y2x(ents_ref) assert [(ent.start, ent.end) for ent in ents_y2x] == [(0, 1)]
| Name | Description |
|---|---|
y_spans | Span objects aligned to the tokenization of reference. |
allow_overlap | Whether the resulting Span objects may overlap or not. Set to False by default. |
| RETURNS | Span objects aligned to the tokenization of predicted. |
Get the aligned view of any set of Span objects defined over
Example.predicted. The resulting span indices will
align to the tokenization in Example.reference. This
method is particularly useful to assess the accuracy of predicted entities
against the original gold-standard annotation.
Example
pythonnlp.add_pipe("my_ner") doc = nlp("Mr and Mrs Smith flew to New York") tokens_ref = ["Mr and Mrs", "Smith", "flew", "to", "New York"] example = Example.from_dict(doc, {"words": tokens_ref}) ents_pred = example.predicted.ents # Assume the NER model has found "Mr and Mrs Smith" as a named entity assert [(ent.start, ent.end) for ent in ents_pred] == [(0, 4)] ents_x2y = example.get_aligned_spans_x2y(ents_pred) assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)]
| Name | Description |
|---|---|
x_spans | Span objects aligned to the tokenization of predicted. |
allow_overlap | Whether the resulting Span objects may overlap or not. Set to False by default. |
| RETURNS | Span objects aligned to the tokenization of reference. |
Return a dictionary representation of the
reference annotation contained in this Example.
Example
pythoneg_dict = example.to_dict()
| Name | Description |
|---|---|
| RETURNS | Dictionary representation of the reference annotation. |
Split one Example into multiple Example objects, one for each sentence.
Example
pythondoc = nlp("I went yesterday had lots of fun") tokens_ref = ["I", "went", "yesterday", "had", "lots", "of", "fun"] sents_ref = [True, False, False, True, False, False, False] example = Example.from_dict(doc, {"words": tokens_ref, "sent_starts": sents_ref}) split_examples = example.split_sents() assert split_examples[0].text == "I went yesterday " assert split_examples[1].text == "had lots of fun"
| Name | Description |
|---|---|
| RETURNS | List of Example objects, one for each original sentence. |
Calculate alignment tables between two tokenizations.
Alignment attributes are managed using AlignmentArray, which is a simplified
version of Thinc's Ragged type that
only supports the data and length attributes.
| Name | Description |
|---|---|
x2y | The AlignmentArray object holding the alignment from x to y. |
y2x | The AlignmentArray object holding the alignment from y to x. |
The current implementation of the alignment algorithm assumes that both
tokenizations add up to the same string. For example, you'll be able to align
["I", "'", "m"] and ["I", "'m"], which both add up to "I'm", but not
["I", "'m"] and ["I", "am"].
Example
pythonfrom spacy.training import Alignment bert_tokens = ["obama", "'", "s", "podcast"] spacy_tokens = ["obama", "'s", "podcast"] alignment = Alignment.from_strings(bert_tokens, spacy_tokens) a2b = alignment.x2y assert list(a2b.data) == [0, 1, 1, 2]If
a2b.data[1] == a2b.data[2] == 1, that means thatA[1]("'") andA[2]("s") both align toB[1]("'s").
| Name | Description |
|---|---|
A | String values of candidate tokens to align. |
B | String values of reference tokens to align. |
| RETURNS | An Alignment object describing the alignment. |