Back to Spacy

Scorer

website/docs/api/scorer.mdx

4.0.0.dev1022.1 KB
Original Source

The Scorer computes evaluation scores. It's typically created by Language.evaluate. In addition, the Scorer provides a number of evaluation methods for evaluating Token and Doc attributes.

Scorer.__init__ {id="init",tag="method"}

Create a new Scorer.

Example

python
from spacy.scorer import Scorer

# Default scoring pipeline
scorer = Scorer()

# Provided scoring pipeline
nlp = spacy.load("en_core_web_sm")
scorer = Scorer(nlp)
NameDescription
nlpThe pipeline to use for scoring, where each pipeline component may provide a scoring method. If none is provided, then a default pipeline is constructed using the default_lang and default_pipeline settings. Optional[Language]
default_langThe language to use for a default pipeline if nlp is not provided. Defaults to xx. str
default_pipelineThe pipeline components to use for a default pipeline if nlp is not provided. Defaults to ("senter", "tagger", "morphologizer", "parser", "ner", "textcat"). Iterable[string]
keyword-only
**kwargsAny additional settings to pass on to the individual scoring methods. Any

Scorer.score {id="score",tag="method"}

Calculate the scores for a list of Example objects using the scoring methods provided by the components in the pipeline.

The returned Dict contains the scores provided by the individual pipeline components. For the scoring methods provided by the Scorer and used by the core pipeline components, the individual score names start with the Token or Doc attribute being scored:

  • token_acc, token_p, token_r, token_f
  • sents_p, sents_r, sents_f
  • tag_acc
  • pos_acc
  • morph_acc, morph_micro_p, morph_micro_r, morph_micro_f, morph_per_feat
  • lemma_acc
  • dep_uas, dep_las, dep_las_per_type
  • ents_p, ents_r ents_f, ents_per_type
  • spans_sc_p, spans_sc_r, spans_sc_f
  • cats_score (depends on config, description provided in cats_score_desc), cats_micro_p, cats_micro_r, cats_micro_f, cats_macro_p, cats_macro_r, cats_macro_f, cats_macro_auc, cats_f_per_type, cats_auc_per_type

Example

python
scorer = Scorer()
scores = scorer.score(examples)
NameDescription
examplesThe Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
keyword-only
per_component <Tag variant="new">3.6</Tag>Whether to return the scores keyed by component name. Defaults to False. bool
RETURNSA dictionary of scores. Dict[str, Union[float, Dict[str, float]]]

Scorer.score_tokenization {id="score_tokenization",tag="staticmethod",version="3"}

Scores the tokenization:

  • token_acc: number of correct tokens / number of predicted tokens
  • token_p, token_r, token_f: precision, recall and F-score for token character spans

Docs with has_unknown_spaces are skipped during scoring.

Example

python
scores = Scorer.score_tokenization(examples)

| Name | Description | | ----------- | ------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | | examples | The Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example] | | RETURNS | Dict | A dictionary containing the scores token_acc, token_p, token_r, token_f. Dict[str, float]] |

Scorer.score_token_attr {id="score_token_attr",tag="staticmethod",version="3"}

Scores a single token attribute. Tokens with missing values in the reference doc are skipped during scoring.

Example

python
scores = Scorer.score_token_attr(examples, "pos")
print(scores["pos_acc"])
NameDescription
examplesThe Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
attrThe attribute to score. str
keyword-only
getterDefaults to getattr. If provided, getter(token, attr) should return the value of the attribute for an individual Token. Callable[[Token, str], Any]
missing_valuesAttribute values to treat as missing annotation in the reference annotation. Defaults to {0, None, ""}. Set[Any]
RETURNSA dictionary containing the score {attr}_acc. Dict[str, float]

Scorer.score_token_attr_per_feat {id="score_token_attr_per_feat",tag="staticmethod",version="3"}

Scores a single token attribute per feature for a token attribute in the Universal Dependencies FEATS format. Tokens with missing values in the reference doc are skipped during scoring.

Example

python
scores = Scorer.score_token_attr_per_feat(examples, "morph")
print(scores["morph_per_feat"])
NameDescription
examplesThe Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
attrThe attribute to score. str
keyword-only
getterDefaults to getattr. If provided, getter(token, attr) should return the value of the attribute for an individual Token. Callable[[Token, str], Any]
missing_valuesAttribute values to treat as missing annotation in the reference annotation. Defaults to {0, None, ""}. Set[Any]
RETURNSA dictionary containing the micro PRF scores under the key {attr}_micro_p/r/f and the per-feature PRF scores under {attr}_per_feat. Dict[str, Dict[str, float]]

Scorer.score_spans {id="score_spans",tag="staticmethod",version="3"}

Returns PRF scores for labeled or unlabeled spans.

Example

python
scores = Scorer.score_spans(examples, "ents")
print(scores["ents_f"])
NameDescription
examplesThe Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
attrThe attribute to score. str
keyword-only
getterDefaults to getattr. If provided, getter(doc, attr) should return the Span objects for an individual Doc. Callable[[Doc, str], Iterable[Span]]
has_annotationDefaults to None. If provided, has_annotation(doc) should return whether a Doc has annotation for this attr. Docs without annotation are skipped for scoring purposes. str
labeledDefaults to True. If set to False, two spans will be considered equal if their start and end match, irrespective of their label. bool
allow_overlapDefaults to False. Whether or not to allow overlapping spans. If set to False, the alignment will automatically resolve conflicts. bool
RETURNSA dictionary containing the PRF scores under the keys {attr}_p, {attr}_r, {attr}_f and the per-type PRF scores under {attr}_per_type. Dict[str, Union[float, Dict[str, float]]]

Scorer.score_deps {id="score_deps",tag="staticmethod",version="3"}

Calculate the UAS, LAS, and LAS per type scores for dependency parses. Tokens with missing values for the attr (typically dep) are skipped during scoring.

Example

python
def dep_getter(token, attr):
    dep = getattr(token, attr)
    dep = token.vocab.strings.as_string(dep).lower()
    return dep

scores = Scorer.score_deps(
    examples,
    "dep",
    getter=dep_getter,
    ignore_labels=("p", "punct")
)
print(scores["dep_uas"], scores["dep_las"])
NameDescription
examplesThe Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
attrThe attribute to score. str
keyword-only
getterDefaults to getattr. If provided, getter(token, attr) should return the value of the attribute for an individual Token. Callable[[Token, str], Any]
head_attrThe attribute containing the head token. str
head_getterDefaults to getattr. If provided, head_getter(token, attr) should return the head for an individual Token. Callable[[Doc, str], Token]
ignore_labelsLabels to ignore while scoring (e.g. "punct"). Iterable[str]
missing_valuesAttribute values to treat as missing annotation in the reference annotation. Defaults to {0, None, ""}. Set[Any]
RETURNSA dictionary containing the scores: {attr}_uas, {attr}_las, and {attr}_las_per_type. Dict[str, Union[float, Dict[str, float]]]

Scorer.score_cats {id="score_cats",tag="staticmethod",version="3"}

Calculate PRF and ROC AUC scores for a doc-level attribute that is a dict containing scores for each label like Doc.cats. The returned dictionary contains the following scores:

  • {attr}_micro_p, {attr}_micro_r and {attr}_micro_f: each instance across each label is weighted equally
  • {attr}_macro_p, {attr}_macro_r and {attr}_macro_f: the average values across evaluations per label
  • {attr}_f_per_type and {attr}_auc_per_type: each contains a dictionary of scores, keyed by label
  • A final {attr}_score and corresponding {attr}_score_desc (text description)

The reported {attr}_score depends on the classification properties:

  • binary exclusive with positive label: {attr}_score is set to the F-score of the positive label
  • 3+ exclusive classes, macro-averaged F-score: {attr}_score = {attr}_macro_f
  • multilabel, macro-averaged AUC: {attr}_score = {attr}_macro_auc

Example

python
labels = ["LABEL_A", "LABEL_B", "LABEL_C"]
scores = Scorer.score_cats(
    examples,
    "cats",
    labels=labels
)
print(scores["cats_macro_auc"])
NameDescription
examplesThe Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
attrThe attribute to score. str
keyword-only
getterDefaults to getattr. If provided, getter(doc, attr) should return the cats for an individual Doc. Callable[[Doc, str], Dict[str, float]]
labelsThe set of possible labels. Defaults to []. Iterable[str]
multi_labelWhether the attribute allows multiple labels. Defaults to True. When set to False (exclusive labels), missing gold labels are interpreted as 0.0 and the threshold is set to 0.0. bool
positive_labelThe positive label for a binary task with exclusive classes. Defaults to None. Optional[str]
thresholdCutoff to consider a prediction "positive". Defaults to 0.5 for multi-label, and 0.0 (i.e. whatever's highest scoring) otherwise. float
RETURNSA dictionary containing the scores, with inapplicable scores as None. Dict[str, Optional[float]]

Returns PRF for predicted links on the entity level. To disentangle the performance of the NEL from the NER, this method only evaluates NEL links for entities that overlap between the gold reference and the predictions.

Example

python
scores = Scorer.score_links(
    examples,
    negative_labels=["NIL", ""]
)
print(scores["nel_micro_f"])
NameDescription
examplesThe Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
keyword-only
negative_labelsThe string values that refer to no annotation (e.g. "NIL"). Iterable[str]
RETURNSA dictionary containing the scores. Dict[str, Optional[float]]

get_ner_prf {id="get_ner_prf",version="3"}

Compute micro-PRF and per-entity PRF scores.

NameDescription
examplesThe Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]

score_coref_clusters {id="score_coref_clusters",tag="experimental"}

Returns LEA (Moosavi and Strube, 2016) PRF scores for coreference clusters.

<Infobox title="Important note" variant="warning">

Note this scoring function is not yet included in spaCy core - for details, see the CoreferenceResolver docs.

</Infobox>

Example

python
scores = score_coref_clusters(
    examples,
    span_cluster_prefix="coref_clusters",
)
print(scores["coref_f"])
NameDescription
examplesThe Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
keyword-only
span_cluster_prefixThe prefix used for spans representing coreference clusters. str
RETURNSA dictionary containing the scores. Dict[str, Optional[float]]

score_span_predictions {id="score_span_predictions",tag="experimental"}

Return accuracy for reconstructions of spans from single tokens. Only exactly correct predictions are counted as correct, there is no partial credit for near answers. Used by the SpanResolver.

<Infobox title="Important note" variant="warning">

Note this scoring function is not yet included in spaCy core - for details, see the SpanResolver docs.

</Infobox>

Example

python
scores = score_span_predictions(
    examples,
    output_prefix="coref_clusters",
)
print(scores["span_coref_clusters_accuracy"])
NameDescription
examplesThe Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
keyword-only
output_prefixThe prefix used for spans representing the final predicted spans. str
RETURNSA dictionary containing the scores. Dict[str, Optional[float]]