Back to Spacy

Sentencizer

website/docs/api/sentencizer.mdx

4.0.0.dev109.2 KB
Original Source

A simple pipeline component to allow custom sentence boundary detection logic that doesn't require the dependency parse. By default, sentence segmentation is performed by the DependencyParser, so the Sentencizer lets you implement a simpler, rule-based strategy that doesn't require a statistical model to be loaded.

Assigned Attributes {id="assigned-attributes"}

Calculated values will be assigned to Token.is_sent_start. The resulting sentences can be accessed using Doc.sents.

LocationValue
Token.is_sent_startA boolean value indicating whether the token starts a sentence. This will be either True or False for all tokens. bool
Doc.sentsAn iterator over sentences in the Doc, determined by Token.is_sent_start values. Iterator[Span]

Config and implementation {id="config"}

The default config is defined by the pipeline component factory and describes how the component should be configured. You can override its settings via the config argument on nlp.add_pipe or in your config.cfg for training.

Example

python
config = {"punct_chars": None}
nlp.add_pipe("sentencizer", config=config)
SettingDescription
punct_charsOptional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. Defaults to None. Optional[List[str]]
overwrite <Tag variant="new">3.2</Tag>Whether existing annotation is overwritten. Defaults to False. bool
scorer <Tag variant="new">3.2</Tag>The scoring method. Defaults to Scorer.score_spans for the attribute "sents" Optional[Callable]
python
%%GITHUB_SPACY/spacy/pipeline/sentencizer.pyx

Sentencizer.__init__ {id="init",tag="method"}

Initialize the sentencizer.

Example

python
# Construction via add_pipe
sentencizer = nlp.add_pipe("sentencizer")

# Construction from class
from spacy.pipeline import Sentencizer
sentencizer = Sentencizer()
NameDescription
keyword-only
punct_charsOptional custom list of punctuation characters that mark sentence ends. See below for defaults. Optional[List[str]]
overwrite <Tag variant="new">3.2</Tag>Whether existing annotation is overwritten. Defaults to False. bool
scorer <Tag variant="new">3.2</Tag>The scoring method. Defaults to Scorer.score_spans for the attribute "sents" Optional[Callable]
python
['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።',
 '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫',
 '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉',
 '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈',
 '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '!', '.', '?',
 '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅',
 '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂',
 '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓',
 '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛',
 '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '。', '。']

Sentencizer.__call__ {id="call",tag="method"}

Apply the sentencizer on a Doc. Typically, this happens automatically after the component has been added to the pipeline using nlp.add_pipe.

Example

python
from spacy.lang.en import English

nlp = English()
nlp.add_pipe("sentencizer")
doc = nlp("This is a sentence. This is another sentence.")
assert len(list(doc.sents)) == 2
NameDescription
docThe Doc object to process, e.g. the Doc in the pipeline. Doc
RETURNSThe modified Doc with added sentence boundaries. Doc

Sentencizer.pipe {id="pipe",tag="method"}

Apply the pipe to a stream of documents. This usually happens under the hood when the nlp object is called on a text and all pipeline components are applied to the Doc in order.

Example

python
sentencizer = nlp.add_pipe("sentencizer")
for doc in sentencizer.pipe(docs, batch_size=50):
    pass
NameDescription
streamA stream of documents. Iterable[Doc]
keyword-only
batch_sizeThe number of documents to buffer. Defaults to 128. int
YIELDSThe processed documents in order. Doc

Sentencizer.to_disk {id="to_disk",tag="method"}

Save the sentencizer settings (punctuation characters) to a directory. Will create a file sentencizer.json. This also happens automatically when you save an nlp object with a sentencizer added to its pipeline.

Example

python
config = {"punct_chars": [".", "?", "!", "。"]}
sentencizer = nlp.add_pipe("sentencizer", config=config)
sentencizer.to_disk("/path/to/sentencizer.json")
NameDescription
pathA path to a JSON file, which will be created if it doesn't exist. Paths may be either strings or Path-like objects. Union[str, Path]

Sentencizer.from_disk {id="from_disk",tag="method"}

Load the sentencizer settings from a file. Expects a JSON file. This also happens automatically when you load an nlp object or model with a sentencizer added to its pipeline.

Example

python
sentencizer = nlp.add_pipe("sentencizer")
sentencizer.from_disk("/path/to/sentencizer.json")
NameDescription
pathA path to a JSON file. Paths may be either strings or Path-like objects. Union[str, Path]
RETURNSThe modified Sentencizer object. Sentencizer

Sentencizer.to_bytes {id="to_bytes",tag="method"}

Serialize the sentencizer settings to a bytestring.

Example

python
config = {"punct_chars": [".", "?", "!", "。"]}
sentencizer = nlp.add_pipe("sentencizer", config=config)
sentencizer_bytes = sentencizer.to_bytes()
NameDescription
RETURNSThe serialized data. bytes

Sentencizer.from_bytes {id="from_bytes",tag="method"}

Load the pipe from a bytestring. Modifies the object in place and returns it.

Example

python
sentencizer_bytes = sentencizer.to_bytes()
sentencizer = nlp.add_pipe("sentencizer")
sentencizer.from_bytes(sentencizer_bytes)
NameDescription
bytes_dataThe bytestring to load. bytes
RETURNSThe modified Sentencizer object. Sentencizer