Tokenizer

Default config

ini

[nlp.tokenizer]
@tokenizers = "spacy.Tokenizer.v1"

Segment text, and create Doc objects with the discovered segment boundaries. For a deeper understanding, see the docs on how spaCy's tokenizer works. The tokenizer is typically created automatically when a Language subclass is initialized and it reads its settings like punctuation and special case rules from the Language.Defaults provided by the language subclass.

Tokenizer.init {id="init",tag="method"}

Create a Tokenizer to create Doc objects given unicode text. For examples of how to construct a custom tokenizer with different tokenization rules, see the usage documentation.

Example

python

# Construction 1
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
# Create a blank Tokenizer with just the English vocab
tokenizer = Tokenizer(nlp.vocab)

# Construction 2
from spacy.lang.en import English
nlp = English()
# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.tokenizer

Name	Description
`vocab`	A storage container for lexical types. ~~Vocab~~
`rules`	Exceptions and special-cases for the tokenizer. ~~Optional[Dict[str, List[Dict[int, str]]]]~~
`prefix_search`	A function matching the signature of `re.compile(string).search` to match prefixes. ~~Optional[Callable[[str], Optional[Match]]]~~
`suffix_search`	A function matching the signature of `re.compile(string).search` to match suffixes. ~~Optional[Callable[[str], Optional[Match]]]~~
`infix_finditer`	A function matching the signature of `re.compile(string).finditer` to find infixes. ~~Optional[Callable[[str], Iterator[Match]]]~~
`token_match`	A function matching the signature of `re.compile(string).match` to find token matches. ~~Optional[Callable[[str], Optional[Match]]]~~
`url_match`	A function matching the signature of `re.compile(string).match` to find token matches after considering prefixes and suffixes. ~~Optional[Callable[[str], Optional[Match]]]~~
`faster_heuristics` <Tag variant="new">3.3.0</Tag>	Whether to restrict the final `Matcher`-based pass for rules to those containing affixes or space. Defaults to `True`. ~~bool~~

Tokenizer.call {id="call",tag="method"}

Tokenize a string.

Example

python

tokens = tokenizer("This is a sentence")
assert len(tokens) == 4

Name	Description
`string`	The string to tokenize. ~~str~~
RETURNS	A container for linguistic annotations. ~~Doc~~

Tokenizer.pipe {id="pipe",tag="method"}

Tokenize a stream of texts.

Example

python

texts = ["One document.", "...", "Lots of documents"]
for doc in tokenizer.pipe(texts, batch_size=50):
    pass

Name	Description
`texts`	A sequence of unicode texts. ~~Iterable[str]~~
`batch_size`	The number of texts to accumulate in an internal buffer. Defaults to `1000`. ~~int~~
YIELDS	The tokenized `Doc` objects, in order. ~~Doc~~

Tokenizer.find_infix {id="find_infix",tag="method"}

Find internal split points of the string.

Name	Description
`string`	The string to split. ~~str~~
RETURNS	A list of `re.MatchObject` objects that have `.start()` and `.end()` methods, denoting the placement of internal segment separators, e.g. hyphens. ~~List[Match]~~

Tokenizer.find_prefix {id="find_prefix",tag="method"}

Find the length of a prefix that should be segmented from the string, or None if no prefix rules match.

Name	Description
`string`	The string to segment. ~~str~~
RETURNS	The length of the prefix if present, otherwise `None`. ~~Optional[int]~~

Tokenizer.find_suffix {id="find_suffix",tag="method"}

Find the length of a suffix that should be segmented from the string, or None if no suffix rules match.

Name	Description
`string`	The string to segment. ~~str~~
RETURNS	The length of the suffix if present, otherwise `None`. ~~Optional[int]~~

Tokenizer.add_special_case {id="add_special_case",tag="method"}

Add a special-case tokenization rule. This mechanism is also used to add custom tokenizer exceptions to the language data. See the usage guide on the languages data and tokenizer special cases for more details and examples.

Example

python

from spacy.attrs import ORTH, NORM
case = [{ORTH: "do"}, {ORTH: "n't", NORM: "not"}]
tokenizer.add_special_case("don't", case)

Name	Description
`string`	The string to specially tokenize. ~~str~~
`token_attrs`	A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. ~~Iterable[Dict[int, str]]~~

Tokenizer.explain {id="explain",tag="method"}

Tokenize a string with a slow debugging tokenizer that provides information about which tokenizer rule or pattern was matched for each token. The tokens produced are identical to Tokenizer.__call__ except for whitespace tokens.

Example

python

tok_exp = nlp.tokenizer.explain("(don't)")
assert [t[0] for t in tok_exp] == ["PREFIX", "SPECIAL-1", "SPECIAL-2", "SUFFIX"]
assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]

Name	Description
`string`	The string to tokenize with the debugging tokenizer. ~~str~~
RETURNS	A list of `(pattern_string, token_string)` tuples. ~~List[Tuple[str, str]]~~

Tokenizer.to_disk {id="to_disk",tag="method"}

Serialize the tokenizer to disk.

Example

python

tokenizer = Tokenizer(nlp.vocab)
tokenizer.to_disk("/path/to/tokenizer")

Name	Description
`path`	A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~
keyword-only
`exclude`	String names of serialization fields to exclude. ~~Iterable[str]~~

Tokenizer.from_disk {id="from_disk",tag="method"}

Load the tokenizer from disk. Modifies the object in place and returns it.

Example

python

tokenizer = Tokenizer(nlp.vocab)
tokenizer.from_disk("/path/to/tokenizer")

Name	Description
`path`	A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~
keyword-only
`exclude`	String names of serialization fields to exclude. ~~Iterable[str]~~
RETURNS	The modified `Tokenizer` object. ~~Tokenizer~~

Tokenizer.to_bytes {id="to_bytes",tag="method"}

Example

python

tokenizer = tokenizer(nlp.vocab)
tokenizer_bytes = tokenizer.to_bytes()

Serialize the tokenizer to a bytestring.

Name	Description
keyword-only
`exclude`	String names of serialization fields to exclude. ~~Iterable[str]~~
RETURNS	The serialized form of the `Tokenizer` object. ~~bytes~~

Tokenizer.from_bytes {id="from_bytes",tag="method"}

Load the tokenizer from a bytestring. Modifies the object in place and returns it.

Example

python

tokenizer_bytes = tokenizer.to_bytes()
tokenizer = Tokenizer(nlp.vocab)
tokenizer.from_bytes(tokenizer_bytes)

Name	Description
`bytes_data`	The data to load from. ~~bytes~~
keyword-only
`exclude`	String names of serialization fields to exclude. ~~Iterable[str]~~
RETURNS	The `Tokenizer` object. ~~Tokenizer~~

Attributes {id="attributes"}

Name	Description
`vocab`	The vocab object of the parent `Doc`. ~~Vocab~~
`prefix_search`	A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. ~~Optional[Callable[[str], Optional[Match]]]~~
`suffix_search`	A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. ~~Optional[Callable[[str], Optional[Match]]]~~
`infix_finditer`	A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) sequence of `re.MatchObject` objects. ~~Optional[Callable[[str], Iterator[Match]]]~~
`token_match`	A function matching the signature of `re.compile(string).match` to find token matches. Returns an `re.MatchObject` or `None`. ~~Optional[Callable[[str], Optional[Match]]]~~
`rules`	A dictionary of tokenizer exceptions and special cases. ~~Optional[Dict[str, List[Dict[int, str]]]]~~

Serialization fields {id="serialization-fields"}

During serialization, spaCy will export several data fields used to restore different aspects of the object. If needed, you can exclude them from serialization by passing in the string names via the exclude argument.

Example

python

data = tokenizer.to_bytes(exclude=["vocab", "exceptions"])
tokenizer.from_disk("./data", exclude=["token_match"])

Name	Description
`vocab`	The shared `Vocab`.
`prefix_search`	The prefix rules.
`suffix_search`	The suffix rules.
`infix_finditer`	The infix rules.
`token_match`	The token match expression.
`exceptions`	The tokenizer exception rules.

Default config

Tokenizer.__init__ {id="init",tag="method"}

Example

Tokenizer.__call__ {id="call",tag="method"}

Example

Tokenizer.pipe {id="pipe",tag="method"}

Example

Tokenizer.find_infix {id="find_infix",tag="method"}

Tokenizer.find_prefix {id="find_prefix",tag="method"}

Tokenizer.find_suffix {id="find_suffix",tag="method"}

Tokenizer.add_special_case {id="add_special_case",tag="method"}

Example

Tokenizer.explain {id="explain",tag="method"}

Example

Tokenizer.to_disk {id="to_disk",tag="method"}

Example

Tokenizer.from_disk {id="from_disk",tag="method"}

Example

Tokenizer.to_bytes {id="to_bytes",tag="method"}

Example

Tokenizer.from_bytes {id="from_bytes",tag="method"}

Example

Attributes {id="attributes"}

Serialization fields {id="serialization-fields"}

Example

Tokenizer.init {id="init",tag="method"}

Tokenizer.call {id="call",tag="method"}