Back to Spacy

Tokenizer

website/docs/api/tokenizer.mdx

4.0.0.dev1014.8 KB
Original Source

Default config

ini
[nlp.tokenizer]
@tokenizers = "spacy.Tokenizer.v1"

Segment text, and create Doc objects with the discovered segment boundaries. For a deeper understanding, see the docs on how spaCy's tokenizer works. The tokenizer is typically created automatically when a Language subclass is initialized and it reads its settings like punctuation and special case rules from the Language.Defaults provided by the language subclass.

Tokenizer.__init__ {id="init",tag="method"}

Create a Tokenizer to create Doc objects given unicode text. For examples of how to construct a custom tokenizer with different tokenization rules, see the usage documentation.

Example

python
# Construction 1
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
# Create a blank Tokenizer with just the English vocab
tokenizer = Tokenizer(nlp.vocab)

# Construction 2
from spacy.lang.en import English
nlp = English()
# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.tokenizer
NameDescription
vocabA storage container for lexical types. Vocab
rulesExceptions and special-cases for the tokenizer. Optional[Dict[str, List[Dict[int, str]]]]
prefix_searchA function matching the signature of re.compile(string).search to match prefixes. Optional[Callable[[str], Optional[Match]]]
suffix_searchA function matching the signature of re.compile(string).search to match suffixes. Optional[Callable[[str], Optional[Match]]]
infix_finditerA function matching the signature of re.compile(string).finditer to find infixes. Optional[Callable[[str], Iterator[Match]]]
token_matchA function matching the signature of re.compile(string).match to find token matches. Optional[Callable[[str], Optional[Match]]]
url_matchA function matching the signature of re.compile(string).match to find token matches after considering prefixes and suffixes. Optional[Callable[[str], Optional[Match]]]
faster_heuristics <Tag variant="new">3.3.0</Tag>Whether to restrict the final Matcher-based pass for rules to those containing affixes or space. Defaults to True. bool

Tokenizer.__call__ {id="call",tag="method"}

Tokenize a string.

Example

python
tokens = tokenizer("This is a sentence")
assert len(tokens) == 4
NameDescription
stringThe string to tokenize. str
RETURNSA container for linguistic annotations. Doc

Tokenizer.pipe {id="pipe",tag="method"}

Tokenize a stream of texts.

Example

python
texts = ["One document.", "...", "Lots of documents"]
for doc in tokenizer.pipe(texts, batch_size=50):
    pass
NameDescription
textsA sequence of unicode texts. Iterable[str]
batch_sizeThe number of texts to accumulate in an internal buffer. Defaults to 1000. int
YIELDSThe tokenized Doc objects, in order. Doc

Tokenizer.find_infix {id="find_infix",tag="method"}

Find internal split points of the string.

NameDescription
stringThe string to split. str
RETURNSA list of re.MatchObject objects that have .start() and .end() methods, denoting the placement of internal segment separators, e.g. hyphens. List[Match]

Tokenizer.find_prefix {id="find_prefix",tag="method"}

Find the length of a prefix that should be segmented from the string, or None if no prefix rules match.

NameDescription
stringThe string to segment. str
RETURNSThe length of the prefix if present, otherwise None. Optional[int]

Tokenizer.find_suffix {id="find_suffix",tag="method"}

Find the length of a suffix that should be segmented from the string, or None if no suffix rules match.

NameDescription
stringThe string to segment. str
RETURNSThe length of the suffix if present, otherwise None. Optional[int]

Tokenizer.add_special_case {id="add_special_case",tag="method"}

Add a special-case tokenization rule. This mechanism is also used to add custom tokenizer exceptions to the language data. See the usage guide on the languages data and tokenizer special cases for more details and examples.

Example

python
from spacy.attrs import ORTH, NORM
case = [{ORTH: "do"}, {ORTH: "n't", NORM: "not"}]
tokenizer.add_special_case("don't", case)
NameDescription
stringThe string to specially tokenize. str
token_attrsA sequence of dicts, where each dict describes a token and its attributes. The ORTH fields of the attributes must exactly match the string when they are concatenated. Iterable[Dict[int, str]]

Tokenizer.explain {id="explain",tag="method"}

Tokenize a string with a slow debugging tokenizer that provides information about which tokenizer rule or pattern was matched for each token. The tokens produced are identical to Tokenizer.__call__ except for whitespace tokens.

Example

python
tok_exp = nlp.tokenizer.explain("(don't)")
assert [t[0] for t in tok_exp] == ["PREFIX", "SPECIAL-1", "SPECIAL-2", "SUFFIX"]
assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]
NameDescription
stringThe string to tokenize with the debugging tokenizer. str
RETURNSA list of (pattern_string, token_string) tuples. List[Tuple[str, str]]

Tokenizer.to_disk {id="to_disk",tag="method"}

Serialize the tokenizer to disk.

Example

python
tokenizer = Tokenizer(nlp.vocab)
tokenizer.to_disk("/path/to/tokenizer")
NameDescription
pathA path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path-like objects. Union[str, Path]
keyword-only
excludeString names of serialization fields to exclude. Iterable[str]

Tokenizer.from_disk {id="from_disk",tag="method"}

Load the tokenizer from disk. Modifies the object in place and returns it.

Example

python
tokenizer = Tokenizer(nlp.vocab)
tokenizer.from_disk("/path/to/tokenizer")
NameDescription
pathA path to a directory. Paths may be either strings or Path-like objects. Union[str, Path]
keyword-only
excludeString names of serialization fields to exclude. Iterable[str]
RETURNSThe modified Tokenizer object. Tokenizer

Tokenizer.to_bytes {id="to_bytes",tag="method"}

Example

python
tokenizer = tokenizer(nlp.vocab)
tokenizer_bytes = tokenizer.to_bytes()

Serialize the tokenizer to a bytestring.

NameDescription
keyword-only
excludeString names of serialization fields to exclude. Iterable[str]
RETURNSThe serialized form of the Tokenizer object. bytes

Tokenizer.from_bytes {id="from_bytes",tag="method"}

Load the tokenizer from a bytestring. Modifies the object in place and returns it.

Example

python
tokenizer_bytes = tokenizer.to_bytes()
tokenizer = Tokenizer(nlp.vocab)
tokenizer.from_bytes(tokenizer_bytes)
NameDescription
bytes_dataThe data to load from. bytes
keyword-only
excludeString names of serialization fields to exclude. Iterable[str]
RETURNSThe Tokenizer object. Tokenizer

Attributes {id="attributes"}

NameDescription
vocabThe vocab object of the parent Doc. Vocab
prefix_searchA function to find segment boundaries from the start of a string. Returns the length of the segment, or None. Optional[Callable[[str], Optional[Match]]]
suffix_searchA function to find segment boundaries from the end of a string. Returns the length of the segment, or None. Optional[Callable[[str], Optional[Match]]]
infix_finditerA function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) sequence of re.MatchObject objects. Optional[Callable[[str], Iterator[Match]]]
token_matchA function matching the signature of re.compile(string).match to find token matches. Returns an re.MatchObject or None. Optional[Callable[[str], Optional[Match]]]
rulesA dictionary of tokenizer exceptions and special cases. Optional[Dict[str, List[Dict[int, str]]]]

Serialization fields {id="serialization-fields"}

During serialization, spaCy will export several data fields used to restore different aspects of the object. If needed, you can exclude them from serialization by passing in the string names via the exclude argument.

Example

python
data = tokenizer.to_bytes(exclude=["vocab", "exceptions"])
tokenizer.from_disk("./data", exclude=["token_match"])
NameDescription
vocabThe shared Vocab.
prefix_searchThe prefix rules.
suffix_searchThe suffix rules.
infix_finditerThe infix rules.
token_matchThe token match expression.
exceptionsThe tokenizer exception rules.