Back to Tokenizers

Tokenizers

docs/source/index.rst

0.23.11.5 KB
Original Source

Tokenizers

Fast State-of-the-art tokenizers, optimized for both research and production

🤗 Tokenizers_ provides an implementation of today's most used tokenizers, with a focus on performance and versatility. These tokenizers are also used in 🤗 Transformers_.

.. _🤗 Tokenizers: https://github.com/huggingface/tokenizers .. _🤗 Transformers: https://github.com/huggingface/transformers

Main features:

  • Train new vocabularies and tokenize, using today's most used tokenizers.
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for both research and production.
  • Full alignment tracking. Even with destructive normalization, it's always possible to get the part of the original sentence that corresponds to any token.
  • Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.

.. toctree:: :maxdepth: 2 :caption: Getting Started

quicktour
installation/main
pipeline
components

.. toctree-tags:: :maxdepth: 3 :caption: Using 🤗 Tokenizers :glob:

:python:tutorials/python/*

.. toctree:: :maxdepth: 3 :caption: API Reference

api/reference

.. include:: entities.inc