Back to Machinelearning

PACKAGE

src/Microsoft.ML.Tokenizers.Data.Cl100kBase/PACKAGE.md

5.0.01.5 KB
Original Source

About

The Microsoft.ML.Tokenizers.Data.Cl100kBase includes the Tiktoken tokenizer data file cl100k_base.tiktoken, which is utilized by models such as GPT-4.

Key Features

  • This package mainly contains the cl100k_base.tiktoken file, which is used by the Tiktoken tokenizer. This data file is used by the following models: 1. gpt-4 2. gpt-3.5-turbo 3. gpt-3.5-turbo-16k 4. gpt-35 5. gpt-35-turbo 6. gpt-35-turbo-16k 7. text-embedding-ada-002 8. text-embedding-3-small 9. text-embedding-3-large

How to Use

Reference this package in your project to use the Tiktoken tokenizer with the specified models.

csharp

// Create a tokenizer for the specified model or any other listed model name
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4");

// Create a tokenizer for the specified encoding
Tokenizer tokenizer = TiktokenTokenizer.CreateForEncoding("cl100k_base");

Main Types

Users shouldn't use any types exposed by this package directly. This package is intended to provide tokenizer data files.

Additional Documentation

<!-- The related packages associated with this package -->

Microsoft.ML.Tokenizers

Feedback & Contributing

Microsoft.ML.Tokenizers.Data.Cl100kBase is released as open source under the MIT license. Bug reports and contributions are welcome at the GitHub repository.