| Title: | A Byte-Pair-Encoding (BPE) Tokenizer for OpenAI's Large Language Models |
|---|---|
| Description: | A thin wrapper around the tiktoken-rs crate, allowing to encode text into Byte-Pair-Encoding (BPE) tokens and decode tokens back to text. This is useful to understand how Large Language Models (LLMs) perceive text. |
| Authors: | David Zimmermann-Kollenda [aut, cre], Roger Zurawicki [aut] (tiktoken-rs Rust library), Authors of the dependent Rust crates [aut] (see AUTHORS file) |
| Maintainer: | David Zimmermann-Kollenda <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.11.0-1 |
| Built: | 2026-05-29 19:33:37 UTC |
| Source: | https://github.com/davzim/rtiktoken |
Decodes tokens back to text
decode_tokens(tokens, model)decode_tokens(tokens, model)
tokens |
a vector of tokens to decode, or a list of tokens |
model |
a model to use for tokenization, either a model name, e.g., |
a character string of the decoded tokens or a vector or strings
model_to_tokenizer(), get_tokens()
tokens <- get_tokens("Hello World", "gpt-4o") tokens decode_tokens(tokens, "gpt-4o") tokens <- get_tokens(c("Hello World", "Alice Bob Charlie"), "gpt-4o") tokens decode_tokens(tokens, "gpt-4o")tokens <- get_tokens("Hello World", "gpt-4o") tokens decode_tokens(tokens, "gpt-4o") tokens <- get_tokens(c("Hello World", "Alice Bob Charlie"), "gpt-4o") tokens decode_tokens(tokens, "gpt-4o")
Returns the number of tokens in a text
get_token_count(text, model)get_token_count(text, model)
text |
a character string to encode to tokens, can be a vector |
model |
a model to use for tokenization, either a model name, e.g., |
the number of tokens in the text, vector of integers
model_to_tokenizer(), get_tokens()
get_token_count("Hello World", "gpt-4o") get_token_count("Hello World", "gpt-5.3") get_token_count("Hello World", "text-embedding-3-small")get_token_count("Hello World", "gpt-4o") get_token_count("Hello World", "gpt-5.3") get_token_count("Hello World", "text-embedding-3-small")
Converts text to tokens
get_tokens(text, model)get_tokens(text, model)
text |
a character string to encode to tokens, can be a vector |
model |
a model to use for tokenization, either a model name, e.g., |
a vector of tokens for the given text as integer
model_to_tokenizer(), decode_tokens()
get_tokens("Hello World", "gpt-4o") get_tokens("Hello World", "o200k_base") get_tokens("Hello World", "gpt-5.") get_tokens("Hello World", "text-embedding-3-small")get_tokens("Hello World", "gpt-4o") get_tokens("Hello World", "o200k_base") get_tokens("Hello World", "gpt-5.") get_tokens("Hello World", "text-embedding-3-small")
Gets the name of the tokenizer used by a model
model_to_tokenizer(model)model_to_tokenizer(model)
model |
the model to use, e.g., |
the tokenizer used by the model
model_to_tokenizer("gpt-4o") model_to_tokenizer("gpt-4-1106-preview") model_to_tokenizer("text-davinci-002") model_to_tokenizer("text-embedding-ada-002") model_to_tokenizer("text-embedding-3-small")model_to_tokenizer("gpt-4o") model_to_tokenizer("gpt-4-1106-preview") model_to_tokenizer("text-davinci-002") model_to_tokenizer("text-embedding-ada-002") model_to_tokenizer("text-embedding-3-small")