Title: | A Byte-Pair-Encoding (BPE) Tokenizer for OpenAI's Large Language Models |
---|---|
Description: | A thin wrapper around the tiktoken-rs crate, allowing to encode text into Byte-Pair-Encoding (BPE) tokens and decode tokens back to text. This is useful to understand how Large Language Models (LLMs) perceive text. |
Authors: | David Zimmermann-Kollenda [aut, cre], Roger Zurawicki [aut] (tiktoken-rs Rust library), Authors of the dependent Rust crates [aut] (see AUTHORS file) |
Maintainer: | David Zimmermann-Kollenda <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.6 |
Built: | 2024-11-07 05:34:51 UTC |
Source: | https://github.com/davzim/rtiktoken |
Decodes tokens back to text
decode_tokens(tokens, model)
decode_tokens(tokens, model)
tokens |
a vector of tokens to decode, or a list of tokens |
model |
a model to use for tokenization, either a model name, e.g., |
a character string of the decoded tokens or a vector or strings
model_to_tokenizer()
, get_tokens()
tokens <- get_tokens("Hello World", "gpt-4o") tokens decode_tokens(tokens, "gpt-4o") tokens <- get_tokens(c("Hello World", "Alice Bob Charlie"), "gpt-4o") tokens decode_tokens(tokens, "gpt-4o")
tokens <- get_tokens("Hello World", "gpt-4o") tokens decode_tokens(tokens, "gpt-4o") tokens <- get_tokens(c("Hello World", "Alice Bob Charlie"), "gpt-4o") tokens decode_tokens(tokens, "gpt-4o")
Returns the number of tokens in a text
get_token_count(text, model)
get_token_count(text, model)
text |
a character string to encode to tokens, can be a vector |
model |
a model to use for tokenization, either a model name, e.g., |
the number of tokens in the text, vector of integers
model_to_tokenizer()
, get_tokens()
get_token_count("Hello World", "gpt-4o")
get_token_count("Hello World", "gpt-4o")
Converts text to tokens
get_tokens(text, model)
get_tokens(text, model)
text |
a character string to encode to tokens, can be a vector |
model |
a model to use for tokenization, either a model name, e.g., |
a vector of tokens for the given text as integer
model_to_tokenizer()
, decode_tokens()
get_tokens("Hello World", "gpt-4o") get_tokens("Hello World", "o200k_base")
get_tokens("Hello World", "gpt-4o") get_tokens("Hello World", "o200k_base")
Gets the name of the tokenizer used by a model
model_to_tokenizer(model)
model_to_tokenizer(model)
model |
the model to use, e.g., |
the tokenizer used by the model
model_to_tokenizer("gpt-4o") model_to_tokenizer("gpt-4-1106-preview") model_to_tokenizer("text-davinci-002") model_to_tokenizer("text-embedding-ada-002") model_to_tokenizer("text-embedding-3-small")
model_to_tokenizer("gpt-4o") model_to_tokenizer("gpt-4-1106-preview") model_to_tokenizer("text-davinci-002") model_to_tokenizer("text-embedding-ada-002") model_to_tokenizer("text-embedding-3-small")