You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
torchtune provides its own set of abstractions for tokenization and application of chat templates (details given below). Most popular open-source models are uploaded to the Hugging Face hub using the format from transformers. This typically includes at least: tokenizer.json, tokenizer_config.json, and generation_config.json. Details on each of these files and their contents are given below.
Consider Qwen3-4B as an example. Note that in this case, the hub upload contains merges.txt and vocab.json files, which can be used to build the torchtune tokenizer directly. But there are other cases, e.g. DeepSeek's Qwen 7B distill, where only the tokenizer.json file is provided. This means that in order to enable these models in general, we need an interface allowing us to build a torchtune tokenizer from an arbitrary Hugging Face hub upload.
Background
Tokenization in torchtune
In torchtune, we provide two levels of abstraction for logic related to tokenization and prompt/chat templating. First:
BaseTokenizer. Any base tokenizer should define encode and decode methods with the following signatures:
Common examples of BaseTokenizers include TikTokenBaseTokenizer and SentencePieceBaseTokenizer. Typically a new BaseTokenizer only needs to be implemented if the underlying tokenization algorithm itself changes. For changes that are specific to a given model family, prompting method, etc., we can instead rely on...
ModelTokenizer. Model tokenizers contain information about special tokens, templating of prompts, handling of media (e.g.), etc. In most cases, when adding a new model family to torchtune, one needs to define a new ModelTokenizer (see e.g. Phi4Tokenizer). ModelTokenizers typically contain some kind of BaseTokenizer as an instance variable, and any ModelTokenizer must define the tokenize_messages API. For example:
tokenizer.json: a JSON representation of the actual tokenizer,
tokenizer_config.json: a config file containing
special tokens and their properties,
configs related to tokenization logic (e.g. clean_up_tokenization_spaces, split_special_tokens, etc.)
chat_template, a Jinja template indicating how to format a conversation, including handling of special tokens, tool-calling logic, as used in e.g. apply_chat_template (since it is a Jinja template, the formatting can also be conditional),
some important standalone tokens like bos_token, eos_token, pad_token,
optionally other fields as well,
generation_config.json: typically contains bos_id, eos_id, pad_id, plus some parameters that are only relevant for generation (i.e. we don't care about them here)
What we have today
Today, we have an initial integration with Hugging Face tokenizers via our HuggingFaceBaseTokenizer. This class reads the three files described above, builds a Hugging Face Tokenizer, and defines encode and decode methods that are compatible with torchtune's BaseTokenizer specs.
Proposal
Define an analogue of HuggingFaceBaseTokenizer for ModelTokenizers. Something like the following:
def HuggingFaceModelTokenizer(ModelTokenizer):
def __init__(
self,
tokenizer_json_path: str,
*,
tokenizer_config_json_path: Optional[str] = None,
generation_config_path: Optional[str] = None,
):
self.base_tokenizer = HuggingFaceBaseTokenizer(
tokenizer_json_path=tokenizer_json_path,
tokenizer_config_json_path=tokenizer_config_json_path,
generation_config_path=generation_config_path,
)
# this is just the contents of tokenizer_config.json
config = self.base_tokenizer.config
# This function needs to be defined
special_tokens = _infer_special_tokens_from_hf_config(config)
# This one too
self.template = _build_torchtune_template_from_hf_config(config, self.base_tokenizer, special_tokens)
def tokenize_messages(self, messages: List[Message]) -> Tuple[List[int], List[bool]]:
# This is just a placeholder and depends on the definition of _build_torchtune_template_from_hf_config
return self.template(messages)
The trickiest part here will be the definition of _build_torchtune_template_from_hf_config. This should parse the Jinja template from Hugging Face and convert into a representation equivalent to the tokenize_messages API. This doesn't need to be done in full generality on the first pass (e.g. it is possible to constrain to a subset of templates and use heuristics, then build from there). Admittedly I'm a Jinja noob so maybe there is actually a pretty easy way to do this.
Constraints and alternatives
One constraint here: we should be able to do this without actually taking a dependency on transformers. The HuggingFaceBaseTokenizer takes a dependency on tokenizers, that's OK. We may need to use jinja to parse the chat template, that's fine as well.
One drawback of the proposed approach is that it requires the base tokenizer to be an instance of HuggingFaceBaseTokenizer, which necessitates building via the tokenizer.json file. As @joecummings pointed out, there are Hugging Face tokenizers that still don't support this, e.g. Qwen2Tokenizer builds with merges + vocab file. An alternative __init__ signature could look like:
def HuggingFaceModelTokenizer(ModelTokenizer):
def __init__(
self,
base_tokenizer: BaseTokenizer
tokenizer_config_json_path: str,
generation_config_path: Optional[str] = None,
):
self.base_tokenizer = base_tokenizer
with open(tokenizer_config_json_path, "rb") as f:
config = json.load(f)
# remaining code the same as in the "proposal" section
This gives us more flexibility as we can now compose with any arbitrary BaseTokenizer. However, we need to take more care to ensure that any calls to the base tokenizer's encode and decode methods adhere strictly to the APIs we've defined (otherwise this can easily wind up being fake composability).
The text was updated successfully, but these errors were encountered:
The problem
torchtune provides its own set of abstractions for tokenization and application of chat templates (details given below). Most popular open-source models are uploaded to the Hugging Face hub using the format from transformers. This typically includes at least:
tokenizer.json
,tokenizer_config.json
, andgeneration_config.json
. Details on each of these files and their contents are given below.Consider Qwen3-4B as an example. Note that in this case, the hub upload contains
merges.txt
andvocab.json
files, which can be used to build the torchtune tokenizer directly. But there are other cases, e.g. DeepSeek's Qwen 7B distill, where only thetokenizer.json
file is provided. This means that in order to enable these models in general, we need an interface allowing us to build a torchtune tokenizer from an arbitrary Hugging Face hub upload.Background
Tokenization in torchtune
In torchtune, we provide two levels of abstraction for logic related to tokenization and prompt/chat templating. First:
BaseTokenizer. Any base tokenizer should define
encode
anddecode
methods with the following signatures:Common examples of
BaseTokenizer
s include TikTokenBaseTokenizer and SentencePieceBaseTokenizer. Typically a newBaseTokenizer
only needs to be implemented if the underlying tokenization algorithm itself changes. For changes that are specific to a given model family, prompting method, etc., we can instead rely on...ModelTokenizer. Model tokenizers contain information about special tokens, templating of prompts, handling of media (e.g.), etc. In most cases, when adding a new model family to torchtune, one needs to define a new
ModelTokenizer
(see e.g. Phi4Tokenizer).ModelTokenizer
s typically contain some kind ofBaseTokenizer
as an instance variable, and anyModelTokenizer
must define thetokenize_messages
API. For example:Hugging Face tokenizer specs
Most HF model hub uploads contain:
tokenizer.json
: a JSON representation of the actual tokenizer,tokenizer_config.json
: a config file containingclean_up_tokenization_spaces
,split_special_tokens
, etc.)chat_template
, a Jinja template indicating how to format a conversation, including handling of special tokens, tool-calling logic, as used in e.g. apply_chat_template (since it is a Jinja template, the formatting can also be conditional),bos_token
,eos_token
,pad_token
,generation_config.json
: typically containsbos_id
,eos_id
,pad_id
, plus some parameters that are only relevant for generation (i.e. we don't care about them here)What we have today
Today, we have an initial integration with Hugging Face tokenizers via our HuggingFaceBaseTokenizer. This class reads the three files described above, builds a Hugging Face
Tokenizer
, and definesencode
anddecode
methods that are compatible with torchtune'sBaseTokenizer
specs.Proposal
Define an analogue of
HuggingFaceBaseTokenizer
forModelTokenizer
s. Something like the following:The trickiest part here will be the definition of
_build_torchtune_template_from_hf_config
. This should parse the Jinja template from Hugging Face and convert into a representation equivalent to thetokenize_messages
API. This doesn't need to be done in full generality on the first pass (e.g. it is possible to constrain to a subset of templates and use heuristics, then build from there). Admittedly I'm a Jinja noob so maybe there is actually a pretty easy way to do this.Constraints and alternatives
One constraint here: we should be able to do this without actually taking a dependency on transformers. The
HuggingFaceBaseTokenizer
takes a dependency ontokenizers
, that's OK. We may need to use jinja to parse the chat template, that's fine as well.One drawback of the proposed approach is that it requires the base tokenizer to be an instance of
HuggingFaceBaseTokenizer
, which necessitates building via thetokenizer.json
file. As @joecummings pointed out, there are Hugging Face tokenizers that still don't support this, e.g. Qwen2Tokenizer builds with merges + vocab file. An alternative__init__
signature could look like:This gives us more flexibility as we can now compose with any arbitrary
BaseTokenizer
. However, we need to take more care to ensure that any calls to the base tokenizer'sencode
anddecode
methods adhere strictly to the APIs we've defined (otherwise this can easily wind up being fake composability).The text was updated successfully, but these errors were encountered: