Integrate Hugging Face tokenizers with torchtune #2706

ebsmothers · 2025-05-08T21:57:51Z

The problem

torchtune provides its own set of abstractions for tokenization and application of chat templates (details given below). Most popular open-source models are uploaded to the Hugging Face hub using the format from transformers. This typically includes at least: tokenizer.json, tokenizer_config.json, and generation_config.json. Details on each of these files and their contents are given below.

Consider Qwen3-4B as an example. Note that in this case, the hub upload contains merges.txt and vocab.json files, which can be used to build the torchtune tokenizer directly. But there are other cases, e.g. DeepSeek's Qwen 7B distill, where only the tokenizer.json file is provided. This means that in order to enable these models in general, we need an interface allowing us to build a torchtune tokenizer from an arbitrary Hugging Face hub upload.

Background

Tokenization in torchtune

In torchtune, we provide two levels of abstraction for logic related to tokenization and prompt/chat templating. First:

BaseTokenizer. Any base tokenizer should define encode and decode methods with the following signatures:

class MyBaseTokenizer(BaseTokenizer):

	def encode(self, token_ids: str, **kwargs) -> List[int]:
		...

	def decode(self, token_ids: List[int], **kwargs) -> str:
		...

Common examples of BaseTokenizers include TikTokenBaseTokenizer and SentencePieceBaseTokenizer. Typically a new BaseTokenizer only needs to be implemented if the underlying tokenization algorithm itself changes. For changes that are specific to a given model family, prompting method, etc., we can instead rely on...

ModelTokenizer. Model tokenizers contain information about special tokens, templating of prompts, handling of media (e.g.), etc. In most cases, when adding a new model family to torchtune, one needs to define a new ModelTokenizer (see e.g. Phi4Tokenizer). ModelTokenizers typically contain some kind of BaseTokenizer as an instance variable, and any ModelTokenizer must define the tokenize_messages API. For example:

class MyModelTokenizer(ModelTokenizer):

	def __init__(self, base_tokenizer: BaseTokenizer):
		self.base_tokenizer = base_tokenizer
		self.between_messages_token = "|<some_madeup_special_token>|"

	def tokenize_messages(self, messages: List[Message]) -> Tuple[List[int], List[bool]]:
		tokenized_messages, mask = [], []
		for message in messages:
			# Tokenize message role and content
			tokenized_role = self.base_tokenizer.encode(
				message.role
			)
			tokenized_message = self.base_tokenizer.encode(
				message.content
			)
			
			# Some arbitrary formatting
			tokenized_separator = self.base_tokenizer.encode(
				self.between_messages_token
			)

			# Append tokenized message and mask 
			tokenized_formatted_message = tokenized_role + tokenized_message + tokenized_separator
			tokenized_messages += tokenized_formatted_message
			mask = mask + [message.masked	] * len(tokenized_formatted_message)

		return tokenized_messages, mask

Hugging Face tokenizer specs

Most HF model hub uploads contain:

tokenizer.json: a JSON representation of the actual tokenizer,
tokenizer_config.json: a config file containing
- special tokens and their properties,
- configs related to tokenization logic (e.g. clean_up_tokenization_spaces, split_special_tokens, etc.)
- chat_template, a Jinja template indicating how to format a conversation, including handling of special tokens, tool-calling logic, as used in e.g. apply_chat_template (since it is a Jinja template, the formatting can also be conditional),
- some important standalone tokens like bos_token, eos_token, pad_token,
- optionally other fields as well,
generation_config.json: typically contains bos_id, eos_id, pad_id, plus some parameters that are only relevant for generation (i.e. we don't care about them here)

What we have today

Today, we have an initial integration with Hugging Face tokenizers via our HuggingFaceBaseTokenizer. This class reads the three files described above, builds a Hugging Face Tokenizer, and defines encode and decode methods that are compatible with torchtune's BaseTokenizer specs.

Proposal

Define an analogue of HuggingFaceBaseTokenizer for ModelTokenizers. Something like the following:

def HuggingFaceModelTokenizer(ModelTokenizer):
	def __init__(
		self, 
		tokenizer_json_path: str,
		*,
		tokenizer_config_json_path: Optional[str] = None,
		generation_config_path: Optional[str] = None,
	):
		self.base_tokenizer = HuggingFaceBaseTokenizer(
			tokenizer_json_path=tokenizer_json_path,
			tokenizer_config_json_path=tokenizer_config_json_path,
			generation_config_path=generation_config_path,
		)

		# this is just the contents of tokenizer_config.json
		config = self.base_tokenizer.config

		# This function needs to be defined
		special_tokens = _infer_special_tokens_from_hf_config(config)

		# This one too
		self.template = _build_torchtune_template_from_hf_config(config, self.base_tokenizer, special_tokens)


	def tokenize_messages(self, messages: List[Message]) -> Tuple[List[int], List[bool]]:
		# This is just a placeholder and depends on the definition of _build_torchtune_template_from_hf_config
		return self.template(messages)

The trickiest part here will be the definition of _build_torchtune_template_from_hf_config. This should parse the Jinja template from Hugging Face and convert into a representation equivalent to the tokenize_messages API. This doesn't need to be done in full generality on the first pass (e.g. it is possible to constrain to a subset of templates and use heuristics, then build from there). Admittedly I'm a Jinja noob so maybe there is actually a pretty easy way to do this.

Constraints and alternatives

One constraint here: we should be able to do this without actually taking a dependency on transformers. The HuggingFaceBaseTokenizer takes a dependency on tokenizers, that's OK. We may need to use jinja to parse the chat template, that's fine as well.

One drawback of the proposed approach is that it requires the base tokenizer to be an instance of HuggingFaceBaseTokenizer, which necessitates building via the tokenizer.json file. As @joecummings pointed out, there are Hugging Face tokenizers that still don't support this, e.g. Qwen2Tokenizer builds with merges + vocab file. An alternative __init__ signature could look like:

def HuggingFaceModelTokenizer(ModelTokenizer):
	def __init__(
		self, 
		base_tokenizer: BaseTokenizer
		tokenizer_config_json_path: str,
		generation_config_path: Optional[str] = None,
	):
		self.base_tokenizer = base_tokenizer
        with open(tokenizer_config_json_path, "rb") as f:
            config = json.load(f)
        # remaining code the same as in the "proposal" section

This gives us more flexibility as we can now compose with any arbitrary BaseTokenizer. However, we need to take more care to ensure that any calls to the base tokenizer's encode and decode methods adhere strictly to the APIs we've defined (otherwise this can easily wind up being fake composability).

The text was updated successfully, but these errors were encountered:

krammnic · 2025-05-09T12:33:24Z

Let me take this

ebsmothers added high-priority community help wanted We would love the community's help completing this issue labels May 9, 2025

krammnic mentioned this issue May 12, 2025

[WIP] HuggingFaceModelTokenizer #2723

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Hugging Face tokenizers with torchtune #2706

Integrate Hugging Face tokenizers with torchtune #2706

ebsmothers commented May 8, 2025 •

edited

Loading

krammnic commented May 9, 2025

Integrate Hugging Face tokenizers with torchtune #2706

Integrate Hugging Face tokenizers with torchtune #2706

Comments

ebsmothers commented May 8, 2025 • edited Loading

The problem

Background

Tokenization in torchtune

Hugging Face tokenizer specs

What we have today

Proposal

Constraints and alternatives

krammnic commented May 9, 2025

ebsmothers commented May 8, 2025 •

edited

Loading