Skip to content

Integrate Hugging Face tokenizers with torchtune #2706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ebsmothers opened this issue May 8, 2025 · 1 comment
Open

Integrate Hugging Face tokenizers with torchtune #2706

ebsmothers opened this issue May 8, 2025 · 1 comment
Labels
community help wanted We would love the community's help completing this issue high-priority

Comments

@ebsmothers
Copy link
Contributor

ebsmothers commented May 8, 2025

The problem

torchtune provides its own set of abstractions for tokenization and application of chat templates (details given below). Most popular open-source models are uploaded to the Hugging Face hub using the format from transformers. This typically includes at least: tokenizer.json, tokenizer_config.json, and generation_config.json. Details on each of these files and their contents are given below.

Consider Qwen3-4B as an example. Note that in this case, the hub upload contains merges.txt and vocab.json files, which can be used to build the torchtune tokenizer directly. But there are other cases, e.g. DeepSeek's Qwen 7B distill, where only the tokenizer.json file is provided. This means that in order to enable these models in general, we need an interface allowing us to build a torchtune tokenizer from an arbitrary Hugging Face hub upload.

Background

Tokenization in torchtune

In torchtune, we provide two levels of abstraction for logic related to tokenization and prompt/chat templating. First:

BaseTokenizer. Any base tokenizer should define encode and decode methods with the following signatures:

class MyBaseTokenizer(BaseTokenizer):

	def encode(self, token_ids: str, **kwargs) -> List[int]:
		...

	def decode(self, token_ids: List[int], **kwargs) -> str:
		...

Common examples of BaseTokenizers include TikTokenBaseTokenizer and SentencePieceBaseTokenizer. Typically a new BaseTokenizer only needs to be implemented if the underlying tokenization algorithm itself changes. For changes that are specific to a given model family, prompting method, etc., we can instead rely on...

ModelTokenizer. Model tokenizers contain information about special tokens, templating of prompts, handling of media (e.g.), etc. In most cases, when adding a new model family to torchtune, one needs to define a new ModelTokenizer (see e.g. Phi4Tokenizer). ModelTokenizers typically contain some kind of BaseTokenizer as an instance variable, and any ModelTokenizer must define the tokenize_messages API. For example:

class MyModelTokenizer(ModelTokenizer):

	def __init__(self, base_tokenizer: BaseTokenizer):
		self.base_tokenizer = base_tokenizer
		self.between_messages_token = "|<some_madeup_special_token>|"

	def tokenize_messages(self, messages: List[Message]) -> Tuple[List[int], List[bool]]:
		tokenized_messages, mask = [], []
		for message in messages:
			# Tokenize message role and content
			tokenized_role = self.base_tokenizer.encode(
				message.role
			)
			tokenized_message = self.base_tokenizer.encode(
				message.content
			)
			
			# Some arbitrary formatting
			tokenized_separator = self.base_tokenizer.encode(
				self.between_messages_token
			)

			# Append tokenized message and mask 
			tokenized_formatted_message = tokenized_role + tokenized_message + tokenized_separator
			tokenized_messages += tokenized_formatted_message
			mask = mask + [message.masked	] * len(tokenized_formatted_message)

		return tokenized_messages, mask

Hugging Face tokenizer specs

Most HF model hub uploads contain:

  • tokenizer.json: a JSON representation of the actual tokenizer,
  • tokenizer_config.json: a config file containing
    • special tokens and their properties,
    • configs related to tokenization logic (e.g. clean_up_tokenization_spaces, split_special_tokens, etc.)
    • chat_template, a Jinja template indicating how to format a conversation, including handling of special tokens, tool-calling logic, as used in e.g. apply_chat_template (since it is a Jinja template, the formatting can also be conditional),
    • some important standalone tokens like bos_token, eos_token, pad_token,
    • optionally other fields as well,
  • generation_config.json: typically contains bos_id, eos_id, pad_id, plus some parameters that are only relevant for generation (i.e. we don't care about them here)

What we have today

Today, we have an initial integration with Hugging Face tokenizers via our HuggingFaceBaseTokenizer. This class reads the three files described above, builds a Hugging Face Tokenizer, and defines encode and decode methods that are compatible with torchtune's BaseTokenizer specs.

Proposal

Define an analogue of HuggingFaceBaseTokenizer for ModelTokenizers. Something like the following:

def HuggingFaceModelTokenizer(ModelTokenizer):
	def __init__(
		self, 
		tokenizer_json_path: str,
		*,
		tokenizer_config_json_path: Optional[str] = None,
		generation_config_path: Optional[str] = None,
	):
		self.base_tokenizer = HuggingFaceBaseTokenizer(
			tokenizer_json_path=tokenizer_json_path,
			tokenizer_config_json_path=tokenizer_config_json_path,
			generation_config_path=generation_config_path,
		)

		# this is just the contents of tokenizer_config.json
		config = self.base_tokenizer.config

		# This function needs to be defined
		special_tokens = _infer_special_tokens_from_hf_config(config)

		# This one too
		self.template = _build_torchtune_template_from_hf_config(config, self.base_tokenizer, special_tokens)


	def tokenize_messages(self, messages: List[Message]) -> Tuple[List[int], List[bool]]:
		# This is just a placeholder and depends on the definition of _build_torchtune_template_from_hf_config
		return self.template(messages)


The trickiest part here will be the definition of _build_torchtune_template_from_hf_config. This should parse the Jinja template from Hugging Face and convert into a representation equivalent to the tokenize_messages API. This doesn't need to be done in full generality on the first pass (e.g. it is possible to constrain to a subset of templates and use heuristics, then build from there). Admittedly I'm a Jinja noob so maybe there is actually a pretty easy way to do this.

Constraints and alternatives

One constraint here: we should be able to do this without actually taking a dependency on transformers. The HuggingFaceBaseTokenizer takes a dependency on tokenizers, that's OK. We may need to use jinja to parse the chat template, that's fine as well.

One drawback of the proposed approach is that it requires the base tokenizer to be an instance of HuggingFaceBaseTokenizer, which necessitates building via the tokenizer.json file. As @joecummings pointed out, there are Hugging Face tokenizers that still don't support this, e.g. Qwen2Tokenizer builds with merges + vocab file. An alternative __init__ signature could look like:

def HuggingFaceModelTokenizer(ModelTokenizer):
	def __init__(
		self, 
		base_tokenizer: BaseTokenizer
		tokenizer_config_json_path: str,
		generation_config_path: Optional[str] = None,
	):
		self.base_tokenizer = base_tokenizer
        with open(tokenizer_config_json_path, "rb") as f:
            config = json.load(f)
        # remaining code the same as in the "proposal" section

This gives us more flexibility as we can now compose with any arbitrary BaseTokenizer. However, we need to take more care to ensure that any calls to the base tokenizer's encode and decode methods adhere strictly to the APIs we've defined (otherwise this can easily wind up being fake composability).

@ebsmothers ebsmothers added high-priority community help wanted We would love the community's help completing this issue labels May 9, 2025
@krammnic
Copy link
Contributor

krammnic commented May 9, 2025

Let me take this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community help wanted We would love the community's help completing this issue high-priority
Projects
None yet
Development

No branches or pull requests

2 participants