Pocket TTS Tokenizer (4k vocab, byte-fallback)

A repackaging of the official [kyutai/pocket-tts-without-voice-cloning] (https://huggingface.co/kyutai/pocket-tts-without-voice-cloning) tokenizer.model (SentencePiece BPE, vocab=4000) as a complete HuggingFace tokenizer directory so it can be loaded directly by name with the default fast path:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("BreadLoveCarrot-v2/pocket-tts-tokenizer-4k")  # use_fast=True is fine

ids = tok.encode("Hello, world!", add_special_tokens=False)
text = tok.decode(ids)

Why this repo exists

The original kyutai release ships only the raw SentencePiece binary (tokenizer.model), which is NOT directly loadable by AutoTokenizer.from_pretrained(...) because it is missing:

  • tokenizer_config.json (declares which tokenizer class to instantiate)
  • special_tokens_map.json
  • tokenizer.json (the fast HF JSON model spec)

This repo wraps the original SP model as LlamaTokenizerFast — LLaMA's tokenizer has the same shape (SP-BPE + byte_fallback + <unk>=0, <s>=1, </s>=2, <pad>=3 as the first four pieces).

A subtle gotcha (and the fix)

transformers' built-in SP → fast-BPE converter has a known bug for case-preserving SP models whose vocab contains leading-uppercase pieces like ▁Hello, ▁STOP, ▁D, ▁S: it over-splits them into byte-fallback pieces (e.g. "Hello"['H','el','lo'] instead of ['▁Hello']). The original LLaMA tokenizer doesn't trip this because its training corpus rarely produced such pieces. Pocket TTS does.

This repo sidesteps the converter entirely: the tokenizer.json is built directly from the SP proto using the Unigram model (with per-piece SP scores) + byte fallback. SP-BPE inference is greedy-by-score and Unigram is Viterbi-by-score; for this model they produce identical segmentation, byte-for-byte (verified on a battery of edge cases including mixed case, Unicode, emoji, and contractions).

Specification

Source kyutai/pocket-tts-without-voice-cloning@d4fdd22ae8c8e1cb3634e150ebeff1dab2d16df3
Algorithm SentencePiece BPE (packaged as Unigram for byte-identical fast inference)
Vocab size 4000 (4 specials + 256 byte-fallback slots + 3740 learned pieces)
Byte fallback ✅ Yes — never emits <unk> for unseen characters (emoji, rare Unicode, etc.)
Special tokens <unk>=0, <s>=1, </s>=2, <pad>=3
Normalization identity (no NFKC, no whitespace squashing); SP add_dummy_prefix=True is replicated as Prepend('▁') + Replace(' ', '▁')
Case-preserving ✅ Yes ("FBI""fbi")
HF wrapper LlamaTokenizerFast
BOS/EOS NOT auto-prepended (add_bos_token=False, add_eos_token=False) — the model handles those

Attribution

The underlying tokenizer.model is © kyutai and licensed under the terms of the kyutai/pocket-tts-without-voice-cloning repo. This repository only repackages it for convenience and does not modify the tokenizer in any way.

Verification

You can verify the SP model is byte-identical to the upstream release by comparing the SHA-256 of tokenizer.model to the original kyutai blob at revision d4fdd22ae8c8e1cb3634e150ebeff1dab2d16df3.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support