Pocket TTS Tokenizer (4k vocab, byte-fallback)

A repackaging of the official [kyutai/pocket-tts-without-voice-cloning] (https://huggingface.co/kyutai/pocket-tts-without-voice-cloning) tokenizer.model (SentencePiece BPE, vocab=4000) as a complete HuggingFace tokenizer directory so it can be loaded directly by name with the default fast path:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("BreadLoveCarrot-v2/pocket-tts-tokenizer-4k")  # use_fast=True is fine

ids = tok.encode("Hello, world!", add_special_tokens=False)
text = tok.decode(ids)

Why this repo exists

The original kyutai release ships only the raw SentencePiece binary (tokenizer.model), which is NOT directly loadable by AutoTokenizer.from_pretrained(...) because it is missing:

tokenizer_config.json (declares which tokenizer class to instantiate)
special_tokens_map.json
tokenizer.json (the fast HF JSON model spec)

This repo wraps the original SP model as LlamaTokenizerFast — LLaMA's tokenizer has the same shape (SP-BPE + byte_fallback + <unk>=0, <s>=1, </s>=2, <pad>=3 as the first four pieces).

A subtle gotcha (and the fix)

transformers' built-in SP → fast-BPE converter has a known bug for case-preserving SP models whose vocab contains leading-uppercase pieces like ▁Hello, ▁STOP, ▁D, ▁S: it over-splits them into byte-fallback pieces (e.g. "Hello" → ['H','el','lo'] instead of ['▁Hello']). The original LLaMA tokenizer doesn't trip this because its training corpus rarely produced such pieces. Pocket TTS does.

This repo sidesteps the converter entirely: the tokenizer.json is built directly from the SP proto using the Unigram model (with per-piece SP scores) + byte fallback. SP-BPE inference is greedy-by-score and Unigram is Viterbi-by-score; for this model they produce identical segmentation, byte-for-byte (verified on a battery of edge cases including mixed case, Unicode, emoji, and contractions).

Specification


Source	`kyutai/pocket-tts-without-voice-cloning@d4fdd22ae8c8e1cb3634e150ebeff1dab2d16df3`
Algorithm	SentencePiece BPE (packaged as Unigram for byte-identical fast inference)
Vocab size	4000 (4 specials + 256 byte-fallback slots + 3740 learned pieces)
Byte fallback	✅ Yes — never emits `<unk>` for unseen characters (emoji, rare Unicode, etc.)
Special tokens	`<unk>=0`, `<s>=1`, `</s>=2`, `<pad>=3`
Normalization	`identity` (no NFKC, no whitespace squashing); SP `add_dummy_prefix=True` is replicated as `Prepend('▁') + Replace(' ', '▁')`
Case-preserving	✅ Yes (`"FBI"` ≠ `"fbi"`)
HF wrapper	`LlamaTokenizerFast`
BOS/EOS	NOT auto-prepended (`add_bos_token=False, add_eos_token=False`) — the model handles those

Attribution

The underlying tokenizer.model is © kyutai and licensed under the terms of the kyutai/pocket-tts-without-voice-cloning repo. This repository only repackages it for convenience and does not modify the tokenizer in any way.

Verification

You can verify the SP model is byte-identical to the upstream release by comparing the SHA-256 of tokenizer.model to the original kyutai blob at revision d4fdd22ae8c8e1cb3634e150ebeff1dab2d16df3.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support