Instructions to use BreadLoveCarrot-v2/pocket-tts-tokenizer-4k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use BreadLoveCarrot-v2/pocket-tts-tokenizer-4k with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("BreadLoveCarrot-v2/pocket-tts-tokenizer-4k", dtype="auto") - Pocket-TTS
How to use BreadLoveCarrot-v2/pocket-tts-tokenizer-4k with Pocket-TTS:
from pocket_tts import TTSModel import scipy.io.wavfile tts_model = TTSModel.load_model("BreadLoveCarrot-v2/pocket-tts-tokenizer-4k") voice_state = tts_model.get_state_for_audio_prompt( "hf://kyutai/tts-voices/alba-mackenna/casual.wav" ) audio = tts_model.generate_audio(voice_state, "Hello world, this is a test.") # Audio is a 1D torch tensor containing PCM data. scipy.io.wavfile.write("output.wav", tts_model.sample_rate, audio.numpy()) - Notebooks
- Google Colab
- Kaggle
Pocket TTS Tokenizer (4k vocab, byte-fallback)
A repackaging of the official [kyutai/pocket-tts-without-voice-cloning]
(https://huggingface.co/kyutai/pocket-tts-without-voice-cloning) tokenizer.model (SentencePiece BPE,
vocab=4000) as a complete HuggingFace tokenizer directory so it can be loaded
directly by name with the default fast path:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("BreadLoveCarrot-v2/pocket-tts-tokenizer-4k") # use_fast=True is fine
ids = tok.encode("Hello, world!", add_special_tokens=False)
text = tok.decode(ids)
Why this repo exists
The original kyutai release ships only the raw SentencePiece binary
(tokenizer.model), which is NOT directly loadable by
AutoTokenizer.from_pretrained(...) because it is missing:
tokenizer_config.json(declares which tokenizer class to instantiate)special_tokens_map.jsontokenizer.json(the fast HF JSON model spec)
This repo wraps the original SP model as LlamaTokenizerFast — LLaMA's
tokenizer has the same shape (SP-BPE + byte_fallback + <unk>=0, <s>=1, </s>=2, <pad>=3 as the first four pieces).
A subtle gotcha (and the fix)
transformers' built-in SP → fast-BPE converter has a known bug for
case-preserving SP models whose vocab contains leading-uppercase pieces
like ▁Hello, ▁STOP, ▁D, ▁S: it over-splits them into byte-fallback
pieces (e.g. "Hello" → ['H','el','lo'] instead of ['▁Hello']). The
original LLaMA tokenizer doesn't trip this because its training corpus
rarely produced such pieces. Pocket TTS does.
This repo sidesteps the converter entirely: the tokenizer.json is built
directly from the SP proto using the Unigram model (with per-piece SP
scores) + byte fallback. SP-BPE inference is greedy-by-score and Unigram
is Viterbi-by-score; for this model they produce identical segmentation,
byte-for-byte (verified on a battery of edge cases including mixed case,
Unicode, emoji, and contractions).
Specification
| Source | kyutai/pocket-tts-without-voice-cloning@d4fdd22ae8c8e1cb3634e150ebeff1dab2d16df3 |
| Algorithm | SentencePiece BPE (packaged as Unigram for byte-identical fast inference) |
| Vocab size | 4000 (4 specials + 256 byte-fallback slots + 3740 learned pieces) |
| Byte fallback | ✅ Yes — never emits <unk> for unseen characters (emoji, rare Unicode, etc.) |
| Special tokens | <unk>=0, <s>=1, </s>=2, <pad>=3 |
| Normalization | identity (no NFKC, no whitespace squashing); SP add_dummy_prefix=True is replicated as Prepend('▁') + Replace(' ', '▁') |
| Case-preserving | ✅ Yes ("FBI" ≠ "fbi") |
| HF wrapper | LlamaTokenizerFast |
| BOS/EOS | NOT auto-prepended (add_bos_token=False, add_eos_token=False) — the model handles those |
Attribution
The underlying tokenizer.model is © kyutai and licensed under the terms of
the kyutai/pocket-tts-without-voice-cloning
repo. This repository only repackages it for convenience and does not modify
the tokenizer in any way.
Verification
You can verify the SP model is byte-identical to the upstream release by
comparing the SHA-256 of tokenizer.model to the original kyutai blob at
revision d4fdd22ae8c8e1cb3634e150ebeff1dab2d16df3.