Nanochat Tokenizer

This is the tokenizer from Andrej Karpathy's educational project nanochat. This is the first step from the speedrun.sh script.

Training

For now, we need to download the first ~2B characters of pretraining dataset using the dataset script in nanochat.

export NANOCHAT_BASE_DIR=".cache/nanochat"
mkdir -p $NANOCHAT_BASE_DIR
python -m nanochat.dataset -n 8

Then, we can train the tokenizer with vocab size ~2B characters of data

python -m scripts.tok_train --max_chars=2000000000

And finally, evaluate:

python -m scripts.tok_eval

timestamp: 2025-10-14 10:29:05

timestamp: 2025-10-14 10:29:10

Text Type	Bytes	GPT-2 Tokens	GPT-2 Ratio	Ours Tokens	Ours Ratio	Relative Diff %
news	1819	404	4.50	375	4.85	+7.2%
korean	893	745	1.20	712	1.25	+4.4%
code	1259	576	2.19	492	2.56	+14.6%
math	1834	936	1.96	966	1.90	-3.2%
science	1112	260	4.28	228	4.88	+12.3%
fwe-train	4208518	900364	4.67	856883	4.91	+4.8%
fwe-val	4991242	1075364	4.64	1027241	4.86	+4.5%

Text Type	Bytes	GPT-4 Tokens	GPT-4 Ratio	Ours Tokens	Ours Ratio	Relative Diff %
news	1819	387	4.70	375	4.85	+3.1%
korean	893	364	2.45	712	1.25	-95.6%
code	1259	309	4.07	492	2.56	-59.2%
math	1834	832	2.20	966	1.90	-16.1%
science	1112	249	4.47	228	4.88	+8.4%
fwe-train	4208518	874799	4.81	856883	4.91	+2.0%
fwe-val	4991242	1048837	4.76	1027241	4.86	+2.1%

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support