English

Nanochat Tokenizer

This is the tokenizer from Andrej Karpathy's educational project nanochat. This is the first step from the speedrun.sh script.

Training

For now, we need to download the first ~2B characters of pretraining dataset using the dataset script in nanochat.

export NANOCHAT_BASE_DIR=".cache/nanochat"
mkdir -p $NANOCHAT_BASE_DIR
python -m nanochat.dataset -n 8

Then, we can train the tokenizer with vocab size ~2B characters of data

python -m scripts.tok_train --max_chars=2000000000

And finally, evaluate:

python -m scripts.tok_eval

Tokenizer training

timestamp: 2025-10-14 10:29:05

  • max_chars: 2,000,000,000
  • doc_cap: 10,000
  • vocab_size: 65,536
  • train_time: 52.9085
  • num_special_tokens: 9
  • token_bytes_min: 1
  • token_bytes_max: 32
  • token_bytes_mean: 6.9197
  • token_bytes_std: 2.8748

Tokenizer evaluation

timestamp: 2025-10-14 10:29:10

Comparison with GPT-2

Text Type Bytes GPT-2 Tokens GPT-2 Ratio Ours Tokens Ours Ratio Relative Diff %
news 1819 404 4.50 375 4.85 +7.2%
korean 893 745 1.20 712 1.25 +4.4%
code 1259 576 2.19 492 2.56 +14.6%
math 1834 936 1.96 966 1.90 -3.2%
science 1112 260 4.28 228 4.88 +12.3%
fwe-train 4208518 900364 4.67 856883 4.91 +4.8%
fwe-val 4991242 1075364 4.64 1027241 4.86 +4.5%

Comparison with GPT-4

Text Type Bytes GPT-4 Tokens GPT-4 Ratio Ours Tokens Ours Ratio Relative Diff %
news 1819 387 4.70 375 4.85 +3.1%
korean 893 364 2.45 712 1.25 -95.6%
code 1259 309 4.07 492 2.56 -59.2%
math 1834 832 2.20 966 1.90 -16.1%
science 1112 249 4.47 228 4.88 +8.4%
fwe-train 4208518 874799 4.81 856883 4.91 +2.0%
fwe-val 4991242 1048837 4.76 1027241 4.86 +2.1%
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support