openai/gsm8k
Benchmark • Updated • 17.6k • 966k • 1.35k
How to use ssurface/qwen3-4b-cot-compress-l4 with PEFT:
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
model = PeftModel.from_pretrained(base_model, "ssurface/qwen3-4b-cot-compress-l4")LoRA adapter for Qwen/Qwen3-4B-Instruct-2507 that produces reasoning in Level-4 style: ~45 chars; short variable chain with arrow.
Part of a 5-level Pareto study on chain-of-thought compression for math reasoning.
See the collection and the eval
artifacts at ssurface/qwen3-4b-cot-compress-eval.
| metric | value |
|---|---|
| accuracy | 0.579985 |
| n_correct / n_total | 765 / 1319 |
| mean think tokens | 39.05 |
| median think tokens | 35.00 |
Baseline (stock Qwen/Qwen3-4B-Instruct-2507) is ~0.896 accuracy at ~294 completion tokens.
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-4B-Instruct-2507", torch_dtype=torch.float16, device_map="auto"
)
model = PeftModel.from_pretrained(base, "ssurface/qwen3-4b-cot-compress-l4")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
prompt = tok.apply_chat_template(
[{"role": "user",
"content": "Solve this using Level 4 (Ultra-compact).\n"
"Problem: Natalia sold clips to 48 friends in April, "
"then half as many in May. How many in total?"}],
tokenize=False, add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
**inputs, max_new_tokens=256,
eos_token_id=tok.convert_tokens_to_ids("<|im_end|>"),
)
print(tok.decode(out[0], skip_special_tokens=False))
SFTTrainerPaper draft in progress.
Base model
Qwen/Qwen3-4B-Instruct-2507