froggeric's picture
Upload chat_template.README.md with huggingface_hub
32939a6 verified

Qwen 3.6 Chat Template

A universally fixed Jinja chat template for Qwen 3.6 that serves as a drop-in upgrade for all inference engines (vLLM, llama.cpp, text-generation-webui, LM Studio, oMLX, etc). The official template continues to crash on C++ tool calls, struggles with the new preserve_thinking feature by spamming empty tags, is vulnerable to model hallucinations, and lacks a way to cleanly toggle thinking inline. This universal template handles all of that.

What's broken in the official template

  1. Tool calls crash on C++ engines. The official template uses Python's |items dictionary filter and |safe, neither of which exist in C++ Jinja runtimes (like those used by LM Studio or MLX). Any tool call triggers an out-of-bounds error. It also crashes if the arguments payload is returned as a raw string instead of an object.
  2. No "developer" role. Modern APIs sometimes send message.role == "developer". The official template raises an exception and dies.
  3. Empty preserve_thinking block spam. Qwen 3.6 introduces a preserve_thinking kwarg. If toggled on, the official template wraps every past turn in a <think></think> block, which means a non-reasoning turn wastes context tokens with <think>\n\n</think>.
  4. The </thinking> hallucination. The Qwen 3.6 LLM sometimes mistakenly generates </thinking> at the end of its reasoning block. The official parser expects strictly </think>, resulting in parsing failure and leaking <thinking> tokens into the chat.
  5. No-user-query exception breaks tool calling. raise_exception crashes agentic loops and resets in OpenClaw and similar runtimes.
  6. Unclosed thinking before tool call. Model starts reasoning then calls a tool without closing the thinking block, producing malformed output.

What this template does

Universal tool arguments compatibility

Replaced |items iteration with direct dictionary key lookups. Swapped is sequence for is iterable (which strict C++ runtimes require). Removed |safe wrappers and safely map raw JSON fallback schemas so that primitive parameters (like booleans) serialize precisely to JSON standard true instead of crashing environments by generating Python-flavored titlecase "True".

"developer" role support

Intercepts "developer" messages and implicitly maps them to "system". No crash, no data loss.

Smarter preserve_thinking historical context

Now ON by default without any required kwargs! Instead of mindlessly generating empty XML tags for past turns, this template checks if the historical context actually contains reasoning (reasoning_content|trim|length > 0). Only then does it emit an active block into the chat cache, keeping context windows hyper-efficient. Furthermore, history is tied to the <|think_off|> override: disabling thinking in the prompt automatically sweeps older thinking blocks from the cache to drastically accelerate processing.

</thinking> Hallucination handling

During the assistant phase, the logic actively looks for boundary hallucinations. If Qwen generates </thinking>, this template dynamically splits on that literal instead of </think>, cleanly isolating tags seamlessly. If generation is interrupted mid-thought (max tokens/aborts) preventing a closing </think> tag from surfacing, the parser actively rescues the incomplete thought-stream instead of injecting invalid raw <think> pairs into the timeline.

Auto-close unclosed thinking before tool calls

The model sometimes starts a thinking block and then immediately calls a tool without emitting the closing tag. The unclosed thinking tag bleeds into the tool call, producing malformed output. This template detects the pattern and auto-injects the closing tag before the tool call boundary.

No-user-query crash fix

The official template scans messages in reverse to find the last real user query. If all user messages are tool results or there are none, it fires raise_exception and hard-crashes. This breaks agentic tool-calling chains and session resets. The fix replaces the exception with a graceful fallback.

Thinking toggle from any message

Drop <|think_on|> or <|think_off|> anywhere in a prompt. The template detects the tag, strips it iteratively without sequential state-bleeding so the model never sees it, and cascades the thinking state down to the generator prompt dynamically.

System: You are a coding assistant. <|think_off|>
User: Check the weather in Paris.

The tag disappears. The model answers fast, generating <think>\n\n</think>\n\n natively.

System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.

The model gets its <think>\n prompt and reasons deeply before answering.

Comparison

Feature Official This Fixed Template
Tool arguments work Crashes Fixed
|safe removed Crashes Fixed
"developer" role Missing Added
Thinking toggle None <|think_off|> anywhere
preserve_thinking Spams empty blocks Dynamic length checks
Tag extraction Fails on </thinking> Supports </thinking>
No-user-query crash Crashes Graceful fallback
Auto-close thinking before tool Not handled Auto-injects close tag

Installation

This template can be used anywhere standard HuggingFace Jinja templates are supported.

General (vLLM, llama.cpp, TextGen)

Simply replace your model's existing chat_template string in your tokenizer_config.json with the minified contents of this file, or load it as a custom template in your UI.

LM Studio

  1. Open LM Studio
  2. Go to the My Models tab (or the right-side panel in Chat)
  3. Select your Qwen 3.6 model
  4. Scroll to Prompt Template
  5. Delete the default template, paste this one in
  6. Save

oMLX

  1. Unload any chat_template_kwargs arguments you may have forced. It is handled by the template actively.
  2. Make sure you load the --jinja flag so the engine utilizes the custom parsing rules.
  3. Overwrite the chat_template.jinja source file locally.