Instructions to use timerring/magnum-v4-12b-fp8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use timerring/magnum-v4-12b-fp8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="timerring/magnum-v4-12b-fp8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("timerring/magnum-v4-12b-fp8")
model = AutoModelForCausalLM.from_pretrained("timerring/magnum-v4-12b-fp8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use timerring/magnum-v4-12b-fp8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "timerring/magnum-v4-12b-fp8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "timerring/magnum-v4-12b-fp8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/timerring/magnum-v4-12b-fp8

SGLang

How to use timerring/magnum-v4-12b-fp8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "timerring/magnum-v4-12b-fp8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "timerring/magnum-v4-12b-fp8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "timerring/magnum-v4-12b-fp8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "timerring/magnum-v4-12b-fp8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use timerring/magnum-v4-12b-fp8 with Docker Model Runner:
```
docker model run hf.co/timerring/magnum-v4-12b-fp8
```

Magnum v4 12B FP8

Model Overview

Magnum v4 12B FP8 is a large language model fine-tuned from Mistral-Nemo-Instruct-2407, specialized for creative writing and roleplay scenarios. This version uses FP8 quantization, significantly reducing VRAM usage while maintaining model performance.

Model Specifications

Parameter	Value
Architecture	MistralForCausalLM
Parameters	12B
Hidden Size	5120
Intermediate Size	14336
Attention Heads	32
KV Heads	8 (GQA)
Layers	40
Max Context Length	1024K tokens
Vocabulary Size	131072
Quantization	FP8 (compressed-tensors)
Original Precision	bfloat16

Key Features

Ultra-long context support: Up to 1024K tokens context window
FP8 quantization: Lower VRAM usage, faster inference
Tool calling support: Built-in special tokens like [TOOL_CALLS], [AVAILABLE_TOOLS]
Multilingual support: Chinese, English

Quick Start

Install Dependencies

pip install huggingface_hub transformers torch

Download from Hugging Face

from huggingface_hub import snapshot_download
model_dir = snapshot_download('timerring/magnum-v4-12b-fp8')

Git Download

git clone https://huggingface.co/timerring/magnum-v4-12b-fp8

Inference Example

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "timerring/magnum-v4-12b-fp8"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype="auto"
)

messages = [
    {"role": "user", "content": "Hello, please introduce yourself."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)

response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(response)

Deploy with vLLM

pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model timerring/magnum-v4-12b-fp8 \
    --served-model-name magnum-v4-12b-fp8 \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 16384 \
    --max-num-seqs 32

Hardware Requirements

FP8 inference: ~12-14GB VRAM
Recommended GPUs: NVIDIA RTX 4090, A100, H100, or other FP8-capable GPUs

Intended Use

This model is designed for:

Creative writing and storytelling
Roleplay and character interactions
General conversational AI applications

Limitations

May generate biased or inappropriate content in certain contexts
Not suitable for factual or safety-critical applications
Performance may vary with different prompt styles

License

Apache License 2.0

Acknowledgments

Base model: Mistral-Nemo-Instruct-2407

Downloads last month: 1

Safetensors

Model size

12B params

Tensor type

BF16

F8_E4M3

Model tree for timerring/magnum-v4-12b-fp8

Base model

mistralai/Mistral-Nemo-Base-2407

Finetuned

mistralai/Mistral-Nemo-Instruct-2407

Quantized

(174)

this model