Instructions to use cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF", filename="cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K # Run inference directly in the terminal: llama-cli -hf cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K # Run inference directly in the terminal: llama-cli -hf cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K # Run inference directly in the terminal: ./llama-cli -hf cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K # Run inference directly in the terminal: ./build/bin/llama-cli -hf cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K
Use Docker
docker model run hf.co/cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K
- LM Studio
- Jan
- vLLM
How to use cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K
- Ollama
How to use cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF with Ollama:
ollama run hf.co/cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K
- Unsloth Studio new
How to use cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF to start chatting
- Docker Model Runner
How to use cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF with Docker Model Runner:
docker model run hf.co/cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K
- Lemonade
How to use cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K
Run and chat with the model
lemonade run user.CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF-Q2_K
List all available models
lemonade list
DeepSeek-V4-Flash · Abliterated · GGUF
CyberNeurova research — cyberneurova.ai Release: v2 (three-direction multi-turn-aware ablation, 1338-prompt capture corpus)
A permanently-abliterated version of
deepseek-ai/DeepSeek-V4-Flash
(284 B FP8 MoE) packaged as GGUF for llama.cpp. The abliteration is baked
into the weights at conversion time — no runtime hooks, no slowdown, no
reliance on the inference framework supporting custom code paths.
Status: experimental research artifact. Built with antirez's llama.cpp DeepSeek-V4-Flash fork, which is itself experimental. Use at your own discretion.
What's new in v2
| v1 | v2 | |
|---|---|---|
| Direction stack | 2 | 3 (added a residual-targeting direction) |
| Capture corpus | 33 prompts | 1338 prompts (AdvBench + JBB + HarmfulQA + SafeRLHF + MaliciousInstruct + bundled) |
| Refusal rate (8-bench safety) | 0.0% | 0.0% |
| Refusal rate (55-prompt OOD probe) | not measured | 3.6% (baseline: 81.8%) |
| Tool-calling format compliance | 74.2% | 99.2% |
| Bug-finding | 78.3% | 85.0% |
| Hacking compliance | 88.7% | 90.0% |
| Cyber-weapons compliance | 87.3% | 90.0% |
| Coding / coherence / reasoning | unchanged | unchanged |
Plain English: v2 strictly dominates v1 on every dimension we measure. The big-ticket fixes are the OOD soft-refusal failure mode (which the v1 release was reported on) and tool-calling JSON correctness (which v1 broke ~25% of the time).
Detailed numbers: see cyberneurova-ablated-deepseek-flash-v4.pdf and
cyberneurova-deepseek-v4-flash-abliteration-v2.html in this repo.
Variants — pick one
| Variant | File size | RAM floor | Best for |
|---|---|---|---|
| Q2_K | 98.8 GB | 128 GB | The default. Practical for most workstations / M-series Macs. |
| Q8_0 | ~282 GB | 320 GB | Reference / maximum quality. Use this for evaluation, research, or when comparing against the bf16 model. |
Routed-expert weights are quantised at the listed level; embed, head,
attention, and shared-expert paths are always Q8_0 in both variants. The
abliteration directions are baked into all paths.
Note: antirez's V4-Flash converter currently supports only
q8_0,q2_k,iq2_xxs,iq2_xs,tq1_0,tq2_0for routed-expert weights. Q4_K_M / Q5_K_M / Q6_K are not available for V4-Flash GGUFs at the time of this release — the architecture's FP8 expert layout is the limiting factor, not us.
How to download
Q2_K (recommended)
hf download cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF \
cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \
--local-dir .
Q8_0 (max quality)
hf download cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF \
cyberneurova-DeepSeek-V4-Flash-abliterated-Q8_0.gguf \
--local-dir .
Either variant will be ~98 GB or ~282 GB respectively over the wire. The HF hub uses LFS —
hf downloadresumes interrupted transfers automatically.
How to run
Step 1: Build antirez's V4-Flash-aware llama.cpp fork
V4-Flash is not in upstream llama.cpp yet. You need antirez's fork.
git clone https://github.com/antirez/llama.cpp-deepseek-v4-flash.git
cd llama.cpp-deepseek-v4-flash
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
Build takes 5-10 min on a modern machine. CPU-only build is sufficient (this model runs on CPU + RAM; GPU offload is optional).
Step 2: Run with llama-cli (interactive)
Q2_K:
./build/bin/llama-cli \
-m cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \
-cnv \
-p "You are a helpful assistant." \
-r "<|im_end|>"
Q8_0:
./build/bin/llama-cli \
-m cyberneurova-DeepSeek-V4-Flash-abliterated-Q8_0.gguf \
-cnv \
-p "You are a helpful assistant." \
-r "<|im_end|>"
Step 3 (alternative): Run as a server with llama-server
./build/bin/llama-server \
-m cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \
--host 0.0.0.0 --port 8080 \
-c 4096
OpenAI-compatible API at http://localhost:8080/v1/chat/completions.
Important inference notes
-r "<|im_end|>"is required as a stop sequence until upstream tokenizer support catches up. Without it, the model may continue past the end-of-turn marker and emit garbage.- The DeepSeek-V4 chat template is auto-detected by recent fork builds.
- Default sampling settings (temp 0.7, top-p 0.95) work well. Nothing exotic is required for the abliteration to take effect — it's baked in.
- For quicker first-token latency on the Q2_K, try
-c 2048instead of 4096 unless you actually need the longer context.
Hardware requirements
| Q2_K | Q8_0 | |
|---|---|---|
| File size | 98.8 GB | 282 GB |
| Disk free during run | 100 GB | 290 GB |
| RAM (recommended) | 128 GB | 320 GB |
| RAM (absolute minimum)* | 96 GB w/ heavy mmap eviction | 256 GB w/ mmap |
| GPU | optional (offload via -ngl) |
optional |
* You can technically run with less RAM if your system mmaps the GGUF from a fast SSD, but generation throughput drops sharply.
Reasonable platforms:
- M3/M4 Max with 128 GB unified memory → Q2_K runs comfortably
- M3/M4 Ultra with 192 GB or 512 GB → Q2_K fast, Q8_0 with mmap
- Workstation/server with 256-512 GB DDR5 → either variant
- 8× A100/H100 with 80 GB each → either variant fits when offloaded
Headline benchmark results
Measured on the bf16 hooked weights via vLLM. The Q8_0 GGUF preserves these effects nearly exactly; Q2_K introduces small additional noise typical of any 2-bit quant.
| Benchmark | Direction | Baseline | v2 ablated | Δ |
|---|---|---|---|---|
| refusal | lower better | 78.8 % | 0.0 % | −78.8 pp |
| soft_refusal_probe (OOD, n=55) | lower better | 81.8 % | 3.6 % | −78.2 pp |
| coding | higher better | 100 % | 100 % | 0 |
| bug_finding | higher better | 80.0 % | 85.0 % | +5.0 |
| hacking (compliance) | higher better | 81.7 % | 90.0 % | +8.3 |
| cyber_weapons (compliance) | higher better | 74.0 % | 90.0 % | +16.0 |
| reasoning | higher better | 76.7 % | 76.7 % | 0 |
| tool_calling | higher better | 99.2 % | 99.2 % | 0 |
| coherence | higher better | 93.3 % | 93.2 % | −0.1 |
The soft_refusal_probe is 12 hand-curated complaint-case prompts (drug synthesis, weapons, malware, doxxing) plus 43 stratified random samples from the 1338-prompt capture corpus — prompts the model has never been benchmarked against. v2 takes refusal from 81.8% → 3.6% on this OOD set, a 22× reduction.
Full per-benchmark breakdown, by-category numbers, and sample baseline/ablated text: see the PDF and HTML in this repo.
Intended use
Defensive security research and academic study of refusal mechanisms in modern MoE LLMs. Useful as a counterfactual baseline against the original V4-Flash for safety research and red-team evaluation.
Not intended for automating harmful action. The abliteration removes canonical refusal behavior but does not remove the model's underlying knowledge — the model still recognises harmful instructions as harmful, it simply no longer refuses them by pattern.
Limitations
- A small fraction of safety-critical prompts (~3.6% on the OOD probe, ~1% on capture-time held-out evaluation) still produce refusals or soft refusals. Linear residual-stream ablation cannot fully remove these without unacceptable damage to general capabilities.
- Q2_K routed-expert quantization introduces small noise on top of the ablation — Q8_0 paths preserve it cleanly.
- Long-context (>32 k) behaviour post-abliteration is not validated in this release.
- antirez's llama.cpp fork is experimental and not in upstream.
- v1 → v2 baselines on the bench differ slightly because the bench was re-run on a different vLLM build (v2: a B200 sm_100 source build); ablated-vs-ablated deltas are valid, baseline numbers are bench-run-specific.
License
MIT (inherits from upstream DeepSeek-V4-Flash).
Acknowledgements
- DeepSeek-AI for V4-Flash
- antirez for the V4-Flash llama.cpp fork
- Arditi et al. 2024 for the refusal-direction methodology this work builds on
- Downloads last month
- 90,487
2-bit
8-bit
Model tree for cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF
Base model
deepseek-ai/DeepSeek-V4-Flash