Local Claude Code with Qwen3.5 27B
after long research, finding best alternative for
Using a local LLM in OpenCode with llama.cpp
to use totally local environment for coding tasks
I found this article How to connect Claude Code CLI to a local llama.cpp server
how to disable telemetry and make claude code totally offline.
Here is Discussion with other users:
model used - Qwen3.5 27B
Quant used - unsloth/UD-Q4_K_XL
inference engine - llama.cpp
Operating Systems - Arch Linux
Hardware - Strix Halo
I have separated my setups into sessions to run iterative cycle how I managed to improve CC (claude code) and llama.cpp model parameters.
First Session
as guide stated, I used option 1 to disable telemetry
~/.bashrc config;
export ANTHROPIC_BASE_URL="http://127.0.0.1:8001"
export ANTHROPIC_API_KEY="not-set"
export ANTHROPIC_AUTH_TOKEN="not-set"
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
export CLAUDE_CODE_ENABLE_TELEMETRY=0
export DISABLE_AUTOUPDATER=1
export DISABLE_TELEMETRY=1
export CLAUDE_CODE_DISABLE_1M_CONTEXT=1
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=4096
export CLAUDE_CODE_AUTO_COMPACT_WINDOW=32768
Spoiler: better to use claude/settings.json it is more stable and controllable.
and in ~/.claude.json
"hasCompletedOnboarding": true
llama.cpp config:
ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \
--model models/Qwen3.5-27B-Q4_K_M.gguf \
--alias "qwen3.5-27b" \
--port 8001 --ctx-size 65536 --n-gpu-layers 999 \
--flash-attn on --jinja --threads 8 \
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 \
--cache-type-k q8_0 --cache-type-v q8_0
I am using Strix Halo so I need to setup ROCBLAS_USE_HIPBLASLT=1
research your concrete hardware to specialize llama.cpp setup
everything else might be same.
Results for 7 Runs:
| Run | Task Type | Duration | Gen Speed | Peak Context | Quality | Key Finding |
|---|---|---|---|---|---|---|
| 1 | File ops (ls, cat) | 1m44s | 9.71 t/s | 23K | Correct | Baseline: fast at low context |
| 2 | Git clone + code read | 2m31s | 9.56 t/s | 32.5K | Excellent | Tool chaining works well |
| 3 | 7-day plan + guide | 4m57s | 8.37 t/s | 37.9K | Excellent | Long-form generation quality |
| 4 | Skills assessment | 4m36s | 8.46 t/s | 40K | Very good | Web search broken (needs Anthropic) |
| 5 | Write Python script | 10m25s | 7.54 t/s | 60.4K | Good (7/10) | |
| 6 | Code review + fix | 9m29s | 7.42 t/s | 65,535 CRASH | Very good (8.5/10) | Context wall hit, no auto-compact |
| 7 | /compact command | ~10m | ~8.07 t/s | 66,680 (failed) | N/A | Output token limit too low for compaction |
Lessons
- Generation speed degrades ~24% across context range: 9.71 t/s (23K) down to 7.42 t/s (65K)
- Claude Code System prompt = 22,870 tokensΒ (35% of 65K budget)
- Auto-compaction was completely broken: Claude Code assumed 200K context, so 95% threshold = 190K. 65K limit was hit at 33% of what Claude Code thought was the window.
/compactΒ needs output headroom: At 4096 max output, the compaction summary can't fit. Needs 16K+.- Web search is dead without AnthropicΒ (Run 4): Solution isΒ SearXNG via MCP or if someone has better solution, please suggest.
- LCP prefix caching works great:Β
sim_best = 0.980Β means the system prompt is cached across turns - Code quality is solid but instructions need precision: I plan to add second reviewer agent to suggest fixes.
VRAM Consumed - 22GB
RAM Consumed (by CC) - 7GB (CC is super heavy)
Second Session
claude/settings.json config:
{
Β "env": {
Β Β Β "ANTHROPIC_BASE_URL": "http://127.0.0.1:8001",
Β Β Β "ANTHROPIC_MODEL": "qwen3.5-27b",
Β Β Β "ANTHROPIC_SMALL_FAST_MODEL": "qwen3.5-27b",
Β Β Β "ANTHROPIC_API_KEY": "sk-no-key-required", Β Β
Β Β Β "ANTHROPIC_AUTH_TOKEN": "",
Β Β Β "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
Β Β Β "DISABLE_COST_WARNINGS": "1",
Β Β Β "CLAUDE_CODE_ATTRIBUTION_HEADER": "0",
Β Β Β "CLAUDE_CODE_DISABLE_1M_CONTEXT": "1",
Β Β Β "CLAUDE_CODE_MAX_OUTPUT_TOKENS": "32768",
Β Β Β "CLAUDE_CODE_AUTO_COMPACT_WINDOW": "65536",
Β Β Β "CLAUDE_AUTOCOMPACT_PCT_OVERRIDE": "90",
Β Β Β "DISABLE_PROMPT_CACHING": "1",
Β Β Β "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1",
Β Β Β "CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING": "1",
Β Β Β "MAX_THINKING_TOKENS": "0",
Β Β Β "CLAUDE_CODE_DISABLE_FAST_MODE": "1",
Β Β Β "DISABLE_INTERLEAVED_THINKING": "1",
Β Β Β "CLAUDE_CODE_MAX_RETRIES": "3",
Β Β Β "CLAUDE_CODE_DISABLE_FEEDBACK_SURVEY": "1",
Β Β Β "DISABLE_TELEMETRY": "1",
Β Β Β "CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY": "1",
Β Β Β "ENABLE_TOOL_SEARCH": "auto", Β
Β Β Β "DISABLE_AUTOUPDATER": "1",
Β Β Β "DISABLE_ERROR_REPORTING": "1",
Β Β Β "DISABLE_FEEDBACK_COMMAND": "1"
Β }
}
llama.cpp run:
ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \
--model models/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf \
--alias "qwen3.5-27b" \
--port 8001 \
--ctx-size 65536 \
--n-gpu-layers 999 \
--flash-attn on \
--jinja \
--threads 8 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00 \
--cache-type-k q8_0 \
--cache-type-v q8_0
claude --model qwen3.5-27b --verbose
VRAM Consumed - 22GB
RAM Consumed (by CC) - 7GB
nothing changed.
all the errors from first session were fixed )
Third Session (Vision)
To turn on vision for qwen, you are required to use mmproj, which was included with gguf.
setup:
ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \
--model models/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf \
--alias "qwen3.5-27b" \
--port 8001 \
--ctx-size 65536 \
--n-gpu-layers 999 \
--flash-attn on \
--jinja \
--threads 8 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--mmproj models/Qwen3.5-27B-GGUF/mmproj-F32.gguf
and its only added 1-2 ram usage.
tested with 8 Images and quality of vision was WOW to me.
if you look at Artificial Analysis Vision Benchmark, qwen is on Claude 4.6 Opus level which makes it superior for vision tasks.
My tests showed that it can really good understand context of image and handwritten diagrams.
Verdict
- system prompt is too big and takes too much time to load. but this is only first time, then caching makes everything for you.
- CC is worth using with local models and local models nowadays are good for coding tasks. and I found it most "offline" coding agent CLI compared to Opencode, why I should use less "performant" alternative, when I can use SOTA )
Future Experiments:
- I want to use bigger Mixture of Experts model from Qwen3.5 Family, but will it give me better 2x performance for 2x size?
- want to try CC with Zed editor, and check how offline zed will behave with local CC.
- How long compaction will hold agents reasoning and how quality gonna degrade, with codex or CC I had 10M context chats with decent quality compared to size.