Local Claude Code with Qwen3.5 27B

06 Apr, 2026

after long research, finding best alternative for
Using a local LLM in OpenCode with llama.cpp
to use totally local environment for coding tasks
I found this article How to connect Claude Code CLI to a local llama.cpp server
how to disable telemetry and make claude code totally offline.

Here is Discussion with other users:

model used - Qwen3.5 27B
Quant used - unsloth/UD-Q4_K_XL
inference engine - llama.cpp
Operating Systems - Arch Linux
Hardware - Strix Halo

I have separated my setups into sessions to run iterative cycle how I managed to improve CC (claude code) and llama.cpp model parameters.

First Session

as guide stated, I used option 1 to disable telemetry

~/.bashrc config;

    export ANTHROPIC_BASE_URL="http://127.0.0.1:8001"  
    export ANTHROPIC_API_KEY="not-set"  
    export ANTHROPIC_AUTH_TOKEN="not-set"  
    export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1  
    export CLAUDE_CODE_ENABLE_TELEMETRY=0  
    export DISABLE_AUTOUPDATER=1  
    export DISABLE_TELEMETRY=1  
    export CLAUDE_CODE_DISABLE_1M_CONTEXT=1  
    export CLAUDE_CODE_MAX_OUTPUT_TOKENS=4096  
    export CLAUDE_CODE_AUTO_COMPACT_WINDOW=32768

Spoiler: better to use claude/settings.json it is more stable and controllable.

and in ~/.claude.json

"hasCompletedOnboarding": true

llama.cpp config:

    ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \
        --model models/Qwen3.5-27B-Q4_K_M.gguf \
        --alias "qwen3.5-27b" \
        --port 8001 --ctx-size 65536 --n-gpu-layers 999 \
        --flash-attn on --jinja --threads 8 \
        --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 \
        --cache-type-k q8_0 --cache-type-v q8_0

I am using Strix Halo so I need to setup ROCBLAS_USE_HIPBLASLT=1
research your concrete hardware to specialize llama.cpp setup
everything else might be same.

Results for 7 Runs:

Run	Task Type	Duration	Gen Speed	Peak Context	Quality	Key Finding
1	File ops (ls, cat)	1m44s	9.71 t/s	23K	Correct	Baseline: fast at low context
2	Git clone + code read	2m31s	9.56 t/s	32.5K	Excellent	Tool chaining works well
3	7-day plan + guide	4m57s	8.37 t/s	37.9K	Excellent	Long-form generation quality
4	Skills assessment	4m36s	8.46 t/s	40K	Very good	Web search broken (needs Anthropic)
5	Write Python script	10m25s	7.54 t/s	60.4K	Good (7/10)
6	Code review + fix	9m29s	7.42 t/s	65,535 CRASH	Very good (8.5/10)	Context wall hit, no auto-compact
7	/compact command	~10m	~8.07 t/s	66,680 (failed)	N/A	Output token limit too low for compaction

Lessons

Generation speed degrades ~24% across context range: 9.71 t/s (23K) down to 7.42 t/s (65K)
Claude Code System prompt = 22,870 tokens (35% of 65K budget)
Auto-compaction was completely broken: Claude Code assumed 200K context, so 95% threshold = 190K. 65K limit was hit at 33% of what Claude Code thought was the window.
/compact needs output headroom: At 4096 max output, the compaction summary can't fit. Needs 16K+.
Web search is dead without Anthropic (Run 4): Solution is SearXNG via MCP or if someone has better solution, please suggest.
LCP prefix caching works great: sim_best = 0.980 means the system prompt is cached across turns
Code quality is solid but instructions need precision: I plan to add second reviewer agent to suggest fixes.

VRAM Consumed - 22GB
RAM Consumed (by CC) - 7GB (CC is super heavy)

Second Session

claude/settings.json config:

    {  
     "env": {  
       "ANTHROPIC_BASE_URL": "http://127.0.0.1:8001",  
       "ANTHROPIC_MODEL": "qwen3.5-27b",  
       "ANTHROPIC_SMALL_FAST_MODEL": "qwen3.5-27b",  
       "ANTHROPIC_API_KEY": "sk-no-key-required",     
       "ANTHROPIC_AUTH_TOKEN": "",  
       "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",  
       "DISABLE_COST_WARNINGS": "1",  
       "CLAUDE_CODE_ATTRIBUTION_HEADER": "0",  
       "CLAUDE_CODE_DISABLE_1M_CONTEXT": "1",  
       "CLAUDE_CODE_MAX_OUTPUT_TOKENS": "32768",  
       "CLAUDE_CODE_AUTO_COMPACT_WINDOW": "65536",  
       "CLAUDE_AUTOCOMPACT_PCT_OVERRIDE": "90",  
       "DISABLE_PROMPT_CACHING": "1",  
       "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1",  
       "CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING": "1",  
       "MAX_THINKING_TOKENS": "0",  
       "CLAUDE_CODE_DISABLE_FAST_MODE": "1",  
       "DISABLE_INTERLEAVED_THINKING": "1",  
       "CLAUDE_CODE_MAX_RETRIES": "3",  
       "CLAUDE_CODE_DISABLE_FEEDBACK_SURVEY": "1",  
       "DISABLE_TELEMETRY": "1",  
       "CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY": "1",  
       "ENABLE_TOOL_SEARCH": "auto",    
       "DISABLE_AUTOUPDATER": "1",  
       "DISABLE_ERROR_REPORTING": "1",  
       "DISABLE_FEEDBACK_COMMAND": "1"  
     }  
    }

llama.cpp run:

    ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \
        --model models/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf \
        --alias "qwen3.5-27b" \
        --port 8001 \
        --ctx-size 65536 \
        --n-gpu-layers 999 \
        --flash-attn on \
        --jinja \
        --threads 8 \
        --temp 0.6 \
        --top-p 0.95 \
        --top-k 20 \
        --min-p 0.00 \
        --cache-type-k q8_0 \
        --cache-type-v q8_0

claude --model qwen3.5-27b --verbose

VRAM Consumed - 22GB
RAM Consumed (by CC) - 7GB
nothing changed.

all the errors from first session were fixed )

Third Session (Vision)

To turn on vision for qwen, you are required to use mmproj, which was included with gguf.

setup:

    ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \
        --model models/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf \
        --alias "qwen3.5-27b" \
        --port 8001 \
        --ctx-size 65536 \
        --n-gpu-layers 999 \
        --flash-attn on \
        --jinja \
        --threads 8 \
        --temp 0.6 \
        --top-p 0.95 \
        --top-k 20 \
        --min-p 0.00 \
        --cache-type-k q8_0 \
        --cache-type-v q8_0 \
        --mmproj models/Qwen3.5-27B-GGUF/mmproj-F32.gguf

and its only added 1-2 ram usage.

tested with 8 Images and quality of vision was WOW to me.
if you look at Artificial Analysis Vision Benchmark, qwen is on Claude 4.6 Opus level which makes it superior for vision tasks.

My tests showed that it can really good understand context of image and handwritten diagrams.

Verdict

system prompt is too big and takes too much time to load. but this is only first time, then caching makes everything for you.
CC is worth using with local models and local models nowadays are good for coding tasks. and I found it most "offline" coding agent CLI compared to Opencode, why I should use less "performant" alternative, when I can use SOTA )

Future Experiments:
- I want to use bigger Mixture of Experts model from Qwen3.5 Family, but will it give me better 2x performance for 2x size?
- want to try CC with Zed editor, and check how offline zed will behave with local CC.
- How long compaction will hold agents reasoning and how quality gonna degrade, with codex or CC I had 10M context chats with decent quality compared to size.