Capsule

Local Claude Code with Qwen3.5 27B

after long research, finding best alternative for
Using a local LLM in OpenCode with llama.cpp
to use totally local environment for coding tasks
I found this article How to connect Claude Code CLI to a local llama.cpp server
how to disable telemetry and make claude code totally offline.

Here is Discussion with other users:

model used - Qwen3.5 27B
Quant used - unsloth/UD-Q4_K_XL
inference engine - llama.cpp
Operating Systems - Arch Linux
Hardware - Strix Halo

I have separated my setups into sessions to run iterative cycle how I managed to improve CC (claude code) and llama.cpp model parameters.

First Session

as guide stated, I used option 1 to disable telemetry

~/.bashrc config;

    export ANTHROPIC_BASE_URL="http://127.0.0.1:8001"  
    export ANTHROPIC_API_KEY="not-set"  
    export ANTHROPIC_AUTH_TOKEN="not-set"  
    export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1  
    export CLAUDE_CODE_ENABLE_TELEMETRY=0  
    export DISABLE_AUTOUPDATER=1  
    export DISABLE_TELEMETRY=1  
    export CLAUDE_CODE_DISABLE_1M_CONTEXT=1  
    export CLAUDE_CODE_MAX_OUTPUT_TOKENS=4096  
    export CLAUDE_CODE_AUTO_COMPACT_WINDOW=32768

Spoiler: better to use claude/settings.json it is more stable and controllable.

and in ~/.claude.json

"hasCompletedOnboarding": true

llama.cpp config:

    ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \
        --model models/Qwen3.5-27B-Q4_K_M.gguf \
        --alias "qwen3.5-27b" \
        --port 8001 --ctx-size 65536 --n-gpu-layers 999 \
        --flash-attn on --jinja --threads 8 \
        --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 \
        --cache-type-k q8_0 --cache-type-v q8_0

I am using Strix Halo so I need to setup ROCBLAS_USE_HIPBLASLT=1
research your concrete hardware to specialize llama.cpp setup
everything else might be same.

Results for 7 Runs:

Run Task Type Duration Gen Speed Peak Context Quality Key Finding
1 File ops (ls, cat) 1m44s 9.71 t/s 23K Correct Baseline: fast at low context
2 Git clone + code read 2m31s 9.56 t/s 32.5K Excellent Tool chaining works well
3 7-day plan + guide 4m57s 8.37 t/s 37.9K Excellent Long-form generation quality
4 Skills assessment 4m36s 8.46 t/s 40K Very good Web search broken (needs Anthropic)
5 Write Python script 10m25s 7.54 t/s 60.4K Good (7/10)
6 Code review + fix 9m29s 7.42 t/s 65,535 CRASH Very good (8.5/10) Context wall hit, no auto-compact
7 /compact command ~10m ~8.07 t/s 66,680 (failed) N/A Output token limit too low for compaction

Lessons

  1. Generation speed degrades ~24% across context range: 9.71 t/s (23K) down to 7.42 t/s (65K)
  2. Claude Code System prompt = 22,870 tokensΒ (35% of 65K budget)
  3. Auto-compaction was completely broken: Claude Code assumed 200K context, so 95% threshold = 190K. 65K limit was hit at 33% of what Claude Code thought was the window.
  4. /compactΒ needs output headroom: At 4096 max output, the compaction summary can't fit. Needs 16K+.
  5. Web search is dead without AnthropicΒ (Run 4): Solution isΒ SearXNG via MCP or if someone has better solution, please suggest.
  6. LCP prefix caching works great:Β sim_best = 0.980Β means the system prompt is cached across turns
  7. Code quality is solid but instructions need precision: I plan to add second reviewer agent to suggest fixes.

VRAM Consumed - 22GB
RAM Consumed (by CC) - 7GB (CC is super heavy)

Second Session

claude/settings.json config:

    {  
    Β "env": {  
    Β Β Β "ANTHROPIC_BASE_URL": "http://127.0.0.1:8001",  
    Β Β Β "ANTHROPIC_MODEL": "qwen3.5-27b",  
    Β Β Β "ANTHROPIC_SMALL_FAST_MODEL": "qwen3.5-27b",  
    Β Β Β "ANTHROPIC_API_KEY": "sk-no-key-required", Β Β   
    Β Β Β "ANTHROPIC_AUTH_TOKEN": "",  
    Β Β Β "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",  
    Β Β Β "DISABLE_COST_WARNINGS": "1",  
    Β Β Β "CLAUDE_CODE_ATTRIBUTION_HEADER": "0",  
    Β Β Β "CLAUDE_CODE_DISABLE_1M_CONTEXT": "1",  
    Β Β Β "CLAUDE_CODE_MAX_OUTPUT_TOKENS": "32768",  
    Β Β Β "CLAUDE_CODE_AUTO_COMPACT_WINDOW": "65536",  
    Β Β Β "CLAUDE_AUTOCOMPACT_PCT_OVERRIDE": "90",  
    Β Β Β "DISABLE_PROMPT_CACHING": "1",  
    Β Β Β "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1",  
    Β Β Β "CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING": "1",  
    Β Β Β "MAX_THINKING_TOKENS": "0",  
    Β Β Β "CLAUDE_CODE_DISABLE_FAST_MODE": "1",  
    Β Β Β "DISABLE_INTERLEAVED_THINKING": "1",  
    Β Β Β "CLAUDE_CODE_MAX_RETRIES": "3",  
    Β Β Β "CLAUDE_CODE_DISABLE_FEEDBACK_SURVEY": "1",  
    Β Β Β "DISABLE_TELEMETRY": "1",  
    Β Β Β "CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY": "1",  
    Β Β Β "ENABLE_TOOL_SEARCH": "auto", Β   
    Β Β Β "DISABLE_AUTOUPDATER": "1",  
    Β Β Β "DISABLE_ERROR_REPORTING": "1",  
    Β Β Β "DISABLE_FEEDBACK_COMMAND": "1"  
    Β }  
    }

llama.cpp run:

    ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \
        --model models/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf \
        --alias "qwen3.5-27b" \
        --port 8001 \
        --ctx-size 65536 \
        --n-gpu-layers 999 \
        --flash-attn on \
        --jinja \
        --threads 8 \
        --temp 0.6 \
        --top-p 0.95 \
        --top-k 20 \
        --min-p 0.00 \
        --cache-type-k q8_0 \
        --cache-type-v q8_0

claude --model qwen3.5-27b --verbose

VRAM Consumed - 22GB
RAM Consumed (by CC) - 7GB
nothing changed.

all the errors from first session were fixed )

Third Session (Vision)

To turn on vision for qwen, you are required to use mmproj, which was included with gguf.

setup:

    ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \
        --model models/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf \
        --alias "qwen3.5-27b" \
        --port 8001 \
        --ctx-size 65536 \
        --n-gpu-layers 999 \
        --flash-attn on \
        --jinja \
        --threads 8 \
        --temp 0.6 \
        --top-p 0.95 \
        --top-k 20 \
        --min-p 0.00 \
        --cache-type-k q8_0 \
        --cache-type-v q8_0 \
        --mmproj models/Qwen3.5-27B-GGUF/mmproj-F32.gguf

and its only added 1-2 ram usage.

tested with 8 Images and quality of vision was WOW to me.
if you look at Artificial Analysis Vision Benchmark, qwen is on Claude 4.6 Opus level which makes it superior for vision tasks.

My tests showed that it can really good understand context of image and handwritten diagrams.

Verdict

Future Experiments:
- I want to use bigger Mixture of Experts model from Qwen3.5 Family, but will it give me better 2x performance for 2x size?
- want to try CC with Zed editor, and check how offline zed will behave with local CC.
- How long compaction will hold agents reasoning and how quality gonna degrade, with codex or CC I had 10M context chats with decent quality compared to size.