After building this blog and reflecting on our collaboration, I wanted to document the actual model configuration that powers it all. Understanding your settings matters more than you might think.
The Model
qwen3.5-27b-claude-4.6-opus-reasoning-distilled
This is a 27-billion parameter model distilled from Claude's reasoning capabilities into Qwen's efficient architecture. The Q3_K_S quantization compresses it to ~13GB while retaining most of the original quality — a reasonable trade-off for running locally.
Context Window: 33,390 Tokens
The model supports up to 262K tokens, but I'm using ~33K. Why not max it out?
- Memory cost: Larger context = more VRAM usage for KV cache
- Diminishing returns: Most coding tasks don't need 200K tokens of history
- Speed: Inference slows with larger contexts
33K holds roughly 25-30k words — enough for several source files plus conversation history. That's sufficient for most development sessions.
GPU Offload: 46 Layers
The model has ~48 layers total. Offloading 46 to the GPU means only 2 layers run on CPU — maximizing speed while keeping some headroom for context storage.
VRAM Breakdown (RTX 4080 16GB):
├─ Model weights: ~13 GB
├─ KV cache (33K): ~2 GB
└─ Overhead: ~1 GB
Total: ~16 GB
It's tight, but it works. The alternative would be reducing context or quantizing further.
CPU Threads: 3
The remaining layers run on CPU with just 3 threads. This preserves system responsiveness — leaving cores free for your editor, browser, and other tools while the model generates.
Temperature: 0.1
This is the most critical setting for coding work. Temperature controls randomness:
- 0.0 = Always pick highest probability token (most deterministic)
- 1.0 = Sample according to probability distribution
- >1.0 = Favor lower-probability tokens (more creative/random)
0.1 is very low — I want consistent, reliable outputs for code generation. Creativity has its place, but when asking for a function implementation, I prefer the model's best answer every time.
Top-K: 40
Before sampling, restrict choices to the top 40 most likely tokens. This filters out obviously wrong predictions while keeping enough options for natural variation.
Top-P: 0.95
Nucleus sampling — include tokens until cumulative probability reaches 95%. Works with Top-K to create a "quality filter" that adapts based on how confident the model is about each token.
Min-P: 0.05
A newer filtering method — discard any token with less than 5% of the top token's probability. This prevents nonsensical outputs when the model is uncertain.
Repeat Penalty: 1.1
Mildly penalizes repeated sequences to prevent loops and verbosity. A value of 1.1 is conservative — enough to reduce repetition without making the model avoid legitimate technical terms that naturally repeat in code.
Context Overflow: Truncate Middle
When context fills up, remove tokens from the middle rather than the beginning or end. This preserves:
- System prompt at the start (instructions, capabilities)
- Recent conversation at the end (current task context)
The middle is usually older, less relevant history anyway.
What's Disabled and Why
- Limited Response Length: Not set — I prefer the model to decide when it's done
- Stop String: Empty — no custom termination triggers needed
- Structured Output: Disabled — free-form text is more flexible for natural conversation
- Speculative Decoding: No draft model — complexity isn't worth the marginal speed gain for my use case
The Philosophy
This configuration prioritizes:
- Determinism over creativity (low temperature)
- Quality filtering (Top-K, Top-P, Min-P all active)
- Performance (max GPU offload, reasonable context size)
- Simplicity (no speculative decoding, no structured output)
It's tuned for development assistance — generating code, explaining concepts, reviewing architecture. For creative writing or brainstorming, I'd raise temperature and relax the sampling parameters.
The server runs locally at 192.168.178.102:1234, providing a REST API that OpenCode connects to. No internet required, no tokens spent, no one else seeing your codebase.
lm-studiomodel-configlocal-llmqwen