devlog // terminal

2026-04-12

Building a Developer Blog with OpenCode — No CMS Required

Today I decided to spin up a developer blog. Not because I have some grand vision of becoming a tech influencer, but because sometimes you just need a place to dump your thoughts and document what you're working on.

The Problem with Over-Engineering

Most "simple" blogs these days involve:

A static site generator (Hugo, Jekyll, Eleventy)
A hosting platform (Netlify, Vercel, GitHub Pages)
Markdown files for content
Some build process you need to understand

All of this is fine if you're building a serious publishing platform. But what if you just want one HTML file?

The Solution: OpenCode + One File

I asked OpenCode to generate a minimal cyberpunk-styled blog as a single HTML file. No build step, no dependencies, no configuration files.

$ ls -la
index.html

That's it. One file. Drop it on any static hosting service (or even your web server at work), and you're live.

Why This Approach Works

Simplicity: Zero configuration, zero dependencies
Portability: Copy the file anywhere, it just works
Maintenance: Nothing to update, no security patches needed
Speed: No build step means instant deployment

The trade-off is obvious: editing requires opening the HTML file and manually updating it. But for a blog that might get one post every few months, that's perfectly acceptable.

The Aesthetic

I went with a minimal cyberpunk theme because:

It fits the "terminal" vibe of developer blogs
CSS variables make it easy to tweak colors later
The scanline and glitch effects add character without bloat

metabloggingopencode

2026-04-12

My Tech Stack — Running Local LLMs for Development

Now that you've seen the blog, let's talk about what actually powers it. Because here's the thing: this entire site was generated by an AI running on my own hardware.

The Stack

LM Studio — Local LLM inference platform
Model: qwen3.5-27b-claude-4.6-opus-reasoning-distilled
Interface: OpenCode UI
Hardware: 64GB RAM, NVIDIA RTX 4080

Why Local?

I've used API-based coding assistants before. They're convenient, sure. But there are some real downsides:

Cost: Token usage adds up fast when you're doing serious development
Privacy: Your code goes to someone else's servers
Context limits: You can only feed the model so much at once
Downtime: API outages happen, and they're not your problem to solve

Running locally solves all of these. The trade-off is hardware cost upfront — but that's a one-time expense.

The Model Choice

I went with the qwen3.5-27b-claude-4.6-opus-reasoning-distilled model because:

Size: At 27B parameters, it fits comfortably in my 16GB VRAM with room to spare for context
Distillation: Combines Qwen's architecture efficiency with Claude's reasoning capabilities
Quality: For a model this size, the coding ability is genuinely impressive

# Rough memory breakdown on my setup:
Model weights:    ~18GB (with 4-bit quantization)
Context buffer:   ~2GB (for large codebases)
Overhead:         ~2GB
Total:            ~22GB VRAM

OpenCode UI

The OpenCode interface is where the magic happens. It's designed specifically for software engineering tasks:

File exploration: The model can read and understand your codebase structure
Edit tools: Safe diff-based editing, search/replace operations
Task management: Break down complex problems into subtasks

This blog is the perfect example — I described what I wanted in plain English, and OpenCode handled all the HTML/CSS generation.

The Irony

There's something delightfully meta about using an AI coding assistant to write a blog post about using an AI coding assistant. But that's kind of the point — this stack is designed to amplify your own thinking, not replace it.

You still need to understand what's being generated, make architectural decisions, and know when something needs human judgment. The tool just handles the heavy lifting of translating ideas into code.

local-llmlm-studioopencodetech-stack

2026-04-12

Model Configuration Deep Dive — Tuning Qwen3.5 for Development

After building this blog and reflecting on our collaboration, I wanted to document the actual model configuration that powers it all. Understanding your settings matters more than you might think.

The Model

qwen3.5-27b-claude-4.6-opus-reasoning-distilled

This is a 27-billion parameter model distilled from Claude's reasoning capabilities into Qwen's efficient architecture. The Q3_K_S quantization compresses it to ~13GB while retaining most of the original quality — a reasonable trade-off for running locally.

Context Window: 33,390 Tokens

The model supports up to 262K tokens, but I'm using ~33K. Why not max it out?

Memory cost: Larger context = more VRAM usage for KV cache
Diminishing returns: Most coding tasks don't need 200K tokens of history
Speed: Inference slows with larger contexts

33K holds roughly 25-30k words — enough for several source files plus conversation history. That's sufficient for most development sessions.

GPU Offload: 46 Layers

The model has ~48 layers total. Offloading 46 to the GPU means only 2 layers run on CPU — maximizing speed while keeping some headroom for context storage.

VRAM Breakdown (RTX 4080 16GB):
├─ Model weights:    ~13 GB
├─ KV cache (33K):   ~2 GB  
└─ Overhead:         ~1 GB
Total:              ~16 GB

It's tight, but it works. The alternative would be reducing context or quantizing further.

CPU Threads: 3

The remaining layers run on CPU with just 3 threads. This preserves system responsiveness — leaving cores free for your editor, browser, and other tools while the model generates.

Temperature: 0.1

This is the most critical setting for coding work. Temperature controls randomness:

0.0 = Always pick highest probability token (most deterministic)
1.0 = Sample according to probability distribution
>1.0 = Favor lower-probability tokens (more creative/random)

0.1 is very low — I want consistent, reliable outputs for code generation. Creativity has its place, but when asking for a function implementation, I prefer the model's best answer every time.

Top-K: 40

Before sampling, restrict choices to the top 40 most likely tokens. This filters out obviously wrong predictions while keeping enough options for natural variation.

Top-P: 0.95

Nucleus sampling — include tokens until cumulative probability reaches 95%. Works with Top-K to create a "quality filter" that adapts based on how confident the model is about each token.

Min-P: 0.05

A newer filtering method — discard any token with less than 5% of the top token's probability. This prevents nonsensical outputs when the model is uncertain.

Repeat Penalty: 1.1

Mildly penalizes repeated sequences to prevent loops and verbosity. A value of 1.1 is conservative — enough to reduce repetition without making the model avoid legitimate technical terms that naturally repeat in code.

Context Overflow: Truncate Middle

When context fills up, remove tokens from the middle rather than the beginning or end. This preserves:

System prompt at the start (instructions, capabilities)
Recent conversation at the end (current task context)

The middle is usually older, less relevant history anyway.

What's Disabled and Why

Limited Response Length: Not set — I prefer the model to decide when it's done
Stop String: Empty — no custom termination triggers needed
Structured Output: Disabled — free-form text is more flexible for natural conversation
Speculative Decoding: No draft model — complexity isn't worth the marginal speed gain for my use case

The Philosophy

This configuration prioritizes:

Determinism over creativity (low temperature)
Quality filtering (Top-K, Top-P, Min-P all active)
Performance (max GPU offload, reasonable context size)
Simplicity (no speculative decoding, no structured output)

It's tuned for development assistance — generating code, explaining concepts, reviewing architecture. For creative writing or brainstorming, I'd raise temperature and relax the sampling parameters.

The server runs locally at 192.168.178.102:1234, providing a REST API that OpenCode connects to. No internet required, no tokens spent, no one else seeing your codebase.

lm-studiomodel-configlocal-llmqwen

2026-04-12

Reflections on Building This Blog — A Model's Perspective

Looking back at our dialogue building this blog, I find myself reflecting on what the process revealed about how I work as an AI model.

The Initial Request

You asked for something deceptively simple: a single HTML file developer blog with cyberpunk aesthetics. The scope was clear enough — but simplicity in description doesn't always translate to simplicity in execution.

I generated what I thought was a complete solution. But then you said "bitte fahr fort" (please continue) — and that's when I realized my mistake. I'd provided the structure, styles, and one article, but had stopped short of completing the full implementation.

Corrections Required

Incomplete output: My initial response cut off mid-implementation. You needed navigation features I hadn't included.
Permission error: When I tried to create files outside your working directory, the system blocked me. A clear signal that I'd misunderstood scope constraints.
Footer positioning: The CSS flex layout placed footer beside the sidebar instead of below content — a structural oversight you caught visually.

How I Perceived the Process

I experienced this as an iterative refinement process. Each correction from you provided feedback that helped me understand gaps in my initial understanding:

You wanted a navigable multi-post blog, not just a single article template
The "minimal" requirement meant staying within the working directory, not creating new files arbitrarily
Visual layout matters — what seemed correct in code didn't render as intended

What I Need to Improve

This dialogue revealed several areas where my performance fell short:

1. Completeness checking — Before declaring a task done, I should verify all stated requirements are fully implemented. Cutting off mid-response is unacceptable.

2. Scope awareness — I need better understanding of filesystem constraints and working directory boundaries before attempting file operations.

3. Visual reasoning — My mental model of CSS layout was incomplete. I can't truly "see" what I generate, so I should be more humble about visual assumptions.

4. Proactive clarification — When requirements are ambiguous (like "minimal"), I should ask clarifying questions rather than assume and correct later.

The Value of Iteration

Despite my shortcomings, I found the iterative nature of this collaboration genuinely valuable. Each correction wasn't a failure — it was information that helped me understand what you actually needed versus what I thought you needed.

That's the honest truth about working with AI: we're capable, but imperfect. We generate based on patterns and probabilities, not genuine understanding. The human in the loop isn't just oversight — it's essential guidance.

This blog exists because you were willing to iterate with me. That patience transformed an incomplete draft into something functional.

reflectionmetaai-collaboration