← projects

Local LLM orchestration and benchmarking

Running Qwen and Llama variants on 16GB consumer hardware via llama.cpp — measuring the real trade-off between tokens per second and logic quality on quantized GGUF models.

status
active
started
2026-01
updated
2026-04
tags
AI/ML · Systems

What

A personal lab for running open-weight models locally — Qwen and Llama variants served through llama.cpp, fronted by Open WebUI. The point isn’t to use them; it’s to measure them under real constraints and build intuition for which size/quant combinations are actually usable on the hardware I have.

Why

Two reasons:

  1. I want to stop paying per-token tax for work that a 4B–8B model can do fine, and I want to know empirically where that line is.
  2. The more interesting question — if you’re doing RAG or agent work — is latency per token, not benchmark score on MMLU. Those numbers you only get by running things yourself.

How

  • Runtime: llama.cpp with GGUF quantized weights.
  • Front-end: Open WebUI for chat-style sanity checks; direct llama-cli for benchmarking runs.
  • What I measure: tokens per second (prompt eval + generation), context window behavior as it fills up, memory footprint, and — separately — correctness on a small private instruction-following set.

Findings so far

Qwen 4B: fast, but brittle

~50 TPS on my setup. Fine for summarization, classification, and short-form generation. Falls over on multi-step instructions and anything that requires holding structure across a longer response — it drifts or drops constraints. Useful as a component, not as a general model.

8B models: ran into a hardware wall

On 16GB RAM, the default quant for 8B-class models pushes the machine into heavy swap — I saw ~5GB of memory swapping during longer-context generation, and context exhaustion when I tried to stretch the window. The machine doesn’t crash; it just slows to a crawl and the generation quality gets non-deterministic as cache pressure rises.

Takeaway

On 16GB, 4B is the speed king for simple tasks. 8B is usable but only if you drop to a more aggressive quant — Q4_K_M is the target I’m testing next — and keep the context reasonable. The “just run an 8B locally” recommendation you see online assumes more headroom than 16GB provides once you account for the OS, the embedding model (if RAG), and the browser you forgot to close.

What’s next

  • Full sweep: Qwen 4B and 8B, Llama 8B, at Q4_K_M / Q5_K_M / Q6_K, measuring TPS + a small correctness set.
  • Publish the numbers as a notes post — including the runs where the “better” quant wasn’t worth it.
  • Wire the best-performing local model into aegis-rag as the default generation backend.

Limitations

  • n=1 hardware. All numbers are 16GB-specific and don’t generalize to a 3090 or to Apple Silicon with unified memory.
  • Correctness set is small and hand-written — not a standard eval. Good for catching obvious regressions, not for publishing leaderboard claims.