Open source · MIT · pip install llm-autotune

Your local AI, actually fast.

autotune sits between your code and Ollama and applies automatic optimizations: right-sized KV buffers, KV precision tuning, system prompt caching, intelligent context management, and model keep-alive. The result: 300+ MB freed per request, first word up to 53% faster, and your computer stays responsive. No config changes. Your code stays exactly the same.

bash

pip install llm-autotune
autotune chat --model qwen3:8b

Get started →Prove it on your machine

381 MB

RAM freed per request

on qwen3:8b — back to your browser

53%

Faster first word

on qwen3:8b; 39% avg across 3 models

67%

Less KV cache

Ollama reserves 3× less memory

Swap events

across all 45 benchmark runs

How it works

Every request, sized exactly right.

autotune sits between your code and Ollama as a transparent proxy. Before each request reaches Ollama, autotune calculates the exact memory it needs, watches live RAM usage from your other apps, and adjusts automatically. No config. No changes to your code or output quality.

1Precise KV cache allocation — every single request

Every time Ollama runs your prompt, it must first allocate a block of RAM called the KV cache — it's where it stores the attention state for every token in the context window. By default, Ollama always allocates for 4,096 tokens. For a typical 50-word message, that's allocating 12× more RAM than the message actually needs. autotune measures the real token count, adds a safe headroom buffer, and tells Ollama the exact minimum. That freed RAM goes back to your browser, your apps, your system.

Formula autotune uses

ctx = input_tokens
+ max_reply
+ 256 (buffer)
→ rounded to nearest bucket

Raw Ollama — qwen3:8b

576 MB

always allocated, every request

for 4,096 tokens

autotune — qwen3:8b

195 MB

381 MB returned to your system

per typical chat request

Buckets (512, 768, 1024, 1536, 2048…) prevent Ollama from reallocating the Metal buffer on every call — requests with similar lengths reuse the same pre-allocated buffer, eliminating 100–300 ms of KV thrashing overhead per request.

2Live pressure management — proactive RAM tier system

Right-sizing the KV cache at request time is the foundation. But RAM usage on your machine is dynamic: Chrome opens a tab, Xcode compiles, a background process wakes up. autotune reads the OS's RAM utilization percentage before every single request and applies two independent levers — context window size and KV precision — across four fixed tiers, maintaining headroom well before any swap risk develops.

Live RAM thresholds — checked before every request

RAM < 80%

Full context window · KV at profile default (F16 or Q8)

RAM 80–88%

Context trimmed −10% · KV precision unchanged

RAM 88–93%

Context −25% · KV switches F16 → Q8 (halves KV memory)

RAM > 93%

Context halved · KV forced Q8 · prevents disk swap

KV precision switching (F16 → Q8) cuts the KV cache's RAM footprint in half instantly — with no meaningful quality impact. Q8 stores each attention value in 1 byte instead of 2; the difference in model output is undetectable in practice. These adjustments happen automatically — you see a brief note in the chat UI when one fires. This is a heuristic tier system based on RAM percentage. autotune also runs a separate exact-math pre-flight check (NoSwapGuard) that computes precise KV bytes using your model's architecture — that system only fires when swap is mathematically certain.

3System prompt prefix caching

In any multi-turn chat, Ollama re-processes your entire system prompt from scratch on every message. autotune pins those tokens in the KV cache so they're only ever evaluated once — at the start. Every follow-up turn gets faster because fewer tokens need processing. The savings compound with every turn.

Turn 1system prompt + message evaluated

Turn 2+system prompt skipped — new tokens only

4Model keep-alive

Ollama unloads the model after 5 minutes idle — a 1–4 second reload every time you come back to it. autotune keeps the model resident in memory between sessions. The weights were already using that RAM; keeping them there costs nothing extra and eliminates the cold-start delay entirely.

Raw Ollama

1–4s reload

after 5 min idle

autotune

stays loaded

instant first token

Benchmark numbers use qwen3:8b / llama3.2:3b on Apple M2 16 GB. KV savings scale with model size — larger models free more RAM in absolute terms. Generation speed and output quality are unchanged: autotune touches only buffer sizes, precision, and scheduling — never model weights or sampling.

All 14 optimizations explained →

Measured results

Real numbers. Real hardware.

Benchmarked on Apple M2 16 GB using Ollama's internal Go nanosecond timers — not wall-clock estimates. 3 runs × 5 prompt types, Wilcoxon signed-rank test. Every number here is reproducible with autotune proof.

Model	KV: Before	KV: After	RAM freed	First word	Speed
qwen3:8b	576 MB	195 MB	381 MB	−53%	unchanged
llama3.2:3b	448 MB	155 MB	293 MB	−35%	unchanged
gemma4:e2b	96 MB	30 MB	66 MB	−29%	unchanged

TTFT improvement is largest when the model is cold or when RAM is under pressure. Generation speed (tok/s) is Metal GPU-bound and is not affected by autotune. KV savings apply every single request regardless of hardware.

300+ MB freed

Every request you send, Ollama allocates a KV buffer for 4,096 tokens. autotune sizes it to the actual prompt — returning hundreds of MB to your system on every single call, automatically.

Up to 53% faster

The KV buffer must be initialized before token 1. A smaller buffer initializes faster. On qwen3:8b, autotune cuts first-word time from the raw baseline by 53% — every new session, every cold request.

Zero trade-offs

autotune changes only the KV buffer size. Model weights, sampling, and generation speed are identical. prompt_eval_count is unchanged — no tokens are dropped or skipped.

Multi-turn & agentic workloads

Where it matters most.

Single-prompt benchmarks miss the real problem: context accumulates. Each tool call, each reasoning step, each file read appends more tokens. By turn 8, the model is processing 5–8× more tokens than turn 1 — and raw Ollama's fixed 4,096-token window runs out, forcing a full model reload mid-session.

autotune computes a session-ceiling KV window once before the loop starts and locks it for the entire session. No reloads. And because the system prompt is pinned via prefix caching, TTFT actually falls as the session grows — not climbs.

Code debugger task — 10 turns, llama3.2:3b

Metric	Raw Ollama	autotune
Session wall time	74 s	40 s
Model reloads	0.5	0.5
TTFT trend per turn	−101 ms/turn	−435 ms/turn
Swap events	0	0
Context at session end	3,043 tokens	1,946 tokens

TTFT falls as the session grows

The system prompt is pinned in KV after turn 1 and never re-evaluated. Each new turn only prefills the new tokens — not the full conversation from scratch. By turn 5, autotune is noticeably faster than turn 1. By turn 10, the difference compounds significantly.

Session window sized for the task

autotune computes a KV window for the full session ceilingbefore the first turn, then holds it constant. raw Ollama's fixed 4,096-token window fills up mid-task and forces a model reload (~1–3 s each). autotune trades a slightly higher turn-1 cost to eliminate all reloads.

Honest caveat: Turn 1 is ~80% slower — autotune pre-allocates a larger KV window for the whole session. From turn 2 onward prefix-cache savings compound and wall time comes out 46% lower. For single-turn usage, the per-request benchmark numbers apply.

Benchmark: code_debugger task, N=2 trials, Apple M2 16 GB, llama3.2:3b balanced profile. Timings from Ollama's internal Go nanosecond timers. Full methodology in AGENT_BENCHMARK.md.

Verify it yourself

Don't trust the numbers. Run the proof.

autotune ships with a built-in benchmark that runs two head-to-head tests on your hardware in about 30 seconds. It uses Ollama's own internal Go nanosecond timers — nothing estimated, nothing made up.

✓KV cache size: raw Ollama vs autotune — exact MB
✓Time to first word: two conditions from the same neutral state
✓Saves a JSON file you can inspect or share
✓Generation speed reported honestly (usually unchanged)

Works with any model you have installed in Ollama. Picks the smallest installed model automatically if you don't specify one.

bash

autotune proof -m qwen3:8b
# Runs in ~30 seconds. Uses Ollama's own timers.
# Saves a proof_qwen3_8b.json you can share.

Tip: Run autotune proof --list-models to see which Ollama models are available on your machine.

Quickstart

Up in 60 seconds

No Ollama commands needed — autotune handles everything for you.

1
Install autotune
One pip install. Ollama is started automatically — no separate setup.
pip install llm-autotune
2
Find the best model for your hardware
Scans your CPU, RAM, and GPU. Recommends the optimal model and settings — no guessing.
autotune recommend
3
Download the recommended model
autotune pulls the model and starts Ollama if it isn't already running.
autotune pull qwen3:8b
4
Start chatting with optimization
Every request is automatically right-sized. No flags, no config.
autotune chat --model qwen3:8b
5
Prove it on your own hardware
30-second benchmark using Ollama's own nanosecond timers. Saves a JSON you can share.
autotune proof -m qwen3:8b

Apple Silicon (M1/M2/M3/M4)

Native Metal GPU kernels via MLX — 10–40% faster generation throughput.

pip install "llm-autotune[mlx]"

Use from Python — drop-in for any OpenAI client

python

import autotune
from openai import OpenAI

autotune.start()  # start the optimizing proxy

client = OpenAI(**autotune.client_kwargs())

response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Hello!"}],
)
# Every optimization is automatic.

Or run as an API server

bash

autotune serve
# → http://localhost:8765/v1
# Any OpenAI client works automatically.

Profiles

fast2k ctx · Q8 KV · quick lookups & completions

balanced8k ctx · F16 KV · general chat (default)

quality32k ctx · F16 KV · long-form writing & analysis

Docker

Ollama + autotune, bundled.

The Docker image bundles Ollama and autotune in a single container. No local install needed — just pull the image, mount a volume for model storage, and your OpenAI-compatible endpoint is ready on port 8765.

bash

# Build once
docker build -t autotune .

# Run — autotune on :8765, models cached in a volume
docker run -p 8765:8765 \
  -v ollama_models:/root/.ollama \
  -e OLLAMA_MODEL=qwen3:8b \
  autotune

OLLAMA_MODEL auto-pulls the model on first start. Models are cached in the named volume and persist across restarts.

docker-compose — two options

--profile singleOllama + autotune in one container. Simplest setup.

--profile multiSeparate services. Lighter autotune image (~200 MB). Set AUTOTUNE_OLLAMA_URL=http://ollama:11434.

Environment variables

OLLAMA_MODEL= qwen3:8b

auto-pull on first boot

AUTOTUNE_PORT= 8765

autotune bind port

AUTOTUNE_OLLAMA_URL= http://ollama:11434

remote/multi-container Ollama

GPU support: Built on ollama/ollama:latest — includes CUDA and ROCm layers. Add --gpus all for NVIDIA, or mount /dev/kfd for AMD.

What autotune does

Every optimization, automatic

⚡

Dynamic KV sizing

Computes the exact context window each request needs. Typical chat message: 4,096 → 1,400 tokens. Frees 60–70% of the KV buffer back to your system.

🔒

System prompt caching

Pins your system prompt in Ollama's KV so it's never re-evaluated on follow-up messages. Pure latency win with no quality cost.

🧠

Adaptive KV precision

Switches from F16 to Q8 KV under memory pressure — halves KV memory at three automatic thresholds with no quality impact.

♾️

Keep-alive management

Holds the model in memory between messages. Eliminates the 1–3s cold-reload cost you'd otherwise pay every time a session goes idle.

🔌

OpenAI-compatible API

Drop-in server at localhost:8765/v1. Works with any OpenAI SDK, LangChain, LlamaIndex, or agent framework without code changes.

🍎

MLX backend (Apple Silicon)

Routes inference to MLX-LM on M-series Macs for native Metal GPU kernels — 10–40% faster generation throughput over Ollama.

🐳

Docker — Ollama bundled

Single container with Ollama + autotune. Mount a volume for models, set OLLAMA_MODEL to auto-pull on first boot, and your OpenAI-compatible API is ready on :8765.

What to run

Best models for your hardware

autotune works with any Ollama model. These are the best options as of April 2026. Run autotune recommend to get a hardware-specific recommendation.

RAM	Model	Size	Why
8 GB	qwen3:4b	~2.6 GB	Best 4B — hybrid thinking mode, strong reasoning
16 GB	qwen3:8b	~5.2 GB	Near-frontier quality; best 8B as of 2026
16 GB	gemma4:e2b	~5.8 GB	Google's newest; 128k native context
24 GB	qwen3:14b	~9.0 GB	Excellent reasoning and coding
32 GB	qwen3:30b-a3b	~17 GB	MoE: flagship quality at 7B inference cost
Coding	qwen2.5-coder:14b	~9.0 GB	Best open coding model
Reasoning	deepseek-r1:14b	~9.0 GB	Chain-of-thought; math & logic

⚡

Try it in 60 seconds.

Open source, MIT licensed. Works with whatever Ollama models you already have. The autotune proof command will show you the exact improvement on your own hardware.

bash

pip install llm-autotune

View on GitHub →PyPI page