Open source · MIT · pip install llm-autotune

Your local AI, actually fast.

autotune sits between your code and Ollama and applies automatic optimizations: right-sized KV buffers, KV precision tuning, system prompt caching, intelligent context management, and model keep-alive. The result: 300+ MB freed per request, first word up to 53% faster, and your computer stays responsive. No config changes. Your code stays exactly the same.

bash
pip install llm-autotune
autotune chat --model qwen3:8b
Autotune - Allows local LLMs to run faster and smoother on your device. | Product Hunt
381 MB
RAM freed per request
on qwen3:8b — back to your browser
53%
Faster first word
on qwen3:8b; 39% avg across 3 models
67%
Less KV cache
Ollama reserves 3× less memory
0
Swap events
across all 45 benchmark runs

How it works

Every request, sized exactly right.

autotune sits between your code and Ollama as a transparent proxy. Before each request reaches Ollama, autotune calculates the exact memory it needs, watches live RAM usage from your other apps, and adjusts automatically. No config. No changes to your code or output quality.

1Precise KV cache allocation — every single request

Every time Ollama runs your prompt, it must first allocate a block of RAM called the KV cache — it's where it stores the attention state for every token in the context window. By default, Ollama always allocates for 4,096 tokens. For a typical 50-word message, that's allocating 12× more RAM than the message actually needs. autotune measures the real token count, adds a safe headroom buffer, and tells Ollama the exact minimum. That freed RAM goes back to your browser, your apps, your system.

Formula autotune uses
ctx = input_tokens
    + max_reply
    + 256 (buffer)
→ rounded to nearest bucket
Raw Ollama — qwen3:8b
576 MB
always allocated, every request
for 4,096 tokens
autotune — qwen3:8b
195 MB
381 MB returned to your system
per typical chat request

Buckets (512, 768, 1024, 1536, 2048…) prevent Ollama from reallocating the Metal buffer on every call — requests with similar lengths reuse the same pre-allocated buffer, eliminating 100–300 ms of KV thrashing overhead per request.

2Live pressure management — proactive RAM tier system

Right-sizing the KV cache at request time is the foundation. But RAM usage on your machine is dynamic: Chrome opens a tab, Xcode compiles, a background process wakes up. autotune reads the OS's RAM utilization percentage before every single request and applies two independent levers — context window size and KV precision — across four fixed tiers, maintaining headroom well before any swap risk develops.

Live RAM thresholds — checked before every request
RAM < 80%
Full context window · KV at profile default (F16 or Q8)
RAM 80–88%
Context trimmed −10% · KV precision unchanged
RAM 88–93%
Context −25% · KV switches F16 → Q8 (halves KV memory)
RAM > 93%
Context halved · KV forced Q8 · prevents disk swap

KV precision switching (F16 → Q8) cuts the KV cache's RAM footprint in half instantly — with no meaningful quality impact. Q8 stores each attention value in 1 byte instead of 2; the difference in model output is undetectable in practice. These adjustments happen automatically — you see a brief note in the chat UI when one fires. This is a heuristic tier system based on RAM percentage. autotune also runs a separate exact-math pre-flight check (NoSwapGuard) that computes precise KV bytes using your model's architecture — that system only fires when swap is mathematically certain.

3System prompt prefix caching

In any multi-turn chat, Ollama re-processes your entire system prompt from scratch on every message. autotune pins those tokens in the KV cache so they're only ever evaluated once — at the start. Every follow-up turn gets faster because fewer tokens need processing. The savings compound with every turn.

Turn 1system prompt + message evaluated
Turn 2+system prompt skipped — new tokens only
4Model keep-alive

Ollama unloads the model after 5 minutes idle — a 1–4 second reload every time you come back to it. autotune keeps the model resident in memory between sessions. The weights were already using that RAM; keeping them there costs nothing extra and eliminates the cold-start delay entirely.

Raw Ollama
1–4s reload
after 5 min idle
autotune
stays loaded
instant first token

Benchmark numbers use qwen3:8b / llama3.2:3b on Apple M2 16 GB. KV savings scale with model size — larger models free more RAM in absolute terms. Generation speed and output quality are unchanged: autotune touches only buffer sizes, precision, and scheduling — never model weights or sampling.

All 14 optimizations explained →

Measured results

Real numbers. Real hardware.

Benchmarked on Apple M2 16 GB using Ollama's internal Go nanosecond timers — not wall-clock estimates. 3 runs × 5 prompt types, Wilcoxon signed-rank test. Every number here is reproducible with autotune proof.

ModelKV: BeforeKV: AfterRAM freedFirst word
qwen3:8b576 MB195 MB381 MB−53%
llama3.2:3b448 MB155 MB293 MB−35%
gemma4:e2b96 MB30 MB66 MB−29%

TTFT improvement is largest when the model is cold or when RAM is under pressure. Generation speed (tok/s) is Metal GPU-bound and is not affected by autotune. KV savings apply every single request regardless of hardware.

300+ MB freed

Every request you send, Ollama allocates a KV buffer for 4,096 tokens. autotune sizes it to the actual prompt — returning hundreds of MB to your system on every single call, automatically.

Up to 53% faster

The KV buffer must be initialized before token 1. A smaller buffer initializes faster. On qwen3:8b, autotune cuts first-word time from the raw baseline by 53% — every new session, every cold request.

Zero trade-offs

autotune changes only the KV buffer size. Model weights, sampling, and generation speed are identical. prompt_eval_count is unchanged — no tokens are dropped or skipped.

Multi-turn & agentic workloads

Where it matters most.

Single-prompt benchmarks miss the real problem: context accumulates. Each tool call, each reasoning step, each file read appends more tokens. By turn 8, the model is processing 5–8× more tokens than turn 1 — and raw Ollama's fixed 4,096-token window runs out, forcing a full model reload mid-session.

autotune computes a session-ceiling KV window once before the loop starts and locks it for the entire session. No reloads. And because the system prompt is pinned via prefix caching, TTFT actually falls as the session grows — not climbs.

Code debugger task — 10 turns, llama3.2:3b
MetricRaw Ollamaautotune
Session wall time74 s40 s
Model reloads0.50.5
TTFT trend per turn−101 ms/turn−435 ms/turn
Swap events00
Context at session end3,043 tokens1,946 tokens
TTFT falls as the session grows

The system prompt is pinned in KV after turn 1 and never re-evaluated. Each new turn only prefills the new tokens — not the full conversation from scratch. By turn 5, autotune is noticeably faster than turn 1. By turn 10, the difference compounds significantly.

Session window sized for the task

autotune computes a KV window for the full session ceilingbefore the first turn, then holds it constant. raw Ollama's fixed 4,096-token window fills up mid-task and forces a model reload (~1–3 s each). autotune trades a slightly higher turn-1 cost to eliminate all reloads.

Honest caveat: Turn 1 is ~80% slower — autotune pre-allocates a larger KV window for the whole session. From turn 2 onward prefix-cache savings compound and wall time comes out 46% lower. For single-turn usage, the per-request benchmark numbers apply.

Benchmark: code_debugger task, N=2 trials, Apple M2 16 GB, llama3.2:3b balanced profile. Timings from Ollama's internal Go nanosecond timers. Full methodology in AGENT_BENCHMARK.md.

Verify it yourself

Don't trust the numbers. Run the proof.

autotune ships with a built-in benchmark that runs two head-to-head tests on your hardware in about 30 seconds. It uses Ollama's own internal Go nanosecond timers — nothing estimated, nothing made up.

  • KV cache size: raw Ollama vs autotune — exact MB
  • Time to first word: two conditions from the same neutral state
  • Saves a JSON file you can inspect or share
  • Generation speed reported honestly (usually unchanged)

Works with any model you have installed in Ollama. Picks the smallest installed model automatically if you don't specify one.

bash
autotune proof -m qwen3:8b
# Runs in ~30 seconds. Uses Ollama's own timers.
# Saves a proof_qwen3_8b.json you can share.
Tip: Run autotune proof --list-models to see which Ollama models are available on your machine.

Quickstart

Up in 60 seconds

No Ollama commands needed — autotune handles everything for you.

  1. 1
    Install autotune
    One pip install. Ollama is started automatically — no separate setup.
    pip install llm-autotune
  2. 2
    Find the best model for your hardware
    Scans your CPU, RAM, and GPU. Recommends the optimal model and settings — no guessing.
    autotune recommend
  3. 3
    Download the recommended model
    autotune pulls the model and starts Ollama if it isn't already running.
    autotune pull qwen3:8b
  4. 4
    Start chatting with optimization
    Every request is automatically right-sized. No flags, no config.
    autotune chat --model qwen3:8b
  5. 5
    Prove it on your own hardware
    30-second benchmark using Ollama's own nanosecond timers. Saves a JSON you can share.
    autotune proof -m qwen3:8b
Apple Silicon (M1/M2/M3/M4)
Native Metal GPU kernels via MLX — 10–40% faster generation throughput.
pip install "llm-autotune[mlx]"
Use from Python — drop-in for any OpenAI client
python
import autotune
from openai import OpenAI

autotune.start()  # start the optimizing proxy

client = OpenAI(**autotune.client_kwargs())

response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Hello!"}],
)
# Every optimization is automatic.
Or run as an API server
bash
autotune serve
# → http://localhost:8765/v1
# Any OpenAI client works automatically.
Profiles
fast2k ctx · Q8 KV · quick lookups & completions
balanced8k ctx · F16 KV · general chat (default)
quality32k ctx · F16 KV · long-form writing & analysis

Docker

Ollama + autotune, bundled.

The Docker image bundles Ollama and autotune in a single container. No local install needed — just pull the image, mount a volume for model storage, and your OpenAI-compatible endpoint is ready on port 8765.

bash
# Build once
docker build -t autotune .

# Run — autotune on :8765, models cached in a volume
docker run -p 8765:8765 \
  -v ollama_models:/root/.ollama \
  -e OLLAMA_MODEL=qwen3:8b \
  autotune

OLLAMA_MODEL auto-pulls the model on first start. Models are cached in the named volume and persist across restarts.

docker-compose — two options
--profile singleOllama + autotune in one container. Simplest setup.
--profile multiSeparate services. Lighter autotune image (~200 MB). Set AUTOTUNE_OLLAMA_URL=http://ollama:11434.
Environment variables
OLLAMA_MODEL= qwen3:8b
auto-pull on first boot
AUTOTUNE_PORT= 8765
autotune bind port
AUTOTUNE_OLLAMA_URL= http://ollama:11434
remote/multi-container Ollama
GPU support: Built on ollama/ollama:latest — includes CUDA and ROCm layers. Add --gpus all for NVIDIA, or mount /dev/kfd for AMD.

What autotune does

Every optimization, automatic

Dynamic KV sizing
Computes the exact context window each request needs. Typical chat message: 4,096 → 1,400 tokens. Frees 60–70% of the KV buffer back to your system.
🔒
System prompt caching
Pins your system prompt in Ollama's KV so it's never re-evaluated on follow-up messages. Pure latency win with no quality cost.
🧠
Adaptive KV precision
Switches from F16 to Q8 KV under memory pressure — halves KV memory at three automatic thresholds with no quality impact.
♾️
Keep-alive management
Holds the model in memory between messages. Eliminates the 1–3s cold-reload cost you'd otherwise pay every time a session goes idle.
🔌
OpenAI-compatible API
Drop-in server at localhost:8765/v1. Works with any OpenAI SDK, LangChain, LlamaIndex, or agent framework without code changes.
🍎
MLX backend (Apple Silicon)
Routes inference to MLX-LM on M-series Macs for native Metal GPU kernels — 10–40% faster generation throughput over Ollama.
🐳
Docker — Ollama bundled
Single container with Ollama + autotune. Mount a volume for models, set OLLAMA_MODEL to auto-pull on first boot, and your OpenAI-compatible API is ready on :8765.

What to run

Best models for your hardware

autotune works with any Ollama model. These are the best options as of April 2026. Run autotune recommend to get a hardware-specific recommendation.

RAMModelSize
8 GBqwen3:4b~2.6 GB
16 GBqwen3:8b~5.2 GB
16 GBgemma4:e2b~5.8 GB
24 GBqwen3:14b~9.0 GB
32 GBqwen3:30b-a3b~17 GB
Codingqwen2.5-coder:14b~9.0 GB
Reasoningdeepseek-r1:14b~9.0 GB

Try it in 60 seconds.

Open source, MIT licensed. Works with whatever Ollama models you already have. The autotune proof command will show you the exact improvement on your own hardware.

bash
pip install llm-autotune