Privacy, cost, and control. Cloud LLM APIs are convenient but:
Local inference runs models on your hardware. Zero API cost, complete privacy, works offline.
Simplest path from zero to working LLM. No Python, no virtual environments, no CUDA drivers (unless you want GPU acceleration). Download binary, pull model, run.
Alternative (LocalAI) exists but Ollama's UX is unmatched for getting started.
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Or download: https://ollama.com/download
# Recommended: Fast, general-purpose (7B params, 4GB)
ollama pull mistral
# Alternative: Smaller, faster (3B params, 2GB)
ollama pull llama3.2:3b
# For code: Optimized for technical content (7B, 4.5GB)
ollama pull qwen2.5-coder:7b
Why these models? Balance of size/speed/quality. Smaller models (3B) are fast on CPU. Larger models (7B) give better results but need more RAM.
ollama serve
Runs on http://localhost:11434 by default.
Edit ~/.qntx/am.toml (or project am.toml):
[local_inference]
enabled = true
base_url = "http://localhost:11434"
model = "mistral"
timeout_seconds = 120
Done. All QNTX LLM operations now use local inference.
# Check Ollama is running
curl http://localhost:11434/api/tags
# Test inference
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral",
"messages": [{"role": "user", "content": "Hello!"}]
}'
CPU inference works but is slow (5-10 tokens/sec). Fine for development, frustrating for production.
GPU acceleration makes local inference viable:
Ollama automatically uses GPU if available. No configuration needed.
| Model | Size | CPU Speed | GPU Speed | Quality | Best For |
|---|---|---|---|---|---|
| llama3.2:3b | 2GB | Very Slow | Fast | Good | Testing, quick tasks |
| mistral | 4GB | Slow | Very Fast | Excellent | General purpose |
| qwen2.5-coder:7b | 4.5GB | Slow | Very Fast | Excellent | Code/technical |
Why not 13B or 70B models? Possible but require 8GB+ VRAM and are slower. Diminishing returns for most QNTX operations.
Local inference has zero API cost but uses local resources (GPU/CPU time, electricity).
QNTX budget system is API-cost focused. Future versions may track GPU time:
[pulse]
# Current: Only tracks cloud API spend
daily_budget_usd = 5.00
# Future: Track GPU resource usage
# max_gpu_minutes_per_day = 30.0
See pulse/budget/ package TODOs for GPU resource tracking plans.
Why switch? Local for bulk operations (save money), cloud for occasional high-quality needs (GPT-4, Claude).
# Enable local inference
qntx config set local_inference.enabled true
# Disable (use cloud APIs)
qntx config set local_inference.enabled false
Configuration reloads automatically. No restart required.
Cause: Ollama server not running.
Fix: ollama serve
Cause: Model not downloaded.
Fix: ollama pull mistral
Cause: CPU inference is inherently slow.
Options:
ollama pull llama3.2:3btimeout_seconds = 300Check: nvidia-smi
Fix: Install CUDA toolkit (Ollama will use it automatically)
Apple Silicon: Works automatically, no setup needed
Why custom models? Specialized system prompts, custom parameters, fine-tuned weights.
Create Modelfile:
FROM mistral
SYSTEM You are a code review assistant. Focus on security and correctness.
PARAMETER temperature 0.7
PARAMETER num_predict 2048
Build and use:
ollama create qntx-reviewer -f Modelfile
Configure in am.toml:
[local_inference]
model = "qntx-reviewer"
Use local if:
Use cloud if:
Why consider LocalAI? Written in Go (like QNTX), supports many model formats, self-contained binary.
Why Ollama instead? Better UX, faster iteration, larger community, simpler setup.
LocalAI is viable if you need specific model formats Ollama doesn't support.
docker run -p 8080:8080 localai/localai:latest
Configure:
[local_inference]
enabled = true
base_url = "http://localhost:8080"
model = "your-model-name"
pulse/budget/ package TODOs