QNTX supports multiple LLM providers that can be enabled/disabled independently:
openrouter.* settings)local_inference.* settings)When local_inference.enabled = true, QNTX uses your local Ollama/LocalAI server. When false, it falls back to OpenRouter or other configured providers.
This guide covers setting up and enabling local inference providers.
Privacy, cost, and control. Cloud LLM APIs are convenient but:
Local inference runs models on your hardware. Zero API cost, complete privacy, works offline.
Simplest path from zero to working LLM. No Python, no virtual environments, no CUDA drivers (unless you want GPU acceleration). Download binary, pull model, run.
Alternative (LocalAI) exists but Ollama's UX is unmatched for getting started.
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Or download: https://ollama.com/download
# Default: Smallest, fastest (3B params, 2GB)
ollama pull llama3.2:3b
# Popular alternatives:
ollama pull mistral # General-purpose (7B, 4GB) - better quality
ollama pull qwen2.5-coder:7b # Code-optimized (7B, 4.5GB) - best for code
ollama pull phi3:mini # Microsoft's model (3.8B, 2.3GB) - good balance
ollama pull gemma2:2b # Google's smallest (2B, 1.6GB) - very fast
ollama pull deepseek-coder:6.7b # Code specialist (6.7B, 3.8GB)
Why these models? Balance of size/speed/quality. Smaller models (2-3B) are fast on CPU. Larger models (7B) give better results but need more RAM.
ollama serve
Runs on http://localhost:11434 by default.
Edit ~/.qntx/am.toml (or project am.toml):
[local_inference]
enabled = true # Disabled by default, set to true to use local models
base_url = "http://localhost:11434"
model = "llama3.2:3b" # Default model
timeout_seconds = 360 # 6 minutes timeout for slow inference
context_size = 16384 # Context window (0 = use model default)
Note: Local inference is disabled by default. Set enabled = true to activate.
Done. All QNTX LLM operations now use local inference.
# Check Ollama is running
curl http://localhost:11434/api/tags
# Test inference
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
CPU inference works but is slow (5-10 tokens/sec). Fine for development, frustrating for production.
GPU acceleration makes local inference viable:
Ollama automatically uses GPU if available. No configuration needed.
| Model | Size | CPU Speed | GPU Speed | Quality | Best For |
|---|---|---|---|---|---|
| gemma2:2b | 1.6GB | Slow | Fast | Fair | Minimal resources |
| llama3.2:3b | 2GB | Slow | Fast | Good | Default, balanced |
| phi3:mini | 2.3GB | Slow | Fast | Good | General purpose |
| mistral | 4GB | Very Slow | Very Fast | Excellent | Quality output |
| qwen2.5-coder:7b | 4.5GB | Very Slow | Very Fast | Excellent | Code/technical |
Why not 13B or 70B models? Possible but require 8GB+ VRAM and are slower. Diminishing returns for most QNTX operations.
Quick override without editing config files:
# Switch to local provider (Ollama/LocalAI) instead of OpenRouter
export QNTX_LOCAL_INFERENCE_ENABLED=true
# Point to different Ollama server
export QNTX_LOCAL_INFERENCE_BASE_URL=http://gpu-server:11434
# Use different model
export QNTX_LOCAL_INFERENCE_MODEL=mistral
# Start QNTX with overrides
make dev
Note: Environment variables take precedence over all config files. When QNTX_LOCAL_INFERENCE_ENABLED=true, QNTX uses Ollama/LocalAI instead of OpenRouter (cloud provider).
Local inference has zero API cost but uses local resources (GPU/CPU time, electricity).
QNTX budget system is API-cost focused. Future versions may track GPU time:
[pulse]
# Current: Only tracks cloud API spend
daily_budget_usd = 5.00
# Future: Track GPU resource usage
# max_gpu_minutes_per_day = 30.0
See pulse/budget/ package TODOs for GPU resource tracking plans.
Why switch? Local for bulk operations (save money), cloud for occasional high-quality needs (GPT-4, Claude).
Option 1: Edit config file
# ~/.qntx/am.toml
[local_inference]
enabled = true # or false
Option 2: Environment variable
QNTX_LOCAL_INFERENCE_ENABLED=false make dev
Configuration reloads automatically. No restart required.
Cause: Ollama server not running.
Fix: ollama serve
Cause: Model not downloaded.
Fix: ollama pull mistral
Cause: CPU inference is inherently slow.
Options:
ollama pull llama3.2:3btimeout_seconds = 300Check: nvidia-smi
Fix: Install CUDA toolkit (Ollama will use it automatically)
Apple Silicon: Works automatically, no setup needed
Why custom models? Specialized system prompts, custom parameters, fine-tuned weights.
Create Modelfile:
FROM llama3.2:3b
SYSTEM You are a code review assistant. Focus on security and correctness.
PARAMETER temperature 0.7
PARAMETER num_predict 2048
Build and use:
ollama create qntx-reviewer -f Modelfile
Configure in am.toml:
[local_inference]
model = "qntx-reviewer"
Use local if:
Use cloud if:
Why consider LocalAI? Written in Go (like QNTX), supports many model formats, self-contained binary.
Why Ollama instead? Better UX, faster iteration, larger community, simpler setup.
LocalAI is viable if you need specific model formats Ollama doesn't support.
docker run -p 8080:8080 localai/localai:latest
Configure:
[local_inference]
enabled = true
base_url = "http://localhost:8080"
model = "your-model-name"
pulse/budget/ package TODOs