Local Inference Setup

QNTX supports multiple LLM providers that can be enabled/disabled independently:

When local_inference.enabled = true, QNTX uses your local Ollama/LocalAI server. When false, it falls back to OpenRouter or other configured providers.

This guide covers setting up and enabling local inference providers.

Why Local Inference?

Privacy, cost, and control. Cloud LLM APIs are convenient but:

Local inference runs models on your hardware. Zero API cost, complete privacy, works offline.

Why Ollama?

Simplest path from zero to working LLM. No Python, no virtual environments, no CUDA drivers (unless you want GPU acceleration). Download binary, pull model, run.

Alternative (LocalAI) exists but Ollama's UX is unmatched for getting started.

Quick Start

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or download: https://ollama.com/download

2. Download a Model

# Default: Smallest, fastest (3B params, 2GB)
ollama pull llama3.2:3b

# Popular alternatives:
ollama pull mistral          # General-purpose (7B, 4GB) - better quality
ollama pull qwen2.5-coder:7b # Code-optimized (7B, 4.5GB) - best for code
ollama pull phi3:mini        # Microsoft's model (3.8B, 2.3GB) - good balance
ollama pull gemma2:2b        # Google's smallest (2B, 1.6GB) - very fast
ollama pull deepseek-coder:6.7b # Code specialist (6.7B, 3.8GB)

Why these models? Balance of size/speed/quality. Smaller models (2-3B) are fast on CPU. Larger models (7B) give better results but need more RAM.

3. Start Ollama

ollama serve

Runs on http://localhost:11434 by default.

4. Configure QNTX

Edit ~/.qntx/am.toml (or project am.toml):

[local_inference]
enabled = true  # Disabled by default, set to true to use local models
base_url = "http://localhost:11434"
model = "llama3.2:3b"  # Default model
timeout_seconds = 360  # 6 minutes timeout for slow inference
context_size = 16384   # Context window (0 = use model default)

Note: Local inference is disabled by default. Set enabled = true to activate.

Done. All QNTX LLM operations now use local inference.

Verify Setup

# Check Ollama is running
curl http://localhost:11434/api/tags

# Test inference
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Performance: CPU vs GPU

CPU inference works but is slow (5-10 tokens/sec). Fine for development, frustrating for production.

GPU acceleration makes local inference viable:

Ollama automatically uses GPU if available. No configuration needed.

Model Size Tradeoffs

ModelSizeCPU SpeedGPU SpeedQualityBest For
gemma2:2b1.6GBSlowFastFairMinimal resources
llama3.2:3b2GBSlowFastGoodDefault, balanced
phi3:mini2.3GBSlowFastGoodGeneral purpose
mistral4GBVery SlowVery FastExcellentQuality output
qwen2.5-coder:7b4.5GBVery SlowVery FastExcellentCode/technical

Why not 13B or 70B models? Possible but require 8GB+ VRAM and are slower. Diminishing returns for most QNTX operations.

Environment Variables

Quick override without editing config files:

# Switch to local provider (Ollama/LocalAI) instead of OpenRouter
export QNTX_LOCAL_INFERENCE_ENABLED=true

# Point to different Ollama server
export QNTX_LOCAL_INFERENCE_BASE_URL=http://gpu-server:11434

# Use different model
export QNTX_LOCAL_INFERENCE_MODEL=mistral

# Start QNTX with overrides
make dev

Note: Environment variables take precedence over all config files. When QNTX_LOCAL_INFERENCE_ENABLED=true, QNTX uses Ollama/LocalAI instead of OpenRouter (cloud provider).

Cost Tracking

Local inference has zero API cost but uses local resources (GPU/CPU time, electricity).

QNTX budget system is API-cost focused. Future versions may track GPU time:

[pulse]
# Current: Only tracks cloud API spend
daily_budget_usd = 5.00

# Future: Track GPU resource usage
# max_gpu_minutes_per_day = 30.0

See pulse/budget/ package TODOs for GPU resource tracking plans.

Switching Between Local and Cloud

Why switch? Local for bulk operations (save money), cloud for occasional high-quality needs (GPT-4, Claude).

Option 1: Edit config file

# ~/.qntx/am.toml
[local_inference]
enabled = true  # or false

Option 2: Environment variable

QNTX_LOCAL_INFERENCE_ENABLED=false make dev

Configuration reloads automatically. No restart required.

Troubleshooting

Connection Refused

Cause: Ollama server not running.

Fix: ollama serve

Model Not Found

Cause: Model not downloaded.

Fix: ollama pull mistral

Slow on CPU

Cause: CPU inference is inherently slow.

Options:

  1. Use smaller model: ollama pull llama3.2:3b
  2. Increase timeout: timeout_seconds = 300
  3. Get GPU-enabled hardware
  4. Use cloud APIs for now

GPU Not Detected (NVIDIA)

Check: nvidia-smi

Fix: Install CUDA toolkit (Ollama will use it automatically)

Apple Silicon: Works automatically, no setup needed

Advanced: Custom Models

Why custom models? Specialized system prompts, custom parameters, fine-tuned weights.

Create Modelfile:

FROM llama3.2:3b

SYSTEM You are a code review assistant. Focus on security and correctness.

PARAMETER temperature 0.7
PARAMETER num_predict 2048

Build and use:

ollama create qntx-reviewer -f Modelfile

Configure in am.toml:

[local_inference]
model = "qntx-reviewer"

When to Use Local vs Cloud

Use local if:

Use cloud if:

Alternative: LocalAI

Why consider LocalAI? Written in Go (like QNTX), supports many model formats, self-contained binary.

Why Ollama instead? Better UX, faster iteration, larger community, simpler setup.

LocalAI is viable if you need specific model formats Ollama doesn't support.

Quick Start (Docker)

docker run -p 8080:8080 localai/localai:latest

Configure:

[local_inference]
enabled = true
base_url = "http://localhost:8080"
model = "your-model-name"

Resources