metal-llama

What exists

Metal-cpp renderer inside qntx-plugins/llama-cpp/. The renderer lives in the same process as inference — the softmax distribution goes directly from C++ to a Metal compute shader with no serialization.

Originally prototyped as a separate Swift plugin (qntx-plugins/metal-llama/, deleted). Moved into llama-cpp because the full distribution (512KB/token) doesn't need to leave the process.


Vision

This plugin exists to visualise what is happening inside the model as it's happening. The llama-cpp plugin already captures pre-sampler logit signals per token — confidence, entropy, top-gap, top-k candidates — and streams them over gRPC. The stream glyph renders this as a DOM-based confidence heatmap. metal-llama replaces that with GPU-accelerated rendering that can keep up with token generation speed and, critically, support stepping back through tokens and selecting different paths in possibility space.

The token stream is not just a sequence to watch — it's a tree. At each token position, the model considered alternatives. metal-llama should make that tree navigable: see where the model was confident, where it hesitated, branch into the roads not taken.

Performance is the priority. Metal is chosen deliberately to eliminate the abstraction layers between data and pixels. This commits to the Apple ecosystem for this plugin. The same GPU running llama-cpp inference renders the visualization — zero-copy potential between inference output buffers and visualization input buffers.

Platform scope: macOS-only via Metal. If this needs to run on Windows or Linux in the future, the viable paths are:

For now, Metal is the right call — the llama-cpp plugin already uses Metal acceleration, Tauri targets macOS, and the inference-to-visualization data path benefits from staying on the same GPU API.

Build toolchain: CMake, same as llama-cpp. Metal-cpp is header-only, linked via -framework Metal -framework Foundation -framework QuartzCore.


The per-token window

The inference loop in stream_chat() (inference.cpp:278-301) runs this sequence for every token:

llama_decode(ctx_, batch)     ← GPU forward pass (~20-100ms), fills logit buffer
capture_signal(ctx_, vocab)   ← read logits, softmax, top-10, entropy (~0.1ms)
llama_sampler_sample(sampler) ← sampler chain picks a token (~0.01ms)
on_token(text, signal)        ← stream to gRPC → WebSocket → UI
llama_decode(ctx_, next)      ← next token's forward pass begins

Between llama_decode completing and llama_sampler_sample choosing, the full model state is available. capture_signal currently extracts 230 bytes from a dataset that is 130KB+ per token. The rest is discarded.

What exists in that window:

DataSize per tokenCurrently capturedAccess cost
Full probability distribution (32k+ floats after softmax)~128 KBTop-10 only (200 bytes)Zero — already computed as local var probs, discarded after partial sort
Hidden state embedding~16 KB (4096 floats)NoneZero — llama_get_embeddings(ctx) is a pointer dereference
Sampler chain stage-by-stage transformations~128 KB × N stagesNoneCustom llama_sampler_i observer between each stage
Temperature sensitivity (softmax at 5 temperatures)~640 KB (5 × 128 KB)None~0.5ms (5 extra softmax passes)
Token metadata (frequency, flags per candidate)~40 bytes per candidateNoneZero — llama_token_get_score / get_attr are O(1) lookups
Context window fill level12 bytesNoneZero — llama_get_seq_pos(ctx, -1)

The natural frame budget for visualization is the next llama_decode call. While the GPU runs the forward pass for token N+1 (20-100ms), metal-llama has that time to render the full signal data from token N. At 10 tokens/sec, that's 100ms per frame — well above 60fps budget.

Data throughput at full extraction: ~130 KB/token × 10 tokens/sec = 1.3 MB/s. Trivially streamable over gRPC. A Metal compute shader processes 32k floats in one dispatch (~0.02ms).

The bottleneck is not what data exists or how fast it can be rendered. The bottleneck is that capture_signal was built to feed a DOM-based heatmap that only needs 230 bytes. metal-llama needs the full 130KB+ because it can actually render it.

What requires llama.cpp patches (Tier 3 — defer)

llama_decode() itself is opaque. The forward pass through 32 transformer layers happens inside it. Per-layer predictions (logit lens), attention weights, and intermediate activations are not observable without hooking into ggml's graph execution. These require forking llama.cpp.

SignalWhat it showsBlocker
Per-layer logits (logit lens)What the model "would predict" at each transformer layer. When does it commit — layer 2 or layer 31?No public API for intermediate hidden states. ~32x compute overhead.
Attention weightsWhich prior tokens each attention head reads from.Internal to ggml. 4 GB per token position at full resolution. Must downsample.

Open questions

Q1: What should the first full-extraction signal look like? → Full distribution as probability nebula

Resolved. The full softmax distribution (32k floats, 128KB per token) rendered as a 3D particle nebula. See docs/research/probability-nebula.md for the design.

C++ work: keep the full probs vector in capture_signal() instead of discarding after top-10 extraction. Add repeated float full_distribution = 5 to TokenSignalProto. Serve projected token positions (from the model's embedding matrix) via a one-shot endpoint at model load.

Also worth exploring later:


Q2: What does "stepping back" feel like? → Scrub + ghost branches, then forking

Resolved. All three, in order:

First: Timeline scrub. Drag backwards through the generation. The nebula rewinds — the cloud re-blooms, the trail un-draws. See how the model's consideration set evolved step by step. Pure replay of captured frames, no re-inference, no C++ work.

First: Ghost branches. At each step, the top-k candidates are already known. Draw faint ghost trails branching off the main path — where would the trail have gone if the second-place token had been chosen? Not actual continuations, just single-step alternatives shown as dim branches. Data already exists in TokenSignal.top_k.

Later: Active forking. Click a bright particle that wasn't chosen at some step. The model re-infers from that point forward — a new trail branches off through a different region of vocabulary space. Two paths diverge in the nebula. Requires KV cache snapshots and multi-sequence batching on the C++ side (3x compute per branch).


Q3: What can metal-llama show that the DOM never could?

The stream glyph is text. It renders tokens as <span> elements with colored backgrounds — a reading experience with signal overlays. metal-llama is not a companion to this and not a replacement for it. It is a parallel system that the stream glyph's existence inspired but that operates in a space the DOM cannot enter.

The stream glyph proves that per-token signal data is valuable in real time. metal-llama takes the same TokenSignal data path — bidirectional, llama-cpp ↔ metal-llama via StreamChat gRPC and the llama_sampler_i vtable — and renders what text-in-a-browser fundamentally cannot:

The stream glyph keeps doing what it does — text with heatmap coloring, readable output, follow-up input. metal-llama exists because some things about inference are not text and never will be.

What is the first thing you'd want to see in this space that you currently cannot? The token tree? The probability landscape? The semantic trajectory? Or something that hasn't been named yet?


Q4: What makes a token "interesting"?Dissolved by the nebula

Resolved. The nebula doesn't highlight individual tokens — the entire field is the visualization. "Interesting" is no longer a per-token threshold. It's a property of the cloud's behavior over time: a bimodal split (the model torn between two directions), a sudden restructuring (the consideration set jumping to a new region), the cloud blooming diffuse then snapping tight. The viewer sees these moments without needing to be told they're interesting. The nebula makes them visible by being them.


Implementation steps — probability nebula

Step 1: Proto generation and gRPC bootstrap (done)

grpc-swift v2 with SPM build plugin. Proto types generated at build time from domain.proto and llm.proto. Plugin compiles, starts, announces port, serves gRPC. Version 0.2.0.

Step 2: Widen the C++ aperture (done)

capture_signal() keeps full softmax distribution (128k floats for Llama 3.2) in signal.full_distribution, streamed via gRPC StreamChat. Stripped from WebSocket broadcast (1.3GB IndexedDB accumulation crashed the browser).

PCA projection of model_->tok_embd to 3D via Accelerate BLAS in vocab_projection.cpp. Accesses private header llama-model.h, dequantizes via ggml_get_type_traits. Covariance via cblas_sgemm, top 3 eigenvectors via power iteration + deflation. Computed once at model load, served as binary float32 at GET /api/llama-cpp/vocab-positions.

Step 3: Particle field — static frame (done)

MSL compute kernel: probability + 3D position → particle (amber color ramp, log-scaled size, visibility threshold 1e-5). Vertex shader: orthographic MVP. Fragment shader: soft circle point sprite. Additive blending on rgba16Float, Reinhard tonemap to sRGB PNG.

Tested with 128,256 particles, real PCA positions, bimodal test distribution. Prototyped in Swift, then ported to Metal-cpp inside llama-cpp.

metal-llama: renderer moves into llama-cpp

Swift plugin was a prototype. Metal-cpp (Apple's header-only C++ wrapper) puts the renderer in the same process as inference — distribution is a float*, no serialization. MSL shaders unchanged.

Step 4: Live streaming

Rendered frames delivered to browser per-token. Each token's distribution is a keyframe. Compute shader lerps between keyframes at 60fps.

Chosen token recorded per step. Line strip connects chosen-token positions — the generation trail. Older segments fade via alpha decay.

Done when: running a prompt shows the nebula updating in real time.

Step 5: Timeline scrub (done)

Hover a token in the stream glyph to scrub the nebula to that token's distribution. The text IS the timeline — no separate scrubber UI. Stream glyph dispatches nebula-scrub CustomEvent with token index, nebula module sends scrub:N over WebSocket, C++ renders the stored keyframe. Trail shader splits at the scrub point: warm path up to the hovered token, cool blue for the future path beyond it. mouseleave resumes live mode.

Per-token keyframe history stored CPU-side (capped at 512 entries). Each store_keyframe in StreamChat captures the full distribution alongside submit_distribution and add_trail_point.

Done when: after a generation completes, dragging the timeline backwards smoothly reverses the nebula animation. Hovering tokens scrubs the nebula and splits the trail. Works.

Step 6: Ghost branches (GHB) — low priority

At each generation step, the top-k candidates are already known. For each step, draw faint trails from the chosen token's position to each runner-up's position — dim branches off the main path showing single-step alternatives. Opacity proportional to the runner-up's probability.

Deprioritized — zero-cost signal extraction and sampler visibility reveal more about inference than single-step visual branches. Ghost branches show where alternatives were in vocabulary space but not why they were rejected. Sampler chain observations (SCO) answer the "why" question.


Known limitations

CodeDescriptionWhere
B64WebSocket frames are base64-encoded PNG — 33% bandwidth overheadmetal_renderer.h
CAMNo camera control — fixed orthographic MVP, no rotation/zoom/panmetal_renderer.h
KFCKeyframe history capped at 512 (64MB). No disk persistencemetal_renderer.h
TRUTrail positions unbounded while keyframes are cappedmetal_renderer.h
PVHPCA accesses private llama-model.h header — version-fragilevocab_projection.cpp
SSLNo signal summary logging for StreamChat (Chat path logs it)plugin.cpp

Zero-cost opportunities

These require small C++ additions to capture_signal() or the sampler chain. Data already exists in the per-token window, costs nothing to extract.

CodeSignalWhat it revealsWhere to add
TMDToken metadata (frequency, special/byte flags)"Is the model picking a rare token?"capture_signal()
CWUContext window usage (tokens used / available)Speed degradation warningcapture_signal()
TMPTemperature sensitivity (softmax at 5 temps)How much temperature reshapes the distributioncapture_signal()
CPXCumulative perplexity (running scalar)Fluency score for comparing prompt strategiescapture_signal()
SCOSampler chain observations (distribution before/after each stage)Why specific tokens were rejectedstream_chat() sampler chain
ECMFactor entropy + top_gap into stream glyph color mappingRicher heatmap, currently confidence-onlystream-glyph.ts

Moderate-cost opportunities

CodeSignalWhat it revealsEffort
HSEHidden state embeddings (4096 floats via llama_get_embeddings)Semantic trajectory — where the model isLow C++, needs UMAP/PCA
SUIExpose top-k/top-p/min-p/penalties in UIUsers can't control the sampler chainProto + C++ + Go + TS
STRGPU-accelerated steering — write back to logit buffer from nebulaDirect manipulation of probability landscapeMetal buffer path exists, no input wired

High-cost opportunities

CodeSignalWhat it revealsEffort
TTBToken tree branching (KV cache fork, generate alternatives)"What if the model had said X instead?"C++ KV cache management
ATSWrite per-generation attestations with signal attributesPersistent record of inference qualityGo storage layer

Codes reference

All codes used across the codebase (README limitations, research docs, source TODOs):

CodeScopeStatus
STOSingle-turn onlyREADME limitation
TAOText attachments onlyREADME limitation
IBPImage-based PDFsREADME limitation
COFContext overflowREADME limitation
NEFNo extraction feedbackREADME limitation
SDRShutdown raceREADME limitation
B64Base64 WebSocket overheadMetal limitation
CAMNo camera controlMetal limitation
KFCKeyframe cap 512Metal limitation
TRUTrail unboundedMetal limitation
PVHPrivate header accessMetal limitation
SSLNo StreamChat signal summaryMissing feature
GHBGhost branchesStep 6, low priority
TMDToken metadataZero-cost opportunity
CWUContext window usageZero-cost opportunity
TMPTemperature sensitivityZero-cost opportunity
CPXCumulative perplexityZero-cost opportunity
SCOSampler chain observationsZero-cost opportunity
ECMEntropy/confidence color mappingZero-cost opportunity
HSEHidden state embeddingsTier 2 opportunity
SUISampling UI controlsTier 2 opportunity
SCWSampler chain wiringDepends on SUI
STRGPU steeringTier 2 opportunity
TTBToken tree branchingTier 3 opportunity
ATSAttestation storageTier 2 opportunity
VDFVocabulary dump to frontendChecklist item
LTRLogit trajectoriesChecklist item
BIGBias glyph integrationBlocked on #718
HSCHidden state cluster comparisonBlocked on HSE
ESDEntropy spike detectionChecklist item
CPYCopy button for stream glyphMissing feature
WMSWindow morph supportMissing feature