Embeddings

Semantic search over attestations using sentence transformers (all-MiniLM-L6-v2) via ONNX Runtime.

Architecture

Rust (ats/embeddings/src/): ONNX Runtime 2.0 inference, HuggingFace tokenizer, mean pooling, L2 normalization → 384-dim unit vectors
Go (ats/embeddings/embeddings/): CGO bindings to Rust, model lifecycle, FLOAT32_BLOB serialization
Storage (ats/storage/embedding_store.go): sqlite-vec L2 distance search, DELETE+INSERT for virtual table compatibility
API (server/embeddings_handlers.go): conditional compilation via rustembeddings build tag (now default in make cli)
Migration: 024_create_embeddings_table.sql — embeddings table + vec_embeddings virtual table

Configuration

Embeddings are configured via am.toml or the UI config API:

[embeddings]
enabled = true
path = "ats/embeddings/models/all-MiniLM-L6-v2/model.onnx"
name = "all-MiniLM-L6-v2"
cluster_threshold = 0.5
recluster_interval_seconds = 3600  # re-cluster every hour (0 = disabled)
min_cluster_size = 5

Key	Type	Default	Description
`embeddings.enabled`	bool	`false`	Enable the embedding service on startup
`embeddings.path`	string	`ats/embeddings/models/all-MiniLM-L6-v2/model.onnx`	Path to ONNX model file
`embeddings.name`	string	`all-MiniLM-L6-v2`	Model identifier for metadata
`embeddings.cluster_threshold`	float	`0.5`	Minimum cosine similarity for incremental cluster assignment
`embeddings.recluster_interval_seconds`	int	`0`	Pulse schedule interval for HDBSCAN re-clustering (0 = disabled)
`embeddings.min_cluster_size`	int	`5`	Minimum cluster size for HDBSCAN

When enabled = false (default), SetupEmbeddingService skips initialization even if built with the rustembeddings tag. Enabling requires the model file to exist at the configured path.

When recluster_interval_seconds > 0, a Pulse scheduled job is auto-created on startup to periodically re-run HDBSCAN clustering. The schedule is idempotent — restarting the server won't duplicate it. Changing the interval in config updates the existing schedule.

Cluster Labeling

An LLM labels unlabeled clusters by sampling member attestation texts. Labels are attested by qntx@embeddings — the label, evidence (sampled texts), and model become part of the attestation graph.

[embeddings]
cluster_label_interval_seconds = 300  # label every 5 min (0 = disabled)
cluster_label_min_size = 15           # skip small clusters
cluster_label_sample_size = 5         # random texts sent to LLM
cluster_label_max_per_cycle = 3       # don't label everything at once
cluster_label_cooldown_days = 7       # min days between re-labels per cluster
cluster_label_max_tokens = 2000       # LLM response budget
cluster_label_model = ""              # empty = system default provider/model

Key	Type	Default	Description
`embeddings.cluster_label_interval_seconds`	int	`0`	Pulse schedule interval (0 = disabled)
`embeddings.cluster_label_min_size`	int	`15`	Minimum members for a cluster to be eligible
`embeddings.cluster_label_sample_size`	int	`5`	Random texts sampled from the cluster and sent to the LLM
`embeddings.cluster_label_max_per_cycle`	int	`3`	Max clusters labeled per scheduled run
`embeddings.cluster_label_cooldown_days`	int	`7`	Minimum days before re-labeling a cluster
`embeddings.cluster_label_max_tokens`	int	`2000`	LLM max_tokens for the labeling request
`embeddings.cluster_label_model`	string	`""`	Model override (empty = system default from OpenRouter/local config)

Re-labeling is purely time-gated (per-cluster cooldown). Membership-change-ratio triggers are deferred until more usage data informs the heuristic.

Config can also be updated at runtime via the REST API:

PATCH /api/config
{"updates": {"embeddings.enabled": true, "embeddings.path": "/path/to/model.onnx"}}

Plugin gRPC Service

Plugins access embeddings via the EmbeddingService gRPC endpoint passed in InitializeRequest.embedding_endpoint. The service exposes Embed (single text) and BatchEmbed (multiple texts), both returning 384-dim L2-normalized float vectors. See plugin/grpc/protocol/embedding.proto for the full message definitions.

API Endpoints

Method	Path	Description
GET	`/api/search/semantic?q=<text>&limit=10&threshold=0.7`	Search stored embeddings by semantic similarity
POST	`/api/embeddings/generate`	Generate embedding for `{"text": "..."}` — returns 384-dim vector
POST	`/api/embeddings/batch`	Embed attestations by ID: `{"attestation_ids": ["..."]}`
POST	`/api/embeddings/cluster?model=<name>`	Run HDBSCAN clustering (model-scoped; omit for all)
POST	`/api/embeddings/project?model=<name>`	Run projection via reduce plugin (model-scoped; omit for all)
GET	`/api/embeddings/projections`	Get `[{id, source_id, x, y, cluster_id}]` for visualization
GET	`/api/embeddings/info`	Embedding service status, counts, and cluster summary

Without the rustembeddings build tag, all endpoints return 503.

2D Projection (UMAP)

Embeddings are 384-dimensional — too high to visualize directly. The qntx-reduce plugin projects them to 2D via UMAP for canvas visualization.

See qntx-reduce/README.md for setup and API details.

Flow: POST /api/embeddings/project reads embeddings for the specified model (or all if omitted), calls the reduce plugin's /fit endpoint, and writes projection_x/projection_y back to the embeddings table. New attestations are auto-projected via /transform if the model is fitted.

When multiple models are configured, Pulse scheduled re-projection loops over each model independently — vectors from different models are never mixed.

Model Files

Located at ats/embeddings/models/all-MiniLM-L6-v2/ (not in git). See ats/embeddings/README.md for download instructions.

Completed

Auto-embedding pipeline: EmbeddingObserver embeds attestations with rich text on creation (#482)
Rich text integration: Uses rich_string_fields from type definitions (#479). Additionally, message and msg are always treated as embeddable text (builtin rich fields in observer.go).
Unified search: Text + semantic results merged and deduplicated (#485)
Semantic Search Glyph (⊨): Live canvas glyph with historical + live matching (#496, #499)
Scheduled re-clustering: HDBSCAN runs on a Pulse schedule so cluster topology stays current as data grows (#506)
Cluster labeling: LLM labels unlabeled clusters on a Pulse schedule, attested by qntx@embeddings
Multi-model support: Each attestation is embedded by all models configured in [cyrnel] models. Per-model vec0 tables (vec_embeddings_{slug}) are created dynamically. Search, storage, clustering, and projection are all model-scoped. Pulse scheduled jobs loop over each model independently. See ADR-019.

Cluster Lifecycle Attestations

When HDBSCAN re-clustering runs (manually or via Pulse schedule), the system emits attestations for cluster topology changes:

Per-event attestations — one for each birth/death:

Subject: cluster:<id>, Predicate: born or died
Actor: qntx@embeddings, Source: cluster-lifecycle
Attributes: run_id, n_members

Deferred news — a summary for Ground to deliver on the next session Stop:

Subject: embeddings, Predicate: deferred:cluster-update
Context: project:<parent>/<repo> (project-scoped, not session-scoped)
Attributes: event, detail (rich text summary with sample texts), after (unix timestamp)

Pulse health summary — piggybacked on the recluster heartbeat:

Subject: pulse, Predicate: deferred:pulse-summary
Same project context and delivery mechanism

Ground delivery protocol

Ground reads deferred:* attestations and delivers the detail field. After delivery, it writes a delivered:<name> attestation with the same project context as an ack.

Ack-aware accumulation: Before emitting new deferred news, QNTX checks for a delivered:cluster-update ack from Ground. If the previous news hasn't been delivered yet, the new events are prepended to the existing detail — so Ground delivers the full accumulated summary spanning all runs since last delivery.

Open Work

Open Questions

Model distribution: Bundled, downloaded on-demand, or user-provided?
Caching: What layer? In-memory, SQLite, or external?
Multi-model support: Should multiple embedding models run simultaneously?
Fine-tuning: Domain-specific fine-tuning for attestation language?
Vector database: sqlite-vec vs dedicated vector DB (Qdrant, Weaviate) at scale?
Rate limiting: Embedding generation is CPU-intensive — what limits are appropriate?

Design decision: embedding tests are local-only

Embedding tests (ats/embeddings/embeddings/embeddings_test.go) require the ONNX model files (~80MB) and add ~3s of inference per run. They're gated behind //go:build cgo && rustembeddings — CI doesn't pass this tag, so they only run locally.

This avoids burdening every PR with model download/caching and inference time. If the embedding surface area grows, a dedicated ci-embeddings.yml workflow (triggered only on changes to ats/embeddings/, ats/storage/embedding_store*, server/embeddings_handlers*) can be added without affecting the main pipeline.

Technical Debt

Error handling standardization across Rust/Go FFI boundary

Design decision: unconditional sqlite-vec

db/connection.go imports sqlite-vec CGO bindings unconditionally — every Go build pays the CGO compilation cost, even builds that don't use embeddings. This is coupled to migration 024, which creates a vec0 virtual table that requires the extension to be loaded. The migration runs unconditionally via //go:embed sqlite/migrations/*.sql.

Making this conditional requires solving both sides together:

Move the sqlite_vec import behind a build tag
Move migration 024 out of the embedded migrations directory (or split it: regular embeddings table stays universal, vec_embeddings virtual table becomes conditional)

Current choice: accept the universal CGO dependency. The cli-nocgo target (CGO_ENABLED=0) will fail on migration 024 at runtime if it encounters a database that hasn't run that migration yet.