Skip to main content

Memory layer

The memory layer is what turns the recording moat into agent-usable signal. It exists between the raw event stream (apps/server/recording) and the prediction surface (apps/server/predictions), and its job is twofold: keep the user's full signal recoverable, and keep the LLM's prompt small enough to be useful.

This page is the design contract.

The four layers

L3 MODELS compiled artifacts: profile, calibrated thresholds, soul card
daily recompile, single-source-of-truth lives in the server
← cites top L2 facts

L2 FACTS atomic semantic claims: preference / habit / skill / artifact
importance + confidence + tags
← cites L1 episodes (why we believe this)

L1 EPISODES coherent sessions extracted from raw signal
start/end timestamps, summary, source(s)
← cites L0 raw rows

L0 RAW append-only event stream
timestamps, source, payload, never rewritten

Every layer is markdown-shaped (currently rendered from Django ORM rows; we keep the markdown view as the user-facing source of truth). Provenance links go upward — a model cites the facts that compiled into it; a fact cites the episodes that produced it; an episode cites the raw rows it was derived from. Following any chain to its root yields a complete audit trail of every prediction.

Why markdown is the substrate

Storage efficiency is not the constraint. Trust is. The user must be able to read, edit, and delete what their Clone knows about them, or every layer above breaks down. Markdown is diffable, versionable, exportable, and an LLM reads it natively. The compiled forms (embeddings, indices, the daily-recompiled soul card) are caches; the markdown is the ground truth from which any cache can be rebuilt.

A consequence: hierarchy is an emergent view, never the storage primitive. A flat namespace plus tags plus bidirectional links carries the structure. Real life is multi-categorical and strict directory hierarchies break inside a year. The directory tree is one of several ways to look at the corpus, not the corpus itself.

Compaction is the system

The point of the four layers is that compaction is reversible. If a fact turns out wrong, deleting it triggers a re-compile from the episodes; if an episode summary was bad, deleting it triggers a re-summary from raw. L0 is never rewritten, so the system can always rewind to ground truth and re-derive. This is what gives the layer the property people usually call "no catastrophic forgetting" — the lower layers always survive.

Three compaction loops, three different cadences:

SourceTargetCadenceGating
L0 → L1raw events for one closed sessionepisode summarynear-real-time, on session.stoppednone — auto
L1 → L2accumulated episodes (last N days)atomic factsdaily batchnone on default flow — auto with confidence tiers (see below)
L2 → L3all facts ranked by importanceprofile, soul card, calibrated thresholdsdailynone — auto

No hard approval gate (and why)

The earliest version of this design proposed a weekly user approval step on L1 → L2 promotion. That is wrong. A 7-day delay between observed behavior and prediction-grade signal breaks the "always-on, immediately personalized" promise of the product, and a weekly review ritual is friction, not stickiness. The first hour matters more than the seventh day.

Auto-promotion is the default flow. Trust is preserved by other mechanisms.

Confidence-tiered auto-promotion

Each candidate fact lands in one of three states:

  • Strong — pattern repeats 5+ times across 2+ sessions with no contradicting signal. Promote to L2 immediately, importance 0.9+, full weight in prediction.
  • Tentative — 1–2 session pattern or weak signal. Promote to L2 with importance 0.4–0.6. Still feeds prediction, but at reduced weight; flagged in the user-facing view.
  • Candidate — single occurrence. Held in a staging queue. If more evidence accumulates within 14 days, promote; otherwise discard.

The agent calling the prediction surface always sees the tier (via importance) and can route accordingly.

Behavioral decay

A fact's confidence is not static. Whenever a prediction grounded on a fact is contradicted by the user's actual response, the cited fact takes a confidence hit. Whenever it is confirmed, it gains. Below a threshold, facts are auto-archived without user intervention. This is the silent garbage collector that makes auto-promotion safe — the system keeps grooming itself against ground truth.

Read-and-correct, not review-and-approve

The user can open the markdown view at any time and delete or edit anything. One click to remove a wrong fact triggers a cascade re-compile. The user's role is agency at the edges, not gatekeeper of the flow. They are not required to act for the system to work; they retain full power to correct it when they choose to.

Hard gates only at three edges

Three narrow cases keep their explicit gate:

  1. Sensitive categories — health, relationships, finances, religion. Auto-promotion is disabled; candidates stay candidates until the user explicitly promotes them.
  2. Direct contradictions of high-importance facts. A new candidate that contradicts an existing fact at importance ≥ 0.85 cannot auto-resolve. The user is prompted to pick which is true.
  3. Action-grounding facts. Facts whose primary use is to authorize Clone-mediated action on the user's behalf (e.g. "the user is OK with auto-replying to internal emails") require explicit confirmation regardless of confidence. Prediction-only facts do not.

These three together cover well under 10% of fact volume; the other 90%+ flows automatically.

Retrieval: stratified, not similarity-only

The prediction prompt has a finite token budget. A pure similarity-search retrieval ("nearest 10 facts to this prompt") fails the moment the user asks something off-pattern: every relevant fact gets crowded out by a single dominant theme, and the user's identity skeleton vanishes from the prompt. The architect-test for "does this system suffer from catastrophic forgetting?" is really asking "does retrieval silently drop the baseline?"

The answer is stratified retrieval:

  • Always-on baseline — the L3 soul card, plus the top N L2 facts by importance, regardless of the current query. This is the user's identity skeleton.
  • Dynamic block — the rest of the token budget filled by similarity-ranked L1 episodes and L0 raw rows within a recency window.

Token budget allocation goes baseline-first, dynamic-last. The baseline is not flexible; the dynamic block shrinks before the baseline does. This is what gives the layer its no-forgetting property in practice — the user's core is always present in every prediction, no matter what the user asks.