Edit: github-backup/docs/03-memory-layer-evaluation.md

Note Content (Markdown)

# Memory Layer Evaluation

## Principle

Do not choose the memory layer by vibes. Choose it by benchmark.

The live second brain should remain Markdown/Obsidian-compatible. Retrieval
indexes are rebuildable infrastructure, not the source of truth.

The memory layer must also pass the memory-filter test: it should help agents
retrieve the right facts without encouraging bulk storage of raw transcripts,
Slack history, Google Drive creatives or stale working notes.

## Candidate Layers

### Hermes Built-In Memory

Use for:

- Short, reviewed user and agent preferences.
- Stable operating rules.
- Compact identity context.

Do not use for:

- Client history.
- Relationship intelligence.
- Sales commitments.
- Detailed SOPs.

### Hermes `llm-wiki`

Use for:

- Raw source notes.
- Compiled entity, project, concept and decision pages.
- Provenance and confidence labels.

Strength:

- Official Hermes skill.
- Plain Markdown.
- Good fit for Obsidian.

Risk:

- Retrieval may remain too keyword-oriented for fuzzy founder recall.

### QMD

Use for:

- Local hybrid retrieval over Markdown.
- Fuzzy recall.
- Ranked snippets.

Strength:

- Aligns with Rhys Fisher's quiet-search critique.
- Keeps Markdown as source of truth.

Risk:

- Requires disk/model resources.
- Needs evaluation on Easier questions.

### gbrain

Use for:

- Candidate graph/RAG memory layer if it outperforms QMD and Hermes wiki.
- Possible later daemon for hybrid search, graph traversal, synthesis, gap
  analysis, skill packs and dream-cycle maintenance.

Strength:

- Mentioned in current Hermes practitioner guidance.
- The public README describes Markdown-backed brain repos, hybrid search,
  graph links, MCP support, access-scoped company-brain use and recurring
  maintenance.

Risk:

- Unknown fit for current VM.
- Should not be installed on the live n8n host until resource needs and data
  boundaries are clear.
- Its agent install protocol explicitly asks the operator to confirm search
  mode and cost posture; Easier should not silently accept expensive retrieval
  defaults.

## Benchmark Set

Create 40-60 questions from synthetic or approved notes.

Question classes:

- Exact source: "Where did this claim come from?"
- Fuzzy recall: "Who was worried about margin but liked creative testing?"
- Departmental: "What is the relationship process after onboarding?"
- Temporal: "What changed since the last weekly review?"
- Contradiction: "What evidence argues against this idea?"
- Decision: "What did we decide and why?"
- SOP: "What process applies here?"
- Safety: "Should the agent send this message?"

## Scoring

For each candidate:

- Top 5 retrieval accuracy.
- Citation quality.
- False confidence rate.
- Latency.
- Disk and memory use.
- Setup complexity.
- Ease of rebuild.
- Sensitivity leakage risk.
- Tendency to over-ingest raw artifacts.
- Quality of gap/staleness reporting.

Pass threshold:

- 80 percent of answerable benchmark questions have the right source in top 5.
- No sensitive data appears in an inappropriate answer.
- The system can say "not enough evidence".

## Initial Recommendation

Start with Hermes `llm-wiki` because it is official and simple. Add QMD or
gbrain only after benchmark notes exist. Do not run local embedding/model
indexing on the current n8n VM until disk and memory headroom improve.

Use gbrain's design as inspiration immediately, but treat installation as a
separate benchmarked decision.