Memory Layer Evaluation

Principle

Do not choose the memory layer by vibes. Choose it by benchmark.

The live second brain should remain Markdown/Obsidian-compatible. Retrieval indexes are rebuildable infrastructure, not the source of truth.

The memory layer must also pass the memory-filter test: it should help agents retrieve the right facts without encouraging bulk storage of raw transcripts, Slack history, Google Drive creatives or stale working notes.

Candidate Layers

Hermes Built-In Memory

Use for:

Short, reviewed user and agent preferences.
Stable operating rules.
Compact identity context.

Do not use for:

Client history.
Relationship intelligence.
Sales commitments.
Detailed SOPs.

Hermes `llm-wiki`

Use for:

Raw source notes.
Compiled entity, project, concept and decision pages.
Provenance and confidence labels.

Strength:

Official Hermes skill.
Plain Markdown.
Good fit for Obsidian.

Risk:

Retrieval may remain too keyword-oriented for fuzzy founder recall.

QMD

Use for:

Local hybrid retrieval over Markdown.
Fuzzy recall.
Ranked snippets.

Strength:

Aligns with Rhys Fisher's quiet-search critique.
Keeps Markdown as source of truth.

Risk:

Requires disk/model resources.
Needs evaluation on Easier questions.

gbrain

Use for:

Candidate graph/RAG memory layer if it outperforms QMD and Hermes wiki.
Possible later daemon for hybrid search, graph traversal, synthesis, gap analysis, skill packs and dream-cycle maintenance.

Strength:

Mentioned in current Hermes practitioner guidance.
The public README describes Markdown-backed brain repos, hybrid search, graph links, MCP support, access-scoped company-brain use and recurring maintenance.

Risk:

Unknown fit for current VM.
Should not be installed on the live n8n host until resource needs and data boundaries are clear.
Its agent install protocol explicitly asks the operator to confirm search mode and cost posture; Easier should not silently accept expensive retrieval defaults.

Benchmark Set

Create 40-60 questions from synthetic or approved notes.

Question classes:

Exact source: "Where did this claim come from?"
Fuzzy recall: "Who was worried about margin but liked creative testing?"
Departmental: "What is the relationship process after onboarding?"
Temporal: "What changed since the last weekly review?"
Contradiction: "What evidence argues against this idea?"
Decision: "What did we decide and why?"
SOP: "What process applies here?"
Safety: "Should the agent send this message?"

Scoring

For each candidate:

Top 5 retrieval accuracy.
Citation quality.
False confidence rate.
Latency.
Disk and memory use.
Setup complexity.
Ease of rebuild.
Sensitivity leakage risk.
Tendency to over-ingest raw artifacts.
Quality of gap/staleness reporting.

Pass threshold:

80 percent of answerable benchmark questions have the right source in top 5.
No sensitive data appears in an inappropriate answer.
The system can say "not enough evidence".

Initial Recommendation

Start with Hermes llm-wiki because it is official and simple. Add QMD or gbrain only after benchmark notes exist. Do not run local embedding/model indexing on the current n8n VM until disk and memory headroom improve.

Use gbrain's design as inspiration immediately, but treat installation as a separate benchmarked decision.

github-backup/docs/03-memory-layer-evaluation.md

Memory Layer Evaluation

Principle

Candidate Layers

Hermes Built-In Memory

Hermes llm-wiki

QMD

gbrain

Benchmark Set

Scoring

Initial Recommendation

Hermes `llm-wiki`