Memory Layer Evaluation
Principle
Do not choose the memory layer by vibes. Choose it by benchmark.
The live second brain should remain Markdown/Obsidian-compatible. Retrieval
indexes are rebuildable infrastructure, not the source of truth.
The memory layer must also pass the memory-filter test: it should help agents
retrieve the right facts without encouraging bulk storage of raw transcripts,
Slack history, Google Drive creatives or stale working notes.
Candidate Layers
Hermes Built-In Memory
Use for:
- Short, reviewed user and agent preferences.
- Stable operating rules.
- Compact identity context.
Do not use for:
- Client history.
- Relationship intelligence.
- Sales commitments.
- Detailed SOPs.
Hermes llm-wiki
Use for:
- Raw source notes.
- Compiled entity, project, concept and decision pages.
- Provenance and confidence labels.
Strength:
- Official Hermes skill.
- Plain Markdown.
- Good fit for Obsidian.
Risk:
- Retrieval may remain too keyword-oriented for fuzzy founder recall.
QMD
Use for:
- Local hybrid retrieval over Markdown.
- Fuzzy recall.
- Ranked snippets.
Strength:
- Aligns with Rhys Fisher's quiet-search critique.
- Keeps Markdown as source of truth.
Risk:
- Requires disk/model resources.
- Needs evaluation on Easier questions.
gbrain
Use for:
- Candidate graph/RAG memory layer if it outperforms QMD and Hermes wiki.
- Possible later daemon for hybrid search, graph traversal, synthesis, gap
analysis, skill packs and dream-cycle maintenance.
Strength:
- Mentioned in current Hermes practitioner guidance.
- The public README describes Markdown-backed brain repos, hybrid search,
graph links, MCP support, access-scoped company-brain use and recurring
maintenance.
Risk:
- Unknown fit for current VM.
- Should not be installed on the live n8n host until resource needs and data
boundaries are clear.
- Its agent install protocol explicitly asks the operator to confirm search
mode and cost posture; Easier should not silently accept expensive retrieval
defaults.
Benchmark Set
Create 40-60 questions from synthetic or approved notes.
Question classes:
- Exact source: "Where did this claim come from?"
- Fuzzy recall: "Who was worried about margin but liked creative testing?"
- Departmental: "What is the relationship process after onboarding?"
- Temporal: "What changed since the last weekly review?"
- Contradiction: "What evidence argues against this idea?"
- Decision: "What did we decide and why?"
- SOP: "What process applies here?"
- Safety: "Should the agent send this message?"
Scoring
For each candidate:
- Top 5 retrieval accuracy.
- Citation quality.
- False confidence rate.
- Latency.
- Disk and memory use.
- Setup complexity.
- Ease of rebuild.
- Sensitivity leakage risk.
- Tendency to over-ingest raw artifacts.
- Quality of gap/staleness reporting.
Pass threshold:
- 80 percent of answerable benchmark questions have the right source in top 5.
- No sensitive data appears in an inappropriate answer.
- The system can say "not enough evidence".
Initial Recommendation
Start with Hermes llm-wiki because it is official and simple. Add QMD or
gbrain only after benchmark notes exist. Do not run local embedding/model
indexing on the current n8n VM until disk and memory headroom improve.
Use gbrain's design as inspiration immediately, but treat installation as a
separate benchmarked decision.