Grigoriy Dobryakov

Howto · breakdown

Breakdown 03

Retrieval layer for book-as-context: why vector-only breaks

Three signals instead of one: BM25, semantic embeddings, and fuzzy title matching — built as an MCP server under the agentive book-as-context layer.

CTO Head of AI Architect Tech Lead

Problem

When building book-as-context (agent working with a corpus of books via MCP), the first instinct is vector-only RAG. Works on a demo corpus. Fails predictably on real technical books:

Methodology

Instead of writing retrieval from scratch — a fork of obsidian-hybrid-search, embedded as an MCP server under the agentive book-as-context layer.

Three signals instead of one:

  1. 1. BM25 — lexical match. Exact terms, codes, file names.
  2. 2. Semantic embeddings — conceptual proximity. Synonyms, paraphrases, related concepts.
  3. 3. Fuzzy title / alias — navigation to a specific note or chapter by name.

Modes for query type: hybrid (default), semantic, fulltext, title.

Fusion via RRF: a note ranked highly in any of the agent's reformulated queries surfaces to the top.

obsidian-search "event bus durability" --mode hybrid --limit 5
{
  "path": "tanenbaum/12-message-delivery.md",
  "score": 0.87,
  "matchedBy": ["semantic", "bm25"],
  "scores": { "semantic": 0.81, "bm25": 0.93, "fuzzy_title": null }
}

bm25 (0.93) > semantic (0.81): exact lexical match pulled up a result that vector-only would have ranked lower. Per-component scores are visible to the agent — it decides whether to trust the result, not receive a "black box top-k."

Agent layer on top of retrieval

Hybrid search solves ranking. Without a layer on top you get blind RAG: chunks in the index may be right, but queries run in task language, not author language — a mismatch. BM25 cannot fix that: the query lacks the anchors the book was indexed with. For a single-book wiki and traversal from requirement to pattern, see book-as-context (step 5). Below — how the same retrieval layer connects to a full book library and external sources.

Two corpora: obsidian-books vs single-book wiki

Full book vault (obsidian-books) Single-book wiki (book-as-context)
Contents Hundreds of books: raw chunks, summaries, wiki notes One book → olw concept pages, cross-links
Retrieval Same hybrid-search (MCP) Same MCP or IDE indexing
Use case Marketing, strategy, cross-author lookup Engineering lens on a project (Tanenbaum in Jira, API, infra)
Risk Summaries rank above raw — for quotes, scope to raw chunks Partial coverage, low-confidence drafts

Book dossiers (books/)

One dossier file per important book: outline, author terms, translation synonyms, query → chapter map, RU/EN pairs. The agent reads the dossier before books_search, instead of echoing the user's wording. Without a dossier search still runs, but noisier: hybrid mode does not replace knowing the author's lexicon.

Fan-out and reformulation

Instead of one query — 3–5 reformulations in queries[]: dossier anchors + plain-language task, translation synonyms, original term if results are weak. RRF is built for fan-out. After top-k — mandatory books_read on 2–3 full notes, not blind snippet quotes.

Weak query (task language) Strong query (author term + plain language)
"find something about Kafka" reliable multicast + FIFO ordering — delivery order and reliable broadcast
"how to sync orders" persistent vs transient communication — queue holds vs receiver online now
"delivery guarantees" AMQP: unsettled → settled → forgotten — three settlement states

Parallel web search (librarian)

When the vault does not cover the task — two channels in parallel: obsidian-books + hybrid (primary sources, dossiers) and the web (book not in library, fresh context, criticism). Priority: Obsidian as source of truth, web as supplement. Typical for marketing and strategy work; for book-as-context on one engineering book, vault + wiki is often enough.

Artifact

Fork github.com/dobryakov/obsidian-hybrid-search (TypeScript): CLI + MCP server — retrieval under obsidian-books and book-as-context wiki. On top of retrieval: books/ dossiers, fan-out + books_read discipline, and the librarian loop (vault + web). Without that discipline, hybrid-search stays "correct top-k on the wrong query."

Series signature

Where it breaks

For whom and why

This breakdown isn't about how to write hybrid search — it's about an architectural decision: take an existing fork instead of reinventing the wheel, understand the mechanics deeply enough, embed it as a layer. Resonates with Head of AI and CTO building AI-assisted systems on top of their own knowledge corpora — books, wikis, docs.

Building AI-assisted systems on your own knowledge corpus?

Retrieval architecture for technical books, wikis, and internal docs — where vector-only isn't enough.

Email me

Other breakdowns

An engineering breakdown series: real task → methodology → working artifact → honest breakdown of where it fails.

Back to series →