Retrieval Layer for Book-as-context: Why Vector-only Breaks

Problem

When building book-as-context (agent working with a corpus of books via MCP), the first instinct is vector-only RAG. Works on a demo corpus. Fails predictably on real technical books:

• Exact terms, codes, chapter names get lost — embedding "blurs" lexical matches
• A navigation query ("find the chapter on two-phase commit in Tanenbaum") and an exploration query ("what does the book say about durability") require different mechanics
• A section added to the index yesterday loses to "roughly similar" content from older notes

Methodology

Instead of writing retrieval from scratch — a fork of obsidian-hybrid-search, embedded as an MCP server under the agentive book-as-context layer.

Three signals instead of one:

1. BM25 — lexical match. Exact terms, codes, file names.
2. Semantic embeddings — conceptual proximity. Synonyms, paraphrases, related concepts.
3. Fuzzy title / alias — navigation to a specific note or chapter by name.

Modes for query type: hybrid (default), semantic, fulltext, title.

Fusion via RRF: a note ranked highly in any of the agent's reformulated queries surfaces to the top.

obsidian-search "event bus durability" --mode hybrid --limit 5

{
  "path": "tanenbaum/12-message-delivery.md",
  "score": 0.87,
  "matchedBy": ["semantic", "bm25"],
  "scores": { "semantic": 0.81, "bm25": 0.93, "fuzzy_title": null }
}

bm25 (0.93) > semantic (0.81): exact lexical match pulled up a result that vector-only would have ranked lower. Per-component scores are visible to the agent — it decides whether to trust the result, not receive a "black box top-k."

Agent layer on top of retrieval

Hybrid search solves ranking. Without a layer on top you get blind RAG: chunks in the index may be right, but queries run in task language, not author language — a mismatch. BM25 cannot fix that: the query lacks the anchors the book was indexed with. For a single-book wiki and traversal from requirement to pattern, see book-as-context (step 5). Below — how the same retrieval layer connects to a full book library and external sources.

Two corpora: obsidian-books vs single-book wiki

	Full book vault (obsidian-books)	Single-book wiki (book-as-context)
Contents	Hundreds of books: raw chunks, summaries, wiki notes	One book → olw concept pages, cross-links
Retrieval	Same hybrid-search (MCP)	Same MCP or IDE indexing
Use case	Marketing, strategy, cross-author lookup	Engineering lens on a project (Tanenbaum in Jira, API, infra)
Risk	Summaries rank above raw — for quotes, scope to raw chunks	Partial coverage, low-confidence drafts

Book dossiers (`books/`)

One dossier file per important book: outline, author terms, translation synonyms, query → chapter map, RU/EN pairs. The agent reads the dossier before books_search, instead of echoing the user's wording. Without a dossier search still runs, but noisier: hybrid mode does not replace knowing the author's lexicon.

Fan-out and reformulation

Instead of one query — 3–5 reformulations in queries[]: dossier anchors + plain-language task, translation synonyms, original term if results are weak. RRF is built for fan-out. After top-k — mandatory books_read on 2–3 full notes, not blind snippet quotes.

Weak query (task language)	Strong query (author term + plain language)
"find something about Kafka"	reliable multicast + FIFO ordering — delivery order and reliable broadcast
"how to sync orders"	persistent vs transient communication — queue holds vs receiver online now
"delivery guarantees"	AMQP: unsettled → settled → forgotten — three settlement states

Parallel web search (librarian)

When the vault does not cover the task — two channels in parallel: obsidian-books + hybrid (primary sources, dossiers) and the web (book not in library, fresh context, criticism). Priority: Obsidian as source of truth, web as supplement. Typical for marketing and strategy work; for book-as-context on one engineering book, vault + wiki is often enough.

Artifact

Fork github.com/dobryakov/obsidian-hybrid-search (TypeScript): CLI + MCP server — retrieval under obsidian-books and book-as-context wiki. On top of retrieval: books/ dossiers, fan-out + books_read discipline, and the librarian loop (vault + web). Without that discipline, hybrid-search stays "correct top-k on the wrong query."

Series signature

Where it breaks

Fusion weights are not universal. The BM25/semantic balance depends on the corpus; code and technical prose need different calibration.
Reranking costs latency. Worth it when the agent needs precise ordering; otherwise just extra delay.
The fork is not maintained upstream. Any changes to the index schema are on you. This is a deliberate choice: understand the tool deeply enough to maintain it.
Query only in task language. Even perfect hybrid search will not find reliable multicast if the query only says "Kafka". You need dossiers and fan-out — see the table in Methodology and book-as-context.
Summary beats raw chunk. In the vault, MOC/summary notes often rank above OCR fragments; for verbatim quotes, scope to raw or follow links from the summary note.
Fan-out without read. Five reformulations without books_read — many similar snippets, little verifiable text.

For whom and why

This breakdown isn't about how to write hybrid search — it's about an architectural decision: take an existing fork instead of reinventing the wheel, understand the mechanics deeply enough, embed it as a layer. Resonates with Head of AI and CTO building AI-assisted systems on top of their own knowledge corpora — books, wikis, docs.

Building AI-assisted systems on your own knowledge corpus?

Retrieval architecture for technical books, wikis, and internal docs — where vector-only isn't enough.

Email me

Other breakdowns

An engineering breakdown series: real task → methodology → working artifact → honest breakdown of where it fails.

Back to series →