Retrieval layer for book-as-context: why vector-only breaks
Three signals instead of one: BM25, semantic embeddings, and fuzzy title matching — built as an MCP server under the agentive book-as-context layer.
Problem
When building book-as-context (agent working with a corpus of books via MCP), the first instinct is vector-only RAG. Works on a demo corpus. Fails predictably on real technical books:
- • Exact terms, codes, chapter names get lost — embedding "blurs" lexical matches
- • A navigation query ("find the chapter on two-phase commit in Tanenbaum") and an exploration query ("what does the book say about durability") require different mechanics
- • A section added to the index yesterday loses to "roughly similar" content from older notes
Methodology
Instead of writing retrieval from scratch — a fork of obsidian-hybrid-search, embedded as an MCP server under the agentive book-as-context layer.
Three signals instead of one:
- 1. BM25 — lexical match. Exact terms, codes, file names.
- 2. Semantic embeddings — conceptual proximity. Synonyms, paraphrases, related concepts.
- 3. Fuzzy title / alias — navigation to a specific note or chapter by name.
Modes for query type:
hybrid (default),
semantic,
fulltext,
title.
Fusion via RRF: a note ranked highly in any of the agent's reformulated queries surfaces to the top.
obsidian-search "event bus durability" --mode hybrid --limit 5
{
"path": "tanenbaum/12-message-delivery.md",
"score": 0.87,
"matchedBy": ["semantic", "bm25"],
"scores": { "semantic": 0.81, "bm25": 0.93, "fuzzy_title": null }
}
bm25 (0.93) > semantic (0.81):
exact lexical match pulled up a result that vector-only would have ranked lower.
Per-component scores are visible to the agent — it decides whether to trust the result,
not receive a "black box top-k."
Agent layer on top of retrieval
Hybrid search solves ranking. Without a layer on top you get blind RAG: chunks in the index may be right, but queries run in task language, not author language — a mismatch. BM25 cannot fix that: the query lacks the anchors the book was indexed with. For a single-book wiki and traversal from requirement to pattern, see book-as-context (step 5). Below — how the same retrieval layer connects to a full book library and external sources.
Two corpora: obsidian-books vs single-book wiki
| Full book vault (obsidian-books) | Single-book wiki (book-as-context) | |
|---|---|---|
| Contents | Hundreds of books: raw chunks, summaries, wiki notes | One book → olw concept pages, cross-links |
| Retrieval | Same hybrid-search (MCP) | Same MCP or IDE indexing |
| Use case | Marketing, strategy, cross-author lookup | Engineering lens on a project (Tanenbaum in Jira, API, infra) |
| Risk | Summaries rank above raw — for quotes, scope to raw chunks | Partial coverage, low-confidence drafts |
Book dossiers (books/)
One dossier file per important book: outline, author terms, translation synonyms,
query → chapter map, RU/EN pairs. The agent reads the dossier before
books_search, instead of echoing the user's wording.
Without a dossier search still runs, but noisier: hybrid mode does not replace knowing the author's lexicon.
Fan-out and reformulation
Instead of one query — 3–5 reformulations in
queries[]: dossier anchors + plain-language task,
translation synonyms, original term if results are weak. RRF is built for fan-out.
After top-k — mandatory
books_read on 2–3 full notes, not blind snippet quotes.
| Weak query (task language) | Strong query (author term + plain language) |
|---|---|
| "find something about Kafka" | reliable multicast + FIFO ordering — delivery order and reliable broadcast |
| "how to sync orders" | persistent vs transient communication — queue holds vs receiver online now |
| "delivery guarantees" | AMQP: unsettled → settled → forgotten — three settlement states |
Parallel web search (librarian)
When the vault does not cover the task — two channels in parallel: obsidian-books + hybrid (primary sources, dossiers) and the web (book not in library, fresh context, criticism). Priority: Obsidian as source of truth, web as supplement. Typical for marketing and strategy work; for book-as-context on one engineering book, vault + wiki is often enough.
Artifact
Fork github.com/dobryakov/obsidian-hybrid-search
(TypeScript): CLI + MCP server — retrieval under obsidian-books and
book-as-context wiki.
On top of retrieval: books/ dossiers,
fan-out + books_read discipline,
and the librarian loop (vault + web). Without that discipline, hybrid-search stays
"correct top-k on the wrong query."
Where it breaks
- Fusion weights are not universal. The BM25/semantic balance depends on the corpus; code and technical prose need different calibration.
- Reranking costs latency. Worth it when the agent needs precise ordering; otherwise just extra delay.
- The fork is not maintained upstream. Any changes to the index schema are on you. This is a deliberate choice: understand the tool deeply enough to maintain it.
- Query only in task language. Even perfect hybrid search will not find reliable multicast if the query only says "Kafka". You need dossiers and fan-out — see the table in Methodology and book-as-context.
- Summary beats raw chunk. In the vault, MOC/summary notes often rank above OCR fragments; for verbatim quotes, scope to raw or follow links from the summary note.
-
Fan-out without read.
Five reformulations without
books_read— many similar snippets, little verifiable text.
For whom and why
This breakdown isn't about how to write hybrid search — it's about an architectural decision: take an existing fork instead of reinventing the wheel, understand the mechanics deeply enough, embed it as a layer. Resonates with Head of AI and CTO building AI-assisted systems on top of their own knowledge corpora — books, wikis, docs.
Building AI-assisted systems on your own knowledge corpus?
Retrieval architecture for technical books, wikis, and internal docs — where vector-only isn't enough.
Email meOther breakdowns
An engineering breakdown series: real task → methodology → working artifact → honest breakdown of where it fails.
Back to series →