Build a Self-Hosted Knowledge Base with Plain Text and LLMs
New to self-hosting AI? The Self-Hosted AI: Start Here hub walks the hardware-decision tree, inference-engine choice, and the operational gotchas that bite hardest in the first three months. Read it before or after this one, whichever fits your stage.
Quick Take
- Build a private, searchable knowledge base from Markdown files without vector databases
- Tag documents by hand in frontmatter, or with a local-LLM pass (see the update note on the retired auto-tagger)
- Query via CLI or over MCP (the agent wrapper shipped after this guide was first written)
- Index updates in seconds with
--no-tag, full re-tag run on a weekly cron
Update 2026-06-15: Two things below have moved on. (1) The Knowledge MCP is now shipped, not roadmap: agents query this index over MCP, and the corpus now also carries the grid’s canonical facts (
GRID-FACTS.md) and a set of ops playbooks, so the local model can answer grid questions without a cloud hop. (2) The Mistral auto-tagger was retired when the SGLang engine on :30000 was shut down (the production model is now a vision-capable Qwen on vLLM). Tags are curated by hand in frontmatter now. The trade-off: less automation, but no broken dependency on a second engine and no bag-of-buzzwords drift. The plain-text-plus-JSON core described below is unchanged and is the part worth copying.
Update 2026-06-19: The retriever described under “Query from the CLI” was upgraded from the naive title/tag/summary scorer to full-body Okapi BM25 (still pure standard library, still no vector store, about 20 MB and 0.2 ms per query). And the “no vectors required” claim above is no longer a hunch: I benchmarked keyword scoring, BM25, dense embeddings, hybrid fusion, and reranking against this corpus, and BM25 won or tied at zero memory cost while the vector stack bought nothing. The full path, including the two benchmarks I accidentally rigged before I got an honest one, is in I Rigged My Own RAG Benchmark.
The setup is intentionally boring: a folder of Markdown files under /data/projects/, a small Python indexer that walks them and writes a single JSON file, and a CLI query tool. No vector store. No embeddings. No vendor lock-in. The LLM only touches the index when I ask it to auto-tag new files. The rest of the time the index is plain JSON and the search runs in milliseconds against it.
Start with the Indexer
python3 /data/scripts/knowledge/index.py --no-tag
This command scans all *.md files in your configured sources (/data/projects/blog/, /data/projects/docs/, /data/projects/podcast-studio/docs/), reads their frontmatter, and builds a JSON index at /data/knowledge-index.json. The --no-tag flag skips Mistral’s auto-tagging step, which is useful when you’re iterating quickly and don’t need AI-generated tags yet. On 172 Markdown files the index pass without tagging takes about two seconds on the DGX Spark.
Why Auto-Tagging Matters
python3 /data/scripts/knowledge/index.py
When you drop the flag, the indexer calls Mistral Small 4 via SGLang to generate tags for untagged documents. It writes those tags back into the frontmatter of each Markdown file, so your tags persist across index rebuilds. Every new document gets categorized without manual effort, and your tag-based queries stay consistent. The example output for CLAUDE.md from a real run: ['sovereign_ai', 'arm64_hardware', 'nvidia_gb10', 'mcp_services', 'tor_privacy', 'docker_arm64', 'llm_deployment']. That is what “useful tag” looks like; bag-of-buzzwords is what to avoid, and Mistral mostly stays on the right side of that line for technical content.
The auto-tagging pass is the slow part. The full run over 172 files takes long enough that I gate it behind a weekly cron; iterative work uses the --no-tag path and re-tags only on schedule.
Query from the CLI
python3 /data/scripts/knowledge/query.py "voxtral tts" --limit 5
The query tool searches the knowledge base using a simple scoring system: title hits score highest, followed by tags, paths, and summaries. The --limit 5 flag caps results. Use --json if you want machine-readable output to pipe into another tool. The CLI returns titles, tags, and short summaries; that is enough signal to decide whether to open a file.
Agent integration via MCP, the honest status
Status as of the 2026-06-15 update: the Knowledge MCP shipped. When this guide was first written it was still on the roadmap, and the honest thing is to say so rather than backdate it. A local FastMCP server now wraps the index and exposes it to agents (the local Qwen, opencode, Claude Code), so they query the knowledge base directly instead of shelling out. The separate Sovereign AI Blog MCP at https://mcp.sovgrid.org/self-hosted-ai still exposes search_blog, list_tags, and get_article over the published-blog corpus; the Knowledge MCP covers the broader corpus under /data/projects/ and /data/scripts/ (including GRID-FACTS.md and the ops playbooks).
It is a FastMCP server (consistent with the rest of the Sovereign AI Grid), not the legacy mcp.server.Server SDK that earlier MCP examples on the web still show. If you are building one yourself, start from the FastMCP docs and the Sovereign AI Blog MCP source as the closer reference; do not copy the legacy-SDK skeletons that were widely shared in late 2024.
The shell-tool path still works as a fallback and is worth knowing for agents without MCP wiring: most coding agents can run python3 /data/scripts/knowledge/query.py "voxtral tts" --json directly and parse the JSON, which is functionally close to the MCP tool with one shell hop of latency.
Keep the Index Fresh
# After adding new docs
python3 /data/scripts/knowledge/index.py --no-tag
# Weekly full rebuild with auto-tagging
python3 /data/scripts/knowledge/index.py
The --no-tag version runs in under two seconds on 172 files; the full rebuild with Mistral Small 4 auto-tagging is slow enough that it makes sense as a weekly cron job rather than a per-edit hook. The systemd timer fires Sunday morning so Monday starts with a clean index and fresh tags. Adding new files mid-week means a manual --no-tag rebuild for fast access; the weekly cron picks them up for tagging on the next pass.
Multi-source layout, the actual directory shape
The indexer points at several roots (the SCAN_ROOTS list in index.py) that each have different update cadences and signal-to-noise ratios. Knowing which is which matters when you read query results:
- The published blog corpus is the high-signal, high-curation source: every file has a quality block, so tags are usually clean.
/data/projects/docs/is the working documentation: ADRs, plans, strategy notes, setup guides not yet blog-public. Denser, noisier, and where most of the indexer’s reading time goes./data/projects/podcast-studio/docs/and/kb/are the podcast-pipeline notes. Smaller corpus, tags converge fast onvoxtral,tts,expressivity,ffmpeg./data/scripts/is the ops namespace, added later. It is what makes the operational source of truth (GRID-FACTS.md,SOVEREIGN-CONTEXT.md, the ops playbooks) retrievable by every agent, which is the whole reason the local model can answer “how do I update this machine” without a human in the loop.
Adding a root is one line in SCAN_ROOTS. The price is reindex time, which scales linearly with file count. Fast-forward from the 172 files this guide started with: the corpus has since grown past 340 files and the no-tag pass is still a few seconds on the Spark.
Edge cases the indexer handles, and the ones it does not
Real-world Markdown is not as clean as the example corpus. The current indexer copes with the common cases; a few are explicit non-goals.
- YAML frontmatter present, well-formed, with
tagsis the happy path. Tags persist as written; the LLM is not consulted unlesstagsis missing or empty. - YAML frontmatter present, well-formed, no
tagskey triggers Mistral on the next full-rebuild pass and the suggested tags get written back into the file. - No frontmatter at all is treated as untagged: file is included in the index by path and title (filename-derived), but tag-based queries will miss it until you add a frontmatter block.
- Malformed YAML is currently a silent skip with a log line. The file does not appear in the index and the writer is not warned in real time. That is a known sharp edge worth fixing on the next iteration.
What the indexer explicitly does not do today: read .bib, .docx, .pdf, or .org files. The architecture is plain-text Markdown only on purpose. If you need binary-format ingest, that is a separate pipeline question and probably belongs in front of the indexer rather than inside it.
Cron, monitoring, and recovery
The weekly auto-tag run is wired through a systemd timer rather than a crontab entry, mostly because systemd gives clean log retention and a systemctl status view that does not require knowing where the cron mailspool ended up:
# /etc/systemd/system/knowledge-index.timer
[Unit]
Description=Weekly knowledge-base re-tagging
[Timer]
OnCalendar=Sun *-*-* 06:00:00
Persistent=true
RandomizedDelaySec=15min
[Install]
WantedBy=timers.target
The matching service unit calls python3 /data/scripts/knowledge/index.py and writes stderr/stdout to the journal. journalctl -u knowledge-index.service --since "7 days ago" is the one command worth remembering: it is how I notice when SGLang is hung and the auto-tag run hangs with it.
Recovery is intentionally boring: delete /data/knowledge-index.json and rerun. The tags written back into the source Markdown frontmatter survive index deletion, so a full rebuild from scratch is closer to a re-index than a re-tag, which is fast.
What I Actually Use
- Tags curated by hand in frontmatter. The Mistral-via-SGLang auto-tagging pass was retired with the SGLang engine; on a well-curated corpus, hand-written tags beat a second engine and its hang risk.
/data/knowledge-index.jsonas the single search target. One file, machine-readable, easy to diff between rebuilds to see what changed.- The Knowledge MCP as the primary integration, with
query.py --jsonfrom agent shell-tools as the zero-maintenance fallback. One shell hop of latency on the fallback path.
This guide is the plain-text, no-vector half of the story. For the same idea built on a vector store instead, with the retrieval bugs that came with it, see A Second Brain for a Local Model. For the benchmark that put numbers behind the no-vectors choice, see I Rigged My Own RAG Benchmark.
Self-Hosted Knowledge Base
Plain text + LLMs architecture