Build a Self-Hosted Knowledge Base with Plain Text and LLMs

May 3, 2026 7 min read

New to self-hosting AI? The Self-Hosted AI: Start Here hub walks the hardware-decision tree, inference-engine choice, and the operational gotchas that bite hardest in the first three months. Read it before or after this one, whichever fits your stage.

Quick Take

Build a private, searchable knowledge base from Markdown files without vector databases

Use Mistral Small 4 to auto-tag documents based on content

Query via CLI today; an MCP wrapper for agent integration is on the roadmap, not yet shipped

Index updates in seconds with --no-tag, full re-tag run on a weekly cron

The setup is intentionally boring: a folder of Markdown files under /data/projects/, a small Python indexer that walks them and writes a single JSON file, and a CLI query tool. No vector store. No embeddings. No vendor lock-in. The LLM only touches the index when I ask it to auto-tag new files. The rest of the time the index is plain JSON and the search runs in milliseconds against it.

Start with the Indexer

python3 /data/scripts/knowledge/index.py --no-tag

This command scans all *.md files in your configured sources (/data/projects/blog/, /data/projects/docs/, /data/projects/podcast-studio/docs/), reads their frontmatter, and builds a JSON index at /data/knowledge-index.json. The --no-tag flag skips Mistral’s auto-tagging step, which is useful when you’re iterating quickly and don’t need AI-generated tags yet. On 172 Markdown files the index pass without tagging takes about two seconds on the DGX Spark.

Why Auto-Tagging Matters

python3 /data/scripts/knowledge/index.py

When you drop the flag, the indexer calls Mistral Small 4 via SGLang to generate tags for untagged documents. It writes those tags back into the frontmatter of each Markdown file, so your tags persist across index rebuilds. Every new document gets categorized without manual effort, and your tag-based queries stay consistent. The example output for CLAUDE.md from a real run: ['sovereign_ai', 'arm64_hardware', 'nvidia_gb10', 'mcp_services', 'tor_privacy', 'docker_arm64', 'llm_deployment']. That is what “useful tag” looks like; bag-of-buzzwords is what to avoid, and Mistral mostly stays on the right side of that line for technical content.

The auto-tagging pass is the slow part. The full run over 172 files takes long enough that I gate it behind a weekly cron; iterative work uses the --no-tag path and re-tags only on schedule.

Query from the CLI

python3 /data/scripts/knowledge/query.py "voxtral tts" --limit 5

The query tool searches the knowledge base using a simple scoring system: title hits score highest, followed by tags, paths, and summaries. The --limit 5 flag caps results. Use --json if you want machine-readable output to pipe into another tool. The CLI returns titles, tags, and short summaries; that is enough signal to decide whether to open a file.

Agent integration via MCP, the honest status

A Knowledge MCP server that wraps query.py and exposes query_knowledge, list_tags, and rebuild_index as tools to agents like Vibe and Claude Code is on the roadmap, not shipped today. The Sovereign AI Blog MCP at https://mcp.sovgrid.org/self-hosted-ai already exposes search_blog, list_tags, and get_article, but those operate on the published-blog corpus only, not on the broader knowledge base under /data/projects/.

When the Knowledge MCP ships it will be a separate FastMCP 1.27 server (consistent with the rest of the Sovereign AI Grid), not the legacy mcp.server.Server SDK that earlier MCP examples on the web still show. If you are building one yourself today, start from the FastMCP docs and the existing Sovereign AI Blog MCP source as the closer reference; do not copy the legacy-SDK skeletons that were widely shared in late 2024.

Until that wrapper exists, the integration is whatever your agent does with shell-tool access: most modern coding agents can run python3 /data/scripts/knowledge/query.py "voxtral tts" --json directly and parse the JSON, which is functionally close to a real MCP tool with one shell hop of latency.

Keep the Index Fresh

# After adding new docs
python3 /data/scripts/knowledge/index.py --no-tag

# Weekly full rebuild with auto-tagging
python3 /data/scripts/knowledge/index.py

The --no-tag version runs in under two seconds on 172 files; the full rebuild with Mistral Small 4 auto-tagging is slow enough that it makes sense as a weekly cron job rather than a per-edit hook. The systemd timer fires Sunday morning so Monday starts with a clean index and fresh tags. Adding new files mid-week means a manual --no-tag rebuild for fast access; the weekly cron picks them up for tagging on the next pass.

Multi-source layout, the actual directory shape

The indexer points at three roots that each have different update cadences and different signal-to-noise ratios. Knowing which is which matters when you read query results:

/data/projects/blog/ is the published blog corpus mirror. High signal, high curation, every file has a quality block. Tag-suggestions from Mistral are usually right because the source is structured.
/data/projects/docs/ is the working documentation: ADRs, plans, strategy notes, setup guides for things that are not yet blog-public. Tag-suggestions are noisier here because the writing style is denser and less consistent. This is where most of the indexer’s actual reading time goes.
/data/projects/podcast-studio/docs/ is the podcast-pipeline working notes. Smaller corpus, tags converge fast on voxtral, tts, expressivity, ffmpeg.

Adding a fourth root is one line in the indexer config. The price is reindex time, which scales linearly with file count. With 172 files the no-tag pass is two seconds; doubling the corpus will roughly double that.

Edge cases the indexer handles, and the ones it does not

Real-world Markdown is not as clean as the example corpus. The current indexer copes with the common cases; a few are explicit non-goals.

YAML frontmatter present, well-formed, with tags is the happy path. Tags persist as written; the LLM is not consulted unless tags is missing or empty.
YAML frontmatter present, well-formed, no tags key triggers Mistral on the next full-rebuild pass and the suggested tags get written back into the file.
No frontmatter at all is treated as untagged: file is included in the index by path and title (filename-derived), but tag-based queries will miss it until you add a frontmatter block.
Malformed YAML is currently a silent skip with a log line. The file does not appear in the index and the writer is not warned in real time. That is a known sharp edge worth fixing on the next iteration.

What the indexer explicitly does not do today: read .bib, .docx, .pdf, or .org files. The architecture is plain-text Markdown only on purpose. If you need binary-format ingest, that is a separate pipeline question and probably belongs in front of the indexer rather than inside it.

Cron, monitoring, and recovery

The weekly auto-tag run is wired through a systemd timer rather than a crontab entry, mostly because systemd gives clean log retention and a systemctl status view that does not require knowing where the cron mailspool ended up:

# /etc/systemd/system/knowledge-index.timer
[Unit]
Description=Weekly knowledge-base re-tagging

[Timer]
OnCalendar=Sun *-*-* 06:00:00
Persistent=true
RandomizedDelaySec=15min

[Install]
WantedBy=timers.target

The matching service unit calls python3 /data/scripts/knowledge/index.py and writes stderr/stdout to the journal. journalctl -u knowledge-index.service --since "7 days ago" is the one command worth remembering: it is how I notice when SGLang is hung and the auto-tag run hangs with it.

Recovery is intentionally boring: delete /data/knowledge-index.json and rerun. The tags written back into the source Markdown frontmatter survive index deletion, so a full rebuild from scratch is closer to a re-index than a re-tag, which is fast.

What I Actually Use

Mistral Small 4 via SGLang on DGX Spark for the auto-tagging pass. Local, private, no per-token billing, and good enough at picking technical tags.

/data/knowledge-index.json as the single search target. One file, machine-readable, easy to diff between weekly runs to see what changed.

query.py --json from agent shell-tools as the integration today, until a proper FastMCP wrapper ships. One shell hop of latency, zero MCP-server maintenance burden meanwhile.

Stack

Self-Hosted Knowledge Base

Plain text + LLMs architecture

Integration Vibe via MCP

Query System CLI + MCP tools

LLM Engine Mistral Small 4 for tagging

Indexer JSON index builder

Data Layer Markdown files in /data/projects/