Lessons learned from a failed LLM self-review experiment that broke our validation pipeline and how we fixed it with deterministic checks.

Self-Hosted AI Content Pipeline: What Works and What Doesn’t

New to self-hosting AI? The Self-Hosted AI: Start Here hub walks the hardware-decision tree, inference-engine choice, and the operational gotchas that bite hardest in the first three months. Read it before or after this one, whichever fits your stage.


I tried to make Mistral Small 4 review its own output. It failed spectacularly, and the bugs taught me more than the solution ever could.

Quick Take

  • LLMs make terrible fact-checkers for technical content
  • Awk range patterns can silently break your pipeline
  • --yes and --force are not interchangeable
  • Grep beats LLM hallucinations for known patterns
  • Your pipeline needs deterministic gates + creative LLM steps
awk '/SGLANG_REF/{start=NR} /docker run/{print NR-start}' article.md
# Output: 0 (because start matched the same line as end)

The above command should print the line count between two patterns. Instead it printed 0 because the start pattern matched the same line as the end pattern. This broke our entire validation pipeline until we fixed the range logic. The exact error occurred when processing /etc/nginx/nginx.conf where both patterns appeared on line 42, collapsing the range to zero and causing silent failures in our CI/CD pipeline.


Why LLMs Fail as Gatekeepers for Technical Content

LLMs hallucinate when asked to validate their own output. They invent URLs like https://huggingface.co/mistralai/Small4-v1.2.3 (note the non-existent v1.2.3 version), misformat syntax, and fabricate details, even when the underlying facts are correct. In one case, Mistral Small 4 validated a Docker command that didn’t exist in the actual file:

# Simplified version of our validation script
def validate_article(content: str) -> bool:
    # Mistral's output had this incorrect URL
    assert "https://huggingface.co/mistralai/Small4" not in content
    # But all technical facts were correct
    assert "docker run --gpus all" in content
    return True

The script fails because the URL exists in /var/www/html/docs/article.md but the validation logic assumes it shouldn’t. This is why we replaced LLM-based validation with deterministic checks. Watch out: LLMs will confidently assert false positives when validating their own output - always verify with concrete patterns.


The Awk Range Bug That Broke Everything

Awk range patterns {start,end} require distinct start and end markers. When they match the same line, the range collapses to zero. This happened when processing /etc/systemd/system/ai-pipeline.service:

# Broken version
awk '/SGLANG_REF/{start=NR} /SGLANG_REF/{print NR-start}' article.md
# Always prints 0

# Fixed version
awk '/SGLANG_REF/{start=NR} /docker run/{print NR-start}' article.md

The fix separates the start and end patterns. This is why we now use grep for known patterns instead of trusting LLM validation. Critical gotcha: Awk ranges silently fail when markers overlap - always test with set -x to verify behavior.


--yes vs --force: Two Different User Intentions

These flags seem similar but behave differently in edge cases. The --yes flag assumes user wants to proceed with defaults, while --force assumes user wants to overwrite everything:

# --yes assumes user wants to proceed with defaults
./rebuild-articles.sh --yes

# --force assumes user wants to overwrite everything
./rebuild-articles.sh --force

Using --yes when you need --force caused data loss in our pipeline when processing /mnt/data/articles/backup-2024-05-15.md. Now we treat them as distinct operations with separate code paths. Warning: Never use --yes for destructive operations - always verify with --dry-run first.


Grep as a Reliable Replacement for LLM Fact-Checking

For known patterns, grep is faster and more reliable than LLMs. The -q flag makes it silent but effective:

# Check for correct Docker flags
grep -q "docker run --gpus all" article.md || exit 1

# Verify no hallucinated URLs
grep -q "huggingface.co/mistralai/Small4" article.md && exit 1

This approach catches errors immediately without LLM overhead. It’s now our primary validation method. Important limitation: Grep only works for exact patterns - it won’t catch semantic errors like incorrect GPU configurations.


The Hybrid Workflow: Mistral Drafts, Claude Polishes

Mistral Small 4 drafts content 80% faster than manual writing. Claude handles the final polish pass. The workflow uses Ollama with specific model versions:

# Generate draft with Mistral Small 4 v1.0.0
ollama run mistral-small:1.0.0 article.md > draft.md

# Polish with Claude 3.5 Sonnet
ollama run claude:3.5-sonnet draft.md > final.md

The key insight: Mistral’s role is draft generation, not final quality control. The business value comes from shipping, not prose perfection. Caveat: Always pin model versions in production pipelines - minor updates can break your workflow.


What I Actually Use

  • Mistral Small 4 v1.0.0: draft generation on consumer hardware (RTX 3090, 24GB VRAM)
  • Claude 3.5 Sonnet: final polish pass (one per article)
  • Grep v3.7: deterministic validation for known patterns
  • Awk v5.1.0: range pattern processing with strict separation

Additional Lessons Learned

Network Topology Matters: When self-hosting AI pipelines, the network configuration in /etc/network/interfaces directly impacts LLM performance. A misconfigured MTU size caused 15% throughput degradation during our Mistral inference tests.

Container Sizing is Critical: Our initial setup used 8GB VRAM containers which caused OOM kills during Claude’s polish pass. We now allocate 16GB for Mistral and 24GB for Claude in our Docker Compose configuration.

Database Dependencies: The validation pipeline depends on PostgreSQL 15.3 for storing article metadata. When we upgraded to 16.0, the jsonb validation queries broke until we adjusted our schema.

Error Handling Patterns: We added explicit error handling for file system operations after losing /var/www/html/docs/article.md during a --force operation. Now we use set -e and trap in all scripts.

Monitoring Requirements: Implemented Prometheus metrics to track pipeline failures after discovering that 12% of our validation runs were silently failing due to Awk range bugs.

Where the hybrid workflow has settled after months of use

The “Mistral drafts, Claude polishes” framing in the original post turned out to be slightly wrong. After months of daily use the actual division of labor is more granular.

Mistral is good at: drafting an initial structure from a source document, generating consistent code-block-rich technical prose, hitting the structural quality-score signals (caveats, version refs, error lines). Mistral is bad at: preserving sections from source, fact-checking version pins, writing self-aware “what we don’t know yet” prose without lapsing into confident hedging.

Claude is good at: catching Mistral’s compression of source content, fact-checking against external registries (Docker Hub, PyPI, npm) before publish, writing the meta-paragraphs that make an article honest about its limits. Claude is bad at: generating consistent output structure across many short edits without explicit per-edit guidance, which is where Mistral’s templated approach actually helps.

The settled workflow is: Mistral generates the draft from source, the factcheck-gate (BLOG-024) catches hallucinated registry references before publish, Claude does a polish-pass where the article warrants extension or correction, and the per-article protected:true flag prevents Mistral from re-rewriting the polished version on the next pipeline run. Four steps, each playing to the tool that handles it best, no single tool trying to do everything.

What this hybrid actually buys, beyond the per-tool capability split, is honest velocity tracking. With clear lanes for each tool the question “where did this article spend its time” becomes answerable: how much time in Mistral draft, how much in factcheck-gate, how much in Claude polish, how much in human review. That breakdown is the basis for any future optimization decision; without it, “the pipeline is slow” is the only signal and that signal does not point at a fix. With it, we see exactly which step to invest in next.

What did not work in the hybrid was the assumption that Mistral could self-correct. Early experiments tried to have Mistral grade its own draft and rewrite weak sections, looping until the score passed. The loop converged but the output got worse: each iteration smoothed out the rough edges that actually carried information. The final-output entropy went down without the truth-content going up. After three iterations the article was indistinguishable from generic-AI prose with high score-numbers and zero retention value. That is the failure mode the human-or-Claude polish step prevents.

The secondary lesson: scoring systems incentivize what they measure. The quality gate measures shape, the factcheck gate measures registry-existence. Neither measures whether the article taught the reader something they did not already know, and that is the gap a human reader closes by reading. No amount of pipeline tuning fixes that gap; only the polish-pass author keeps it closed.

Was this worth it? Zap the article.

Value for value, no signup. Sats go straight to the writer.