Self-Hosted AI Pipeline

April 17, 2026

I spent two hours fixing a pipeline that broke because an LLM couldn’t tell its own flags from the truth.

Quick Take

Mistral writes articles, then checks them with Mistral, it fails

Grep catches bad flags faster than any LLM

Small scripts beat big rebuilds for frontmatter tweaks

--force means “I know what I’m doing,” not “skip checks”

Watch out: LLM-based validation fails when prompts are ambiguous or when the model prioritizes coherence over external truth

Gotcha: Using --yes without --force can bypass critical safety checks

Limitation: Regex-based flag validation only catches known patterns, not novel misconfigurations

Start with a targeted script, not a hammer

bash /scripts/generate-descriptions.sh [--only slug] [--force] [--dry-run]

This script fills empty description: fields in frontmatter without rebuilding the whole site. It reads the first 60 lines of each article, calls Mistral Small 4 (v1.2.3) for up to 80 tokens, and replaces only the description: line using a Python regex. A 2-second pause between articles keeps GPU RAM safe. Watch out: If the first 1000 characters contain misleading context, the generated description may be inaccurate.

import re, sys, subprocess

def update_description(slug, body):
    prompt = f"Write a concise meta description (max 160 chars) for this article:\n\n{body[:1000]}"
    desc = subprocess.check_output(
        ["mistral", "--model", "mistral-small-4", "--max-tokens", "80"],
        input=prompt.encode()
    ).decode().strip()
    return re.sub(r'description:.*', f'description: "{desc}"', body)

No full rebuild, no wasted tokens. Small scripts beat generic rebuild hammers every time. Gotcha: If the article body contains malformed frontmatter, the regex substitution may fail silently.

Flags that crash hardware need exact matching

SGLANG_REF=$(awk '/^### Kritische Flags/,/^### /' "$VIBE")

The broken version matched “### Kritische Flags” as both start and end, so Mistral wrote flags without the VIBE.md reference. The fix uses a negated character class to keep the range open until the next section title that doesn’t start with K.

Before the fix, the pipeline generated --attention-backend flashinfer, --speculative-eagle-topk 4, and --mem-fraction-static 0.88, all flags that crash GB10 hardware. After the fix, it pulls the correct flags verbatim from VIBE.md. Limitation: This approach requires strict section formatting; if VIBE.md uses inconsistent headers, the extraction will fail.

`--yes` should not bypass safety gates

if [[ "$AUTO_YES" == true ]]; then
  if grep -qE "${FORBIDDEN_FLAGS[*]}" <<< "$DRAFT"; then
    echo "Hallucination detected. Use --force to override."
    exit 1
  fi
fi

The old behavior let --yes skip the hallucination check entirely. The new behavior treats the check as a gate: auto-yes skips boring confirmations, but not safety checks. To override, you need the explicit --force flag. Watch out: If FORBIDDEN_FLAGS is empty, the check becomes meaningless, allowing invalid configurations through.

This distinction matters. --yes means “skip boring confirmations,” --force means “I know what I’m doing.” They solve two different problems.

Grep beats LLM for string matching

We tried using Mistral to check Mistral-written articles:

Mistral writes → Mistral checks → Mistral saves or aborts

The checker got the article draft plus a “stack reference” (correct flags, hardware, script names) and was supposed to flag contradictions.

It failed for four reasons.

First, ambiguous prompt wording made the LLM flag correct flags as wrong:

Wrong SGLang flag: e.g. --attention-backend flashinfer, --speculative-eagle-topk 4

The LLM saw “flashinfer” and “4” in its training data and flagged the correct --attention-backend triton as wrong. Gotcha: LLM-based validation can misclassify valid configurations if the prompt references banned patterns.

Second, self-contradictory outputs:

- Wrong SGLang flag: `--moe-runner-backend flashinfer_cutlass`
  (correct: `--moe-runner-backend flashinfer_cutlass`)

Third, false positives blocked good articles. The checker flagged --attention-backend triton and --speculative-eagle-topk 1 as wrong, so the correct article wouldn’t save without --force, which defeats the safety gate. Limitation: Overly strict validation can block valid configurations, requiring manual overrides.

Fourth, context drift. When the draft contained misinformation, the LLM sometimes trusted the draft over the stack reference. LLMs optimize for coherence, not external truth. Watch out: If the draft contains incorrect but plausible-sounding flags, the LLM may fail to detect them.

The replacement uses grep against a forbidden list:

FORBIDDEN_FLAGS=(
  "--attention-backend flashinfer"
  "--mem-fraction-static 0.88"
  "--speculative-eagle-topk [2-9]"
)

for pattern in "${FORBIDDEN_FLAGS[@]}"; do
  if grep -qE "$pattern" <<< "$DRAFT"; then
    echo "FORBIDDEN: $pattern"
    HALLUCINATED=true
  fi
done

No extra Mistral call, no false positives, deterministic, and easy to extend. It catches known bad patterns but doesn’t claim to judge plausibility. Gotcha: If a new invalid flag emerges, it must be manually added to FORBIDDEN_FLAGS; automated detection requires updates.

What’s left to fix before the pipeline is production-ready

The plan is to replace the LLM hallucination checker with the grep version, add the full docker run command to VIBE.md so Mistral copies it verbatim, and stop passing --yes from rebuild-articles.sh to new-article.sh.

Content-wise, the plan is to manually review the SGLang article for correct flags, validate EEAT scores, add missing internal links, and expand two short articles. The description: fields for all 12 articles are done. Watch out: If VIBE.md is outdated, the extracted flags may be incorrect, leading to runtime errors.

What I Actually Use

Mistral Small 4 (v1.2.3): the model that writes and sometimes misfires, but still beats hand-writing flags

GNU awk (v5.1.0): for reliable range patterns without self-closing gotchas

grep (v3.8): the unsung hero that catches bad flags before they reach readers

Limitation: This pipeline assumes Mistral Small 4 is available locally; cloud-based models may introduce latency or cost issues

Flow