This blog gates every article behind one Python scorer before it publishes. I gave Qwen3.6 and Mistral Small 4 the same brief, the Start Here hub article this site still owes, and ran the raw output through that real gate with no editing. Both passed. Both invented hardware, processes, and benchmarks the scorer counted as quality. Here is the full method, the two source texts, and why a passing score is a floor and not a truth filter.

The Quality Gate That Rewards Fabrication: I Had Qwen and Mistral Write This Blog

New here? Jump to “Plain-language version”. Short story: I let two local AI models write an article for this blog. The automated quality check a draft has to clear passed both. Both had quietly made up facts about my own hardware. The check could not tell. A passing score does not publish anything here. I still do.

This blog has a coding ruler and a vision ruler. Both are documented in a sibling article, the 0/30 that was a broken ruler. This is the third ruler, and it measures the blog against itself.

Every article here is written to a hard contract: no em-dashes, no AI filler words, concrete numbers, real file paths, explicit caveats, honest limits. Every draft is scored by one Python function against that contract and a style-specific threshold. A score below the threshold blocks the deploy outright. A score above it does not publish anything on its own. It clears the automated bar, and then I still make the call by hand. That scorer is an instrument. Instruments can be wrong. So I pointed it at the obvious test it had never run: can the local models that already write code on this Spark write the blog, and what does the gate say when they do?

The method, exactly

One backlog item on this site has been open for weeks: BLOG-021, the “Start Here” hub article, the single highest-leverage piece this blog still does not have. That is the brief I gave both models. Same system prompt, same user prompt, same temperature, no second attempt, no human edit between the model and the scorer.

The system prompt handed each model the real writing contract verbatim: never use em-dashes or en-dashes, no filler list (“it is worth noting”, “leverage”, “utilize”, “delve”, and the rest), no hedging, vary sentence length, be concrete with real paths and measured numbers, emit YAML frontmatter then the body. The user prompt described the hub article: what the site is, the thesis, the actual stack as of May 2026, the article pillars, one closing action, be honest about limits and not promotional.

The scorer is not a vibe check. It is compute_quality_signals() and compute_quality_score() in scripts/update_blog_from_gitea.py, the exact functions the deploy pipeline runs. They extract 19 countable signals from the body text. Positive signals: code blocks, version references, file paths, error lines, caveats, H2 count, concrete numbers, comparison terms, concrete examples, lexical diversity, defined terms, causal “because” answers, step sequences, temporal markers, sentence-length burstiness. Negative signals with negative weights: AI filler phrases, hedging phrases, em-dashes, repeated three-bullet lists. Each style has a minimum. For an essay like this one the floor is 140 to 150 depending on the style key.

The harness is pure: it calls the model API, captures the raw reply, and runs the canonical scoring functions on it directly. No file is written into src/content/blog/ at any point, so a test article can never be picked up by a deploy. The Mistral half required physically swapping the production model. The Spark has 121 GB of unified memory and runs exactly one inference service at a time, so production Qwen3.6 on port 30001 was stopped, the kernel page cache was dropped, Mistral Small 4 was launched on SGLang at port 30000, the test ran, and then the same swap in reverse restored production. The whole swap cost about 25 minutes of the coding and MCP endpoint being offline. That cost is part of why this test had not been run before.

What the gate said

Both models cleared the gate with room to spare.

MeasureQwen3.6-35B-A3B PrismaQuantMistral Small 4 NVFP4
Generation speed59.7 tok/s28.1 tok/s
Body word count1816888
Valid YAML frontmatterYesNo, wrapped in a code fence
Em-dashes (rule: zero)01
Filler phrases01
Repeated 3-bullet lists05
Code blocks014
Version references313
Lexical diversity3775
Sentence-length stdev3.4540.07
Score, style “conclusion” (min 140)299, PASS274, PASS
Score, “best_practice_learnings” (min 160)184, PASS285, PASS

Two models, two very different texts, both cleared the one automated test between a draft and the live site. That test is not the publish decision, I still am, but it is the cheap filter that is supposed to catch the obvious failures. The gate did its job as designed. The problem is what the gate was never designed to see.

Qwen: obeys the contract, then runs out of things to say

Qwen followed every formatting rule. Valid frontmatter, zero em-dashes, zero filler words, no repeated triple-bullet lists. For an automated pipeline that property is worth a lot, because the output ingests with no repair pass. The opening is genuinely on-voice.

Qwen, surprisingly good: the opening

Most people read about this hardware in press releases. I read about it in thermal throttling logs and PCIe bandwidth contention reports. This blog is not a marketing page. It is an engineering log.

That is the right voice on the first try, from a model that obeyed an explicit constraint set most writers ignore.

Then it ran one sentence shape into the ground. The structural signals that scored well hid a flat text: zero code blocks and zero file paths, on a blog whose whole identity is copy-pasteable commands. The lexical diversity of 37 and sentence-length stdev of 3.45 are the scorer quietly admitting the prose is repetitive, but those signals carry low weight and the score sailed regardless.

Qwen, AI slop: the anaphora wall

I document these issues. I document the error logs. I document the solution. I document the root cause.

and later

It requires hardware. … It requires knowledge. … It requires maintenance.

The contract banned filler words. It did not, and cannot easily, ban filler rhythm.

Qwen, hallucination: a quality process that does not exist

I verify this loss weekly with a small held-out validation set. If the BLEU score drops more than 0.5 points, I retrain.

There is no weekly BLEU retrain loop on this stack. The model also described the MCP server as exposing “my local file system, my git repository, and my database”. The real MCP server exposes blog search, article fetch, tag listing, and SGLang diagnostics. Both claims are inventions, written with the same flat confidence as the true sentences around them.

Mistral: writes like the blog, then breaks the blog’s rules

Mistral produced real engineering texture: 14 code blocks, version pins, a docker-compose file, a comparison frame, a fabricated but well-formed nvidia-smi panel. Its prose, where it is prose, is the better of the two. The single best line in either output is Mistral’s.

Mistral, surprisingly good: the site ethos in three sentences

If you see a number, it was measured on my hardware. If you see a path, it exists on my machine. If you see a failure, it happened to me.

That is the thesis of this entire blog, stated more sharply than the blog has ever stated it itself. A model wrote the mission statement better than the operator did.

Then it broke the contract it had been handed. It wrapped the entire article in a Markdown code fence, which means the frontmatter parser rejects it outright and the pipeline never even reaches the scorer without a repair step. It stacked five separate three-bullet lists. It wrote “10 to 15% throughput loss” with a literal en-dash, the one character this blog forbids, in a piece whose own brief said not to. (That en-dash is replaced with a hyphen in the source appendix below so this article stays compliant. Its presence in the original is itself finding number three.)

Mistral, the dangerous hallucination: a spec sheet that is entirely wrong

Hardware: NVIDIA DGX Spark (GB10), 1x 128-core Grace CPU, 1x Blackwell B100 GPU, 121 GB unified memory, 1.5 TB NVMe.

There is no discrete B100 in this machine. There is no 1.5 TB NVMe figure I have ever published. The VPS is in Romania, not the “Amsterdam” Mistral asserts a few lines later. It then printed a fabricated nvidia-smi panel and a fake “nvidia-driver-550, CUDA 12.5, vLLM 0.6.3” stack. Specific, plausibly formatted, shaped exactly like real terminal output, and false. That is worse than vaguely wrong. It is convincingly wrong, which is the kind of wrong that survives a skim review.

The actual finding: the gate rewards the fabrication

Here is the part that matters beyond these two models.

The scorer counts version_refs, concrete_numbers, file_paths, and code_blocks as positive quality signals, because in honest writing those correlate with technical depth. The scorer has no way to check whether a version number is real. When Mistral invents “vLLM 0.6.3 on CUDA 12.5” and “nvidia-driver-550”, the gate does not see three lies. It sees three version_refs, and it raises the score. The fabrication is not penalized. It is rewarded. The more confidently a model invents specifics, the better it scores.

This is the same lesson as the 0/30 coding result and the vision-tower asymmetry in the sibling article, arrived at from a third direction. The 0/30 was an instrument measuring its own network failure. The vision capability was set by the quantized checkpoint, not the model card. And the quality gate measures textual structure, not textual truth. In all three the instrument was trusted to mean something it does not mean. A passing score is a statement that the text has the shape of a good article. It is not a statement that the article is true.

That has a concrete operational consequence, not a philosophical one:

Plain-language version

This website checks every article with an automated program before it goes live. The program looks for signs of good technical writing: real commands, real version numbers, specific details, honest warnings.

I asked two AI models running on my own hardware to write an article for the site. I did not edit what they wrote. I fed it straight to the checker.

Both passed. But both had invented things: a maintenance routine I do not run, hardware I do not own, software versions that are not installed. The checker not only failed to catch the invented details, it gave points for them, because invented version numbers look exactly like real version numbers to a program that only counts them.

One model (Qwen) followed the formatting rules perfectly but wrote dull, repetitive text. The other (Mistral) wrote livelier text with real-looking code, but broke the formatting rules and made up more convincing falsehoods. Neither could be published as written. The automated check is useful, but it measures whether writing looks right, not whether it is true. A human who knows the machine still has to read every word.

Takeaway

A green quality score means the text is shaped like a good article. It does not mean the article is honest. An automated writing gate that counts specifics will always reward a confident fabricator, because a made-up benchmark and a real one are the same shape on the page. Keep the gate, it is cheap and it is honest about its own scope. Never let it stand in for the read. The instrument measures structure. Truth is still a human job, and on this stack it always will be.

Appendix A: full raw Qwen3.6 output

Raw model output, verbatim, unedited. It contains fabricated claims about this system, preserved as evidence. Do not read it as a description of the real stack.

---
title: Start Here: The DGX Spark Engineering Log
description: A technical log of running production AI on a single NVIDIA DGX Spark (GB10, 121GB UIM). No hype. Just hardware, latency numbers, and the specific stack I trust.
date: 2026-05-18
tags: [dgx-spark, self-hosted-ai, vllm, mcp, engineering-log, start-here]
style: conclusion
---

## The Hardware Reality

I run a single node NVIDIA DGX Spark. The unit is a Blackwell GB10 system with 121 GB of unified memory. It sits on my desk. It draws power from my wall. It handles inference, fine-tuning, and local serving.

Most people read about this hardware in press releases. I read about it in thermal throttling logs and PCIe bandwidth contention reports. This blog is not a marketing page. It is an engineering log. I document what works, what fails, and why.

If you are looking for a guide to buying a DGX Spark, go elsewhere. If you are looking for a guide to running a production-grade AI stack on limited unified memory, you are in the right place.

The core thesis of this site is simple. Small, well-defined self-hosted tools on owned hardware beat renting frontier APIs for a specific class of work. You should distrust any performance claim you have not configured yourself. Benchmarks are noise until you run them on your specific data with your specific latency constraints.

As of May 2026, my stack is stable. It is not state-of-the-art in terms of raw parameter count. It is state-of-the-art in terms of cost-per-token and data sovereignty for my use case.

## The Stack as of May 2026

The architecture is lean. There is no Kubernetes cluster. There is no managed vector database service. There is no serverless function invoker. There is just the DGX Spark and a hardened VPS.

Here is the current configuration.

### Inference Engine: vLLM with ~35B Quantized LLM

The primary model is a 35-billion parameter language model. I use a quantized version to fit comfortably within the 121 GB unified memory budget while leaving headroom for context windows and KV cache management.

I serve this via vLLM. The throughput is approximately 70 tokens per second on continuous batching. This is not the speed of a cloud GPU cluster. It is faster than a human can read, which is the requirement for my workflow.

I do not use full precision. Full precision on 35B parameters with a 32k context window would require over 140 GB of memory. That exceeds the hardware. Quantization reduces the memory footprint by roughly 50 percent. The quality loss is negligible for my tasks. I verify this loss weekly with a small held-out validation set. If the BLEU score drops more than 0.5 points, I retrain.

### Vision Fallback: Mistral

The DGX Spark has a strong GPU. It also has a decent NPU. However, I keep a Mistral vision model on standby. I use this only for specific image parsing tasks. The primary LLM handles text. The vision model handles the images.

I route requests based on input type. If the payload contains an image, I send it to the Mistral endpoint. If it is text only, it goes to the 35B LLM. This separation prevents the vision model from consuming context window space needed by the primary LLM.

### Control Plane: Self-Hosted MCP Server

I use the Model Context Protocol (MCP). I run my own MCP server on the DGX Spark. This server exposes tools to the LLM. These tools interact with my local file system, my git repository, and my database.

The MCP server runs on a separate port. It is isolated from the inference engine. If the MCP server crashes, the LLM still runs. It just cannot execute tools. This decoupling is critical. A crash in a tool execution script does not kill the inference process.

I do not use a managed MCP registry. I write the tools myself. This ensures I know exactly what data leaves my machine and what data stays.

### Frontend and Hosting: Astro and Caddy

The blog itself runs on Astro. It is static. I build it locally. I deploy it to a hardened VPS. The VPS runs Caddy for TLS termination and reverse proxying.

The VPS is not part of the AI stack. It is part of the delivery stack. It handles the HTTP traffic. It serves the markdown files. It does not run any AI models. This separation ensures that high traffic to the blog does not impact inference latency on the DGX Spark.

## Why Self-Host?

You might ask why I do this. Why not use an API?

I have used APIs. I have paid for them. The cost scales linearly with usage. My usage is high. I run hundreds of requests per day. The monthly bill exceeded the cost of the DGX Spark in six months.

There is a second reason. Latency. API calls go through the internet. They go through load balancers. They go through regional endpoints. The round-trip time is variable. On my local stack, the latency is deterministic. It is measured in milliseconds, not seconds.

There is a third reason. Data privacy. My code is proprietary. My logs are proprietary. I do not send this data to a third-party provider. I keep it on my disk. I encrypt it. I control the keys.

## The Pillars of This Log

This blog is organized into three categories. New readers should start with Setup. Then move to Fixes. Finally, read the Strategy articles.

### 1. Setup

The setup articles are the foundation. They cover the initial configuration of the DGX Spark. They cover the installation of drivers. They cover the network configuration.

I do not assume you have a working system. I assume you have a box. You need to install the operating system. You need to install the CUDA toolkit. You need to install vLLM.

The first article in this series covers the OS installation. I use Ubuntu 24.04 LTS. I avoid rolling releases. Stability matters for production. I detail the exact partition scheme. I detail the exact kernel parameters.

The second article covers the network setup. The DGX Spark has multiple NICs. I bind them for throughput. I configure VLANs for isolation. I explain why I do not use the default bridge.

These articles are step-by-step. They include terminal commands. They include file paths. They include version numbers. You can follow them exactly.

### 2. Fixes

The fixes articles are the most valuable. They document the failures. They document the workarounds. They document the bugs.

AI engineering is not smooth. It is full of edge cases. The kernel panics. The GPU hangs. The memory leaks. The quantized model produces gibberish.

I document these issues. I document the error logs. I document the solution. I document the root cause.

For example, I wrote an article on the PCIe bandwidth bottleneck. The DGX Spark uses a switch inside the chassis. The bandwidth is shared. When I run heavy I/O operations, the inference latency spikes. I document the fix. I document the trade-off.

I also document the memory fragmentation issues. Unified memory is powerful. It is also complex. Pages can move between CPU and GPU memory. This movement has a cost. I document how I pinned memory to reduce the cost.

These articles are not tutorials. They are incident reports. They are useful because they save you time. You will encounter these problems. I have already solved them. Read the fix. Apply the fix.

### 3. Strategy

The strategy articles are the opinionated ones. They discuss the why. They discuss the trade-offs. They discuss the architecture.

I discuss model selection. I discuss why 35B is the sweet spot for my hardware. I discuss why I do not use larger models. I discuss the cost of context window.

I discuss the MCP design. I discuss the security implications of exposing tools to the LLM. I discuss the authentication layer. I discuss the rate limiting.

I discuss the blog architecture. I discuss why I chose Astro over Next.js. I discuss the build time. I discuss the deployment pipeline.

These articles are for engineers who want to understand the decisions. They are not for people who want a copy-paste solution. You must read them to understand the context.

## The Limits

I am honest about the limits. This stack is not for everyone.

It requires hardware. The DGX Spark is expensive. It is not a Raspberry Pi. You need to spend money to save money later.

It requires knowledge. You need to understand Linux. You need to understand networking. You need to understand GPU architecture. If you do not know these things, this blog will confuse you. Read the setup articles first.

It requires maintenance. The stack is not set-and-forget. You need to update drivers. You need to update vLLM. You need to monitor the thermal sensors. You need to watch the logs.

The 35B model is not GPT-5. It is not Claude Opus. It is a quantized model. It makes mistakes. It hallucinates. It has a limited context window. It cannot solve complex math problems reliably. It is good at text processing. It is good at code generation. It is good at summarization. It is not good at everything.

I do not hide these limits. I highlight them. You must know the limits before you trust the output.

## The Funnel

This article is the entry point. It is the funnel anchor. If you landed here, you are interested in self-hosted AI. You are interested in the DGX Spark. You are interested in the engineering trade-offs.

Here is the path.

1.  Read the Setup articles. Get the hardware running. Get the software installed. Get the network configured.
2.  Read the Fixes articles. Anticipate the problems. Know the solutions.
3.  Read the Strategy articles. Understand the architecture. Make informed decisions.
4.  Run the stack. Monitor the metrics. Measure the latency. Measure the cost.
5.  Iterate. Change the model. Change the tools. Change the architecture.

Do not skip steps. Do not assume the setup will work on your machine. My machine is not your machine. Your network is not my network. Your data is not my data.

## The Next Action

You have the overview. You have the stack. You have the path.

The single next action is to check your hardware. Do you have a DGX Spark? Do you have the RAM? Do you have the storage? If you do not have the hardware, you cannot run the stack.

If you have the hardware, install Ubuntu 24.04 LTS. Do not install a desktop environment. Do not install a GUI. Install the server version. Run the update command. Check the kernel version.

If you do not have the hardware, read the strategy articles. Understand the economics. Understand the performance. Then decide if the investment is worth it.

I do not sell courses. I do not sell consulting. I sell nothing. I write code. I write logs. I share the results.

The DGX Spark is a tool. It is a powerful tool. It is also a difficult tool. It requires respect. It requires attention. It requires engineering rigor.

Start with the setup. Measure the latency. Trust your own numbers. Distrust the hype.

The work begins now.

Appendix B: full raw Mistral Small 4 output

Raw model output, verbatim except one en-dash replaced with a hyphen so this article stays rule-compliant (the substitution is marked inline). It contains fabricated claims about this system, preserved as evidence. Do not read it as a description of the real stack.

```markdown
---
title: Start here
description: One person's honest engineering log of running production AI on a single NVIDIA DGX Spark. Thesis, stack, and how to navigate the blog.
date: 2026-05-18
tags: [intro, self-hosted, vLLM, dgx-spark, mcp]
style: conclusion
---

## What this site is
I bought a NVIDIA DGX Spark (GB10 Blackwell) in March 2026 because I was tired of sending prompts to frontier APIs and waiting for billing emails. This blog is the raw log of one person running a production-grade AI stack on that single metal box. There are no slide decks, no vaporware claims, no managed-service up-sells. Every post shows the exact commands, config files, versions, and measured numbers I used to make a feature work or break. If you see a number, it was measured on my hardware. If you see a path, it exists on my machine. If you see a failure, it happened to me.

The site has one goal: to help you decide whether a similar setup is worth your time and money. I will not argue that everyone should self-host. I will only show what I did, what broke, and what I learned.

## Thesis in one sentence
Small, well-defined self-hosted tools on owned hardware beat renting frontier APIs for a class of work, and you should distrust any claim you have not configured yourself.

Evidence so far: on my DGX Spark with 121 GB unified memory, a quantized 35B parameter model served via vLLM delivers ~70 tokens per second per user, which is enough for interactive use and fine-tuning jobs that would cost hundreds of dollars per month on a public cloud. The inference cost on my hardware is the price of electricity and depreciation; the API cost is a credit card charge I cannot audit.

As of May 2026, the stack is intentionally minimal:

- LLM: quantized 35B parameter model, served by vLLM 0.6.3 on CUDA 12.5.
- Vision fallback: Mistral-8x7B-Instruct-v0.3 vision variant, quantized to 4-bit, served via vLLM as a secondary endpoint.
- Model context protocol (MCP): a single Python server exposing filesystem, shell, and custom tools. Wrapped behind Caddy with strict rate limits.
- Blog: static site built with Astro, served by Caddy on a hardened VPS in Amsterdam. No JavaScript frameworks, no analytics.
- Hardware: NVIDIA DGX Spark (GB10), 1x 128-core Grace CPU, 1x Blackwell B100 GPU, 121 GB unified memory, 1.5 TB NVMe.

Failures:
- The Blackwell B100's unified memory is not as fast as the HBM on a high-end data-center GPU. When batching large prompts (>4k tokens), latency spikes to 400 ms vs 120 ms on a rented A100.
- vLLM 0.6.3 does not yet support Blackwell's new TensorFloat-32X (TF32X) mode, so we're stuck at FP8/FP16 mixed precision. Performance is 15% worse than advertised for Blackwell.
- The Grace CPU is powerful but the OS scheduler still places GPU-bound threads on the wrong NUMA node, causing 10 to 15% throughput loss unless you pin CUDA processes manually. [en-dash in original replaced with "to"]

## Navigating the blog
The site is split into three pillars. Pick one and go deep.

1. Setup
   - How to install a hardened Ubuntu 24.04 on the DGX Spark.
   - Exact NVIDIA driver, CUDA, and vLLM versions that work together.
   - Disk layout, power limits, and thermal throttling notes.
   - Reproducible Dockerfile and systemd units for vLLM and MCP.

2. Fixes
   - Why vLLM 0.6.3 crashes when using more than 110 GB of unified memory, and the kernel patch that fixes it.
   - How to route Mistral vision through a separate vLLM instance to avoid OOM.
   - The Caddy config that hardens the MCP endpoint against DoS.
   - Benchmarks: tokens per second vs batch size vs GPU utilization vs power draw.

3. Strategy
   - Cost accounting: DGX Spark depreciation vs API spend for 10k tokens/day.
   - When to self-host vs when to rent.
   - Failure modes: power outages, driver regressions, NVIDIA blacklisting.
   - Long-term roadmap: adding a vector database, fine-tuning pipeline, and multi-GPU scaling on the same box.

## Where to start today
If you are new, read the setup posts in order. You will need:

`/etc/nvidia/gridd.conf` with FeatureType=1
`/etc/modprobe.d/nvidia.conf` with options nvidia NVreg_EnableDeepPCIASPML 1
`docker-compose.yml` vLLM service with:

```yaml
services:
  vllm:
    image: vllm/vllm-openai:v0.6.3
    runtime: nvidia
    volumes:
      - ./models:/models
    environment:
      - VLLM_MODEL=/models/quantized-35b/AWQ_model/
      - VLLM_PORT=8000
      - VLLM_MAX_NUM_SEQS=16
    deploy:
      resources:
        limits:
          cpus: "64"
          memory: 96G
          nvidia.com/gpu: 1
```

Then run:

```bash
docker compose up --build
```

Measure tokens per second with:

```bash
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "quantized-35b", "prompt": "Write a 200-word summary of Blackwell architecture.", "max_tokens": 200}'
```

Expect 70 ± 5 tok/s on a cold start.

## The single next action
Open a terminal on your DGX Spark. Install Ubuntu 24.04, then run:

```bash
sudo apt update && sudo apt install -y nvidia-driver-550 nvidia-cuda-toolkit nvidia-container-toolkit
```

After reboot, verify the driver:

```bash
nvidia-smi
```

If you see:

```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15    Driver Version: 550.54.15    CUDA Version: 12.5     |
+-----------------------------------------------------------------------------+
```

You are ready to install vLLM. Proceed to the setup post titled "DGX Spark OS install and driver sanity check."
```

Sources and method

Cross-references

Illustration: The Quality Gate That Rewards Fabrication: I Had Qwen and Mistral Write This Blog