Content Quality in the AI Age: Where Our Scoring System Is Right, Wrong, and Missing
Coming from outside the stack? The Self-Hosted AI: Start Here hub article maps where strategy decisions like this one land in the actual deploy: hardware tree, inference engine, what hurts most. Useful as the operational anchor for the framing here.
I built a quality scoring system for the blog you are reading right now because I needed a mechanical signal for “is this article actually carrying information or just sounding like it does”. Weighted signals, style-aware gates, epistemic markers. Then I ran my own framework through a rigorous philosophical critique to see whether I was measuring what I thought I was measuring.
The honest answer was: three things confirmed, two things exposed, one fix that matters.
What the Framework Gets Right About Us
The critique opens with the same sentence we wrote into our own pipeline six weeks ago:
We do not measure quality. We measure the reduction of conditions that prevent quality.
This is not a coincidence. It reflects the only coherent position available when you accept that quality itself is not directly observable. You cannot measure understanding. You can measure the absence of ambiguity. You cannot measure trust. You can measure terminological consistency.
Our 16-signal architecture already operates on this principle. defined_terms, why_answers, step_sequences are not quality measurements. They are absence-of-failure signals. The framework validates this framing precisely.
Gate architecture is correct. The framework proposes a Structural Integrity Gate (SIG): binary, collapses everything else to zero if violated. Our min_score gate does exactly this: below threshold, the article is rejected and Mistral retries. Not penalized. Rejected. This is the right architecture. An additive score that allows bad structure to be compensated by word count is not a quality gate. It is a performance reward.
Anti-KPI protocol is correct. Klickrate, Scrolltiefe, Social Engagement, early traffic: explicitly excluded. We never measured these. The framework’s formal exclusion list matches our implicit assumptions. We can now make those assumptions explicit in the docs.
What the Framework Exposes
The Goodhart Problem Is Architectural
Goodhart’s Law: once a metric becomes a target, it ceases to be a good metric. Our system has a specific Goodhart vulnerability that most scoring systems do not. Mistral generates the content AND is graded by the same signals it was told about.
The feedback loop in our pipeline sends Mistral this when it fails the quality gate:
"Add more why_answers: explain WHY, not just WHAT. Use 'because', 'therefore'..."
Mistral responds by inserting therefore and because into the next draft. The signal count goes up. The article passes. The reasoning did not improve. It was papered over with causal connectives.
This is the core tension we cannot fully resolve without changing the generation model. Partial mitigations:
- Use
defined_termsandstep_sequencesas primary gates (harder to fake than connective injection) - Limit retry prompts to structural requirements rather than keyword hints
- Treat the score as internal pipeline health, not a quality claim
ContentClass Is Missing
The framework introduces a concept our system lacks entirely:
ContentClass { Foundational | Durable | Ephemeral }
This matters because our scoring treats a setup tutorial about Docker Compose v2.24.7 the same as a strategic analysis of AI infrastructure economics. These are not the same content type. Quality for ephemeral content means something different:
- Ephemeral (fixes/, setup/ articles): version markers, explicit dependencies, deprecation signals. Quality is defined as minimal update cost, not longevity.
- Durable (strategy/, services/ articles): argument completeness, defined terms, structural consistency. Quality is defined as resistance to concept drift.
- Foundational (rare): principle-level abstraction that survives version changes. Quality is defined as definitional precision.
Our version_refs signal rewards articles that name specific versions. For an Ephemeral article, this is correct because specificity reduces ambiguity. For a Foundational article, version references indicate the wrong abstraction level entirely. The same signal, opposite semantics, zero differentiation in our system.
The Temporal Honesty Gap
The framework defines a Temporal Honesty Protocol: every claim is either timeless or explicitly time-bound. Temporal ambiguity is a structural defect.
Our version_refs signal counts version numbers. It does not check whether they are temporally anchored. "nginx 1.27" scores the same as "nginx 1.27 (as of April 2026, current stable)". The second is structurally more honest because it declares its own obsolescence mechanism.
A new signal: temporal_markers. Count of explicit time-binding statements:
temporal_markers = len(re.findall(
r"\b(?:as of|at the time of writing|since version|until version|"
r"deprecated in|introduced in|updated in|current as of|"
r"checked on|tested on|last updated)\b",
body, re.IGNORECASE
))
For Ephemeral content, temporal_markers is a primary signal. Naming a specific version without anchoring it in time is a structural defect, not a quality indicator.
What the Framework Gets Wrong About Scores
The framework argues for no scores, only trend vectors: DeltaCorrections, DeltaRetractions, DeltaConceptDrift over time. This is philosophically correct and operationally impossible for a static content pipeline.
We do not have time-series data per article. We do not track retraction rates. We cannot implement an Epistemic Stability Index without infrastructure that does not exist.
The framework’s Quality Adjacency function:
Quality_Adjacency = SIG x ZCS x CSM x TSD
This requires Zap history (ZCS), change tracking (CSM), and days-on-shelf (TSD). We have none of these at scoring time. We have one pass of the article body at publish time.
This is not a failure of the framework. It is a constraint on what a static single-pass pipeline can compute. The score we generate is a structural integrity proxy, not a quality adjacency function. Naming it correctly matters.
Three Changes That Follow
1. Add ContentClass to frontmatter. Auto-detect by slug prefix:
- fixes/ and setup/ articles:
Ephemeral - strategy/ and services/ articles:
Durable - Override manually for Foundational pieces
2. Add temporal_markers as a signal, weighted by inferred ContentClass:
- Ephemeral articles: weight x5, feedback hint when zero
- Durable articles: weight x2
- Foundational articles: weight x0 (version anchoring is a defect at this abstraction level)
3. Rename the score in UI. It is not a quality score. It is a Structural Integrity Index: a proxy for the probability that the content is not structurally defective. The insights page label changes. The frontmatter field quality.score stays (too many downstream dependencies to refactor), but what it represents is now named correctly.
What Remains Unresolved
The Goodhart problem has no clean solution within a generative pipeline. Mistral will always be told what signals matter and will optimize for them. The only protection is that structural signals require actual structural decisions that connective injection cannot fake. defined_terms requires a definition. step_sequences requires ordering. temporal_markers requires an explicit time anchor.
Trend tracking is not implemented. The framework is correct that a snapshot score is less informative than a trend over revisions. Adding revision history to the pipeline would require git tracking of article changes: possible, not prioritized as of April 2026.
The score as currently computed is honest about what it is: a structural integrity gate for a single-pass generative pipeline. It does not claim to measure quality. It filters for the absence of structural failure conditions. Within that scope, the architecture holds.
Two honest limitations that sit permanently outside the scope of this system. First, the framework can’t evaluate argument validity. A structurally correct article can carry a wrong conclusion. The signals measure shape; they are blind to whether the reasoning holds under scrutiny. That is not proven solvable with regex at publish time. Second, the score is a single-document view. It doesn’t account for the body of work: whether this article repeats ground covered in three earlier pieces, whether it contradicts a claim made last month, whether the terminology is consistent across the whole site. Those gaps require cross-article analysis that the current pipeline doesn’t attempt.
What the May 2026 gate-hardening changed
The original framework evaluation predicted three weak spots. Two of those have since been closed; one remains.
Closed: the “structurally fine but substantively thin” failure mode. The score gate alone passed 485-word Mistral outputs as long as code-block density was high. Adding a 1200-word floor (default) plus per-style overrides (fixes-style at 800, where conciseness is genuinely a virtue) cut the false-pass rate to near zero on the May 2026 audit pass.
Closed: the unverified-version-pin failure mode. The original framework discussion noted that scores treat “Mistral-Small-4 v99.9.9” the same as a real version pin. The factcheck-gate added in deploy.sh now pings Docker Hub, PyPI, and npm registries before publish and refuses to ship articles with references that do not exist. Net result: no live article on this blog contains a hallucinated docker tag or PyPI version, by construction.
Still open: the “real ingredient, misleading narrative” failure mode. A real Docker image used in a misleading context still passes both gates. The score is shape, the factcheck is literal-existence; neither addresses argument validity. That remains a human-attestation problem and will likely stay that way until a separate rebuttal-pass tool exists.
Implementation status as of mid-May 2026
The three recommendations in the original section above all shipped into the pipeline.
1. ContentClass in frontmatter : implemented. content_class is auto-set during scoring based on slug prefix: fixes-* and setup-* resolve to Ephemeral, strategy-* and services-* to Durable. The field appears in every article’s quality block. No Foundational articles exist yet by design.
2. temporal_markers signal : implemented. Added to the signal stack using the regex from the original draft. The current article’s own quality block records 14 temporal markers. Weight is higher for Ephemeral content where “as of
3. UI renaming : partially done. The /insights/ page still labels the column “Score” for downstream consistency (frontmatter field quality.score is referenced by too many places to refactor cleanly). The score-tier explainer block now frames the number as “a linter for shape, not a quality claim”, and links back to this article as the deep-dive companion. That covers the framing concern; the literal column header stays.
Three other gate changes shipped between then and now, not predicted in the original article:
Stylometry catalogue expanded. sovereign-kb/mistral-overuse-phrases.md grew from 13 phrases to a larger set targeting AI-output tells: the essentially, fundamentally, ultimately, notably, interestingly, crucially, remarkably adjective cluster, plus aphorism-couplet patterns like pick your currency, plans are theater, is paid in time, lazy in the right places. Curly quotes get normalized to ASCII before phrase-matching to catch the UTF-8 drift that otherwise slipped past the gate.
Burstiness check added. A _check_burstiness() signal warns when sentence-length standard deviation drops below 4.0, because uniformly-paced sentences read flat in TTS rendering. Surfaced originally for the podcast pipeline, the warning now runs on the blog gate too.
Quality-signals self-heal in deploy. The deploy pipeline runs update_blog_from_gitea.py --rescore-all as Phase 2 before rsync to Floki. Regex-only, no Mistral round-trip, idempotent across 77 articles in about two seconds. Hand-written articles that landed without a quality: frontmatter block, and thus appeared “gated” on /insights/, now get scored automatically on every deploy.
The score-as-shape-not-quality framing was the load-bearing correction this article made. Everything since has been incremental: more phrase patterns, more self-healing, more honest UI framing. The three open questions identified above (Goodhart vulnerability, no trend tracking, real-ingredient-misleading-narrative failure) all remain open.