Content Quality in the AI Age: Where Our Scoring System Is Right, Wrong, and Missing
We built a quality scoring system for AI-generated blog content. Weighted signals, style-aware gates, epistemic markers. Then we ran the framework through a rigorous philosophical critique.
Three things confirmed, two things exposed, one fix that matters.
What the Framework Gets Right About Us
The critique opens with the same sentence we wrote into our own pipeline six weeks ago:
We do not measure quality. We measure the reduction of conditions that prevent quality.
This is not a coincidence. It reflects the only coherent position available when you accept that quality itself is not directly observable. You cannot measure understanding. You can measure the absence of ambiguity. You cannot measure trust. You can measure terminological consistency.
Our 16-signal architecture already operates on this principle. defined_terms, why_answers, step_sequences are not quality measurements. They are absence-of-failure signals. The framework validates this framing precisely.
Gate architecture is correct. The framework proposes a Structural Integrity Gate (SIG): binary, collapses everything else to zero if violated. Our min_score gate does exactly this: below threshold, the article is rejected and Mistral retries. Not penalized. Rejected. This is the right architecture. An additive score that allows bad structure to be compensated by word count is not a quality gate. It is a performance reward.
Anti-KPI protocol is correct. Klickrate, Scrolltiefe, Social Engagement, early traffic: explicitly excluded. We never measured these. The framework’s formal exclusion list matches our implicit assumptions. We can now make those assumptions explicit in the docs.
What the Framework Exposes
The Goodhart Problem Is Architectural
Goodhart’s Law: once a metric becomes a target, it ceases to be a good metric. Our system has a specific Goodhart vulnerability that most scoring systems do not. Mistral generates the content AND is graded by the same signals it was told about.
The feedback loop in our pipeline sends Mistral this when it fails the quality gate:
"Add more why_answers: explain WHY, not just WHAT. Use 'because', 'therefore'..."
Mistral responds by inserting therefore and because into the next draft. The signal count goes up. The article passes. The reasoning did not improve. It was papered over with causal connectives.
This is the core tension we cannot fully resolve without changing the generation model. Partial mitigations:
- Use
defined_termsandstep_sequencesas primary gates (harder to fake than connective injection) - Limit retry prompts to structural requirements rather than keyword hints
- Treat the score as internal pipeline health, not a quality claim
ContentClass Is Missing
The framework introduces a concept our system lacks entirely:
ContentClass { Foundational | Durable | Ephemeral }
This matters because our scoring treats a setup tutorial about Docker Compose v2.24.7 the same as a strategic analysis of AI infrastructure economics. These are not the same content type. Quality for ephemeral content means something different:
- Ephemeral (fixes/, setup/ articles): version markers, explicit dependencies, deprecation signals. Quality is defined as minimal update cost, not longevity.
- Durable (strategy/, services/ articles): argument completeness, defined terms, structural consistency. Quality is defined as resistance to concept drift.
- Foundational (rare): principle-level abstraction that survives version changes. Quality is defined as definitional precision.
Our version_refs signal rewards articles that name specific versions. For an Ephemeral article, this is correct because specificity reduces ambiguity. For a Foundational article, version references indicate the wrong abstraction level entirely. The same signal, opposite semantics, zero differentiation in our system.
The Temporal Honesty Gap
The framework defines a Temporal Honesty Protocol: every claim is either timeless or explicitly time-bound. Temporal ambiguity is a structural defect.
Our version_refs signal counts version numbers. It does not check whether they are temporally anchored. "nginx 1.27" scores the same as "nginx 1.27 (as of April 2026, current stable)". The second is structurally more honest because it declares its own obsolescence mechanism.
A new signal: temporal_markers. Count of explicit time-binding statements:
temporal_markers = len(re.findall(
r"\b(?:as of|at the time of writing|since version|until version|"
r"deprecated in|introduced in|updated in|current as of|"
r"checked on|tested on|last updated)\b",
body, re.IGNORECASE
))
For Ephemeral content, temporal_markers is a primary signal. Naming a specific version without anchoring it in time is a structural defect, not a quality indicator.
What the Framework Gets Wrong About Scores
The framework argues for no scores, only trend vectors: DeltaCorrections, DeltaRetractions, DeltaConceptDrift over time. This is philosophically correct and operationally impossible for a static content pipeline.
We do not have time-series data per article. We do not track retraction rates. We cannot implement an Epistemic Stability Index without infrastructure that does not exist.
The framework’s Quality Adjacency function:
Quality_Adjacency = SIG x ZCS x CSM x TSD
This requires Zap history (ZCS), change tracking (CSM), and days-on-shelf (TSD). We have none of these at scoring time. We have one pass of the article body at publish time.
This is not a failure of the framework. It is a constraint on what a static single-pass pipeline can compute. The score we generate is a structural integrity proxy, not a quality adjacency function. Naming it correctly matters.
Three Changes That Follow
1. Add ContentClass to frontmatter. Auto-detect by slug prefix:
- fixes/ and setup/ articles:
Ephemeral - strategy/ and services/ articles:
Durable - Override manually for Foundational pieces
2. Add temporal_markers as a signal, weighted by inferred ContentClass:
- Ephemeral articles: weight x5, feedback hint when zero
- Durable articles: weight x2
- Foundational articles: weight x0 (version anchoring is a defect at this abstraction level)
3. Rename the score in UI. It is not a quality score. It is a Structural Integrity Index: a proxy for the probability that the content is not structurally defective. The insights page label changes. The frontmatter field quality.score stays (too many downstream dependencies to refactor), but what it represents is now named correctly.
What Remains Unresolved
The Goodhart problem has no clean solution within a generative pipeline. Mistral will always be told what signals matter and will optimize for them. The only protection is that structural signals require actual structural decisions that connective injection cannot fake. defined_terms requires a definition. step_sequences requires ordering. temporal_markers requires an explicit time anchor.
Trend tracking is not implemented. The framework is correct that a snapshot score is less informative than a trend over revisions. Adding revision history to the pipeline would require git tracking of article changes: possible, not prioritized as of April 2026.
The score as currently computed is honest about what it is: a structural integrity gate for a single-pass generative pipeline. It does not claim to measure quality. It filters for the absence of structural failure conditions. Within that scope, the architecture holds.