How we fixed loudness pumping, markup stripping, and dialogue rhythm in a self-hosted podcast pipeline

Voxtral Podcast Audio: Mono 24 kHz Baseline and Three Compression Pitfalls

Last week, the 20-minute episode came back from Voxtral sounding like a broken radio with a volume knob stuck between stations.

Quick Take

  • Loudness pumping ruined 19 seconds of dialogue at 5:46
  • Backticks in scripts were read aloud as literal text
  • Dialogue had zero reactive turns and repeated phrases
  • Single-pass loudnorm caused dynamic gain swings
  • ARM64 MP3 encoding was silently downsampled to 75 kbps

The Backtick Bug That Made Voxtral Read Aloud

In practice, the script for turn 26 (007_hexabella.wav) contained the phrase `config.json`. Voxtral read it as “backtick config dot json backtick,” which triggered the quality gate at 5:46 because the region was marked as invalid.

The root cause is simple: clean_markup() stripped asterisks, underscores, and parentheses, but left backticks untouched. The quality gate had no rule for backticks, so the malformed text slipped through.

Fixing it required two changes:

  1. Add a regex to strip backticks in generate_script.py:
_CLEAN_BACKTICK = re.compile(r'`([^`]+)`')
cleaned = _CLEAN_BACKTICK.sub(r'\1', text)
  1. Add a new check in quality_gate.py:
_BACKTICK_RE = re.compile(r'`[^`]+`')
if _BACKTICK_RE.search(text):
    raise QualityGateError("backtick markup detected")

Three new tests (test_clean_markup_backtick_*) and two gate tests (test_forbidden_markup_backtick_*) now catch this before Voxtral ever sees it.


Why Single-Pass Loudnorm Pumps Audio

Last week this failed because the TTS turns alternated between reactive (2, 15 words) and substantive (40, 100 words). The single-pass loudnorm measured the overall loudness and applied dynamic gain in real time, which meant the gain chased the changing loudness and produced audible pumping.

The fix is a two-pass loudnorm:

Pass 1 measures the target values:

ffmpeg -i raw.wav \
  -af "loudnorm=I=-16:TP=-1.5:LRA=11:print_format=json" \
  -f null -

It prints JSON like:

{
  "input_i": "-23.90",
  "input_tp": "-1.30",
  "input_lra": "16.60",
  "input_thresh": "-33.90",
  "target_offset": "-0.00"
}

Pass 2 applies the measured values with a post-limiter:

ffmpeg -i raw.wav \
  -af "highpass=f=80,
       loudnorm=I=-16:TP=-1.5:LRA=11
         :measured_I=-23.90:measured_TP=-1.30:measured_LRA=16.60
         :measured_thresh=-33.90:offset=-0.00,
       alimiter=limit=0.891:level=false" \
  processed.wav

The alimiter sits after loudnorm to cap any overshoot without interfering with the gain calculation.


Dialogue Rhythm: Zero Reactive Turns

The episode contained 44 turns, all between 31 and 104 words, and no reactive turns. The root cause was an old system prompt that did not enforce reactive dialogue patterns.

The fix introduces three new rules:

  1. System prompt now includes:
_dialog_rhythm_block():
  reaktive_turns >= 35% of total
  content_driven = True
  1. Turn-count pressure in the prompt builder:
build_part{1,2}_prompt():
  turns_per_half >= 28
  reaktive_turns_per_half >= 9
  1. Cross-reaction patterns for speaker personas:
HEXABELLA:
  prefix = "wait, "
  max_words = 10
CIPHERFOX:
  prefix = "pushback"
  max_words = 10

A new naturalizer can insert reactive turns before substantive turns longer than 80 words. The style file styles/deep_dive.yaml now enforces:

avg_words_per_turn: 22
min_reactive_ratio: 0.35
min_turns_per_half: 28

These changes apply starting with the next episode.


Studio Pipeline Refactor: mix_audio.py

The symptom was an episode LRA of 16.6 dB, which prevented linear loudnorm because the true peak would clip at +7.9 dB gain. The ARM64 build also produced 75 kbps MP3 instead of the intended 192 kbps.

The four root causes were:

  1. No per-block normalization → high LRA blocked linear loudnorm
  2. amix normalize=1 halved both inputs → -6 dB voice loss
  3. -q:a 2 -b:a 192k conflict on ARM64 → actual bitrate 75 kbps
  4. Static volume duck for intro music → music overrode voice starts

The refactor splits the episode into voice blocks and transition pass-throughs, then normalizes each block independently:

def normalize_block(block_path):
    ffmpeg -i block_path \
      -af "highpass=f=80,
           loudnorm=I=-16:TP=-1.5:LRA=11,
           alimiter=limit=0.891:level=false" \
      block_normalized.wav

Sidechain ducking for the intro uses:

[1:a]atrim=start=0:end=3,afade=t=0:d=0.5,volume=1[m_base]
[0:a]asplit=2[v_hear][v_trig]
[m_base][v_trig]sidechaincompress=threshold=0.02:ratio=4:attack=200:release=800[m_duck]
[v_hear][m_duck]amix=inputs=2:duration=longest:normalize=0,alimiter[out]

Setting normalize=0 prevents the ffmpeg default -6 dB summing loss, and the sidechain automatically ducks the music when voice is active.

Global two-pass auto-selects linear or dynamic loudnorm based on the measured LRA. CBR encoding is enforced with:

ffmpeg -i processed.wav \
  -codec:a libmp3lame -b:a 192k \
  -ar 44100 \
  final.mp3

Resulting metrics for the fixed episode:


What I Actually Use

  • Mistral Small 4: the model that reads the cleaned scripts without backticks
  • ffmpeg 6.1: the only tool that handles sidechain ducking and loudnorm in one pipeline
  • DGX Spark ARM64: the hardware that finally encodes MP3 at the promised bitrate

Why mono 24 kHz is the right baseline for Voxtral output

Two formats kept appearing in the early debugging output: 16-bit PCM mono at 24 kHz, and 32-bit float stereo at 48 kHz. The first is what Voxtral actually emits; the second is what FFmpeg upsampled to before the pipeline was tightened. The upsample was silent, lossless on first hop, and adding ~3x the file size with zero perceptual gain. After pinning the output container to mono 24 kHz the per-episode storage dropped from ~12 MB to ~4 MB and the upload step over a slow connection stopped being the bottleneck.

The expressivity fixes from the v1-v3 prompt-rule series compound on top of this: cleaner audio + stricter prompt discipline + audience-pivot persona (HEXABELLA listener-proxy block) means each minute of generated audio sits at roughly the same quality threshold a human podcaster would hit on a USB condenser mic in a quiet room. Not studio-grade, not embarrassing.