Per-Segment Loudnorm and the 3-Second Lookahead Bug
After per-block loudness normalization, the female co-host always sounded one room away from the male host. After the global two-pass loudnorm, the first three seconds of every episode disappeared. Both bugs lived in the same mix_audio.py. Both are ffmpeg loudnorm filter footguns. Both look like TTS quality problems until you measure.
Quick Take
- Per-block
loudnormaverages multiple speakers in the block to one gain, flattening per-speaker imbalance the wrong way- Dynamic-mode
loudnorm(the fallback when linear-mode conditions fail) leaves a 3-second leading-silence artifact- Per-segment normalization (measure
input_i, applyvolume={gain}dB) fixes the first- Dropping the global
loudnormpass entirely and letting per-segment converge fixes the second- Verification: RMS measurement at sample timestamps catches both before they ship
Bug 1: per-block average masks per-speaker imbalance
The original mixer ran one loudnorm pass over an entire voice block (the run of TTS segments between music transitions). Each block had alternating speakers, CIPHERFOX → HEXABELLA → CIPHERFOX → HEXABELLA → …, 50 to 250 segments per block in a typical episode.
ffmpeg measured the integrated loudness of the block as a whole and applied one gain offset. The averaging behavior is exactly what loudnorm advertises in single-pass mode: read the input, compute an integrated LUFS measurement, return a gain that brings the measurement to the target. The problem is that “the block as a whole” mixes the two speakers’ levels into a single average. Whichever speaker was louder in the source pulled the average up; whichever was quieter stayed relatively quiet after the gain was applied.
Listener verdict on the V6 episode: “weder bei cipherfox noch bei hexabella signifikant verbessert … lautstärken-unterschiede”. Translation: neither voice improved over baseline, the audible volume gap between speakers persisted. Engineering instinct said “the TTS is producing inconsistent volume.” Diagnosis: the mixer was flattening the wrong axis.
You cannot fix per-speaker imbalance with a per-block normalizer. The averaging is the bug.
Bug 1, the fix: per-segment normalization
Replace the per-block loudnorm with per-segment loudness measurement and a clamped gain offset. Each segment goes through loudnorm pass-1 to read its integrated LUFS, then a volume filter applies the delta to bring it to target. The cap at ±9 dB prevents runaway on outlier short segments where the integrated measurement is unstable.
def _measure_loudness(wav: Path, target_lufs: int) -> float | None:
"""Pass-1 measurement only, no audio output."""
result = subprocess.run([
"ffmpeg", "-y", "-i", str(wav),
"-af", f"loudnorm=I={target_lufs}:TP=-2.0:LRA=8:print_format=json",
"-f", "null", "-",
], capture_output=True)
m = re.search(r'\{[^{}]+\}', result.stderr.decode(), re.DOTALL)
if not m:
return None
return float(json.loads(m.group())["input_i"])
def normalize_block(segs: list[Path], out_wav: Path, target_lufs: int = -16) -> None:
"""Per-segment measure + gain + denoise + concat."""
with tempfile.TemporaryDirectory() as tmp_dir:
tmp = Path(tmp_dir)
processed: list[Path] = []
for i, seg in enumerate(segs):
measured = _measure_loudness(seg, target_lufs)
gain_db = max(-9.0, min(9.0, target_lufs - measured)) if measured else 0.0
seg_out = tmp / f"seg_{i:04d}.wav"
af = f"highpass=f=80,afftdn=nr=10:nf=-25,volume={gain_db:+.2f}dB,{_LIMITER}"
_run([
"ffmpeg", "-y", "-i", str(seg),
"-af", af, "-ar", str(_SR), "-ac", "2",
"-c:a", "pcm_s16le", str(seg_out),
], f"per-segment normalize ({seg.name})")
processed.append(seg_out)
# Concat all processed segments...
Cost: one extra ffmpeg call per segment for the measurement (~1 second wall each on a DGX Spark). Worth it. After this change, CIPHERFOX and HEXABELLA volumes stayed within ~1 dB of each other across the full episode.
Bug 2: loudnorm dynamic mode eats leading audio
The pipeline ran a global two-pass loudnorm after intro and outro mixing, intended to catch any residual loudness drift in the assembled track. In dynamic mode (the fallback when linear=true cannot be applied because the measured LRA is too high or the target true-peak constraint cannot be satisfied), loudnorm introduces a leading-silence artifact at the start of the output.
The behavior is consistent with how loudnorm’s implementation handles its lookahead buffer: it processes audio with a roughly 3-second window to compute true-peak limits, and the buffer’s leading state writes silence before any input audio reaches the output. The man page (man ffmpeg-filters, search for loudnorm) describes the upsampling-to-192-kHz behavior in dynamic mode for true-peak detection but does not explicitly call out the leading-silence side effect.
I measured it on the V6 episode:
$ for t in 0 1 2 3 4 5; do
rms=$(ffmpeg -ss $t -t 0.5 -i episode.v6.mp3 -af astats -f null - 2>&1 \
| grep "RMS level dB" | head -1 | awk -F: '{print $2}')
echo "t=${t}s: RMS=$rms"
done
t=0s: RMS= -inf
t=1s: RMS= -inf
t=2s: RMS= -inf
t=3s: RMS= -inf
t=4s: RMS= -19.755297
t=5s: RMS= -14.619059
3.5 seconds of silence at file start. The intro-music solo phase that the sidechain mix had carefully placed (4 seconds of music before voice cuts in) was eaten by the loudnorm lookahead.
Bug 2, the fix: drop the global pass
Once per-segment normalization is in place (Bug 1’s fix), the global loudnorm pass is redundant. Each segment is already at target. The global pass was added to “catch peaks”, but a simple highpass-and-limiter chain achieves that without the lookahead artifact.
def process_voice(raw_wav: Path, processed_wav: Path, target_lufs: int = -16) -> None:
"""Final mastering: gentle highpass + limiter, no loudnorm.
Loudnorm's 3-second lookahead in dynamic mode creates silent leading
audio. Per-segment normalization in stage 2 already converged each
segment, so a final loudnorm pass is redundant.
"""
af = f"highpass=f=80,{_LIMITER}"
_run(["ffmpeg", "-y", "-i", str(raw_wav), "-af", af, str(processed_wav)],
"final master")
After the change, the same RMS measurement at t=0..3 returns real audio (around -22 dB, the intro music at full volume). Loudness across the episode stayed at -19.5 LUFS integrated, slightly below the -16 target but within podcast publishing norms. Streaming platforms re-normalize anyway; per-speaker consistency matters more than precise integrated loudness for listening comfort.
Why both bugs share a root cause
loudnorm is a single-input filter that assumes its input represents one homogeneous stream. Multi-speaker TTS output is not homogeneous; voices alternate at the segment level, with different acoustic characteristics. Mixed audio (voice plus music) is even less homogeneous. The filter is doing exactly what it advertises, the input simply does not match its assumptions.
The right model for multi-speaker TTS is per-segment normalization at the source level, before any mixing. Music tracks should be pre-mastered to target loudness offline (a separate master_music_assets.py script handles this). The final mixer then only assembles pre-loudness-correct components and applies a peak limiter to catch any residual clipping. No single loudnorm pass at any stage of the pipeline.
What to verify in your own pipeline
If you are running multi-speaker TTS through ffmpeg’s loudnorm, two diagnostic commands answer most questions:
# Check per-speaker imbalance: measure each speaker's segments separately.
for spk in cipherfox hexabella; do
ffmpeg -i "concat:$(ls segments/*${spk}.wav | head -20 | paste -sd '|')" \
-af "loudnorm=I=-16:TP=-2.0:print_format=json" -f null - 2>&1 \
| grep '"input_i"'
done
# Check leading-silence: RMS at t=0 to 5.
for t in 0 1 2 3 4 5; do
ffmpeg -ss $t -t 0.5 -i episode.mp3 -af astats -f null - 2>&1 \
| grep "RMS level dB" | head -1
done
If the per-speaker measurements differ by more than ~1 dB, you have Bug 1. If t=0..3 returns -inf or values significantly lower than t=4+, you have Bug 2. Both reproduce reliably; both have one-line fixes once you find them.
What I Actually Use
- Per-segment
loudnormpass-1 measurement,volume={gain}dBapplication, ±9 dB capafftdn=nr=10:nf=-25for Voxtral hiss before the limiteralimiter=limit=0.841:level=false(-1.5 dBFS true-peak ceiling) at the end- No global
loudnormpass, justhighpass=f=80,alimiter=...for final master- RMS verification at
t=0..5on every render
External references
- ffmpeg
loudnormfilter docs: https://ffmpeg.org/ffmpeg-filters.html#loudnorm - EBU R128 standard (the algorithm
loudnormimplements): https://tech.ebu.ch/publications/r128 - ITU-R BS.1770 (true-peak measurement, why dynamic mode upsamples to 192 kHz): https://www.itu.int/rec/R-REC-BS.1770
Related in this series
This article is Part 4 of Voxtral Pipeline Discoveries (May 2026):
- Part 1: Voxtral 4B Open-Checkpoint: The Encoder is Gated. The architectural constraint behind this pipeline.
- Part 2: Voxtral Chunk Strategy. 30 to 38 percent render-time savings with whole-turn rendering.
- Part 3: FFmpeg
volumeFiltereval=frame. A 4-second silent intro bug, companion piece to this one. - Part 4 (this article). Per-segment loudness for multi-speaker TTS, and the 3-second loudnorm lookahead.
Two compound loudnorm bugs
Multi-speaker TTS pipeline, ffmpeg loudnorm filter