Per-block ffmpeg loudnorm averages multiple speakers to one gain, leaving the quieter voice quieter. Dynamic-mode loudnorm eats the first 3 seconds of audio.

Per-Segment Loudnorm and the 3-Second Lookahead Bug

After per-block loudness normalization, the female co-host always sounded one room away from the male host. After the global two-pass loudnorm, the first three seconds of every episode disappeared. Both bugs lived in the same mix_audio.py. Both are ffmpeg loudnorm filter footguns. Both look like TTS quality problems until you measure.

Quick Take

  • Per-block loudnorm averages multiple speakers in the block to one gain, flattening per-speaker imbalance the wrong way
  • Dynamic-mode loudnorm (the fallback when linear-mode conditions fail) leaves a 3-second leading-silence artifact
  • Per-segment normalization (measure input_i, apply volume={gain}dB) fixes the first
  • Dropping the global loudnorm pass entirely and letting per-segment converge fixes the second
  • Verification: RMS measurement at sample timestamps catches both before they ship

Bug 1: per-block average masks per-speaker imbalance

The original mixer ran one loudnorm pass over an entire voice block (the run of TTS segments between music transitions). Each block had alternating speakers, CIPHERFOX → HEXABELLA → CIPHERFOX → HEXABELLA → …, 50 to 250 segments per block in a typical episode.

ffmpeg measured the integrated loudness of the block as a whole and applied one gain offset. The averaging behavior is exactly what loudnorm advertises in single-pass mode: read the input, compute an integrated LUFS measurement, return a gain that brings the measurement to the target. The problem is that “the block as a whole” mixes the two speakers’ levels into a single average. Whichever speaker was louder in the source pulled the average up; whichever was quieter stayed relatively quiet after the gain was applied.

Listener verdict on the V6 episode: “weder bei cipherfox noch bei hexabella signifikant verbessert … lautstärken-unterschiede”. Translation: neither voice improved over baseline, the audible volume gap between speakers persisted. Engineering instinct said “the TTS is producing inconsistent volume.” Diagnosis: the mixer was flattening the wrong axis.

You cannot fix per-speaker imbalance with a per-block normalizer. The averaging is the bug.

Bug 1, the fix: per-segment normalization

Replace the per-block loudnorm with per-segment loudness measurement and a clamped gain offset. Each segment goes through loudnorm pass-1 to read its integrated LUFS, then a volume filter applies the delta to bring it to target. The cap at ±9 dB prevents runaway on outlier short segments where the integrated measurement is unstable.

def _measure_loudness(wav: Path, target_lufs: int) -> float | None:
    """Pass-1 measurement only, no audio output."""
    result = subprocess.run([
        "ffmpeg", "-y", "-i", str(wav),
        "-af", f"loudnorm=I={target_lufs}:TP=-2.0:LRA=8:print_format=json",
        "-f", "null", "-",
    ], capture_output=True)
    m = re.search(r'\{[^{}]+\}', result.stderr.decode(), re.DOTALL)
    if not m:
        return None
    return float(json.loads(m.group())["input_i"])

def normalize_block(segs: list[Path], out_wav: Path, target_lufs: int = -16) -> None:
    """Per-segment measure + gain + denoise + concat."""
    with tempfile.TemporaryDirectory() as tmp_dir:
        tmp = Path(tmp_dir)
        processed: list[Path] = []
        for i, seg in enumerate(segs):
            measured = _measure_loudness(seg, target_lufs)
            gain_db = max(-9.0, min(9.0, target_lufs - measured)) if measured else 0.0
            seg_out = tmp / f"seg_{i:04d}.wav"
            af = f"highpass=f=80,afftdn=nr=10:nf=-25,volume={gain_db:+.2f}dB,{_LIMITER}"
            _run([
                "ffmpeg", "-y", "-i", str(seg),
                "-af", af, "-ar", str(_SR), "-ac", "2",
                "-c:a", "pcm_s16le", str(seg_out),
            ], f"per-segment normalize ({seg.name})")
            processed.append(seg_out)
        # Concat all processed segments...

Cost: one extra ffmpeg call per segment for the measurement (~1 second wall each on a DGX Spark). Worth it. After this change, CIPHERFOX and HEXABELLA volumes stayed within ~1 dB of each other across the full episode.

Bug 2: loudnorm dynamic mode eats leading audio

The pipeline ran a global two-pass loudnorm after intro and outro mixing, intended to catch any residual loudness drift in the assembled track. In dynamic mode (the fallback when linear=true cannot be applied because the measured LRA is too high or the target true-peak constraint cannot be satisfied), loudnorm introduces a leading-silence artifact at the start of the output.

The behavior is consistent with how loudnorm’s implementation handles its lookahead buffer: it processes audio with a roughly 3-second window to compute true-peak limits, and the buffer’s leading state writes silence before any input audio reaches the output. The man page (man ffmpeg-filters, search for loudnorm) describes the upsampling-to-192-kHz behavior in dynamic mode for true-peak detection but does not explicitly call out the leading-silence side effect.

I measured it on the V6 episode:

$ for t in 0 1 2 3 4 5; do
    rms=$(ffmpeg -ss $t -t 0.5 -i episode.v6.mp3 -af astats -f null - 2>&1 \
          | grep "RMS level dB" | head -1 | awk -F: '{print $2}')
    echo "t=${t}s: RMS=$rms"
  done
t=0s: RMS= -inf
t=1s: RMS= -inf
t=2s: RMS= -inf
t=3s: RMS= -inf
t=4s: RMS= -19.755297
t=5s: RMS= -14.619059

3.5 seconds of silence at file start. The intro-music solo phase that the sidechain mix had carefully placed (4 seconds of music before voice cuts in) was eaten by the loudnorm lookahead.

Bug 2, the fix: drop the global pass

Once per-segment normalization is in place (Bug 1’s fix), the global loudnorm pass is redundant. Each segment is already at target. The global pass was added to “catch peaks”, but a simple highpass-and-limiter chain achieves that without the lookahead artifact.

def process_voice(raw_wav: Path, processed_wav: Path, target_lufs: int = -16) -> None:
    """Final mastering: gentle highpass + limiter, no loudnorm.

    Loudnorm's 3-second lookahead in dynamic mode creates silent leading
    audio. Per-segment normalization in stage 2 already converged each
    segment, so a final loudnorm pass is redundant.
    """
    af = f"highpass=f=80,{_LIMITER}"
    _run(["ffmpeg", "-y", "-i", str(raw_wav), "-af", af, str(processed_wav)],
         "final master")

After the change, the same RMS measurement at t=0..3 returns real audio (around -22 dB, the intro music at full volume). Loudness across the episode stayed at -19.5 LUFS integrated, slightly below the -16 target but within podcast publishing norms. Streaming platforms re-normalize anyway; per-speaker consistency matters more than precise integrated loudness for listening comfort.

Why both bugs share a root cause

loudnorm is a single-input filter that assumes its input represents one homogeneous stream. Multi-speaker TTS output is not homogeneous; voices alternate at the segment level, with different acoustic characteristics. Mixed audio (voice plus music) is even less homogeneous. The filter is doing exactly what it advertises, the input simply does not match its assumptions.

The right model for multi-speaker TTS is per-segment normalization at the source level, before any mixing. Music tracks should be pre-mastered to target loudness offline (a separate master_music_assets.py script handles this). The final mixer then only assembles pre-loudness-correct components and applies a peak limiter to catch any residual clipping. No single loudnorm pass at any stage of the pipeline.

What to verify in your own pipeline

If you are running multi-speaker TTS through ffmpeg’s loudnorm, two diagnostic commands answer most questions:

# Check per-speaker imbalance: measure each speaker's segments separately.
for spk in cipherfox hexabella; do
  ffmpeg -i "concat:$(ls segments/*${spk}.wav | head -20 | paste -sd '|')" \
    -af "loudnorm=I=-16:TP=-2.0:print_format=json" -f null - 2>&1 \
    | grep '"input_i"'
done

# Check leading-silence: RMS at t=0 to 5.
for t in 0 1 2 3 4 5; do
  ffmpeg -ss $t -t 0.5 -i episode.mp3 -af astats -f null - 2>&1 \
    | grep "RMS level dB" | head -1
done

If the per-speaker measurements differ by more than ~1 dB, you have Bug 1. If t=0..3 returns -inf or values significantly lower than t=4+, you have Bug 2. Both reproduce reliably; both have one-line fixes once you find them.

What I Actually Use

  • Per-segment loudnorm pass-1 measurement, volume={gain}dB application, ±9 dB cap
  • afftdn=nr=10:nf=-25 for Voxtral hiss before the limiter
  • alimiter=limit=0.841:level=false (-1.5 dBFS true-peak ceiling) at the end
  • No global loudnorm pass, just highpass=f=80,alimiter=... for final master
  • RMS verification at t=0..5 on every render

External references

This article is Part 4 of Voxtral Pipeline Discoveries (May 2026):

Flow

Two compound loudnorm bugs

Multi-speaker TTS pipeline, ffmpeg loudnorm filter

1
Symptom 1 Female co-host always quieter than male
2
Root cause 1 Per-block average pulls toward louder speaker
3
Symptom 2 First 3 seconds of every episode silent
4
Root cause 2 Dynamic-mode lookahead eats leading audio
5
Fix for both Per-segment normalize, drop global loudnorm pass
Illustration: Per-Segment Loudnorm and the 3-Second Lookahead Bug