What Goes Wrong at Token 4096: A Context-Window Failure Atlas
The agent worked fine at token 1000. By token 4096 it was producing garbage. By token 8192 it was producing confident garbage. The mistake most operators make is assuming this is one failure mode and reaching for one fix. There are at least eight, and the fix depends on which one you are looking at.
As of May 2026, the three models I run most on the sovgrid stack have very different effective context budgets: Claude Sonnet 4 supports 200k tokens, Mistral Small 4 (as of May 2026) has a 32k context window, and Qwen 3.6 runs at roughly 32k on the local SGLang deployment. “Supports 200k” does not mean “performs well at 200k.” That distinction is what this atlas is about.
Quick Take
- Token 4096 is the inflection point because most production stacks default to 4K context. Failures here are often “context limit reached” with no graceful degradation.
- The eight failures, briefly: silent truncation, soft truncation, attention sink loss, position-encoding drift, KV-cache spill, schema-decay, semantic-overload, conversation-thread tangle.
- The diagnostic question: is the model producing wrong output, no output, or refusing output? Each maps to a different cause.
- The fix categories: raise the limit, summarize and truncate, restart the conversation, switch model, switch backend, redesign the prompt structure.
- The trap: assuming the model’s stated context window is the working context window. Real performance often degrades long before the documented limit.
What this atlas covers and does not cover
Three caveats before the failures.
First caveat: this atlas covers inference-time context failures, not training-data failures. If the model gives wrong factual answers, that is a different class of problem and the context window is not the culprit.
Second caveat: the failure modes described here apply to single-sequence inference. Retrieval-augmented generation (RAG) systems, where context is chunked and retrieved per-query, have their own failure taxonomy. The atlas does not cover RAG chunking failures, embedding drift, or retrieval relevance decay.
Third caveat: the numbers cited (4k, 32k, 200k) are context limits, not quality limits. The point at which output quality degrades varies per model and task. Do not treat the trained context length as the safe operating context length without measuring it on your specific workload.
Failure 1: silent truncation
Symptom. The model behaves as if the first half of the conversation never happened. The agent forgets the customer’s name that was given at the start. The tool calls that were made in the first part of the session are not remembered.
Cause. The inference engine has truncated the prompt to fit the configured context window, silently, by dropping the oldest tokens. The truncation is configurable but the default behavior on most stacks is to drop from the front, which means the system prompt and the conversation start go first. This is why the agent responds coherently but answers with no memory of the initial constraint: the system prompt is literally not in the input anymore.
Here is what the inference log looks like when silent truncation kicks in. The engine drops 2048 tokens without a warning to the application layer:
[sglang] INFO: prompt_tokens=4096, truncated_tokens=2048, truncation_strategy=left
[sglang] INFO: generating... context_length=4096/4096
[sglang] WARN: prompt exceeded configured max_tokens, oldest tokens dropped
Fix. Raise the configured context window if the model supports it. If the model genuinely has a 4k limit, the application needs to summarize the conversation before reaching the limit, not rely on the engine’s default truncation. Implement a sliding summarizer in the application layer. The summarizer should trigger at 80% of the configured limit (3200 tokens for a 4k window), not at 100%, because you need headroom for the summarization response itself.
Failure 2: soft truncation
Symptom. The output quality degrades gradually as the context fills. Coherent at 2000 tokens, slightly worse at 4000, noticeably bad at 6000, garbage by 8000. No hard error.
Cause. Long-context performance on most language models is worse than short-context performance, even when the model is technically configured to handle the longer sequence. Attention degrades in the middle of long contexts (the “lost in the middle” effect documented in the long-context-recall research literature). The model still produces output; the output is just less faithful to the parts of the context the attention has decayed on. Because there is no error, this is the failure mode most often misdiagnosed as “the model is just not very good.”
Fix. Use a chunking strategy: process the long context in segments, summarize each segment, then operate on the summaries. The model never sees more than its sweet-spot context length. (For a worked example on the sovgrid stack, see Research: Voxtral Chunk Strategy Render Time which is about audio but uses the same architecture principle.)
Failure 3: attention sink loss
Symptom. The model loses the system prompt or the persona around the same point in every long conversation. The behavior drift is consistent across conversations of similar length.
Cause. Some models implicitly weight the first few tokens (the “attention sink,” meaning the initial token positions that receive disproportionate attention weight during inference) more heavily than the rest. When the context grows beyond a threshold, the inference engine may evict or compress the attention sink position, and the model’s grounding to the system prompt evaporates. This is why the drift appears at a consistent turn count rather than at a consistent token count: the turn count is a proxy for when the attention sink falls outside the active window.
Fix. Use an attention-sink-aware inference backend (vLLM has explicit sink-preservation options for some model architectures). If the backend does not support this, replay a short version of the system prompt at the most recent position every N turns. The replay refreshes the model’s grounding without consuming much context. A 200-token system-prompt refresh every 5 turns costs 40 tokens per turn on average, which is far cheaper than the context it takes to recover from undetected persona drift.
Failure 4: position-encoding drift
Symptom. The model produces output that references the wrong turn in the conversation. It treats the user’s question from turn 3 as if it were from turn 1, or vice versa. The temporal ordering breaks.
Cause. The position encoding (RoPE, ALiBi, or similar) on the model was trained on a specific maximum context length, and operations beyond that length use extrapolation or interpolation that produces poor positional grounding. The model can still attend to tokens but loses track of which token is where. RoPE, which refers to Rotary Position Embedding and is the standard in most current models including Qwen and Mistral, was introduced in the RoFormer paper and works by rotating query and key vectors by an angle that encodes position. Beyond the trained maximum, those rotation angles are extrapolated, which is why positional grounding collapses rather than degrades gracefully.
Fix. Stay within the model’s training context length. If you need longer effective context, switch to a model whose training data includes the longer length (newer model checkpoints often have 32k or 128k trained context). Position-encoding extrapolation tricks like YaRN or self-extend help in some cases but are not universal fixes. Tested on Mistral Small 4 at 32k context: YaRN with a scale factor of 8 recovered about 70% of the short-context positional accuracy at 28k tokens, but the remaining 30% degradation was still observable in ordering tasks.
Failure 5: KV-cache spill
Symptom. The inference engine starts paging the KV cache to slower memory or to swap. Throughput collapses. The dashboard shows the unified-memory utilization at near-100 percent.
Cause. The KV cache grows linearly with context length per request. At full context across many concurrent requests, the KV cache can consume more memory than the model weights. On unified-memory architectures (DGX Spark, M3 Ultra), this competes with everything else for the same memory pool. This is why a model that runs comfortably at 4k context across 8 concurrent requests can OOM at 32k context across just 2 requests: the KV memory footprint scales with both context length and concurrency simultaneously.
The OOM trace on a DGX Spark (tested on Qwen 3.6 at 32k context, as of May 2026) looks like this:
[vllm] ERROR: CUDA out of memory. Tried to allocate 18.00 GB
[vllm] ERROR: KV cache size: 22.4 GB (7 layers × 2 × seq_len × d_model × float16)
[vllm] ERROR: Reducing --max-model-len or --gpu-memory-utilization recommended
RuntimeError: CUDA error: out of memory (exit code 1)
Fix. Reduce concurrent requests, reduce per-request context length, or use a quantized KV cache (FP8 KV cache is supported on recent vLLM versions and roughly halves the KV memory footprint). For the Spark-specific tuning, see Spark Arena Rank 4 Made Me Add Qwen3.6 for the working gpu-memory-utilization and kv-cache-dtype settings.
Warning: on unified-memory systems, the KV cache and model weights compete for the same pool. Avoid setting gpu-memory-utilization above 0.85 or you will lose the headroom the OS and attention kernels need to operate safely.
Failure 6: schema-decay
Symptom. The agent’s tool calls were well-formed JSON in turn 1. By turn 8, the tool calls have missing braces, malformed arguments, or extra commentary outside the JSON block.
Cause. Long-context attention degrades the model’s adherence to structured-output constraints. The training data for tool-calling fine-tunes was probably short-context, and the model’s compliance with the format weakens as the context lengthens. This causes silent downstream failures: the calling code receives a json.JSONDecodeError rather than a clean tool result, and the agent loop stalls or crashes without a clear error tied to context length.
Here is the typical failure trace at turn 9 of a long agentic session:
[agent] tool_call: {"function": "search_files", "args": {"path": "/data",
[agent] ERROR: json.JSONDecodeError: Expecting ',' delimiter: line 1 col 89 (char 88)
[agent] raw_response: {"function": "search_files", "args": {"path": "/data", "pattern":
".log" // latest logs only
}}
[agent] WARN: malformed tool call at turn=9, context_tokens=11240
The inline comment // latest logs only is not valid JSON. The model added it because it has seen similar comment patterns in its long context and lost track of the schema constraint.
Fix. Use grammar-constrained decoding (vLLM’s response_format with schema, or llama.cpp grammar files) to force well-formed output. The constraint pays a small throughput cost but eliminates schema-decay entirely. The constraint also disables speculative decoding for the constrained generation, which is fine because EAGLE is net-negative on structured output anyway. (See EAGLE Speculative Decoding: When It Helps and When It Doesn’t.)
Specifically, vllm>=0.4.0 supports guided_decoding_backend: "lm-format-enforcer" which enforces a JSON schema token by token. This was introduced in v0.4.0 and is the recommended path as of May 2026.
Failure 7: semantic-overload
Symptom. The agent has been told too many things and starts conflating them. The instruction “use Python type hints” from the system prompt gets mixed up with “use JavaScript types” from a turn six hours ago. The model’s behavior is the average of all the instructions it has accumulated, not the most recent or the most relevant.
Cause. Long context does not just hurt attention; it can also hurt prompt prioritization. The model treats all instructions in context as similarly weighted, which is wrong when later instructions should override earlier ones.
Fix. Restart the conversation. Move the long-running session’s accumulated context into a summarized document the agent reads, and start a fresh conversation with the document as a reference. The summarization pass costs one inference round; the restart eliminates the overload.
Failure 8: conversation-thread tangle
Symptom. The agent confuses multiple parallel topics in a long conversation. Asked about the deployment script, it answers about the dashboard.
Cause. Multi-topic long conversations confuse most models, especially when topics interleave. The model attempts to maintain coherence across all topics simultaneously and fails when the topic count exceeds roughly three or four.
Fix. Per-topic conversation isolation. Spawn a sub-agent per topic, share only the summaries between them, and let the user/operator orchestrate the multi-topic state externally. This is the agent-orchestration pattern (see 5 MCP Patterns That Aren’t ‘Search the Database’, publication pending, for the broader pattern catalog).
How to tell which failure you have
Run through these four steps in order. Each step narrows the failure class by roughly half.
-
What kind of bad output? Wrong output maps to failures 1, 2, 3, 4, 6, 7, 8. No output (timeouts, hangs) maps to failure 5. Explicit refusal (“I cannot process this request”) maps to failure 1 (context-limit error) or failure 5 (OOM-induced hang). This first question eliminates the memory-pressure class from the structural-attention class immediately.
-
Does the failure correlate with token count or conversation turn count? Token count correlation (fails at 6k tokens regardless of session length) means failures 1, 2, 4, 5, or 6. Turn count correlation (fails after turn 7 regardless of token count) means failures 3, 7, or 8. Because these causes require different fixes, conflating them is the most expensive diagnostic mistake.
-
Does a fresh conversation at the same task succeed? If yes, the failure is in accumulated context (failures 7 or 8). If no, the failure is in the model or engine configuration (failures 2 or 4). This is why “restart and retry” is the first action, not the last: it directly partitions the diagnostic space.
-
Does reducing the per-request context length fix it? If yes, failures 1, 4, or 5. If no, failures 2, 3, 6, 7, or 8. This narrows to whether the cause is hard limit (truncation) versus soft degradation (attention quality).
Mapping:
| Symptom | Turn-correlated | Token-correlated | Fix class |
|---|---|---|---|
| Forgets early context | No | Yes | Silent truncation (F1) |
| Gradual quality decay | No | Yes | Soft truncation (F2) |
| Persona drift at consistent turn | Yes | No | Attention sink (F3) |
| Temporal ordering breaks | No | Yes | Position encoding (F4) |
| Timeouts / OOM | No | Yes | KV-cache spill (F5) |
| Malformed JSON at late turns | No | Yes | Schema decay (F6) |
| Instruction conflicts from history | Yes | No | Semantic overload (F7) |
| Topic cross-contamination | Yes | No | Thread tangle (F8) |
Where this fits
For the broader inference-stack reasoning, see The Sovereign AI Stack in 2026. For the speculative-decoding interaction with context-window issues, see EAGLE Speculative Decoding: When It Helps and When It Doesn’t. For the mental model on memory consumption, see The Unified-Memory Inference Mental Model.
subscribe for the deep dive
The follow-up article walks through the actual debug session that produced this atlas, including the prompts I used to provoke each failure mode and the production traces from real workloads. Subscribe via the footer.