#technical

5 articles

All articles tagged "technical" : self-hosted AI fixes, setups, and architecture notes.

Jun 25, 2026

A No-Vector RAG That Works: The Architecture, Decision by Decision

The complete design of the retrieval system my local models run on: Markdown files, one JSON index, full-body BM25 chunked per section, served to agents over MCP. No vector database, no embeddings. Here is every decision and the reason behind it, with the external evidence that backs each one.

Read article →

Eight specific failure modes that surface as the context window fills. They are not the same failure mode; the fixes are different. The atlas helps you tell which failure you are looking at when the agent starts producing garbage.

May 19, 2026

mistralfix

What Goes Wrong at Token 4096: A Context-Window Failure Atlas

Eight specific failure modes that surface as the context window fills. They are not the same failure mode; the fixes are different. The atlas helps you tell which failure you are looking at when the agent starts producing garbage.

May 19, 2026

mistral

EAGLE Speculative Decoding: When It Helps and When It Doesn't

EAGLE accelerates conversational and long-prose workloads by 2-3x on Mistral Small 4. On structured-JSON output, the same configuration is net-negative. The decision rule is the workload class, not the inference engine.

NVFP4 is a 4-bit floating-point format with a tiny exponent field, designed for inference-time activation and weight quantization on Blackwell-class hardware. Here is what the format actually does, why it is faster than INT4 on Spark for some workloads, and where it loses to other quantization choices.

May 19, 2026

dgx-spark

NVFP4 Quantization Explained (For Engineers Who Skipped the Paper)

NVFP4 is a 4-bit floating-point format with a tiny exponent field, designed for inference-time activation and weight quantization on Blackwell-class hardware. Here is what the format actually does, why it is faster than INT4 on Spark for some workloads, and where it loses to other quantization choices.

Unified memory is not a bigger pile of RAM. It is a different architecture, and the mental model for what makes inference fast on it is different from the model for discrete-GPU inference. Here is the working model after two months on a DGX Spark.

May 19, 2026

dgx-spark

The Unified-Memory Inference Mental Model

Unified memory is not a bigger pile of RAM. It is a different architecture, and the mental model for what makes inference fast on it is different from the model for discrete-GPU inference. Here is the working model after two months on a DGX Spark.