Learn

The terms behind the articles

Self-hosted AI is an acronym swamp. When a term shows up in an article like OOM, it links here, to a short, honest entry: what it means, why it matters, and where to go deeper. No marketing, no hand-waving. Each entry is the evergreen explanation; the articles are the dated war stories it links out to.

117 terms

Sort

KV cache deeper Memory The key-value (KV) cache stores a model's view of every token so far, so it does not recompute the whole prompt each step. It speeds generation up and is the memory that grows with context length.
OOM basics Memory Out of memory (OOM) is the most common wall you hit running local models. What it means, why unified-memory boxes hit it differently, and the calm way past it.
RAM basics Memory RAM (random-access memory) is the fast, temporary memory a computer uses while it runs. How it relates to the GPU's memory, and why a DGX Spark merges the two into one pool.
unified memory basics Memory Unified memory is a single pool of system memory shared by the processor and the graphics processor, with no separate video memory. On a DGX Spark all 128 GB is shared, which is why a desk machine can hold a very large model.
VRAM basics Memory VRAM is the memory a GPU uses to hold a model and its working data. How much you have caps the model size and context length you can run, and on a DGX Spark the rules change.
CPU basics Hardware A CPU (central processing unit) is the main processor that runs the operating system and coordinates everything. Why it matters less than the GPU for running models, and where it still bites.
CUDA deeper Hardware CUDA is NVIDIA's platform for running general-purpose software on its graphics processors. Almost every local large-language-model engine is built on it, which makes it the one part of an otherwise sovereign stack you cannot self-host your way out of.
GB10 deepest Hardware GB10 is the Grace Blackwell system-on-chip that powers the DGX Spark, pairing an Arm-based processor and a Blackwell-class GPU around one shared pool of memory.
GPU basics Hardware A GPU (graphics processing unit) is the parallel processor that does the heavy maths of running a model. Why it matters for local AI, and how a DGX Spark blurs the usual lines.
HBM deepest Hardware HBM (High-Bandwidth Memory) is the stacked, very fast memory soldered onto data-centre GPUs. It is why those cards stream weights so quickly, and it is not what a DGX Spark uses.
memory bandwidth deeper Hardware Memory bandwidth is how fast data moves between the accelerator and its memory. For one-at-a-time generation it usually sets your tokens per second, and it is a different thing from how much memory you have.
NVLink deepest Hardware NVLink is NVIDIA's high-speed interconnect that lets two or more GPUs share data far faster than the regular bus. It matters for multi-GPU boxes, and not at all for a single-chip DGX Spark.
PCIe deepest Hardware PCIe (Peripheral Component Interconnect Express) is the high-speed bus that links a discrete graphics card to a computer's processor and memory. On a unified-memory box like the DGX Spark, the GPU is not on the far end of a PCIe link.
SM deepest Hardware A streaming multiprocessor (SM) is the building block of an NVIDIA GPU, the unit that actually runs the threads of a kernel. How many a GPU has, and which architecture they are, sets what it can do.
TFLOPS deepest Hardware TFLOPS (tera floating-point operations per second) is a headline number for how much raw maths a chip can do. It rarely predicts the tokens per second you actually get from a model.
x86 deepest Hardware x86 is the processor instruction set behind most desktop and server chips from Intel and AMD. The DGX Spark uses an Arm-based Grace processor instead, which is why some prebuilt software does not just run.
AWQ deepest Quantization AWQ (Activation-aware Weight Quantization) shrinks a model to lower-precision weights while protecting the ones that matter most, judged by how the model is actually used. It is one of the common quantized formats for local serving.
BF16 deeper Quantization BF16, or 16-bit Brain Floating Point, is a half-size number format that trades precision for a wide value range. It is the common working format for model weights before any heavier quantization.
FP16 deeper Quantization FP16, or 16-bit Floating Point, is a half-size number format that favours precision over range. It is a common native format on consumer hardware and the sibling of BF16.
FP32 deeper Quantization FP32 (32-bit Floating Point) is the wide, accurate number format that models are often trained in. It is the most precise and the most expensive way to store a weight, which is why almost nobody serves a large model in it.
FP8 deeper Quantization FP8 (8-bit floating point) is a numeric format that stores each model weight in eight bits instead of sixteen. Half the memory, a small accuracy cost, and two layouts you have to pick between.
GGUF deeper Quantization GGUF is the single-file format that llama.cpp and its relatives load. It packs a model's weights, its tunable numeric precision, and the metadata needed to run it into one portable file.
GPTQ deepest Quantization GPTQ is a method for shrinking a trained model to lower-precision weights after the fact, with a calibration step that limits the accuracy loss. It is one of the common formats you will meet when running quantized models locally.
INT4 deeper Quantization INT4 (4-bit integer) stores each model weight in four bits, about a quarter of 16-bit. The smallest practical format for most local models, and the one with the most to lose.
mixed precision deeper Quantization Mixed precision means a model uses more than one numeric format at once, keeping the sensitive parts accurate while running the bulk in a smaller, faster format. It is how you buy speed without losing the plot.
MXFP4 deepest Quantization MXFP4 is a four-bit floating-point format where each small block of weights shares one scale factor, an open industry standard that keeps most of the accuracy a flat four-bit format would lose.
NF4 deepest Quantization NF4 (4-bit NormalFloat) is a 4-bit format for storing model weights, shaped so its levels match how weights are actually distributed. It is best known from quantized fine-tuning, where the base model stays in 4 bits while a small adapter trains.
NVFP4 deeper Quantization NVFP4 is NVIDIA's 4-bit floating-point number format for quantizing model weights so a large model fits in limited memory.
quantization basics Quantization Quantization stores a model's weights in fewer bits so it takes less memory and runs faster. The trade is a small, usually manageable, loss of precision.
A3B deepest Models & inference A3B is a model-name suffix meaning three billion active parameters: a mixture-of-experts model that holds far more weights in total but routes only about 3B of them per token.
attention deepest Models & inference Attention is the mechanism that lets a model, while reading one word, weigh how much every other word matters to it. It is the core idea that makes a transformer good at language.
beam search deepest Models & inference Beam search is a decoding strategy that keeps several candidate sequences alive at each step and extends the most promising ones, instead of committing to one token at a time. It contrasts with sampling.
chain of thought deeper Models & inference Chain of thought means prompting a model to work through a problem step by step before it answers, instead of jumping straight to a conclusion. It often helps on harder reasoning, at the cost of more tokens.
context window basics Models & inference The context window is the most text a model can take into account at once, measured in tokens. What fits inside it, why a bigger one costs more, and why a full window often reads worse than a short curated one.
DFlash deeper Models & inference DFlash is a speculative-decoding draft configuration, a tuned draft at a small speculation depth used in this stack's production Qwen serving.
distillation deepest Models & inference Distillation trains a smaller model to copy the behaviour of a larger one, so you get much of the quality at a fraction of the size. It is how many of the small models you can actually run came to be.
EAGLE deeper Models & inference EAGLE is a speculative-decoding method where a lightweight head predicts the base model's own features so its drafts agree more often.
few-shot deeper Models & inference Few-shot means putting a few worked examples in the prompt so the model copies the pattern, the format and tone you want, without any extra training.
fine-tuning deeper Models & inference Fine-tuning takes a trained model and trains it a little more on your own examples to shift its behaviour. It is powerful, often overkill, and a different tool from retrieval.
hallucination basics Models & inference A hallucination is when a model states something false with the same fluency and confidence it uses for the truth. It is not a bug you patch; it is how the model works, so you design around it.
inference basics Models & inference Inference is what happens every time you actually use a model: it reads your input and produces a result. It is the part you run on your own hardware, separate from training.
LoRA deeper Models & inference LoRA (Low-Rank Adaptation) teaches a model new behaviour by training a small pair of adapter matrices instead of the whole model. It is the cheap, hardware-friendly way to fine-tune on your own box.
MoE deeper Models & inference A mixture of experts (MoE) model holds many specialist sub-networks but activates only a few per token. Why that lets a huge model run cheaply, and the catch hiding in the memory bill.
MTP deepest Models & inference Multi-token prediction (MTP) is a speculative-decoding method where the model proposes more than one token per step and verifies them together, aiming for more tokens per pass.
parameter basics Models & inference A parameter is one of the numbers a model learned during training. When a model is called 7B, that means seven billion of them, and the count hints at both capability and the memory it needs.
perplexity deepest Models & inference Perplexity is a measure of how well a model predicts text: roughly, how surprised it is by what comes next. Lower is better, and it is a common way to compare models or check a quantized one.
prompt basics Models & inference A prompt is the text you send a language model to get a response. Its wording, examples, and context shape the answer, which is why prompting is a skill, not a formality.
prompt engineering basics Models & inference Prompt engineering is the practice of shaping a model's input, the wording, the context, the examples, to get better output, without changing the model itself.
RLHF deepest Models & inference RLHF (Reinforcement Learning from Human Feedback) shapes a model's behaviour using human judgements of which answers are better. It is a large part of why a chat model feels helpful rather than merely fluent.
sampling deeper Models & inference Sampling is the step that turns a model's list of possible next words into one actual choice. The rules you pick decide whether output is steady and predictable or varied and surprising.
speculative decoding basics Models & inference Speculative decoding speeds up generation by letting a small draft model propose several tokens that the big model checks all at once.
system prompt basics Models & inference A system prompt is the standing instruction that tells a model who it is and how to behave, kept separate from the user's message so it frames every turn of the conversation.
temperature deeper Models & inference Temperature is the setting that decides how adventurous a model's word choices are. Low keeps it focused and repeatable, high makes it creative and erratic.
token basics Models & inference A token is the small chunk of text a language model actually processes. Counting tokens, not words, is how you reason about context length, cost, and speed.
tokenizer deeper Models & inference A tokenizer is the component that splits text into tokens, the small units a model actually reads and writes. Its rules decide how your words map to the numbers the model sees.
top-p deeper Models & inference Top-p, also called nucleus sampling, keeps only the smallest set of next tokens whose probabilities add up to p, then draws from that set. It works with temperature to shape how varied a model's output is.
transformer deeper Models & inference The transformer is the neural-network design nearly every modern language model is built on. Its core trick is letting each word look at every other word to decide what it means in context.
zero-shot deeper Models & inference Zero-shot means asking a model to do a task with no examples in the prompt, relying entirely on what it learned in training. The cleanest prompt, when the model already knows the job.
EmergentTTS-Eval deepest Speech & audio EmergentTTS-Eval is a benchmark suite that stress-tests text-to-speech on difficult cases like emotions, questions, and tricky punctuation, then uses an automatic grader to score how well a model handles them.
prosody deeper Speech & audio Prosody is the rhythm, stress, intonation and pacing of speech. It is the layer that makes a voice sound alive and human instead of flat or read aloud, the part of TTS that is hardest to measure and easiest to feel.
speaker similarity deepest Speech & audio Speaker similarity (SIM) measures how close a cloned voice is to the reference speaker, usually as the cosine similarity between speaker embeddings, where higher means a closer match.
TTS basics Speech & audio TTS, or text to speech, is the task of synthesising spoken audio from written text using a neural model. It is the umbrella term for the whole speech-synthesis topic, covering everything from intelligibility to how natural the voice sounds.
voice cloning deeper Speech & audio Voice cloning reproduces a target speaker's timbre from a short reference clip, so the TTS model speaks new text in that voice. It captures the character of a voice from seconds of audio rather than using a fixed preset.
WER deeper Speech & audio WER, or word error rate, feeds generated speech back into a speech recognizer and counts the fraction of words it gets wrong. Lower is better. It measures intelligibility and stability, not how natural or alive the voice sounds.
batch size deeper Performance Batch size is how many requests or sequences a model processes at once. It trades memory for throughput, and on a shared-memory box it is one of the first knobs that decides whether you OOM.
continuous batching deepest Performance Continuous batching is a serving trick that adds and drops requests from a running batch token by token, so the GPU rarely sits idle waiting for the slowest request to finish.
Elo rating deeper Performance Elo rating is a relative skill score computed from head-to-head comparisons, borrowed from chess, used in blind A/B arenas to rank models by human preference.
FlashAttention deepest Performance FlashAttention is a way to compute the attention step without writing the huge intermediate score matrix out to slow memory. It makes long contexts cheaper in both memory and time.
latency deeper Performance Latency is the delay you feel before a model starts answering, often measured as time to first token. It is a different number from throughput, and they can pull in opposite directions.
PagedAttention deepest Performance PagedAttention is the trick vLLM uses to store the key-value cache in fixed-size pages instead of one contiguous block. Why that cuts waste and lets more requests share the same memory.
prefill and decode deepest Performance Inference runs in two phases: prefill reads the whole prompt at once, then decode generates the answer one token at a time. They stress the hardware differently, which is why one number never describes speed.
tensor parallelism deepest Performance Tensor parallelism splits each layer of a model across several GPUs so they compute one forward pass together, letting a model that is too big for one card run on many.
throughput deeper Performance Throughput is how many tokens a model generates per second once it is running. It tells you how fast a long answer finishes, and it is not the same number as latency.
BM25 deeper Retrieval BM25 (Best Matching 25) scores how well a document matches a search query from how often the query words appear and how rare they are. It is the strong lexical baseline that fancier dense retrieval often fails to beat.
chunking deeper Retrieval Chunking is cutting documents into smaller passages before turning each into an embedding. How you cut decides what retrieval can find, which makes it one of the quietest, most consequential choices in a RAG stack.
embedding deeper Retrieval An embedding is a list of numbers that represents a piece of text by its meaning, so that similar texts land close together. It is the math under semantic search and retrieval.
RAG deeper Retrieval RAG (retrieval-augmented generation) searches your own documents for relevant passages and feeds them to the model, so a general model can answer about your private or current material without retraining.
reranking deeper Retrieval Reranking is a second pass that reorders retrieved passages by how relevant they actually are to the question. It catches the near-misses that a fast first search lets through.
vector database deeper Retrieval A vector database stores embeddings and finds the nearest ones to a query. It is the search engine that lets a retrieval system pull the right passages out of a pile of documents.
llama.cpp deeper Serving llama.cpp is an open-source C++ inference engine that runs language models from GGUF files, with light dependencies, so it works on plain CPUs and small GPUs as well as big ones.
Ollama basics Serving Ollama is a friendly local-model runner: one command pulls a model and starts it behind an API. It wraps llama.cpp and handles the file management for you.
OpenAI-compatible API deeper Serving An OpenAI-compatible API is the request and response format most local inference servers expose. Why it became the lingua franca, and what it lets you swap without rewriting clients.
SGLang deeper Serving SGLang is an open-source serving engine that runs a large language model as an API: it schedules requests, reuses cached prefixes, and keeps the GPU busy under load.
vLLM deeper Serving vLLM is an open-source inference server that runs a large language model behind an API, batching many requests together and managing the key-value cache so the GPU stays busy.
CGNAT deepest Networking Carrier-Grade Network Address Translation has your internet provider share one public address across many customers. It is why port forwarding stops working and home hosting needs a different approach.
DNS deeper Networking DNS (the Domain Name System) turns a name people can remember into the address machines actually use. It is the first thing that resolves when you put a self-hosted service online, and a common place for outages to hide.
firewall basics Networking A firewall decides which network traffic is allowed in and out of a machine. It is the first line of defence on any box you put online, and a default-deny stance is the sane way to run one.
L402 deepest Networking L402 is an open protocol for paying for a web resource with Bitcoin over the Lightning Network. It uses the HTTP 402 Payment Required status and small signed tokens, so a program can pay per request with no account and no card.
NAT deeper Networking Network Address Translation lets many devices on a home network share a single public internet address. It is why your machines can reach out freely but cannot be reached from outside by default.
port forwarding deeper Networking Port forwarding tells your router to send inbound traffic on a given port to a specific device on your network. It is the classic way to host a service from home, and the classic way to expose one.
reverse proxy deeper Networking A reverse proxy sits in front of your services, takes every incoming request, and routes it to the right backend. It is where TLS, names, and access usually get handled, so the services behind it stay simple.
SSH basics Networking SSH (Secure Shell) is how you log in to a remote machine and run commands over an encrypted connection. It is the everyday front door to any server you self-host, and key-based login is its safest form.
TLS deeper Networking TLS (Transport Layer Security) is the encryption that puts the lock icon on a web connection. It encrypts traffic and proves a server is who it claims to be, and it is the difference between HTTP and HTTPS.
WireGuard deeper Networking WireGuard is a modern VPN protocol built to be simple and fast. It links your machines into one private network over the public internet, which is how home hardware and a remote box can talk as if they were on the same LAN.
2FA basics Self-hosting Two-factor authentication (2FA) asks for a second proof on login, so a stolen password alone is not enough. What the factors are, why TOTP codes are the common one, and where it helps.
ACL deepest Self-hosting An access control list (ACL) is a set of rules that says which identities may reach which services. On a private network it is the line that scopes a shared node to exactly one port.
age deeper Self-hosting age is a small, modern tool for encrypting files. One recipient key, one command, no certificate ceremony. Why a backup pipeline reaches for it instead of older tooling.
CLI basics Self-hosting A CLI is a command-line interface: you type a command, the program runs it, you read the text it prints back. It is the default way to operate a server and run AI tooling.
container basics Self-hosting A container is a packaged, isolated process: one running service with its own files but the host's kernel. It is the unit Docker starts, and the building block of most self-hosted stacks.
Docker basics Self-hosting Docker packages a service and everything it needs into one image, then runs it as an isolated container. It is how most self-hosted AI stacks ship and start their pieces.
LUKS deeper Self-hosting LUKS (Linux Unified Key Setup) is the standard way to encrypt a disk on Linux. What it protects, what it does not, and why a self-hosted AI box wants it on at rest.
MCP deeper Self-hosting The Model Context Protocol (MCP) is an open standard that lets an AI agent call external tools and data through one interface, instead of every app hardcoding every integration. This blog runs its own MCP server you can query.
systemd deeper Self-hosting systemd is the Linux service manager that starts long-running processes at boot, restarts them when they crash, and collects their logs. It is how a self-hosted service survives a reboot.
VPS basics Self-hosting A VPS is a virtual private server: a rented, isolated machine in someone's data centre. It is the cheap, public-facing front end many self-hosted setups put in front of hardware at home.
custodial basics Sovereignty Custodial means another party holds your keys or funds for you. Non-custodial means you hold them yourself. The difference decides who can freeze, lose, or close your account, and it runs through every sovereign setup.
hardware wallet basics Sovereignty A hardware wallet stores your Bitcoin private keys offline on a dedicated device and signs transactions inside it, so the keys never reach an internet-connected computer. It is the standard way to hold your own coins.
KYC basics Sovereignty KYC is know your customer: the identity checks a service makes you pass before it will deal with you. Avoiding it is a recurring theme in sovereign, self-hosted setups.
Lightning channel deeper Sovereignty A Lightning channel is a funded, two-party link between Lightning nodes. It is opened with an on-chain transaction, then carries many instant payments off-chain. Channels are what make Lightning fast, and what you manage when you run your own node.
Lightning node deeper Sovereignty A Lightning node is your own connection to the Lightning Network, the layer on top of Bitcoin for instant, cheap payments. Running your own means you hold the funds and route your own payments instead of trusting a custodian.
multisig deeper Sovereignty Multisig (multi-signature) is a Bitcoin setup where spending needs more than one key to sign. It removes the single point of failure that a one-key wallet carries.
NIP deepest Sovereignty A NIP, a Nostr Implementation Possibility, is one of the numbered spec documents that define how Nostr features work. They are why two different clients can talk to each other at all.
Nostr deeper Sovereignty Nostr is an open protocol for publishing notes and other data across independent relays, with no company in the middle. Your identity is a key you hold, not an account someone can close.
npub and nsec deeper Sovereignty On Nostr your identity is a key pair. The npub is the public half you hand out so people can find and verify you. The nsec is the secret half you must never reveal, because whoever holds it is you.
on-chain deeper Sovereignty On-chain means a transaction is recorded directly on the Bitcoin blockchain, settled by the whole network. It is final and fully yours, but slower and costlier than Lightning, which is why the two get used for different jobs.
seed phrase basics Sovereignty A seed phrase is the short list of words that can restore an entire wallet. It is the single backup that matters: anyone who has it controls the funds, and anyone who loses it loses them.
UTXO deepest Sovereignty A UTXO (Unspent Transaction Output) is a discrete chunk of Bitcoin a wallet can spend. Your balance is just the sum of your UTXOs, and understanding them explains fees and privacy.
zap basics Sovereignty A zap is a small Bitcoin payment sent over the Lightning network as a tip on Nostr. It is value-for-value support, attached to a note or a person, with no platform in the middle.

← All articles