The Model Is the Cheap Part

June 22, 2026 13 min read

There is a way to read the AI market that makes self-hosting sound like a hobby and renting sound like the only adult choice. The model on someone else’s servers is bigger, newer, and faster than anything I can fit on a desk, and it gets cheaper to rent every quarter. Against that, my owned machine looks like nostalgia with a power bill. I have made this case against myself in detail, in the essay about my idle machine and in the full cost model, and I am not going to pretend the dollars come out my way for most people at most volumes.

But the dollar argument quietly assumes the model is the valuable thing. That is the assumption I want to take apart. Read the stack the way an economist reads any market, by asking which layer is commoditizing and which layer is scarce, and the conclusion flips. The model is the cheap part. The expensive part, the part with the moat, is your data, your process, and your judgment. And renting a frontier API does something strange when you look at it through that lens: it pays a premium for the commodity layer while exporting the scarce layer into someone else’s weights. Owning versus renting stops being a values question and becomes an arithmetic one.

The program is the weights

Andrej Karpathy gave the cleanest statement of what an AI model actually is back in his “Software 2.0” essay. In Software 1.0, a human writes the program in explicit code. In Software 2.0, the program is not written, it is grown. The weights of a neural network are the program, and they are produced by pointing data and compute at an architecture until the behavior you want falls out. As he puts it, the dataset and the architecture are the source code, and “it is significantly easier to collect the data than to explicitly write the program.”

Sit with what that means for value. If the program is the weights, and the weights come from data plus compute rather than from some rare piece of algorithmic genius, then the scarce input was never the cleverness. It was the data and the compute. And both of those are commoditizing fast. Compute is a commodity by definition, you rent it by the hour from a dozen vendors. The training recipes are increasingly public. Open-weight models that match last year’s frontier ship every few months. The thing that made a model good, the data and the compute poured into it, is exactly the thing the whole industry is racing to make abundant.

So when you rent a model, what are you actually buying? You are buying the output of a process that is getting cheaper to run every year. The research nonprofit Epoch AI has tracked the cost of a given level of model capability falling by roughly an order of magnitude a year (their inference-price data). That is not the price curve of a scarce asset. That is the price curve of a commodity in free fall. Karpathy was describing how capability is made. The economics that follow are unavoidable: a thing made from commoditizing inputs becomes a commodity itself.

The model is the commodity

The business press reached the same conclusion from the opposite end, looking not at how models are built but at where enterprises actually capture value. The running theme across MIT Sloan Management Review’s coverage is that AI value in an organization does not come from the model. It comes from the data you feed it, the processes you wrap around it, and the human judgment that decides what to do with its output. The model is the interchangeable part. The organization is the moat.

This is worth stating bluntly because it inverts the marketing. The pitch for a frontier API is that the model is the magic and you are buying access to the magic. Sloan’s research keeps finding the opposite: the magic is a commodity, and the durable advantage is everything you bring to it. Their work even names a failure mode that should frighten anyone leaning fully on a rented brain. When organizations cut the entry-level roles where critical thinking is learned and route the work through AI instead, they report a kind of atrophy, the human judgment that was supposed to supervise the model thinning out underneath it. The judgment is the asset, and renting your way out of building it is how you lose the asset.

Put Karpathy and Sloan together and you get a stereo signal from two unrelated sources. From the builder’s side, the program is the weights and the weights are made from commoditizing inputs. From the buyer’s side, the model is the commodity and the value is your data and process. Neither is making a sovereignty argument. Both are describing the same market structure. The scarce layer is not the model. It never was.

None of this is new. The razor and blades model, the one Gillette is famous for, sells the razor cheap or gives it away and makes its money on the blades you keep buying. The cheap, ubiquitous part is never where the money is. The personal computer ran the same play in slow motion. The hardware commoditized into interchangeable, low-margin parts you could buy from anyone, while the durable value migrated to software, brands, and data. The box became a commodity and the moat moved off the box. So when someone tells you the model is the moat, hear what is actually being said: they are selling razors and have misread where the value went. Renting the model because it is “the magic” is paying a premium for the blade-holder while handing over the blades. Every generation of this story has had a crowd convinced the commoditizing component was the prize. They were wrong every time, and they always sounded reasonable while being wrong.

What a commodity looks like from the inside

I want to ground this in something I actually did, because the abstract claim that the model is a commodity sounds different once you have lived it.

A few weeks ago I swapped the model running my entire daily stack. The production weights went from one quantization to another, an AutoRound build that scored 12.7% higher on my coding gate than the version it replaced. Here is the part that matters for this essay: the swap was invisible to every client. Same served name, same port, same API surface. The editor talking to it, the agents calling it, the dashboard polling it, none of them knew anything had changed except that the answers got better. I changed the engine under the hood and nothing downstream had to be touched.

That transparency is not a nice operational detail. It is the definition of a commodity. A commodity is precisely the thing you can swap for an equivalent without the rest of the system noticing, the way you can put any brand of gasoline in a car. The fact that I could replace the model behind a stable interface and have the whole stack carry on is direct, lived proof that the model is the interchangeable layer. The interface I built, the process I wrapped around it, the data flowing through it, that all stayed. The model, the supposedly precious frontier asset, turned out to be the one piece I could quietly hot-swap.

The same lesson arrived from the other direction when I tested bigger models against my smaller daily driver. By the capability story, a larger, newer model should win. It did not. The 120-billion-parameter Nemotron build ran at roughly a third of my daily model’s throughput, around 23.7 tok/s, and despite passing a coding gate it made zero tool calls, which made it useless as an agent. The newer GPT-OSS-120B ran faster, around 59.5 tok/s, and still lost on the actual agent benchmark. Bigger and newer did not translate into better-for-my-work. If raw model capability were the scarce, decisive thing, the largest model would have won every time. It kept losing to the smaller model that fit my process. More evidence, from my own logs, that the model is not where the value lives.

Retrieval is where the moat does the work

The cleanest place to watch “the model is the cheap part” stop being a slogan is retrieval. Retrieval-augmented generation, RAG, is the plumbing that feeds your own data to a model at the moment it answers, instead of hoping the answer was baked into the weights during training. Once you have that plumbing, two consequences follow that run straight into the rest of this series.

The first is about tokens and size. If the knowledge an answer needs is supplied at query time from your corpus, the model no longer has to be the one that memorized the whole internet. It only has to be good enough to read what you handed it and write a clean answer, which is a far lower bar, and a small model on a desk clears it on your own domain. The frontier’s advantage is the breadth of knowledge baked into its weights; retrieval routes around the exact part you were going to pay the most to rent. The scarce input, your curated data, is the thing you already hold, and a cheap swappable model is enough to put it to work. I run precisely this: a local model answers questions about my own writing by retrieving from a curated index, not by being large enough to have swallowed it.

The second is about ethics, and it connects to the stolen-goods problem head on. An answer grounded in retrieval is grounded in sources you hold, can cite, and have the right to use. It is the opposite of an answer dredged from the opaque, partly-pirated training set folded into the weights. RAG does not remediate what is already congealed in the model, but it shifts the load-bearing knowledge from the stolen layer to the attributable one. The answer can show its work, and that is an ethical difference, not just a technical one.

There is an honest engineering footnote, because the honesty is the product. When I measured my own retrieval, the expensive part, dense vector embeddings, did not beat plain lexical search on my curated, well-tagged corpus, and I rolled the embeddings back. That is not a local quirk. The standard retrieval benchmark, BEIR, found that strong lexical baselines like BM25 are hard for dense retrievers to beat out of domain, and a frontier lab keeps BM25 as a first-class half of its own contextual retrieval precisely because it nails the exact technical terms a curated corpus is full of. On a small, well-tagged corpus the cheap retrieval method is also the better one. The whole stack, model and retrieval both, turned out to be a place where the expensive layer was not the valuable layer.

Then just rent the commodity

Here is the objection that should be bothering you, because it bothered me, and a sovereignty essay that ducks its strongest counter is just marketing with a Latin word in it.

If the model is the cheap, commoditizing part, the conclusion seems obvious, and it is the opposite of mine. Do not own the commodity. Why would you? You do not run your own oil refinery to avoid paying for gasoline. Rent the cheap thing from whoever makes it cheapest, ride that decline down as the price drops an order of magnitude a year, and keep the scarce part, your data and process, safe at home behind the privacy controls the provider offers. Zero-retention API tiers exist. Data-processing agreements exist. You can, on paper, consume the commodity model while keeping your moat local. The commoditization of the model, on this reading, is the reason renting is the smart move, not a reason to own. Let someone else eat the depreciation on the cheap part.

That is a genuinely good argument and I held it for a while. The flaw is in the phrase “keep your data safe at home behind the provider’s controls,” because the commodity you are renting is the same system that logs your data, and the terms of that logging are set by the party you are renting from, not by you. You cannot send a model your data without sending a model your data. The privacy controls are promises made by the counterparty, revocable by the counterparty, on a timeline the counterparty chooses. A zero-retention tier is a setting on someone else’s server. The moat you were trying to protect, your data and your process, flows through the exact layer you do not control, and every call teaches that layer a little more about how you work.

So the commoditization argument does not point where it first seemed to. If the model were the scarce, expensive thing, renting it would be straightforwardly rational, pay the specialist, skip the capital cost. But the model is cheap and your data is the asset, which means the rental relationship has you paying a premium for the worthless layer while piping the valuable layer through infrastructure that bills you, logs you, and can change its terms whenever it likes. The right response to a commoditizing model is not to rent it more comfortably. It is to bring the cheap, swappable part in-house so that the scarce part never has to leave the building. Commoditization is the reason to own the model, not the reason to rent it. The cheap part is cheap enough to own. It is the only part you safely can.

What I am actually claiming

I am not claiming my desk machine beats the frontier on capability. It does not, the largest models live in datacenters I do not control, and that gap is real. I am not claiming self-hosting is cheaper per token at low volume, because the cost model says it is not until you are running somewhere around 800 to 1,000 calls a day, and most people live far below that line. Those concessions stand, and an honest argument keeps them in view.

What I am claiming is narrower and, I think, harder to dismiss. The owning-versus-renting choice is not an ideological preference dressed up as engineering. It is the correct economic read of a stack whose layers are moving in opposite directions. The model layer is commoditizing, falling roughly an order of magnitude a year in cost, swappable behind a stable interface, beatable by smaller models that fit your process. Your data, your process, and your judgment are not commoditizing at all. They are the scarce, appreciating moat. A rental relationship gets the trade exactly backwards: you pay full price for the part that is becoming worthless and settle the bill with the only part that was ever worth anything, one logged call at a time. Owning the cheap part is how you stop paying a premium for a commodity and stop exporting the only thing that was ever worth anything.

This is part of a longer argument that runs through the whole series. The spine of it, and why each essay concedes its strongest objection before answering it, is on the philosophy page. The structured, complete version is the forthcoming book, for which these essays are the public workshop. If you read only one more thing, read why I keep a machine idling at 22% on purpose, because the answer is the same as the one here: availability of the scarce thing, not occupancy of the cheap thing, is what you are actually buying. The rest of the case is on the philosophy page.

Comparison

What the bill is for, scored two ways

You pay a premium for the first column and export the second into someone else's weights.

What you rent

What is actually scarce

The model weights

Premium price for a commoditizing layer.

Cheap, swappable, falling in cost yearly.

Your data

Streamed out, logged, used for training.

The asset only you hold. Keep it home.

Your process

Reshaped to fit the provider's defaults.

The workflow you built and can repair.

Your judgment

Eroded by deferring every call to the API.

The durable skill no model replaces.