Giving a local 8B model persistent memory and retrieval good enough to replace a cloud assistant for daily coding. The architecture is mem0 plus a RAG knowledge base over ChromaDB. The honest part is the two bugs that made the first version forget you and answer the wrong question with full confidence.

A Second Brain for a Local Model, and the Two Bugs That Made It Useless First

The goal was specific: let a local 8-billion-parameter model do the daily coding-assistant work that a cloud model had been doing, on the operator’s own hardware, with no data leaving the box. The blocker was never raw capability. An 8B model on current hardware is good enough at code. The blocker was memory. A stateless local model meets you fresh every session. It does not know who you are, what you are building, or what you decided last week. A cloud assistant papers over this with a large context window and server-side history. The local equivalent has to be built.

This is the build. It is two stores over one vector database, bridged into both the chat interface and the coding tools so they share a single memory. It is also the two bugs that made the first version worse than nothing, because those are the part worth reading.

The naive version forgets you

The first cut was the obvious one. Run mem0 for facts about the operator, run a retrieval-augmented knowledge base for notes, point both at a local embedding model, expose them as tools. On paper it works. In practice the first version failed in two different ways within the first hour, and both failures are the kind that a demo never catches because a demo is one happy-path query.

The honest framing for the rest of this post: I shipped a memory system that forgot people when the GPU was busy, and a retrieval system that answered “what is RAG” with a note about keyboard lighting. Both are fixed now. The fixes are more interesting than the architecture.

Two stores, one database

The split matters, so it is worth being precise about it. There are two different things a model needs to remember, and they have different shapes.

The first is facts about the operator. This person codes in the evenings, prefers Python, is learning Rust, decided last week to use a particular library. These are short, they accumulate slowly, and they are personal. That is what mem0 holds. It runs a small extraction pass to turn a conversational sentence into a stored fact, and it keeps those facts in a ChromaDB collection.

The second is knowledge. Setup notes, how a service is wired, what a term means, the post-mortem from a bug three weeks ago. These are longer, they are written deliberately, and they are not about the person. That is the RAG knowledge base: Markdown notes, one concept per file, embedded into a separate ChromaDB collection and retrieved on demand.

Both stores sit in the same ChromaDB instance with different collections. The embeddings come from a local model, nomic-embed-text through Ollama on the machine with a GPU to spare, and a CPU sentence-transformer on the machine without one. Nothing in the retrieval path touches the network. That is the whole point of the exercise.

Bug one: the memory depended on a model that is not always up

mem0’s default behavior is to run an extraction model over every write. You hand it “I prefer to code in the evenings” and it calls an LLM to distill that into a clean stored fact. This is a reasonable default and it produces tidy memories.

It also means every memory write depends on an LLM being available at write time. On a single-GPU box the inference server is a shared resource under a mutex. When the model is loaded for something else, or swapped out, or simply busy, the extraction call has nowhere to go. The first time this happened the write did not error in a way anyone would notice. It just did not store anything. The operator told the assistant a preference, the assistant acknowledged it, and the fact evaporated because the extraction step failed silently behind the acknowledgment.

A second brain that forgets whenever the GPU is doing something else is not a second brain. It is a brain with a dependency on the exact resource that is most often contended.

The fix is a fallback, and it is short. Attempt the normal inferred write. If the result comes back empty, which is what a failed extraction looks like, store the raw text directly with extraction turned off. mem0 supports a raw-store mode for exactly this. The stored fact is slightly less tidy, it keeps the operator’s own phrasing instead of a distilled version, but it is stored. The memory layer now degrades to “remember it verbatim” instead of “drop it” when the model is unavailable. It does not pause because the GPU is busy.

Bug two: the retrieval returned keyboard lighting for “what is RAG”

The retrieval bug was funnier and more instructive. The knowledge base had a clean glossary note defining RAG. Ask the system “what is RAG” and it returned a note about the laptop’s keyboard RGB backlight. Confidently. With a relevance score that looked fine.

Two things were going wrong at once. The first is that the query expansion step was non-deterministic. The retrieval uses a hypothetical-answer technique: before searching, it asks the local model to draft a plausible answer, then embeds that draft to find similar notes. That draft was being generated at a normal sampling temperature, so the same question produced a different hypothetical answer every time, which produced different search results every time. A retrieval system that returns different results for the identical question is not a retrieval system you can trust.

The second is that short queries in German, which is the operator’s language, do not carry enough semantic signal for pure vector similarity. “Was ist RAG” is four tokens. The embedding for four tokens sits in a fuzzy region of the vector space, and the nearest neighbor was a note that happened to share surface tokens. The acronym RAG and the acronym RGB are one character apart and the German text around both looked similar to the model.

The fixes that made retrieval deterministic

Three changes, in order of how much they mattered.

The query expansion now runs at temperature zero. The same question produces the same hypothetical answer, which produces the same search, which returns the same notes. Determinism is not a nice-to-have for retrieval. It is the difference between a tool the operator learns to trust and one that feels haunted.

Pronoun normalization rewrites first-person queries before search. The operator asks “what do you know about me”, and the profile note is indexed under a name, not under the word “me”. So the query layer rewrites first-person pronouns to the operator’s handle before the search runs. This sounds trivial and it roughly doubled the hit rate on personal queries, because the most common questions are exactly the ones phrased in the first person.

A deterministic glossary boost pins exact matches. For a short definition query, vector similarity is the wrong tool. So before trusting the vector search, the retrieval checks whether any query word maps to a glossary note by exact slug. If “RAG” maps to the glossary note glossary-rag, that note is pinned to the top of the results regardless of what the vector search thought. The expensive fuzzy method handles the long-tail questions. A cheap exact-match shortcut handles the definition questions where fuzzy was failing. The “what is RAG” query now returns the RAG note, every time.

There was a fourth, smaller fix in the indexer. Standalone Markdown headings were being embedded as their own chunks, which produced a pile of useless 34-character entries that matched nothing well. The indexer now merges a lone heading into the paragraph that follows it, so the heading’s words travel with the content they introduce.

One brain for chat and for coding

The architecture would be half as useful if the chat interface and the coding tools each kept their own memory. The operator tells the chat assistant a preference in the evening, then opens a coding session the next morning and the coding tool has never heard of it. That split is the normal state of affairs with most setups, and it is why memory feels unreliable even when it technically works.

The bridge is an MCP layer. One process exposes the memory and knowledge-base tools under sub-paths, and three different surfaces bind to the same servers: the chat interface calls them as native function-calls, and the two coding tools bind the identical servers directly over stdio. Because all three point at the same ChromaDB, there is one brain. A fact saved from a chat message is visible to the coding tool, and a note written during a coding session is visible in chat. The operator stops experiencing memory as a per-app feature and starts experiencing it as a property of the machine.

Small models need positive instructions

One behavior caught me out and is worth passing on, because it is the kind of thing that does not show up until a small model is in the loop. The system prompt for the memory-writing persona originally said, in effect, “never claim you saved something if you did not call the save tool.” A larger model respects that. The 8B model read it, skipped the tool call, and cheerfully replied “noted, I have saved that” anyway. The negative instruction was simply not weighted heavily enough to change the behavior.

Rewriting the same rule in the positive fixed it: “if you cannot save, say plainly that you cannot save this right now.” Small models follow positive instructions about what to do far more reliably than negative instructions about what not to do. After any prompt change like this, the test is to provoke the exact failure on purpose and confirm the new phrasing holds. A rule you did not adversarially test is a rule you are guessing about.

Verifying it end to end

The verification was deliberately not a single happy-path query, because a single query is what hid both bugs. The check exercises the real tool path the agents use: recall the operator’s facts, run a knowledge query in the first person to confirm pronoun normalization fires, run a definition query to confirm the glossary boost fires, and then a full round trip of writing a fact, reading it back, and deleting it.

The round trip surfaced one more honest detail. Writing a marker fact, then searching for the marker word, returned nothing, which looked like a failure until I read the stored record. The extraction model had reworded “marker xyzzy” into a clean sentence that no longer contained the marker word. The write had worked perfectly. The search was looking for a string the model had paraphrased away. The lesson is to delete test data by its record identifier, not by searching for text the extraction step may have rewritten. Small thing, but it is exactly the kind of thing that makes you distrust a system that is actually fine.

What it cost and what it bought

The pieces are all open and local: ChromaDB for vectors, mem0 for the fact store, a local embedding model, a knowledge base of Markdown files, and one MCP process to bridge it into chat and coding. The fixes that made it usable were not new components. They were a fallback write path, a temperature of zero, a pronoun rewrite, an exact-match shortcut, and a heading merge. None of those are in a tutorial. All of them came from watching the thing fail on a real question.

What it bought is the thing the project set out to buy. A local 8B model now does the daily coding-assistant work with persistent, per-operator memory and a retrieval layer that returns the right note for a short question, on hardware the operator owns, with nothing leaving the box. The capability was always there. The memory layer is what turned a capable stateless model into an assistant that knows who it is working for.

Illustration: A Second Brain for a Local Model, and the Two Bugs That Made It Useless First

Was this worth it? Zap the article.

Value for value, no signup. Sats go straight to the writer.