We Were Wrong About Local 8B Tool-Use (2026 Reality Check)

June 1, 2026 15 min read

In mid-May I wrote a memo to my own future self that said local 8B models are too weak for MCP tool-use. The memo lived in my agent’s persistent memory under the unfortunate filename local-llm-tool-use-limitations.md. Every time a new session asked the question, it found that file and proceeded under the assumption that local tool-use was broken.

Two weeks later I tested again, with the same models, on different hardware, and got perfect results. The memo was right about the symptom and wrong about the cause. The bridge layer was the broken part, not the model.

This post is the receipt. It includes the exact curl commands you can run to verify on your own setup. If any of the numbers are wrong, please correct me in public.

The original memo

The triggering context was a Tuesday in May where I tried to use opencode (a CLI coding agent) against a local Ollama instance. opencode would send a user message to the model, then immediately inject its own internal “title generator” turn before the model could respond. The model saw two consecutive user-role messages, which violated strict-alternation, and either crashed the chat-template or hallucinated the answer with no tool call.

I noted the symptom honestly: tool calls were not happening, hallucinations were happening. I wrote the cause incorrectly: “8B models cannot reliably handle function-calling protocols at scale.” That generalization is what got persisted into agent memory, and it bit me for two weeks.

The retest

I had a fresh laptop on the bench: Lenovo Legion Pro 7 Gen 10, RTX 5080 Mobile, Ubuntu 26.04, Ollama with four models loaded. I needed to validate that the friend who would own this laptop could actually use MCP tools from OpenWebUI. So I shelled in and ran the protocol directly.

The test prompt was deliberately German to exercise non-English retrieval: “Suche in meiner KB nach luks passphrase.” (Search my KB for luks passphrase.) The tool definition followed standard OpenAI function-calling format with a single kb_search tool that takes a query string.

Here is the raw curl I used:

curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [
      {"role": "user", "content": "Suche in meiner KB nach luks passphrase"}
    ],
    "tools": [{
      "type": "function",
      "function": {
        "name": "kb_search",
        "description": "Sucht in der Knowledge Base",
        "parameters": {
          "type": "object",
          "properties": {"query": {"type": "string"}},
          "required": ["query"]
        }
      }
    }]
  }'

I ran the same payload, swapping the model field, against four targets: qwen3:8b, mistral:7b, llama3.1:8b (all three local Ollama on the Legion), plus qwen3.6-35b via Tailscale on the DGX Spark vLLM endpoint.

All four returned a clean tool_calls block with function.name = "kb_search" and function.arguments = {"query": "luks passphrase"}. No model wrote prose. No model dropped the tool. The German parsed fine, the JSON was structurally valid, the argument string was the right value.

The shape of a qwen3:8b response, redacted for length:

{
  "model": "qwen3:8b",
  "message": {
    "role": "assistant",
    "content": "",
    "tool_calls": [{
      "function": {
        "name": "kb_search",
        "arguments": {"query": "luks passphrase"}
      }
    }]
  },
  "done_reason": "stop",
  "total_duration": 1842318917,
  "eval_count": 31,
  "eval_duration": 521000000
}

The content field is empty because the model correctly chose tool-call over prose. The eval_count of 31 tokens at the measured throughput means the tool decision took under half a second on cold cache. On warm cache the same call comes back in 180-220 ms, which is well inside the latency budget for an interactive tool-use loop.

I ran the same test ten times in a row to check stability. Ten clean tool calls, no hallucinations, no malformed JSON, no extra prose. Then I varied the prompt: “Search for luks”, “Find the LUKS notes”, “What does the KB say about disk encryption”. All ten variations also produced clean tool calls with sensibly transformed query strings. The model is not memorizing the exact prompt-to-call mapping. It is doing the work.

The throughput numbers from the retest, measured against a 200-token completion warmup prompt:

qwen3:8b: 59.5 tokens per second
mistral:7b: 65.5 tokens per second
llama3.1:8b: 63.5 tokens per second

These are within 5 percent of the DFlash-tuned around 71 tok/s on the DGX Spark for the much larger Qwen 3.6 PrismaQuant. A laptop RTX 5080 Mobile running stock-quantized 8B models with no speculative decoding can produce tool calls about as fast as a server-tuned 35B-parameter model. That is a useful baseline to internalize.

To make this concrete with a single end-to-end measurement, I ran a real-world request from the friend’s laptop against the DGX Spark Qwen 3.6 35B PrismaQuant via the Tailscale-shared vLLM endpoint on 2026-05-30. Prompt was 27 tokens. Completion was 332 tokens. Total 359 tokens. Wall-clock time was 7.7 seconds, including Tailscale latency and HTTP overhead. That works out to about 43 tokens per second effective. The pure-inference number measured at the Spark itself sits at 50 to 57 tok/s, which means the Tailscale tax for this request shape is 10 to 15 percent. For comparison, cloud-hosted ChatGPT-4 runs around 30 to 50 tok/s in typical use and Claude Opus around 50 to 80 tok/s. The DGX-Spark-over-Tailscale experience is on parity with current cloud providers from the friend’s seat, with the difference that nothing crosses the two-tailnet boundary. The practical implication for this post’s thesis: the larger remote model is in the same interactive-latency ballpark as the local 8B models. The friend can absolutely run a 35-billion-parameter model from his laptop. The infrastructure is already there, the latency is already acceptable, and the quality bump for hard queries is what he wanted the escape hatch for in the first place.

Why the bridge breaks what the model handles

I went back and reread my old memo with the new evidence. The story it should have told is this:

The model is fine. The protocol Ollama exposes via /v1/chat/completions is fine. What broke was the layer between them.

opencode-TUI does helpful things like generating chat titles, summarizing context, and injecting reminders. Each of these helpful things gets injected as a user-role message before the model sees the actual user message. The Qwen 3 chat template assumes strict role alternation: system, then user, then assistant, then user, then assistant. Two user-role messages in a row crashes the template’s state machine. The model produces nonsense not because it cannot reason, but because the input it sees is malformed at the protocol layer.

OpenWebUI takes a different approach. It treats the model as an OpenAI-compatible endpoint and respects role boundaries. When you enable a tool server via mcpo, OpenWebUI builds the messages list correctly and the model returns sensible tool calls. There is no Title-Generator-User-Inject step, because OpenWebUI uses a separate metadata channel for chat titles.

The lesson, in retrospect, is one I would not have learned without the second test: when a model misbehaves, examine the exact bytes being sent to it before drawing conclusions about model capability.

To make this concrete, the failing trace I should have captured back in May looked like this (reconstructed from opencode --debug logs):

POST /v1/chat/completions
{
  "messages": [
    {"role": "system", "content": "<long system prompt>"},
    {"role": "user", "content": "Suche meine notes für luks"},
    {"role": "user", "content": "Generate a short title (max 5 words) for the above conversation."}
  ]
}

That second user-role message is what kills it. The model receives the title-generation request as if it were the actual user task. Some chat templates throw, some silently overwrite the previous user turn, some concatenate. Qwen 3’s template falls into the third category and produces a confused merge that the model interprets as nonsense input. The “8B is too weak” framing was completely wrong. A 70B model fed the same malformed sequence would also produce garbage. Garbage in, garbage out is older than transformers.

The fix on the opencode side was eventually committed as a side-car proxy that buffers the title-generation request, holds it back until after the main response lands, and then submits it as a separate completion request. For OpenWebUI users, the fix is the absence of the bug: OpenWebUI never inserts a title-generation user-turn into the live message thread to begin with.

The mcpo architecture, briefly

For readers who have not run it: mcpo is a small Python proxy that takes MCP servers (which speak stdio JSON-RPC) and exposes them as OpenAPI HTTP endpoints. OpenWebUI then registers each /openapi.json URL as a tool server, fetches the spec at registration time, and presents the tools to the model in OpenAI function-calling format.

The chain looks like this:

Model in OpenWebUI
    -> tool_calls in response
    -> OpenWebUI router
    -> HTTP POST to mcpo
    -> mcpo dispatches to stdio MCP server
    -> result returns up the chain

On the Legion laptop I have four MCP servers behind one mcpo process:

kb (Knowledge Base search and write)
mem0 (persistent personal memory)
sovgrid-ai (search and read sovgrid.org articles)
context7 (current library documentation)

Total of 16 tools exposed. The mcpo config file is 30 lines. There is no other glue. The model picks the tool, OpenWebUI executes it, the result lands back in the conversation. End to end test: I asked qwen3:8b to “search my KB for luks passphrase” through OpenWebUI, and it returned the actual glossary-embeddings note from the local ChromaDB, with the matching snippet about semantic search finding luks-passphrase-aendern.md even when the word “Passwort” is not in the query. The whole loop works.

What still does not work cleanly

I am not going to claim local 8B tool-use is solved. Three honest caveats:

Multi-step chains lose thread. “Search the KB for X, then summarize the top three matches, then update my memory with the summary” requires the model to chain three tool calls and integrate three results. qwen3:8b handles two-step chains reliably. Three steps work about 70 percent of the time. The DGX-Spark-side Qwen 3.6 35B handles four-step chains fine. Size still matters for chain depth.

Argument formatting varies across models. Qwen returns {"query":"luks passphrase"}. Mistral returns {"query": "luks passphrase"}. Llama wraps the entire tool call in a code fence sometimes. OpenWebUI handles all three. Naive parsers do not.

Embedding models refuse function-calling correctly. When I tested nomic-embed-text against the same prompt+tools payload, Ollama rejected it with “model does not support generate API.” That is the right behavior. Embedding-only models should not pretend to be chat models. I mention this because it is the only “failure” I observed in the retest, and it is a feature, not a bug.

Ambiguous tool names cause silent mis-selection. I set up a deliberate trap: two tools called kb_search and kb_query with overlapping descriptions. qwen3:8b picked the wrong one 30 percent of the time. Mistral picked the wrong one 45 percent of the time. The fix is on the tool-author side: do not ship overlapping tools to the model. The DGX-Spark-side larger models are not magically better here either; they just have a longer attention window, so the disambiguation hint at the bottom of the prompt actually lands. On a laptop-class context window the lesson is to deduplicate aggressively before exposing tools.

Tool-call refusal is not implemented. None of the three local 8B models refuse a tool call when the user prompt is clearly asking for something dangerous. The 35B remote model also does not refuse. The OpenAI hosted models would refuse with a policy message. If you ship a tool that can delete files, the model will call it without hesitation. That is your job to gate, not the model’s. The dashboard’s Doktor-tab encodes this: any action that touches the system is allowlisted and requires an explicit confirmation step in the UI, not just in the prompt.

Concrete next steps if you have shelved local tool-use

If you are sitting on an old “local tool-use does not work” conclusion the way I was, here is the minimum-effort retest:

Install Ollama if you have not. curl -fsSL https://ollama.com/install.sh | sh. Pull one of the three models above.
Run the curl from earlier in this post. If you get a tool_calls block, your local stack works.
If you want the full OpenWebUI experience with persistent KB and memory, the simplest path is a single docker compose up with OLLAMA_BASE_URL pointing at host-gateway:11434 plus a side-car mcpo container running the MCP servers you actually want to expose.
Test against your real use case before believing my numbers. Mine were measured on a laptop with one specific GPU and one specific model quantization. Yours may differ. The honest test is your prompt, not my prompt.

The memo file gets a successor

I have updated my agent’s persistent memory with a corrective entry pointing at this post and the underlying memory file feedback_local_llm_tools_2026.md. The old local-llm-tool-use-limitations.md is not deleted. Both files exist. The cross-reference between them is the cheapest way to make sure future sessions of my agents do not silently re-inherit the wrong conclusion.

The general lesson sits below the specific one. Persistent agent memory is useful exactly when it is correct, and slightly worse than no memory when it has stale wrong conclusions. The mitigation is not “be smarter when you write memos.” The mitigation is to schedule retest passes for any memo that has shaped agent behavior for more than a month, and update or contradict the original entry rather than letting it stand.

I am going to apply that retest discipline to other entries in agent memory next, starting with any conclusions about Voxtral and TTS expressivity, which were also drawn under specific failure conditions and may not generalize either. If the next retest produces another “we were wrong” post, I will write it the same way.

The receipt for this post is the curl command at the top. If it does not produce the result I claimed on your setup, that is news, and I want to know.

The friend test

The reason I bothered to retest in the first place was a friend’s laptop sitting on my bench. He was about to receive it. He does not know what a tool call is. He will not run curl. He will open OpenWebUI, ask the local model a question in German, and expect the answer to come back with relevant snippets from his KB.

If the model fails the protocol, he will think the system is broken. He will not blame the bridge layer, because he does not know there is a bridge layer. He will blame the whole thing and stop using it. The friend test is therefore stricter than the developer test: anything intermittent reads as broken.

Across the first 48 hours of his ownership, the model handled 31 of his 33 KB queries cleanly. The two failures were both multi-step (“find the LUKS note, then send me a summary” where he wanted the summary via the chat) and both fell back to the model writing prose with the snippet inlined, which was an acceptable degraded output. Zero hard crashes. Zero “the model said nothing useful” complaints. The friend test is the one that counts. The retest validated the protocol; the friend validated the experience.

This is also why I am not chasing 100 percent multi-step reliability on the 8B locals. The realistic envelope for a sovereign-AI laptop in 2026 is single-step tool calls with high reliability, with multi-step chains either degrading gracefully to prose-with-citation or escalating transparently to the DGX-Spark-side larger model via Tailscale. For the routine case, the laptop is enough. For the rare deep query, the laptop knows it can phone home.

What I would change in agent-memory hygiene

The mistake here was not the original memo. The original memo was an accurate trace of a specific failure. The mistake was the filename and the lack of a freshness flag.

The filename local-llm-tool-use-limitations.md implies a general claim. A truthful filename would have been opencode-strict-alternation-bug-may-2026.md, which would have signaled to future sessions that the conclusion was scoped to a specific tool, a specific protocol bug, and a specific time. The general framing leaked through the filename more than through the content. Future agent sessions read the filename first when deciding which memory to consult.

The freshness flag is the second piece. Any memo that has changed agent behavior over more than 30 days should carry a one-line “last verified” date at the top, and a retest plan. The corrective entry I wrote today includes that field. The two-line retest plan is: “rerun the curl from this article against the current local model lineup; if tool calls succeed, mark verified; if not, write a corrective entry like this one.” Future me, or future agent of mine, does not have to invent the retest. It is on file.

This is operationally cheap and epistemically large. Memos that hard-code state-of-the-art conclusions decay fast. Memos that document a specific failure with a specific retest plan decay slowly. The same kind of discipline that good test suites apply to code, applied to the conclusions an AI agent persists about its own world.

Sibling posts on this thread

24 Hours Setting Up a Lenovo Legion Pro 7 Gen 10 is the full day-of mechanics post that includes this retest as one milestone.

Sovereign Friend-Setup is the concept post that explains why the friend test matters more than the developer test.

Dashboard As Learning-Cockpit shows the UI that exposes these tool calls to a non-admin user.

Two Tailnets, One Shared Node is the network primitive that lets the laptop escalate to the DGX-Spark-side larger model when it needs to.

	Today	7d	30d	All-time
Unique readers	—	—	—	—
Page views	—	—	—	—