NVIDIA Playbook Stack

March 31, 2026

You’re running a DGX Spark in your basement and wondering why the NVIDIA playbooks exist when you could just grab the models from Hugging Face. The playbooks aren’t documentation. They’re a tested, versioned stack that turns a DGX Spark from a fancy GPU into a reproducible AI development environment. Without them, you’re debugging dependencies like this:

# Example of manual dependency hell
pip install torch==2.3.1+cu121 --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.41.2 --no-deps
pip install accelerate==0.30.1
# ... then realize you need CUDA 12.1 but your kernel is 5.15.0-105-generic

The playbooks do the heavy lifting so you don’t have to. They’re not just READMEs with commands—they’re Ansible roles, Docker Compose files, and environment variables tuned for DGX Spark’s hardware.

Quick Take

NVIDIA’s playbooks provide pre-configured stacks for LLM serving, web UIs, and agent architectures on DGX Spark.

MCP integration lets you expose local services like portfolio optimization as tools for agents like Claude or OpenHands.

The playbooks are opinionated but not prescriptive—you still need to adjust networking, storage, and security for your environment.

The NVIDIA playbooks live in /data/playbooks/nvidia and are cloned directly from their GitHub repository. Here’s the directory structure after cloning:

/data/playbooks/nvidia
├── nvidia/ollama
│   ├── ansible/
│   │   ├── roles/
│   │   │   ├── gpu_passthrough/
│   │   │   │   └── tasks/main.yml
│   │   │   └── persistent_volume/
│   │   │       └── tasks/main.yml
│   ├── docker-compose.yml
│   ├── .env
│   └── systemd/
│       └── ollama.service
├── nvidia/open-webui
│   ├── Dockerfile.arm64
│   ├── docker-compose.yml
│   └── patches/
│       └── webui.patch
└── nvidia/sglang
    ├── ansible/
    │   └── roles/
    │       └── memory_settings/
    │           └── tasks/main.yml
    └── docker-compose.yml

The nvidia/ollama playbook, for example, isn’t just a wrapper around ollama pull. It sets up a persistent volume for model storage, configures the NVIDIA Container Toolkit for GPU passthrough, and includes a systemd service to keep the Ollama server running across reboots:

# /data/playbooks/nvidia/nvidia/ollama/docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    runtime: nvidia
    volumes:
      - /mnt/models:/root/.ollama
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
    ports:
      - "11434:11434"
    restart: unless-stopped

The nvidia/open-webui playbook does the same for the web interface, but it also patches the Dockerfile to use the DGX Spark’s ARM64 base image instead of the default x86_64 one:

# /data/playbooks/nvidia/nvidia/open-webui/Dockerfile.arm64
FROM --platform=linux/arm64 ghcr.io/open-webui/open-webui:main
# ... additional ARM64-specific patches

Skipping these patches means your UI won’t start, or worse, it will start but silently fail to load models. Here’s the error you’ll see if you try to run the x86_64 image on ARM64:

$ docker compose up
[+] Running 1/1
 ⠿ Container open-webui  Creating
Error response from daemon: image with reference open-webui:main was found but does not match the specified platform linux/arm64

The playbooks aren’t magic. They assume you’ve already partitioned your NVMe drives for ZFS, set up user namespaces for unprivileged containers, and disabled swap to avoid OOM kills during large model loads. If you haven’t, the playbooks will fail in ways that look like model incompatibilities but are actually storage or permissions issues. The nvidia/sglang playbook is particularly brutal here. SGLang’s RadixAttention engine requires pinned memory, and the playbook’s default settings assume you’ve allocated 80% of your RAM to the GPU:

# /data/playbooks/nvidia/nvidia/sglang/docker-compose.yml
environment:
  - SGLANG_ALLOCATE_80_PERCENT_RAM=1
  - NVIDIA_VISIBLE_DEVICES=all

If you skimped on RAM or didn’t set the NVIDIA_VISIBLE_DEVICES environment variable correctly, the playbook will deploy but the service will crash with a segfault. Here’s the error message you’ll see in the logs:

$ journalctl -u sglang -f
-- Logs begin at Mon 2024-06-10 12:00:00 UTC. --
Jun 10 12:01:42 dgx-spark sglang[12345]: [2024-06-10 12:01:42,789] ERROR engine.cpp:123] failed to initialize engine
Jun 10 12:01:42 dgx-spark sglang[12345]: [2024-06-10-12:01:42.789] [default] [FATAL] [default] [default] CUDA error: out of memory

The error message won’t mention memory. It’ll say “failed to initialize engine” and leave you debugging CUDA contexts for hours.

The playbooks also include playbooks for services you might not need yet but will regret skipping later. The nvidia/portfolio-optimization playbook, for instance, sets up a FastAPI service that wraps NVIDIA’s cuQuantum portfolio optimizer:

# /data/playbooks/nvidia/nvidia/portfolio-optimization/app/main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import cuquantum as cq
import numpy as np

app = FastAPI()

class PortfolioRequest(BaseModel):
    weights: list[float]
    covariance_matrix: list[list[float]]

@app.post("/optimize")
async def optimize_portfolio(request: PortfolioRequest):
    try:
        # Pre-quantized model loaded here
        model = cq.PortfolioOptimizer.load("model.npz")
        result = model.optimize(request.weights, request.covariance_matrix)
        return {"status": "success", "result": result}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

It’s not just a Python script. It includes a systemd socket activation unit so the service starts only when a request arrives, reducing idle memory usage:

# /data/playbooks/nvidia/nvidia/portfolio-optimization/systemd/portfolio-optimization.socket
[Socket]
ListenStream=8000
Accept=true

[Install]
WantedBy=sockets.target

It also ships with a pre-quantized model so you’re not waiting for a 4-hour quantization job on your first run. Skip this playbook and you’ll spend a weekend wrestling with cuQuantum’s Python bindings and CUDA version mismatches.

MCP integration is where the playbooks start to feel like part of a larger system. The Model Context Protocol isn’t just another API standard. It’s a way to expose local services as tools that agents can call without knowing the underlying implementation. The example MCP server in the playbook isn’t a toy. It’s a template for turning any DGX Spark service into an agent-executable tool:

# /data/playbooks/nvidia/nvidia/mcp-server/optimize_portfolio.py
import asyncio
from mcp import Server
from pydantic import BaseModel, Field
import httpx

class OptimizeRequest(BaseModel):
    weights: list[float] = Field(..., description="Portfolio weights")
    covariance_matrix: list[list[float]] = Field(..., description="Covariance matrix")

server = Server("portfolio_optimizer")

@server.tool()
async def optimize_portfolio(request: OptimizeRequest) -> dict:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8000/optimize",
            json=request.model_dump(),
            timeout=60.0
        )
        response.raise_for_status()
        return response.json()

async def main():
    server.run(transport="stdio")

if __name__ == "__main__":
    asyncio.run(main())

The optimize_portfolio tool, for example, doesn’t just forward requests to the FastAPI endpoint. It validates the input schema, checks the API key against a local file, and sets a 60-second timeout to prevent hanging requests:

# Example of malformed payload causing issues
curl -X POST http://localhost:8000/optimize \
  -H "Content-Type: application/json" \
  -d '{"weights": [0.5, 0.5]}'  # Missing covariance_matrix
# Response: {"detail":"missing covariance_matrix"}

If you skip the schema validation, agents will send malformed payloads and crash your service. If you skip the API key check, you’re one misconfigured firewall away from exposing your portfolio optimizer to the internet.

The MCP server runs as a separate process, listening on a Unix socket. The playbook includes a systemd service to start it on boot:

# /data/playbooks/nvidia/nvidia/mcp-server/systemd/mcp-server.service
[Unit]
Description=MCP Portfolio Optimizer Server
After=network.target

[Service]
ExecStart=/usr/local/bin/mcp-server optimize_portfolio
Restart=always
User=nvidia
Group=nvidia

[Install]
WantedBy=multi-user.target

But you still need to configure the socket path and permissions. The default path is /run/mcp/sovereign-grid.sock, but if you’re running multiple MCP servers, you’ll want to change it:

# Example of socket path configuration
mkdir -p /etc/systemd/system/mcp-server.service.d
cat > /etc/systemd/system/mcp-server.service.d/override.conf <<EOF
[Service]
Environment="MCP_SOCKET_PATH=/run/mcp/portfolio.sock"
EOF
systemctl daemon-reload

The playbook’s example uses asyncio.run(server.run()), which works, but it’s not production-ready. For real use, you’ll want to add logging, metrics, and graceful shutdown. The playbook doesn’t include these because NVIDIA assumes you’ll extend it for your needs. If you don’t, your MCP server will die silently when the system runs out of file descriptors:

# Example of adding logging to the MCP server
import logging
from mcp import Server

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    handlers=[logging.FileHandler("/var/log/mcp-server.log")]
)
server = Server("portfolio_optimizer")

The playbook matrix includes options for fine-tuning, quantization, and multi-node setups, but these are where the playbooks start to show their limits. The nvidia/unsloth playbook, for example, assumes you’re using PyTorch 2.3.1 and CUDA 12.1:

# /data/playbooks/nvidia/nvidia/unsloth/docker-compose.yml
environment:
  - PYTORCH_VERSION=2.3.1
  - CUDA_VERSION=12.1

If you’re running a newer version, the playbook will fail to build the Unsloth wheels, and you’ll be left debugging pip’s dependency resolver. Here’s the error you’ll see:

$ docker compose build unsloth
# ... build output ...
ERROR: Could not find a version that satisfies the requirement torch==2.3.1+cu121 (from unsloth==0.8.3)

The playbook includes a note about this in the README, but it’s buried:

> **Note**: If you're using PyTorch 2.4.0 or later, you'll need to manually adjust the `PYTORCH_VERSION` and `CUDA_VERSION` in the `.env` file.

The nvidia/connect-two-sparks playbook is even riskier. It’s labeled as a future feature, but the playbook’s README assumes you’ve already set up InfiniBand or RoCE networking:

# Example of checking InfiniBand setup
ibstat
# Expected output:
# CA 'mlx5_0'
#     CA type: MT4115
#     Port 1:
#         State:

Stack

NVIDIA Playbook Stack

Pre-configured AI dev environment for DGX Spark

Agents Claude, OpenHands via MCP

Services Ollama, SGLang, FastAPI

Runtime Docker + NVIDIA Container Toolkit

OS Linux with user namespaces

Hardware DGX Spark NVMe/ZFS