We Built a Local Model Arena in 30 Minutes — Infrastructure Mattered More Than the App

How LLMesh turns your existing machines into a distributed inference mesh, and why we stopped caring about which model won.

Every developer building with local LLMs hits the same wall.

You run Ollama on your laptop. It works. You have a colleague with an NVIDIA box running Mistral. That works too. But your app points at localhost:11434, and when you push to staging, it breaks. So you start hardcoding IPs, writing wrapper scripts, and wishing someone had solved this already.

We built LLMesh to solve this. It’s a distributed inference broker — think of it as nginx for LLM inference. One endpoint, any backend, any machine. Your app never knows or cares where the compute lives. Your data never leaves your infrastructure — privacy by architecture, not by policy.

LLMesh has full streaming support for Ollama out of the box, with vLLM and MLX backends in beta. If your team is running models through Ollama — and many local LLM teams are — this is ready for you to use today. It exposes OpenAI and Anthropic compatible APIs, so any existing SDK integration works without code changes. Point your base_url at the hub and go.

To show what this actually feels like in practice, we built a small FastAPI app called Model Arena: a side-by-side comparison tool that sends the same prompt to two different models through LLMesh and streams both responses in real time. It took about 30 minutes. The interesting part wasn’t the app — it was what happened when we started adding machines.

What We’re Building

Model Arena is a single-page FastAPI app. You pick two models, type a prompt, and watch both responses stream side by side with latency and token counts. It runs through LLMesh, which means:

  • The app talks to one endpoint (the LLMesh hub)
  • The hub routes each request to whichever node has that model
  • If you add a second machine, the app doesn’t change
  • If you push to staging, the app doesn’t change

Here’s the architecture:

Browser → Model Arena (FastAPI :5000) → LLMesh Hub (:8000) → Node A (Ollama on laptop)
                                                           → Node B (Ollama on GPU box)

Part 1: Single Machine Setup

Start with everything on one machine. You need LLMesh and at least two models in Ollama.

Start the LLMesh hub

bash

# Clone LLMesh
git clone https://github.com/qcoda-ai/llmesh.git
cd llmesh

# Configure an API key
cp server_config.example.json server_config.json
# Edit server_config.json:
# {
#   "api_keys": {
#     "arena-demo-key": "arena_user"
#   }
# }

# Start the hub
uvicorn lib.hub.server:app --port 8000

Start an agent on the same machine

bash

# In a new terminal
LLMESH_API_KEY="arena-demo-key" 
HUB_URL="http://127.0.0.1:8000" 
python -m lib.agent.client

The agent auto-detects Ollama, discovers your models, and registers with the hub. You should see something like:

[INFO] Detected Ollama at localhost:11434
[INFO] Found models: llama3.2:3b, mistral:7b
[INFO] Registered as node abc123

Pull two models (if you haven’t already)

bash

ollama pull llama3.2:3b
ollama pull mistral:7b

Start Model Arena

bash

cd examples/model-arena

pip install -r requirements.txt

export LLMESH_HUB_URL="http://localhost:8000"
export LLMESH_API_KEY="arena-demo-key"

uvicorn app:app --port 5000

Open http://localhost:5000. You’ll see both models in the dropdowns. Type a prompt, hit Run, and watch them race.

At this point, everything is running on one machine. The hub receives your request, queues it, and the local agent picks it up. Both models run through Ollama on your laptop. This works, but it doesn’t demonstrate why LLMesh exists.

Docker alternative

If you prefer containers, the Model Arena includes a docker-compose.yml that starts the hub, Postgres (for session persistence), and the Arena UI in one command:

bash

cd examples/model-arena
docker compose up

Three services come up: Postgres on 5432, the LLMesh hub on 8000, and the Arena on 5000. The hub and Arena run in containers. Agents still run on bare metal — they need access to your GPUs and local model servers.

This is where the nginx analogy becomes concrete. Look at the compose file:

yaml

services:
  postgres:    # data layer
  hub:         # LLMesh — the routing layer
  arena:       # your app

The hub sits between your application and your compute, exactly where nginx sits between your users and your application servers. It routes, load-balances, and observes. Your app talks to the hub. The hub talks to agents. Agents talk to Ollama. Nobody skips a layer.

Start an agent on any machine with Ollama installed:

bash

LLMESH_API_KEY="arena-demo-key" 
HUB_URL="http://<hub-machine-ip>:8000" 
python -m lib.agent.client

The Arena at http://localhost:5000 immediately picks up the new node’s models.

Part 2: Add a Second Machine

This is where it gets interesting.

Grab any other machine on your network — a colleague’s laptop, an old workstation, a Raspberry Pi with enough RAM. Install Ollama, pull a model, and start a LLMesh agent pointing at your hub:

bash

# On the second machine
ollama pull mistral:7b

LLMESH_API_KEY="arena-demo-key" 
HUB_URL="http://192.168.1.100:8000" 
python -m lib.agent.client

Replace 192.168.1.100 with your hub machine’s IP.

Now go back to Model Arena and run the same prompt again.

Nothing changed in the Arena app. Same code. Same config. Same endpoint. But the hub now has two nodes registered, and it routes mistral:7b to whichever node has capacity. If the second machine has a GPU, the latency numbers will tell you immediately — Mistral on the GPU box will finish before Llama on your laptop CPU.

This is the architectural insight: your app doesn’t know how many machines exist. It doesn’t know which machine runs which model. It sends a request to the hub, and the hub figures out the rest. Add a third machine, a fourth — the Arena still works, unchanged.

Part 3: Push to Staging

Here’s the portability test.

Take your Model Arena and deploy it somewhere — a VM, a Docker container, a different machine entirely. Change one environment variable:

bash

export LLMESH_HUB_URL="http://staging-hub.yourcompany.local:8000"

That’s it. The Arena code is identical. The hub at the staging URL has its own agents, its own nodes, its own models. Your app doesn’t care. The endpoint is the abstraction layer.

This is the problem LLMesh was built to solve: the gap between “works on my machine” and “works in staging” for local LLM inference. Every other solution we found either assumed single-machine (Ollama), required Kubernetes (KServe, Seldon), or routed to cloud APIs (LiteLLM, Portkey) — which defeats the purpose of running local models.

Beyond Staging: Production and the Office

The same architecture scales to production. The difference isn’t the code — it’s the nodes behind the hub. In dev, you run one agent on your laptop. In staging, maybe two or three machines. In production, you provision dedicated hardware: a rack of GPU nodes, a set of Mac Studios, whatever your inference workload demands. The hub endpoint stays the same. Your application doesn’t know or care that the infrastructure behind it changed from a single laptop to a well-provisioned cluster.

But production doesn’t have to mean a data center. This is where LLMesh fits a use case that most inference tools ignore entirely.

Small office / home office (SOHO). A design studio with three Mac Minis running Ollama. A law firm with a few workstations that sit idle after hours. A research lab where everyone has their own GPU box. LLMesh turns all of that existing hardware into a shared inference layer — no cloud account, no DevOps team, no Kubernetes cluster. Install an agent on each machine, point them at a hub running on any one of them, and the entire office has a single API endpoint for local AI.

Desktop apps and internal tools. Any application that speaks the OpenAI or Anthropic API format can use the hub as its backend. That includes chat interfaces, coding assistants, document processors, summarization tools — anything your team builds or buys that supports a configurable API endpoint. Instead of each person running their own Ollama instance on their own machine, the whole office shares a pool. Models load once, across whichever machine has the most capacity.

Interoffice and multi-site. If your hub is reachable across locations — a VPN, a tunnel, a public IP with auth — remote offices and distributed teams share the same inference mesh. A developer in one city and a researcher in another both hit the same endpoint. The hub routes to whichever node is closest and least loaded.

The pattern is always the same: one endpoint, many nodes, zero config changes in the application layer.

What the Arena Reveals

After running a few dozen comparisons, the numbers tell a story:

Latency varies by hardware, not by model quality. A 3B model on an M1 MacBook can outperform a 7B model on an older Intel machine, not because the model is better, but because the hardware is faster. The Arena makes this visible instantly.

Token counts are consistent across machines. The same prompt produces roughly the same token count regardless of which node handles it. This matters for cost estimation — you can predict inference costs based on prompt complexity, not deployment topology.

The interesting metric isn’t speed — it’s cost per quality. The Arena’s side-by-side view lets you see when the smaller model produces an answer that’s good enough. If llama3.2:3b gives you 90% of the quality at 40% of the latency, that’s a real infrastructure decision you can make with data.

And all of this is already tracked in the LLMesh dashboard. Every Arena request generates real metrics: tokens per model, latency per node, and success rates over time. You get observability for free because the broker sees everything.

The 50-Line Core

Here’s the entire Arena backend — the part that talks to LLMesh:

python

@app.post("/api/arena")
async def arena(request: Request):
    body = await request.json()
    prompt = body["prompt"]
    model_a = body["model_a"]
    model_b = body["model_b"]

    async def run_model(side, model):
        payload = {
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "stream": True,
        }
        async with httpx.AsyncClient(timeout=120) as client:
            async with client.stream(
                "POST",
                f"{HUB_URL}/v1/chat/completions",
                json=payload,
                headers={"Authorization": f"Bearer {API_KEY}"},
            ) as resp:
                async for line in resp.aiter_lines():
                    if line.startswith("data: ") and line[6:] != "[DONE]":
                        chunk = json.loads(line[6:])
                        content = chunk["choices"][0]["delta"].get("content", "")
                        if content:
                            yield f"data: {json.dumps({'side': side, 'content': content})}nn"

That’s it. The app uses the standard OpenAI chat completions endpoint with streaming — the same API shape you’d use with gpt-4o or any other hosted model. No LLMesh-specific SDK, no custom protocol, no vendor lock-in. If you ripped out LLMesh tomorrow and pointed this at OpenAI’s API, it would still work. The same applies to Anthropic’s message format — LLMesh speaks both natively.

Why This Matters Beyond the Demo

The Arena is a toy. The pattern isn’t.

Any application that uses the OpenAI or Anthropic SDK can point at a LLMesh hub instead of a cloud endpoint. That gives you:

Environment portability. Your app works in dev, staging, and production without config changes. The hub endpoint is the same everywhere — only the nodes behind it change.

Hardware pooling. Every machine on your network becomes available compute. No Kubernetes, no orchestration platform, no cloud bill. Install an agent, point it at the hub, done.

Token visibility. Every request is logged with token counts, latency, model, and owner. You know what inference costs before you get the cloud bill — or you use this data to decide whether you need a cloud bill at all.

LLMesh is open source under MIT license. If your team is running local models and hitting the localhost-to-staging wall, this is the missing layer.

llmesh.net | GitHub


:::info
LLMesh is built by Andrew Schwabe for Qcoda, an AI orchestration platform for product teams. We extracted LLMesh and open-sourced it because the inference routing layer shouldn’t be something every team reinvents. 25 years of building distributed systems led to this — the simplest infrastructure problem nobody had solved yet.

:::

Liked Liked