We’ll benchmark an Open weights LLM on any GPU you choose — drop your model + hardware and we’ll run it. [D]

digitado ⋅ 4 de July de 2026

We run HexGrid Cloud, a platform for deploying open-source models on GPUs, and we’re heads-down optimizing our serving/deployment layer.

To pressure-test it we’re benchmarking real models under real concurrency — and instead of guessing, we’d rather run what you actually want to see.

—

Models available for benchmarking:

Nemotron-3 Super 120B-A12B (only NVFP4)
Nemotron-3 Nano 30B A3B
Qwen-3.6 27B
Llama 3.3 70B Instruct
Gemma-4 31B
Devstral-Small-2-24B-Instruct-2512
?? (you suggest a model to us)

We’re focused on chat/instruct models for now (that’s what most of our users deploy), so pick one from the list above — or suggest another open-weight chat model that fits on a single H200 (141GB).

—

Hardware & quant choices:

GPU (up to H200 for this round): RTX PRO 6000 · L40S · H100 · H200
Quant: FP8 / AWQ / BF16
Context length: (8K, 32K, 64K, 128K)
What you want measured: max throughput? single-stream speed? long-context prefill?

—

We’ll run the top picks and post full results — tokens/sec, TTFT, TPOT, throughput under concurrency, and cost-per-million-tokens — config and flags included so it’s reproducible.

Let us know in comments.

submitted by /u/Temporary-Owl1725
[link] [comments]

Like 0

Liked Liked