LightSeek Foundation Releases TokenSpeed, an Open-Source LLM Inference Engine Targeting TensorRT-LLM-Level Performance for Agentic Workloads

digitado ⋅ 8 de May de 2026

Inference efficiency has quietly become one of the most consequential bottlenecks in AI deployment. As agentic coding systems such as Claude Code, Codex, and Cursor scale from developer tools to infrastructure powering software development at large, the underlying inference engines serving those requests are under increasing strain. The LightSeek Foundation researchers have released TokenSpeed, an open-source LLM inference engine released under the MIT license and designed specifically for the demands of agentic workloads. The TokenSpeed engine is currently in preview status.

Why Agentic Inference is a Different Problem

To understand what makes TokenSpeed’s design choices meaningful, it helps to understand what makes agentic inference hard. Coding agents don’t behave like a typical chatbot turn. Contexts routinely exceed 50K tokens, and conversations often span dozens of turns. This creates simultaneous pressure on two metrics: per-GPU TPM (tokens per minute), which determines how many users a single GPU can serve, and per-user TPS (tokens per second), which determines whether an individual user perceives the system as responsive. Most public benchmarks do not fully capture this behavior.

TokenSpeed has been designed to maximize both. The objective is to maximize per-GPU TPM while maintaining a per-user TPS floor — typically 70 TPS, and sometimes 200 TPS or higher.

Architecture: Five Interlocking Subsystems

TokenSpeed’s architecture is built around five design pillars: a compiler-backed modeling mechanism for parallelism, a high-performance scheduler, a safe KV resource reuse restriction, a pluggable layered kernel system that supports heterogeneous accelerators, and SMG integration for a low-overhead CPU-side request entrypoint.

The modeling layer uses a local SPMD (Single Program, Multiple Data) approach. SPMD is a parallel execution model where all processes run the same program but on different subsets of data — a common pattern in distributed deep learning. Rather than requiring developers to manually implement the communication logic between processes, TokenSpeed enables developers to specify I/O placement annotations at module boundaries, and a lightweight static compiler then automatically generates the required collective operations during model construction, eliminating the need to manually implement communication logic.

The scheduler makes a structural split between the control plane and the execution plane. The control plane is implemented in C++ as a finite-state machine that works with the type system to enforce safe resource management — including KV cache state transfer and usage — at compile time rather than at runtime. Request lifecycle, KV cache resources, and overlap timing are represented through explicit FSM transitions and ownership semantics, so correctness is enforced by a verifiable control system rather than convention. By encoding these correctness constraints into the type system rather than leaving them to runtime convention, errors in KV cache management — one of the most error-prone areas in LLM serving — are caught earlier. The execution plane is implemented in Python to maintain development efficiency, enabling faster feature iteration and lower cognitive load for developers

The kernel layer treats GPU kernels as a first-class modular subsystem rather than baking them into the engine core. It provides a portable public API, a centralized registry and selection model, and an extensible plugin mechanism to support heterogeneous accelerators — meaning it isn’t locked to NVIDIA hardware. The dev team has also developed one of the fastest MLA (Multi-head Latent Attention) kernels for agentic workloads on NVIDIA Blackwell. In the decode kernel, q_seqlen and num_heads are grouped to fully utilize Tensor Cores, as num_heads are small in some of these use cases. The binary prefill kernel includes a fine-tuned softmax implementation. Notably, TokenSpeed MLA has been adopted by vLLM.

https://lightseek.org/blog/lightseek-tokenspeed.html

Finally, TokenSpeed integrates SMG — a PyTorch-native component — for a low-overhead CPU-side request entrypoint, reducing the handoff cost between CPU orchestration and GPU execution.

Benchmark Results Against TensorRT-LLM on NVIDIA B200

It is worth noting upfront that these benchmarks cover single (non-disaggregated) deployment only. PD disaggregation support is still undergoing cleanup and may be covered in a dedicated follow-up from the TokenSpeed team.

Together with the EvalScope team, TokenSpeed was evaluated against SWE-smith traces, which closely mirror production coding-agent traffic, benchmarked against TensorRT-LLM — the current state of the art on NVIDIA Blackwell. The test model was Kimi K2.5.

For coding agents running above 70 TPS/User, the best configuration is Attention TP4 + MoE TP4, where TokenSpeed dominates TensorRT-LLM across the entire Pareto frontier: roughly 9% faster in the min-latency case (batch size 1), and roughly 11% higher throughput around 100 TPS/User. TP4 here refers to tensor parallelism across 4 GPUs, a technique that shards model weights across multiple devices to reduce per-device memory pressure and latency.

On the MLA kernel, the gains are more pronounced at the decode stage. The decode kernel folds the query-sequence axis into the head axis to better fill the BMM1 M tile, improving Tensor Core utilization. The binary-version prefill kernel uses NVIDIA-internal knobs to fine-tune the softmax implementation, outperforming TensorRT-LLM’s MLA across all five typical prefill workloads for coding agents with long prefix KV cache. Combined with other optimizations, this nearly halves latency relative to TensorRT-LLM on typical decode workloads with speculative decoding at batch sizes 4, 8, and 16 with long prefix KV cache.

Key Takeaways

TokenSpeed is a new MIT-licensed, open-source LLM inference engine by LightSeek Foundation, built specifically for agentic workloads. (Available in preview mode)
Its scheduler uses a C++ finite-state machine to enforce KV cache safety at compile time, while keeping the execution plane in Python for usability.
On NVIDIA B200, TokenSpeed outperforms TensorRT-LLM by ~9% in min-latency and ~11% in throughput at 100 TPS/User on Kimi K2.5.
The TokenSpeed MLA kernel nearly halves decode latency vs. TensorRT-LLM on speculative decoding workloads and has already been adopted by vLLM.

Check out the Technical details and GitHub Repo. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post LightSeek Foundation Releases TokenSpeed, an Open-Source LLM Inference Engine Targeting TensorRT-LLM-Level Performance for Agentic Workloads appeared first on MarkTechPost.

Like 0

Liked Liked