February 2026

PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference

digitado ⋅ 9 de February de 2026

arXiv:2602.06072v1 Announce Type: new Abstract: Attention efficiency is critical to large language model (LLM) inference. While prior advances optimize attention execution for individual requests (e.g., FlashAttention), production LLM serving relies on batching requests with highly heterogeneous sequence lengths for high serving throughput. This mismatch induces severe computation and I/O imbalance, exacerbates stragglers, and underutilizes GPU resources. We present PackInfer, a kernel-level attention framework that enables compute- and I/O-aware execution for heterogeneous batched inference. PackInfer orchestrates batched requests into […]

Ver mais

Like 0

Liked Liked

technocracy

FlashSketch: Sketch-Kernel Co-Design for Fast Sparse Sketching on GPUs

digitado ⋅ 9 de February de 2026

arXiv:2602.06071v1 Announce Type: new Abstract: Sparse sketches such as the sparse Johnson-Lindenstrauss transform are a core primitive in randomized numerical linear algebra because they leverage random sparsity to reduce the arithmetic cost of sketching, while still offering strong approximation guarantees. Their random sparsity, however, is at odds with efficient implementations on modern GPUs, since it leads to irregular memory access patterns that degrade memory bandwidth utilization. Motivated by this tension, we pursue a sketch-kernel co-design approach: we design […]

Ver mais

Like 0

Liked Liked

technocracy

Computationally Efficient Laplacian CL-colME

digitado ⋅ 9 de February de 2026

arXiv:2602.06070v1 Announce Type: new Abstract: Decentralized collaborative mean estimation (colME) is a fundamental task in heterogeneous networks. Its graph-based variants B-colME and C-colME achieve high scalability of the problem. This paper evaluates the consensus-based C-colME framework, which relies on doubly stochastic averaging matrices to ensure convergence to the oracle solution. We propose CL-colME, a novel variant utilizing Laplacian-based consensus to avoid the computationally expensive normalization processes. Simulation results show that the proposed CL-colME maintains the convergence behavior and […]

Ver mais

Like 0

Liked Liked

technocracy

HQP: Sensitivity-Aware Hybrid Quantization and Pruning for Ultra-Low-Latency Edge AI Inference

digitado ⋅ 9 de February de 2026

arXiv:2602.06069v1 Announce Type: new Abstract: The escalating demand for high-fidelity, real-time inference in distributed edge-cloud environments necessitates aggressive model optimization to counteract severe latency and energy constraints. This paper introduces the Hybrid Quantization and Pruning (HQP) framework, a novel, integrated methodology designed to achieve synergistic model acceleration while adhering to strict quality guarantees. We detail a sensitivity-aware structural pruning algorithm that employs a dynamic weight sensitivity metric, derived from a highly efficient approximation of the Fisher Information Matrix […]

Ver mais

Like 0

Liked Liked

technocracy

iScheduler: Reinforcement Learning-Driven Continual Optimization for Large-Scale Resource Investment Problems

digitado ⋅ 9 de February de 2026

arXiv:2602.06064v1 Announce Type: new Abstract: Scheduling precedence-constrained tasks under shared renewable resources is central to modern computing platforms. The Resource Investment Problem (RIP) models this setting by minimizing the cost of provisioned renewable resources under precedence and timing constraints. Exact mixed-integer programming and constraint programming become impractically slow on large instances, and dynamic updates require schedule revisions under tight latency budgets. We present iScheduler, a reinforcement-learning-driven iterative scheduling framework that formulates RIP solving as a Markov decision process […]

Ver mais

Like 0

Liked Liked

technocracy

Mapping Gemma3 onto an Edge Dataflow Architecture

digitado ⋅ 9 de February de 2026

arXiv:2602.06063v1 Announce Type: new Abstract: We present the first end-to-end deployment of the Gemma3 family of large language and vision models on a tiled edge dataflow architecture (AMD Ryzen AI NPU). Our work introduces a set of hardware-aware techniques. For prefill, we introduce an efficient dequantization engine, optimize tiled matrix multiplication kernels, and propose FlowQKV, a chunked, pipelined attention mechanism. For decoding, we introduce FusedDQP, which fuses dequantization and projection into a single kernel, and FlowKV, which re-structures […]

Ver mais

Like 0

Liked Liked

technocracy

Deep Unfolded Fractional Optimization for Maximizing Robust Throughput in 6G Networks

digitado ⋅ 9 de February de 2026

arXiv:2602.06062v1 Announce Type: new Abstract: The sixth-generation (6G) of wireless communication networks aims to leverage artificial intelligence tools for efficient and robust network optimization. This is especially the case since traditional optimization methods often face high computational complexity, motivating the use of deep learning (DL)-based optimization frameworks. In this context, this paper considers a multi-antenna base station (BS) serving multiple users simultaneously through transmit beamforming in downlink mode. To account for robustness, this work proposes an uncertainty-injected deep […]

Ver mais

Like 0

Liked Liked

technocracy

UAV-Mounted Aerial Relays in Military Communications: A Comprehensive Survey

digitado ⋅ 9 de February de 2026

arXiv:2602.06061v1 Announce Type: new Abstract: Relays are pivotal in military communication networks, expanding coverage and ensuring reliable connectivity in challenging operational environments. While traditional terrestrial relays (TR) are constrained by fixed locations and vulnerability to physical obstructions, unmanned aerial vehicle (UAV)-mounted aerial relays (AR) offer a dynamic and flexible alternative by operating above obstacles and adapting to changing battlefield conditions. This paper provides a comprehensive survey of AR systems in military communications, presenting a detailed comparison between AR […]

Ver mais

Like 0

Liked Liked

technocracy

Quantifying Energy-Efficient Edge Intelligence: Inference-time Scaling Laws for Heterogeneous Computing

digitado ⋅ 9 de February de 2026

arXiv:2602.06057v1 Announce Type: new Abstract: Large language model inference on resource constrained edge devices remains a major challenge for low latency intelligent systems, as existing solutions depend heavily on cloud or datacenter infrastructure. This work introduces QEIL, Quantifying Edge Intelligence via Inference time Scaling Laws, a unified framework for efficient local LLM inference using principled scaling laws and heterogeneous orchestration across CPU, GPU, and NPU accelerators. We derive five architecture agnostic theorems that characterize how inference efficiency scales […]

Ver mais

Like 0

Liked Liked

technocracy

Analyzing Diffusion and Autoregressive Vision Language Models in Multimodal Embedding Space

digitado ⋅ 9 de February de 2026

arXiv:2602.06056v1 Announce Type: new Abstract: Embedding models are a fundamental component of modern AI systems such as semantic search and retrieval-augmented generation. Recent advances in large foundation models have substantially accelerated the development of embedding models, including those based on Large Language Models (LLMs), Vision Language Models (VLMs), and Multimodal LLMs. More recently, Large Diffusion Language Models (dLLMs) and Multimodal dLLMs have emerged as competitive alternatives to autoregressive models, offering advantages such as bidirectional attention and parallel generation. […]

Ver mais

Like 0

Liked Liked