March 2026

Compression Method Matters: Benchmark-Dependent Output Dynamics in LLM Prompt Compression

digitado ⋅ 26 de March de 2026

arXiv:2603.23527v1 Announce Type: new Abstract: Prompt compression is often evaluated by input-token reduction, but its real deployment impact depends on how compression changes output length and total inference cost. We present a controlled replication and extension study of benchmark-dependent output dynamics under aggressive compression, covering 5,400 API calls across three benchmarks and multiple providers. To explain conflicting prior observations, we formalize instruction survival probability (Psi), a structural metric that captures whether task-critical prompt segments remain after truncation. Results […]

Ver mais

Like 0

Liked Liked

technocracy

Plato’s Cave: A Human-Centered Research Verification System

digitado ⋅ 26 de March de 2026

arXiv:2603.23526v1 Announce Type: new Abstract: The growing publication rate of research papers has created an urgent need for better ways to fact-check information, assess writing quality, and identify unverifiable claims. We present Plato’s Cave as an open-source, human-centered research verification system that (i) creates a directed acyclic graph (DAG) from a document, (ii) leverages web agents to assign credibility scores to nodes and edges from the DAG, and (iii) gives a final score by interpreting and evaluating the […]

Ver mais

Like 0

Liked Liked

technocracy

Prompt Compression in Production Task Orchestration: A Pre-Registered Randomized Trial

digitado ⋅ 26 de March de 2026

arXiv:2603.23525v1 Announce Type: new Abstract: The economics of prompt compression depend not only on reducing input tokens but on how compression changes output length, which is typically priced several times higher. We evaluate this in a pre-registered six-arm randomized controlled trial of prompt compression on production multi-agent task-orchestration, analyzing 358 successful Claude Sonnet 4.5 runs (59-61 per arm) drawn from a randomized corpus of 1,199 real orchestration instructions. We compare an uncompressed control with three uniform retention rates […]

Ver mais

Like 0

Liked Liked

technocracy

Navigating the Concept Space of Language Models

digitado ⋅ 26 de March de 2026

arXiv:2603.23524v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) trained on large language model activations output thousands of features that enable mapping to human-interpretable concepts. The current practice for analyzing these features primarily relies on inspecting top-activating examples, manually browsing individual features, or performing semantic search on interested concepts, which makes exploratory discovery of concepts difficult at scale. In this paper, we present Concept Explorer, a scalable interactive system for post-hoc exploration of SAE features that organizes concept explanations […]

Ver mais

Like 0

Liked Liked

technocracy

Do 3D Large Language Models Really Understand 3D Spatial Relationships?

digitado ⋅ 26 de March de 2026

arXiv:2603.23523v1 Announce Type: new Abstract: Recent 3D Large-Language Models (3D-LLMs) claim to understand 3D worlds, especially spatial relationships among objects. Yet, we find that simply fine-tuning a language model on text-only question-answer pairs can perform comparably or even surpass these methods on the SQA3D benchmark without using any 3D input. This indicates that the SQA3D benchmark may not be able to detect if the model exploits textual shortcuts rather than engages in 3D-aware reasoning. To address this issue, […]

Ver mais

Like 0

Liked Liked

technocracy

Qworld: Question-Specific Evaluation Criteria for LLMs

digitado ⋅ 26 de March de 2026

arXiv:2603.23522v1 Announce Type: new Abstract: Evaluating large language models (LLMs) on open-ended questions is difficult because response quality depends on the question’s context. Binary scores and static rubrics fail to capture these context-dependent requirements. Existing methods define criteria at the dataset level or generate them in a single pass, which limits their ability to explore the evaluation space implied by each question. We introduce One-Question-One-World (Qworld), a method that generates question-specific evaluation criteria using a recursive expansion tree. […]

Ver mais

Like 0

Liked Liked

technocracy

Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages

digitado ⋅ 26 de March de 2026

arXiv:2603.23521v1 Announce Type: new Abstract: Multimodal research has predominantly focused on single-image reasoning, with limited exploration of multi-image scenarios. Recent models have sought to enhance multi-image understanding through large-scale pretraining on interleaved image-text datasets. However, most Vision-Language Models (VLMs) are trained primarily on English datasets, leading to inadequate representation of Indian languages. To address this gap, we introduce the Chitrakshara dataset series, covering 11 Indian languages sourced from Common Crawl. It comprises (1) Chitrakshara-IL, a large-scale interleaved pretraining […]

Ver mais

Like 0

Liked Liked

technocracy

From Physician Expertise to Clinical Agents: Preserving, Standardizing, and Scaling Physicians’ Medical Expertise with Lightweight LLM

digitado ⋅ 26 de March de 2026

arXiv:2603.23520v1 Announce Type: new Abstract: Medicine is an empirical discipline refined through long-term observation and the messy, high-variance reality of clinical practice. Physicians build diagnostic and therapeutic competence through repeated cycles of application, reflection, and improvement, forming individualized methodologies. Yet outcomes vary widely, and master physicians’ knowledge systems are slow to develop and hard to transmit at scale, contributing to the scarcity of high-quality clinical expertise. To address this, we propose Med-Shicheng, a general framework that enables large […]

Ver mais

Like 0

Liked Liked

technocracy

MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?

digitado ⋅ 26 de March de 2026

arXiv:2603.23519v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across various specialist domains and have been integrated into high-stakes areas such as medicine. However, as existing medical-related benchmarks rarely stress-test the long-context memory, interference robustness, and safety defense required in practice. To bridge this gap, we introduce MedMT-Bench, a challenging medical multi-turn instruction following benchmark that simulates the entire diagnosis and treatment process. We construct the benchmark via scene-by-scene data synthesis refined by manual […]

Ver mais

Like 0

Liked Liked

technocracy

Cluster-R1: Large Reasoning Models Are Instruction-following Clustering Agents

digitado ⋅ 26 de March de 2026

arXiv:2603.23518v1 Announce Type: new Abstract: General-purpose embedding models excel at recognizing semantic similarities but fail to capture the characteristics of texts specified by user instructions. In contrast, instruction-tuned embedders can align embeddings with textual instructions yet cannot autonomously infer latent corpus structures, such as determining the optimal number of clusters. To address both limitations, we reframe instruction-following clustering as a generative task and train large reasoning models (LRMs) as autonomous clustering agents. Our reasoning-driven training pipeline enables LRMs […]

Ver mais

Like 0

Liked Liked