Carnegie Mellon at NeurIPS 2025

CMU researchers are presenting 156 papers at the Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025), held from December 2nd-December 7th at the San Diego Convention. Here is a quick overview of the areas our researchers are working on:

Here are our most frequent collaborator institutions:

Oral Papers

Task-Optimized Convolutional Recurrent Networks Align with Tactile Processing in the Rodent Brain

Authors: Trinity Chung (Carnegie Mellon University), Yuchen Shen (Carnegie Mellon University), Nathan Kong (MIT), Aran Nayebi (School of Computer Science, Carnegie Mellon University)

This paper introduces an Encoder–Attender–Decoder (EAD) framework to study task-optimized neural networks for tactile processing using realistic whisker-based simulations. Convolutional recurrent neural networks (ConvRNNs) emerge as the most effective encoders, both for tactile categorization and for producing representations that closely match activity in rodent somatosensory cortex, revealing a linear link between task performance and neural alignment. Notably, self-supervised contrastive ConvRNN models achieve neural fits comparable to supervised training, indicating that label-free learning can capture biologically relevant tactile representations. These findings highlight the importance of recurrent processing for understanding cortical tactile computation and for building robust embodied AI systems.

MaxSup: Overcoming Representation Collapse in Label Smoothing

Authors: Yuxuan Zhou (CISPA Helmholtz Center for Information Security), Heng Li (Carnegie Mellon University), Zhi-Qi Cheng (University of Washington), Xudong Yan (City University of Macao), Yifei Dong (Carnegie Mellon University), Mario Fritz (CISPA Helmholtz Center for Information Security), Margret Keuper (University of Mannheim)

Label Smoothing is commonly used to reduce overconfidence and improve generalization, but it can paradoxically increase confidence in misclassified samples and collapse feature representations. This work analytically decomposes the LS loss, revealing an error-amplification term that strengthens incorrect predictions and drives representation collapse. To overcome this, the authors propose Max Suppression (MaxSup), which regularizes predictions uniformly by penalizing the top-1 logit instead of the ground-truth logit. Experiments show that MaxSup preserves intra-class diversity, improves class separation, and consistently outperforms LS across large-scale classification and downstream tasks.

Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)

Authors: Liwei Jiang (University of Washington), Yuanjun Chai (University of Washington), Margaret Li (University of Washington), Mickel Liu (University of Washington), Raymond Fok (University of Washington), Nouha Dziri (Allen Institute for AI), Yulia Tsvetkov (Department of Computer Science, University of Washington), Maarten Sap (Carnegie Mellon University), Yejin Choi (UW => Stanford / NVIDIA)

This paper introduces INFINITY-CHAT, a large-scale dataset of 26,000 diverse open-ended user queries and a comprehensive taxonomy of prompt types to evaluate creativity and diversity in language model outputs. Using this resource, the authors identify a pronounced “Artificial Hivemind” effect marked by both repetitive responses within a single model and striking similarities across different models. The dataset also includes over 31,000 human annotations enabling analysis of collective and individual preferences. Results show that existing models and evaluation methods are poorly calibrated to idiosyncratic human judgments, highlighting risks of homogenized AI outputs.

Mean Flows for One-step Generative Modeling

Authors: Zhengyang Geng (CMU), Mingyang Deng (Massachusetts Institute of Technology), Xingjian Bai (Massachusetts Institute of Technology), Zico Kolter (Carnegie Mellon University), Kaiming He (MIT)

The authors introduce MeanFlow, a principled one-step generative modeling framework based on the concept of average velocity rather than the instantaneous velocity used in prior flow-matching methods. The authors derive a formal identity linking average and instantaneous velocities to guide neural network training in a self-contained approach requiring no pretraining, distillation, or curriculum learning. MeanFlow achieves strong results, including a 3.43 FID on ImageNet 256×256 with a single function evaluation, outperforming previous one-step models. These results substantially narrow the performance gap between one-step and multi-step diffusion and flow-based methods.

Spotlight Papers

OpenCUA: Open Foundations for Computer-Use Agents

Authors: Xinyuan Wang (University of Hong Kong), Bowen Wang (University of Hong Kong), Dunjie Lu (SUN YAT-SEN UNIVERSITY), Junlin Yang (Tsinghua University), Tianbao Xie (the University of Hong Kong, University of Hong Kong), Junli Wang (Alibaba Group), Jiaqi Deng (The University of Hong Kong), Xiaole Guo (University of Hong Kong), Yiheng Xu (University of Hong Kong), Chen Wu (Carnegie Mellon University), Zhennan Shen (Shanghai Jiaotong University), Zhuokai Li (University of Hong Kong), Ryan Li (Computer Science Department, Stanford University), Xiaochuan Li (Tsinghua University), Junda Chen (Harbin Institute of Technology), Boyuan Zheng (The University of Hong Kong), Li Peihang (University of Hong Kong), Fangyu Lei (Institute of automation, Chinese academy of science, Chinese Academy of Sciences), Ruisheng Cao (Shanghai Jiaotong University), Yeqiao Fu (University of Hong Kong), Dongchan Shin (University of Hong Kong), Martin Shin (University of Hong Kong), Hu Jiarui (University of Hong Kong), Yuyan Wang (Johns Hopkins University), Jixuan Chen (University of California, San Diego), Yuxiao Ye (The Hong Kong University of Science and Technology), Danyang Zhang (Shanghai Jiao Tong University), Yipu Wang (Institute of automation, Chinese academy of science, Chinese Academy of Sciences), Heng Wang (University of Illinois Urbana-Champaign), Diyi Yang (Stanford University), Victor Zhong (University of Waterloo), Y.Charles (Moonshot AI), Zhilin Yang (Tsinghua University, Tsinghua University), Tao Yu (University of Hong Kong)

This paper introduces OpenCUA, an open-source framework designed to enable transparent research into computer-use agents built with vision–language models. The framework includes an annotation system for collecting human demonstrations, AgentNet, a large-scale dataset spanning three operating systems and 200+ applications, and a scalable pipeline that converts demonstrations into state–action data with reflective chain-of-thought reasoning. End-to-end agent models trained with OpenCUA show strong benchmark performance, with OpenCUA-72B achieving a 45.0% success rate on OSWorld-Verified, setting a new open-source state of the art.

ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation

Authors: Jiatong Shi (Carnegie Mellon University), Yifan Cheng (Huazhong University of Science and Technology), Bo-Hao Su (Carnegie Mellon University), Hye-jin Shim (Carnegie Mellon University), Jinchuan Tian (Carnegie Mellon University), Samuele Cornell (Università Politecnica delle Marche), Yiwen Zhao (School of Computer Science, Carnegie Mellon University), Siddhant Arora (Carnegie Mellon University), Shinji Watanabe (Carnegie Mellon University)

This work presents ARECHO, an autoregressive chain-based framework for jointly evaluating multiple speech quality metrics such as PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), and MOS (Mean Opinion Score), which traditionally differ in scale and assumptions. ARECHO introduces a comprehensive tokenization pipeline, a dynamic classifier chain to model inter-metric dependencies, and a confidence-oriented two-step decoding scheme to improve inference reliability. Experiments show that ARECHO consistently outperforms baseline methods across speech enhancement, generation evaluation, and noisy-speech scenarios. The approach also improves interpretability and flexibility by enabling reference-free evaluation and subset metric queries.

UMA: A Family of Universal Models for Atoms

Authors: Brandon Wood (FAIR at Meta), Misko Dzamba (Facebook), Xiang Fu (Periodic Labs), Meng Gao (Facebook), Muhammed Shuaibi (FAIR, Meta), Luis Barroso-Luque (Facebook), Kareem Abdelmaqsoud (Carnegie Mellon University), Vahe Gharakhanyan (Meta), John Kitchin (Carnegie Mellon University), Daniel Levine (Meta FAIR), Kyle Michel (Meta), Anuroop Sriram (Meta FAIR), Taco Cohen (Meta / FAIR), Abhishek Das (FAIR, Meta AI), Sushree Sahoo (Facebook), Ammar Rizvi (Meta), Zachary Ulissi (FAIR, Meta AI), Larry Zitnick (Fundamental AI Research at Meta AI)

This paper introduces Universal Models for Atoms (UMA), a family of large-scale models designed to rapidly and accurately predict properties from atomic simulations across chemistry and materials science. Trained on over 500 million unique 3D atomic structures spanning molecules, materials, and catalysts, UMA leverages empirical scaling laws and a novel mixture-of-linear-experts architecture to increase capacity without sacrificing speed. Evaluations show that a single UMA model, without fine-tuning, matches or outperforms specialized models across diverse applications.

A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search

Authors: Arnav Kumar Jain (University de Montreal), Vibhakar Mohta (Nuro Inc.), Subin Kim (Korea Advanced Institute of Science & Technology), Atiksh Bhardwaj (Cornell University), Juntao Ren (Stanford University), Yunhai Feng (Cornell University), Sanjiban Choudhury (Cornell University), Gokul Swamy (Carnegie Mellon University)

This work addresses a key limitation of behavioral cloning (BC) in imitation learning: BC only teaches an agent to mimic expert actions at states the expert visited, leaving it unable to recover from mistakes. To overcome this, the authors propose SAILOR, which leverages learning to search (L2S) by training a world model and a reward model to plan and recover toward expert outcomes even after errors. SAILOR achieves stable and sample-efficient learning without additional human corrections and consistently outperforms state-of-the-art diffusion-policy BC methods across visual manipulation benchmarks. It also demonstrates robustness to nuanced failures and reward hacking, and the performance gap persists even when BC is trained with 5–10x more demonstrations.

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

Authors: Jiajun Shi (Beijing University of Aeronautics and Astronautics), Jian Yang (Alibaba Group), Jiaheng Liu (Nanjing University), Xingyuan Bu (Alibaba Group), Jiangjie Chen (ByteDance Seed), Junting Zhou (Peking University), Kaijing Ma (Tongji University), Zhoufutu Wen (ByteDance Inc.), Bingli Wang (Sichuan Agricultural University), Yancheng He (Alibaba Group), Liang Song (M-A-P), Hualei Zhu (Beijing University of Aeronautics and Astronautics), Shilong Li (Beijing University of Posts and Telecommunications), Xingjian Wang (Shanghai University of Electric Power), Wei Zhang (Beijing University of Aeronautics and Astronautics), Ruibin Yuan (Carnegie Mellon University), Yifan Yao (Beijing University of Posts and Telecommunications), Wenjun Yang (University College London, University of London), Yunli Wang (Kuaishou Technology), Siyuan Fang (Beijing University of Posts and Telecommunications), Siyu Yuan (Fudan University), Qianyu He (Fudan University), Robert Tang (Yale University), Yingshui Tan (Alibaba Group), Wangchunshu Zhou (Guangdong OPPO Mobile Telecommunications Corp.,Ltd.), ZHAO-XIANG ZHANG (Chinese Academy of Sciences, China), Zhoujun Li (Beijing University of Aeronautics and Astronautics), Wenhao Huang (Key Laboratory of Machine Perception), Ge Zhang (University of Michigan – Ann Arbor)

The authors introduce KORGym, a dynamic evaluation platform designed to comprehensively assess the reasoning abilities of large language models (LLMs) and vision-language models (VLMs). Unlike existing domain-specific benchmarks, KORGym offers over 50 interactive games in textual and visual formats, including multi-turn and reinforcement learning scenarios. Experiments on 19 LLMs and 8 VLMs reveal consistent reasoning patterns within model families and highlight the superior performance of closed-source models. The platform also enables analysis of factors such as modality, reasoning strategies, reinforcement learning approaches, and response length, providing a robust tool for advancing reasoning evaluation in complex environments.

Towards Understanding Camera Motions in Any Video

Authors: Zhiqiu Lin (Carnegie Mellon University), Siyuan Cen (University of Massachusetts at Amherst), Daniel Jiang (Carnegie Mellon University), Jay Karhade (CMU, Carnegie Mellon University), Hewei Wang (Carnegie Mellon University), Chancharik Mitra (CMU, Carnegie Mellon University), Yu Tong Tiffany Ling (CMU, Carnegie Mellon University), Yuhan Huang (Carnegie Mellon University), Rushikesh Zawar (Carnegie Mellon University), Xue Bai (Adobe Systems), Yilun Du (Google Deepmind / Harvard), Chuang Gan (IBM), Deva Ramanan (Carnegie Mellon University)

This work presents CameraBench, a large-scale dataset and benchmark for evaluating camera motion understanding, comprising roughly 3,000 diverse videos annotated through a rigorous expert-driven process. A key contribution is a taxonomy of camera motion primitives, developed with cinematographers, which captures motions that require both geometric and semantic understanding. Human studies show that domain expertise and targeted training significantly improve motion recognition, such as distinguishing zoom from forward translation. Evaluations reveal that Structure-from-Motion models struggle with semantic motions, while generative video-language models struggle with geometric ones, and fine-tuning a generative VLM on CameraBench enables strong performance across motion-augmented captioning, video QA, and video-text retrieval tasks.

Enhancing Training Data Attribution with Representational Optimization

Authors: Weiwei Sun (Carnegie Mellon University), Haokun Liu (Department of Computer Science, University of Toronto), Nikhil Kandpal (Department of Computer Science), Colin Raffel (University of Toronto, Vector Institute and Hugging Face), Yiming Yang (CMU)

This paper presents AirRep, a scalable representation-based method for training data attribution (TDA) that learns task-specific, model-aligned representations optimized for measuring how training data affects model predictions. AirRep features a trainable encoder for attribution quality and an attention-based pooling mechanism to estimate group-wise influence accurately. Trained using a ranking objective over subsets labeled by their empirical effect, AirRep matches the performance of gradient-based methods like influence functions while being nearly 100× more efficient at inference.

Checklists Are Better Than Reward Models For Aligning Language Models

Authors: Vijay Viswanathan (Carnegie Mellon University), Yanchao Sun (University of Maryland, College Park), Xiang Kong (Apple), Meng Cao (Apple), Graham Neubig (Carnegie Mellon University), Sherry Wu (Carnegie Mellon University)

This work introduces Reinforcement Learning from Checklist Feedback (RLCF), a method for improving instruction-following in language models using flexible, instruction-specific criteria rather than fixed metrics like helpfulness or harmfulness. RLCF extracts checklists from instructions and evaluates responses against each item using AI judges and verifier programs to compute rewards for reinforcement learning. Applied to models like Qwen2.5-7B-Instruct, RLCF improves performance across five benchmarks, achieving notable gains in hard satisfaction rates and win rates, and can also enhance other models off-policy, such as Llama 3.1 8B Instruct and OLMo 2 7B Instruct. The authors release their WildChecklists dataset, models, and code to support further research in flexible instruction alignment.

Extrapolation by Association: Length Generalization Transfer In Transformers

Authors: Ziyang Cai (Princeton University), Nayoung Lee (University of Wisconsin-Madison), Avi Schwarzschild (Carnegie Mellon University), Samet Oymak (University of Michigan – Ann Arbor), Dimitris Papailiopoulos (University of Wisconsin-Madison)

This paper studies length generalization in transformer language models—the ability to handle longer inputs than seen during training—through the concept of task association. The authors show that training on a longer, related auxiliary task can improve generalization to longer inputs on a target task across algorithmic domains like arithmetic, string manipulation, and maze navigation. They find similar transfer effects in pretrained language models, suggesting pretraining provides reusable computational scaffolding. Mechanistic analysis indicates that this length generalization transfer is linked to the reuse of attention heads between tasks, highlighting how transformers leverage compositional inductive structures.

Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation

Authors: Xinyu Yang (CMU), Yuwei An (Carnegie Mellon University), Hongyi Liu (Carnegie Mellon University), Tianqi Chen (Carnegie Mellon University), Beidi Chen (CMU / Amazon)

This work introduces Multiverse, a generative model that enables natively parallel generation by internalizing a MapReduce paradigm with Map, Process, and Reduce stages. The approach includes Multiverse Curator for automated data creation, Multiverse Attention for separating parallel reasoning steps, and Multiverse Engine for dynamic sequential-parallel inference. After minimal fine-tuning, Multiverse-32B matches leading autoregressive LLMs in performance while achieving up to 2× speedup and better scaling efficiency. The authors have open-sourced the full Multiverse ecosystem, including models, data, serving systems, and training pipelines.

Thought Communication in Multiagent Collaboration

Authors: Yujia Zheng (Carnegie Mellon University), Zhuokai Zhao (Meta), Zijian Li (Mohamed bin Zayed University of Artificial Intelligence), Yaqi Xie (CMU), Mingze Gao (Meta Inc.), Lizhu Zhang (Meta), Kun Zhang (CMU & MBZUAI)

This work introduces thought communication, a paradigm for multi-agent interaction that goes beyond natural language by enabling agents to share latent, mind-like representations directly. The authors formalize this process as a latent variable model, proving that both shared and private thoughts, as well as the global structure of thought sharing among agents, can be identified and recovered with theoretical guarantees. They develop a framework that extracts and distributes relevant latent thoughts to agents, enhancing collaboration across modalities. Experiments on synthetic and real-world benchmarks validate the approach, showing that thought communication can unlock collaborative advantages beyond what is possible with surface-level language-based exchanges.

Cost-aware LLM-based Online Dataset Annotation

Authors: Eray Can Elumar (CMU, Carnegie Mellon University), Cem Tekin (Bilkent University), Osman Yagan (Carnegie Mellon University)

This paper introduces CaMVo, a method for labeling datasets with large language models (LLMs) while keeping costs low. Instead of querying many LLMs for every example, CaMVo adaptively chooses only a few models based on how confident they are likely to be. It uses ideas from contextual bandits (LinUCB) and a Bayesian confidence estimator to decide which models to query and how to weight their votes—without needing any ground-truth labels. Experiments on MMLU and IMDB show that CaMVo matches or beats full majority voting but with far fewer LLM calls, making it a practical approach for efficient large-scale annotation.

Conformal Mixed-Integer Constraint Learning with Feasibility Guarantees

Authors: Daniel Ovalle (Carnegie Mellon University), Lorenz Biegler (Carnegie Mellon University), Ignacio Grossmann (CMU, Carnegie Mellon University), Carl Laird (Carnegie Mellon University), Mateo Dulce Rubio (CMU)

The authors introduce C-MICL, a framework for learning constraints in optimization problems while guaranteeing that the resulting solutions remain feasible with high probability. Traditional learned constraints can fail due to model error or limited data, but C-MICL uses conformal prediction to add uncertainty-aware adjustments that ensure feasibility at a user-specified confidence level. The method works for both regression- and classification-based constraint learning and avoids the heavy computational overhead of ensemble approaches. Experiments show that C-MICL reliably meets feasibility targets, preserves strong optimization performance, and is significantly more efficient, offering a principled way to blend machine learning with safe decision-making.

SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications

Authors: Gabriele Oliaro (Carnegie Mellon University), Zhihao Jia (School of Computer Science, Carnegie Mellon University), Daniel Campos (Zipf AI), Aurick Qiao (Snowflake)

The authors present SuffixDecoding, a new speculative decoding method tailored for emerging AI workloads like LLM-based agents, which generate long, repetitive, and predictable sequences. Unlike existing speculative decoding approaches designed for diverse, independent requests, SuffixDecoding uses suffix trees to efficiently cache and reuse long stretches of past tokens from prompts and model outputs. It adaptively adjusts how many tokens to speculate—expanding aggressively when predictions are likely to be accepted and backing off when uncertainty is higher. Experiments on agent-style tasks such as SWE-Bench and Text-to-SQL show that SuffixDecoding can deliver up to 3.9× speedups, making it well suited for fast, iterative agentic inference.

Horizon Reduction Makes RL Scalable

Authors: Seohong Park (UC Berkeley), Kevin Frans (UC Berkeley), Deepinder Mann (UC Berkeley), Benjamin Eysenbach (Princeton), Aviral Kumar (Carnegie Mellon University), Sergey Levine (UC Berkeley)

This paper examines why offline reinforcement learning (RL) often fails to scale, even when given massive datasets, large models, and ample compute. The authors find that long decision horizons—the number of steps required to propagate rewards—are a key bottleneck that prevents standard offline RL algorithms from improving with more data. Through extensive experiments, they show that reducing the effective horizon dramatically improves scalability and performance on challenging tasks. Building on this insight, they introduce SHARSA, a simple horizon-reduction method that achieves the strongest scaling behavior and best asymptotic performance across their benchmarks.

To Distill or Decide? Understanding the Algorithmic Trade-off in Partially Observable RL

Authors: Yuda Song (Carnegie Mellon University), Dhruv Rohatgi (Massachusetts Institute of Technology), Aarti Singh (CMU), J. Bagnell (Carnegie Mellon University)

This paper studies when it’s better to distill privileged expert policies—which have access to latent state information during training—versus directly learning from partial observations in reinforcement learning. Using a simple theoretical model (the perturbed Block MDP) and controlled locomotion experiments, the authors show that the trade-off depends strongly on how stochastic the underlying latent dynamics are. When the latent state is easy to infer, distillation works well, but when it is highly stochastic, imitating the latent optimal policy can actually hurt performance. The results provide practical guidance: the best latent policy isn’t always the best one to distill, and deciding when to distill versus directly learning depends on the underlying uncertainty structure of the task.

A Principled Approach to Randomized Selection under Uncertainty: Applications to Peer Review and Grant Funding

Authors: Alexander Goldberg (Computer Science Department, School of Computer Science), Giulia Fanti (CMU), Nihar Shah (CMU)

MERIT is a principled framework for using randomized selection in settings like peer review or grant funding, where evaluations are noisy and uncertainty can make deterministic rankings unreliable. Instead of relying on ad-hoc randomization, MERIT uses interval estimates (e.g., confidence intervals) to model uncertainty and then optimizes for the worst-case expected number of true top-k items selected. The authors develop a polynomial-time algorithm that scales to large datasets and show that MERIT satisfies desirable fairness and robustness properties that existing methods lack. Experiments on synthetic peer-review data show that MERIT matches prior probabilistic methods in expected performance while providing stronger guarantees in worst-case scenarios.

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

Authors: Thomas Kuntz (EPFL – EPF Lausanne), Agatha Duzan (EPFL – EPF Lausanne), Hao Zhao (EPFL – EPF Lausanne), Francesco Croce (University of Tübingen), Zico Kolter (Carnegie Mellon University), Nicolas Flammarion (EPFL), Maksym Andriushchenko (ELLIS Institute Tübingen and MPI-IS)

OS-Harm is a benchmark for evaluating the safety of LLM-based computer use agents that interact directly with operating system interfaces. OS-Harm tests agents across three harm categories—deliberate misuse, prompt injection attacks, and model misbehavior—using 150 tasks spanning applications like email, browsers, and code editors. An automated judge evaluates both task performance and safety, achieving strong agreement with human annotations. Evaluations of leading agents reveal that models often comply with unsafe commands, are vulnerable to prompt injections, and sometimes take unsafe actions, highlighting the need for robust safety measures in these systems.

Can We Infer Confidential Properties of Training Data from LLMs?

Authors: Pengrun Huang (University of California, San Diego), Chhavi Yadav (CMU), Kamalika Chaudhuri (FAIR, Meta and UCSD), Ruihan Wu (University of California, San Diego)

PropInfer is a benchmark designed to evaluate whether large language models (LLMs) can leak sensitive properties of the datasets used for fine-tuning, particularly in domains like healthcare. It tests property inference under both question-answering and chat-completion setups. Two tailored attacks—a prompt-based generation attack and a shadow-model attack leveraging word frequency—are proposed to extract dataset-level information. Empirical results show that these attacks can succeed across multiple pretrained LLMs, revealing an important and previously underexplored privacy risk.

Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?

Authors: Hyeong Kyu Choi (University of Wisconsin-Madison, Computer Sciences), Jerry Zhu (Carnegie Mellon University), Sharon Li (University of Wisconsin-Madison)

Multi-Agent Debate (MAD) improves large language model performance by having multiple agents reason collaboratively, but its key drivers were unclear. By separating Majority Voting from inter-agent debate, experiments across seven NLP benchmarks show that most gains come from majority voting rather than the debate itself. A theoretical analysis models debate as a stochastic process, revealing that debate alone doesn’t improve expected correctness, though targeted interventions that bias belief updates can enhance its impact. These results suggest that while MAD has potential, simple ensembling methods often remain a more reliable and effective approach.

The Complexity of Symmetric Equilibria in Min-Max Optimization and Team Zero-Sum Games

Authors: Ioannis Anagnostides (Carnegie Mellon University), Ioannis Panageas (UC Irvine), Tuomas Sandholm (CMU, Strategy Robot, Optimized Markets, Strategic Machine), Jingming Yan (University of California, Irvine)

The study analyzes the complexity of computing equilibria in team-based zero-sum games and symmetric min-max optimization. It shows that finding epsilon-Nash equilibria in 3-player adversarial team games (2 vs. 1) is CLS-complete, resolving an open question about such games. Additionally, computing symmetric equilibria in symmetric min-max problems is PPAD-complete, even for quadratic objectives, and this extends to 6-player team games (3 vs. 3), implying that common symmetric dynamics cannot reliably converge. Finally, computing non-symmetric equilibria with polynomial precision is FNP-hard, highlighting the fundamental difficulty of equilibrium computation in these settings.

Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning

Authors: Emile Anand (Georgia Institute of Technology and Cognition Labs), Ishani Karmarkar (Stanford University), Guannan Qu (Carnegie Mellon University)

Scaling multi-agent reinforcement learning (MARL) is difficult due to the exponential growth of joint state and action spaces as agents increase. SUBSAMPLE-MFQ introduces a method that combines subsampling agents with mean-field Q-learning and a decentralized randomized policy, allowing efficient learning for any subset of k agents. The algorithm’s runtime scales polynomially in k, not the total number of agents n, making it practical for large systems. Theoretical guarantees show that the learned policy converges to the optimal policy at a rate of roughly 1 over root k, independent of the total agent count.

On the Hardness of Conditional Independence Testing In Practice

Authors: Zheng He (University of British Columbia), Roman Pogodin (Google), Yazhe Li (Microsoft), Namrata Deka (Carnegie Mellon University), Arthur Gretton (Google Deepmind / UCL), Danica J. Sutherland (University of British Columbia + Amii)

Conditional independence (CI) tests are central to tasks like causal discovery and fairness evaluation, but they often fail in practice despite theoretical guarantees. Focusing on the Kernel-based Conditional Independence (KCI) test, the work shows that many recent CI tests are special cases of a Generalized Covariance Measure. Practical performance is largely driven by errors in estimating the conditional mean, which affect Type I error, and by the choice of conditioning kernel, which influences test power but can also inflate false positives. These insights clarify why popular CI tests often underperform and highlight how careful kernel and estimation choices are crucial for reliable results.

Projection-based Lyapunov method for fully heterogeneous weakly-coupled MDPs

Authors: Xiangcheng Zhang (Tsinghua), Yige Hong (Carnegie Mellon University), Weina Wang (Computer Science Department, Carnegie Mellon University)

Heterogeneity creates major challenges in large-scale decision-making, especially in weakly-coupled Markov decision processes (WCMDPs) where each subproblem has distinct dynamics. In the fully heterogeneous setting, the authors show that an efficiently computable policy can achieve an O(1/root N) optimality gap in long-run average reward per subproblem as the number of subproblems N grows. This work provides the first asymptotic optimality guarantee for fully heterogeneous average-reward WCMDPs. Key to this result is a novel use of projection-based Lyapunov functions that ensure convergence of rewards and costs even under complete heterogeneity.

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Authors: Hyungjoo Chae (Georgia Institute of Technology), Seonghwan Kim (Yonsei University), Junhee Cho (Yonsei University), Seungone Kim (Carnegie Mellon University), Seungjun Moon (Yonsei University), Gyeom Hwangbo (University of Seoul), Dongha Lim (Korea Advanced Institute of Science & Technology), Minjin Kim (Yonsei University), Yeonjun Hwang (Yonsei University), Minju Gwak (Yonsei University), Dongwook Choi (Chung-Ang University), Minseok Kang (Yonsei University), Gwanhoon Im (Yonsei University), ByeongUng Cho (Yonsei University), Hyojun Kim (Yonsei University), Jun Han (Yonsei University), Taeyoon Kwon (Yonsei University), Minju Kim (Yonsei University), Beong-woo Kwak (Yonsei University), Dongjin Kang (Yonsei University), Jinyoung Yeo (Yonsei University)

Web navigation poses a long-horizon sequential decision-making challenge that goes beyond typical multimodal LLM tasks, but step-level reward models have been lacking. Web-Shepherd, the first process reward model (PRM) for web navigation, evaluates trajectories at each step, enabling both training and test-time assessment. The approach is supported by the WebPRM Collection, a 40K step-level dataset with annotated preference pairs, and WebRewardBench, a benchmark for evaluating PRMs. Experiments show Web-Shepherd outperforms GPT-4o by ~30 points on WebRewardBench and improves policy performance on WebArena-lite by 10.9 points while reducing verification cost by 10×, demonstrating a practical and efficient solution for web navigation tasks.

Fair Cooperation in Mixed-Motive Games via Conflict-Aware Gradient Adjustment

Authors: Woojun Kim (Carnegie Mellon University), Katia Sycara (Carnegie Mellon University)

Mixed-motive multi-agent reinforcement learning requires balancing individual incentives with collective goals, which are often in conflict. The proposed adaptive conflict-aware gradient adjustment method dynamically balances policy gradients from individual and collective objectives, promoting cooperation while preserving fairness in task-specific rewards. Theoretical analysis guarantees monotonic improvement in both collective and individual outcomes, ensuring fairness across agents. Experiments in sequential social dilemma environments show that this approach outperforms baselines in social welfare while maintaining equitable outcomes for all agents.

Poster Papers

Applications

VLMLight: Safety-Critical Traffic Signal Control via Vision-Language Meta-Control and Dual-Branch Reasoning Architecture

Authors: Maonan Wang (The Chinese University of Hong Kong), Yirong Chen (Shanghai Artificial Intelligence Laboratory), Aoyu Pang (The Chinese University of Hong Kong), Yuxin Cai (Carnegie Mellon University), Chung Shue Chen (Nokia Bell Labs), Yuheng Kan (Enbodied AI Research Center, Fourier), Man On Pun (The Chinese University of Hong Kong, Shenzhen)

Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders

Authors: Gongxu Luo (Mohamed bin Zayed University of Artificial Intelligence), Haoyue Dai (Carnegie Mellon University), Longkang Li (MBZUAI), Chengqian Gao (Mohamed bin Zayed University of Artificial Intelligence), Boyang Sun (Mohamed bin Zayed University of Artificial Intelligence), Kun Zhang (CMU & MBZUAI)

Model-Based Policy Adaptation for Closed-Loop End-to-end Autonomous Driving

Authors: Haohong Lin (CMU), Yunzhi Zhang (Stanford University), Wenhao Ding (NVIDIA), Jiajun Wu (Stanford University), DING ZHAO (Carnegie Mellon University)

SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions

Authors: Xianzhe Fan (The University of Hong Kong), Xuhui Zhou (CMU, Carnegie Mellon University), Chuanyang Jin (New York University), Kolby Nottingham (AI Dungeon / Voyage), Hao Zhu (Stanford University), Maarten Sap (Carnegie Mellon University)

MLZero: A Multi-Agent System for End-to-end Machine Learning Automation

Authors: Haoyang Fang (AWS), Boran Han (AWS), Nick Erickson (Amazon Web Services), Xiyuan Zhang (AWS AI), Su Zhou (Carnegie Mellon University), Anirudh Dagar (AWS), Jiani Zhang (Google), Caner Turkmen (Amazon Web Services), Tony Hu (AWS AI), Huzefa Rangwala (George Mason University), Ying Nian Wu (University of California, Los Angeles), Yuyang (Bernie) Wang (AWS AI), George Karypis (University of Minnesota, Minneapolis)

ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning

Authors: Tonghe Zhang (Carnegie Mellon University), Chao Yu (Tsinghua University, Tsinghua University), Sichang Su (The University of Texas at Austin), Yu Wang (Tsinghua University)

Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex

Authors: Muquan Yu (Chinese University of Hong Kong), Mu Nan (University of Hong Kong), Hossein Adeli (Columbia University), Jacob Prince (Harvard University), John A. Pyles (University of Washington), Leila Wehbe (Carnegie Mellon University), Maggie Henderson (Carnegie Mellon University), Michael Tarr (Carnegie Mellon University), Andrew Luo (University of Hong Kong)

Probabilistic Reasoning with LLMs for Privacy Risk Estimation

Authors: Jonathan Zheng (Georgia Institute of Technology), Alan Ritter (Georgia Institute of Technology), Sauvik Das (Carnegie Mellon University), Wei “Coco” Xu (Georgia Institute of Technology)

Geometry Aware Operator Transformer as an efficient and accurate neural surrogate for PDEs on arbitrary domains

Authors: Shizheng Wen (ETHZ – ETH Zurich), Arsh Kumbhat (None), Levi Lingsch (ETH Zurich), Sepehr Mousavi (ETHZ – ETH Zurich), Yizhou Zhao (Carnegie Mellon University), Praveen Chandrashekar (Tata Institute of Fundamental Research), Siddhartha Mishra (Swiss Federal Institute of Technology)

Topology-Aware Conformal Prediction for Stream Networks

Authors: Jifan Zhang (Northwestern University), Fangxin Wang (University of Illinois at Chicago), Zihe Song (University of Illinois at Chicago), Philip S Yu (UIC), Kaize Ding (Northwestern University), Shixiang Zhu (Carnegie Mellon University)

On Evaluating LLM Alignment by Evaluating LLMs as Judges

Authors: Yixin Liu (Yale University), Pengfei Liu (Carnegie Mellon University), Arman Cohan (Yale University)

OMiSO: Adaptive optimization of state-dependent brain stimulation to shape neural population states

Authors: Yuki Minai (CMU, Carnegie Mellon University), Joana Soldado-Magraner (Carnegie Mellon University), Byron M Yu (Carnegie Mellon University), Matthew Smith (Carnegie Mellon University)

Building 3D Representations and Generating Motions From a Single Image via Video-Generation

Authors: Weiming Zhi (Vanderbilt University), Ziyong Ma (CMU, Carnegie Mellon University), Tianyi Zhang (Carnegie Mellon University), Matthew Johnson-Roberson (None)

ChemOrch: Empowering LLMs with Chemical Intelligence via Groundbreaking Synthetic Instructions

Authors: Yue Huang (University of Notre Dame ), Zhengzhe Jiang (Sichuan University), Xiaonan Luo (University of Notre Dame), Kehan Guo (university of notre dame), Haomin Zhuang (University of Notre Dame), Yujun Zhou (University of Notre Dame), Zhengqing Yuan (University of Notre Dame), Xiaoqi Sun (Massachusetts Institute of Technology), Jules Schleinitz (California Institute of Technology), Yanbo Wang (Mohamed bin Zayed University of Artificial Intelligence), Shuhao Zhang (Carnegie Mellon University), Mihir Surve (University of Notre Dame), Nitesh Chawla (University of Notre Dame), Olaf Wiest (University of Notre Dame), Xiangliang Zhang (University of Notre Dame)

LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

Authors: Yang Xiao (Hong Kong Polytechnic University), Jiashuo WANG (HKPU), Ruifeng Yuan (Hong Kong Polytechnic University), Chunpu Xu (Hong Kong Polytechnic University), Kaishuai Xu (Hong Kong Polytechnic University), Wenjie Li (The Hong Kong Polytechnic University), Pengfei Liu (Carnegie Mellon University)

Hamiltonian Neural PDE Solvers through Functional Approximation

Authors: Anthony Zhou (Carnegie Mellon University), Amir Barati Farimani (Carnegie Mellon University)

Retrieval is Not Enough: Enhancing RAG through Test-Time Critique and Optimization

Authors: Jiaqi Wei (Zhejiang University), Hao Zhou (South China University of Technology), Xiang Zhang (University of British Columbia), Di Zhang (Shanghai Artificial Intelligence Laboratory), Zijie Qiu (Fudan University), Noah Wei (Carnegie Mellon University), Jinzhe Li (Fudan University), Wanli Ouyang (Shanghai AI Lab), Siqi Sun (Fudan University)

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Authors: Ziyang Ma (Shanghai Jiao Tong University), Yinghao Ma (Centre for Digital Music, Queen Mary University of London), Yanqiao Zhu (Shanghai Jiaotong University), Chen Yang (Shanghai Jiaotong University), Yi-Wen Chao (Nanyang Technological University), Ruiyang Xu (Shanghai Jiaotong University), Wenxi Chen (Shanghai Jiaotong University), Yuanzhe Chen (ByteDance Inc.), Zhuo Chen (ByteDance Inc.), Jian Cong (ByteDance Inc.), Kai Li (Tsinghua University, Tsinghua University), Keliang Li (, Chinese Academy of Sciences), Siyou Li (Queen Mary University of London), Xinfeng Li (Nanyang Technological University), Xiquan Li (Shanghai Jiaotong University), Zheng Lian (Institute of automation, Chinese academy of science, Chinese Academy of Sciences), Yuzhe Liang (Shanghai Jiaotong University), Minghao Liu (2077AI), Zhikang Niu (Shanghai Jiaotong University), Tianrui Wang (Tianjin University), Wang Yuping (University of Science and Technology of China), Yuxuan Wang (ByteDance), Yihao Wu (Nanyang Technological University), Guanrou Yang (Shanghai Jiaotong University), Jianwei Yu (Microsoft), Ruibin Yuan (Carnegie Mellon University), Zhisheng Zheng (University of Texas at Austin), Ziya Zhou (Hong Kong University of Science and Technology), Haina Zhu (Shanghai Jiaotong University), Wei Xue (Hong Kong University of Science and Technology), Emmanouil Benetos (Queen Mary University of London), Kai Yu (Shanghai Jiao Tong University), Eng-Siong Chng (Nanyang Technological University), Xie Chen (Shanghai Jiaotong University)

Intrinsic Goals for Autonomous Agents: Model-Based Exploration in Virtual Zebrafish Predicts Ethological Behavior and Whole-Brain Dynamics

Authors: Reece Keller (School of Computer Science, Carnegie Mellon University), Alyn Kirsch (Carnegie Mellon University), Felix Pei (Carnegie Mellon University), Xaq Pitkow (Carnegie Mellon University), Leo Kozachkov (Brown University), Aran Nayebi (School of Computer Science, Carnegie Mellon University)

Mellow: a small audio language model for reasoning

Authors: Soham Deshmukh (), Satvik Dixit (Carnegie Mellon University), Rita Singh (Carnegie Mellon University), Bhiksha Raj (Carnegie Mellon University)

A Generalist Intracortical Motor Decoder

Authors: Joel Ye (Carnegie Mellon University), Fabio Rizzoglio (Northwestern University), Xuan Ma (Northwestern University), Adam Smoulder (CMU, Carnegie Mellon University), Hongwei Mao (University of Pittsburgh), Gary Blumenthal (University of Pittsburgh), William Hockeimer (University of Pittsburgh), Nicolas Kunigk (University of Pittsburgh), Dalton Moore (University of Chicago), Patrick Marino (Phantom Neuro), Raeed Chowdhury (None), J. Patrick Mayo (University of Pittsburgh), Aaron Batista (University of Pittsburgh), Steven Chase (None), Michael Boninger (University of Pittsburgh), Charles Greenspon (University of Chicago), Andrew B Schwartz (University of Pittsburgh), Nicholas Hatsopoulos (University of Chicago), Lee Miller (Northwestern University at Chicago), Kristofer Bouchard (Lawrence Berkeley National Laboratory), Jennifer Collinger (University of Pittsburgh), Leila Wehbe (Carnegie Mellon University), Robert Gaunt (University of Pittsburgh)

SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning

Authors: Rui Pan (Princeton University), Yinwei Dai (Princeton University), Zhihao Zhang (Carnegie Mellon University), Gabriele Oliaro (Carnegie Mellon University), Zhihao Jia (School of Computer Science, Carnegie Mellon University), Ravi Netravali (Department of Computer Science, Princeton University)

Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

Authors: Chandler Smith (Oxford University), Marwa Abdulhai (University of California, Berkeley), Manfred Díaz (Mila, Quebec), Marko Tesic (University of Cambridge), Rakshit Trivedi (Massachusetts Institute of Technology), Sasha Vezhnevets (DeepMind), Lewis Hammond (University of Oxford / Cooperative AI Foundation), Jesse Clifton (Center on Long-Term Risk), Minsuk Chang (Google Deepmind), Edgar Duenez-Guzman (Google DeepMind), John Agapiou (Google DeepMind), Jayd Matyas (DeepMind), Danny Karmon (Google DeepMind), Beining Zhang (University of Southampton ), Jim Dilkes (University of Southampton), Akash Kundu (Heritage Institute of Technology), Hieu Minh Nguyen (Apart Research), Emanuel Tewolde (Carnegie Mellon University), Jebish Purbey (Tribhuvan University), Ram Mohan Rao Kadiyala (), Siddhant Gupta (Indian Institute of Technology, Roorkee), Aliaksei Korshuk (Coframe), Buyantuev Alexander (Higher School of Economics), Ilya Makarov (AIRI & ISP RAS), Gang Zhao (Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University), Rolando Fernandez (University of Texas at Austin), Zhihan Wang (University of Texas at Austin), Caroline Wang (The University of Texas at Austin | Google DeepMind), Jiaxun Cui (Meta), Lingyun Xiao (University of Texas at Austin), Di Shi (University of Texas at Austin), Yoonchang Sung (Nanyang Technological University), Muhammad Arrasy Rahman (The University of Texas at Austin), Peter Stone (The University of Texas at Austin, Sony AI), Yipeng Kang (National Key Laboratory of General Artificial Intelligence), Hyeonggeun Yun (Companoid Labs), Ananya Ananya (Stanford University), Taehun Cha (Korea University), Zhiqiang Wu (Tongji University), Elizaveta Tennant (University College London), Olivia Macmillan-Scott (UCL), Marta Segura (University College London, University of London), Diana Riazi (Department of Computer Science, University College London, University of London), Fuyang Cui (University of Toronto), Sriram Ganapathi (University of Waterloo), Toryn Klassen (University of Toronto), Nico Schiavone (University of Toronto), Mogtaba Alim (University of Toronto), Sheila McIlraith (University of Toronto and Vector Institute), Manuel Rios (Universidad de los Andes), Oswaldo Peña (Universidad Nacional de Colombia), Carlos Rojas (Grupo Bancolombia), Manuela Chacon-Chamorro (Universidad de los Andes), Rubén Manrique (Universidad de Los Andes), Luis Felipe Giraldo (Universidad de Los Andes), Nicanor Quijano (Universidad de Los Andes), Yiding Wang (Peking University), Yuxuan Chen (the University of Hong Kong, University of Hong Kong), Fangwei Zhong (Beijing Normal University), Mengmeng Wang (State Key Laboratory of General Artificial Intelligence), Wenming Tu (Shanghai Jiaotong University), Zhaowei Zhang (Peking University), Ziang Chen (Tsinghua University, Tsinghua University), Zixia Jia (BigAI), Xue Feng (BIGAI), Zilong Zheng (Beijing Institute for General Artificial Intelligence), Chichen Lin (), Weijian Fan (Communication University of China), Chenao Liu (Communication University of China), Sneheel Sarangi (New York University Abu Dhabi), Ziyan Wang (King’s College London; Microsoft Research), shuqing shi (Kings College London), Yali Du (King‘s College London), Avinaash Anand Kulandaivel (None), Yang Liu (BIGAI), Wu Ruiyang (Communication University of China), Chetan Talele (None), 陆孙嘉 (Communication University of China), Gema Parreno (–), Shamika Dhuri (Carnegie Mellon University), Bain McHale (CMU, Carnegie Mellon University), Tim Baarslag (Centrum Wiskunde & Informatica / Eindhoven University of Technology), Dylan Hadfield-Menell (MIT), Natasha Jaques (University of Washington, Google DeepMind), José Hernández-Orallo (Universitat Politècnica de València), Joel Leibo (DeepMind)

Computer Vision

PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers

Authors: Yuchen Lin (Peking University), Chenguo Lin (Peking University), Panwang Pan (ByteDance), Honglei Yan (ByteDance Inc.), Feng Yiqiang (ByteDance Inc.), Yadong Mu (Peking University), Katerina Fragkiadaki (Carnegie Mellon University)

Grounded Reinforcement Learning for Visual Reasoning

Authors: Gabriel Sarch (Princeton University), Snigdha Saha (Google), Naitik Khandelwal (Carnegie Mellon University), Ayush Jain (CMU, Carnegie Mellon University), Michael Tarr (Carnegie Mellon University), Aviral Kumar (Carnegie Mellon University), Katerina Fragkiadaki (Carnegie Mellon University)

GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains

Authors: Chun Wang (Zhejiang University), Xiaojun Ye (Zhejiang University), Xiaoran Pan (Zhejiang University), Zihao Pan (None), Haofan Wang (Carnegie Mellon University), Yiren Song (National University of Singapore)

COS3D: Collaborative Open-Vocabulary 3D Segmentation

Authors: Runsong Zhu (The Chinese University of Hong Kong), Ka-Hei Hui (Autodesk), Zhengzhe Liu (Carnegie Mellon University), Qianyi Wu (Monash University), Weiliang Tang (The Chinese University of Hong Kong), Shi Qiu (The Chinese University of Hong Kong), Pheng-Ann Heng (The Chinese University of Hong Kong), Chi-Wing Fu (The Chinese University of Hong Kong)

Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

Authors: Matvei Popov (Trinity University), Peter Robicheaux (Roboflow), Anish Madan (Carnegie Mellon University), Isaac Robinson (Roboflow), Joseph Nelson (Roboflow), Deva Ramanan (Carnegie Mellon University), Neehar Peri (Carnegie Mellon University)

Instant4D: 4D Gaussian Splatting in Minutes

Authors: Zhanpeng Luo (University of Pittsburgh), Haoxi Ran (Carnegie Mellon University), Li Lu (Sichuan University)

Enhancing Vision-Language Model Reliability with Uncertainty-Guided Dropout Decoding

Authors: Yixiong Fang (Carnegie Mellon University), Ziran Yang (Princeton University), Zhaorun Chen (University of Chicago), Zhuokai Zhao (Meta), Jiawei Zhou (Stony Brook University)

FreeInv: Free Lunch for Improving DDIM Inversion

Authors: Yuxiang Bao (Alibaba Group), Huijie Liu (Beihang University), xun gao (Alibaba Group), Huan Fu (Futurise), Guoliang Kang (Carnegie Mellon University)

ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks

Authors: Philip Schroeder (Massachusetts Institute of Technology), Ondrej Biza (Robotics and AI Institute), Thomas Weng (Carnegie Mellon University), Hongyin Luo (Massachusetts Institute of Technology), Jim Glass (Massachusetts Institute of Technology)

RaySt3R: Predicting Novel Depth Maps for Zero-Shot Object Completion

Authors: Bardienus Duisterhof (Carnegie Mellon University), Jan Oberst (CMU, Carnegie Mellon University), Bowen Wen (NVIDIA), Stan Birchfield (NVIDIA), Deva Ramanan (Carnegie Mellon University), Jeffrey Ichnowski (Carnegie Mellon University)

Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment

Authors: Hua Ye (nanjing university), Hang Ding (Shanghai Jiao Tong University), Siyuan Chen (University of Bristol), Yiyang Jiang (Hong Kong Polytechnic University), changyuan zhang (University of Hong Kong), Xuan Zhang (Carnegie Mellon University)

Towards Self-Refinement of Vision-Language Models with Triangular Consistency

Authors: Yunlong Deng (Mohamed bin Zayed University of Artificial Intelligence), Guangyi Chen (MBZUAI&CMU), Tianpei Gu (ByteDance Inc.), Lingjing Kong (Carnegie Mellon University), Yan Li (Mohamed bin Zayed University of Artificial Intelligence), Zeyu Tang (Stanford University), Kun Zhang (CMU & MBZUAI)

OmniBench: Towards The Future of Universal Omni-Language Models

Authors: Yizhi Li (The University of Manchester), Ge Zhang (University of Michigan – Ann Arbor), Yinghao Ma (Centre for Digital Music, Queen Mary University of London), Ruibin Yuan (Carnegie Mellon University), Zhu (Guangdong OPPO Mobile Telecommunications Corp.,Ltd.), Hangyu Guo (Alibaba Group), Yiming Liang (University of the Chinese Academy of Sciences), Jiaheng Liu (Nanjing University), Noah Wang (), Jian Yang (Alibaba Group), Siwei Wu (Nanjing University of Science and Technology), Xingwei Qu (University of Manchester), Jinjie Shi (Queen Mary, University of London), Xinyue Zhang (National University of Singapore), Zhenzhu Yang (China University of Geoscience Beijing), Yidan WEN (Northwest Polytechnical University Xi’an), Yanghai Wang (nanjing university), Shihao Li (nanjing university), ZHAO-XIANG ZHANG (Chinese Academy of Sciences, China), Ruibo Liu (Google DeepMind), Emmanouil Benetos (Queen Mary University of London), Wenhao Huang (Key Laboratory of Machine Perception), Chenghua Lin (University of Manchester)

UFM: A Simple Path towards Unified Dense Correspondence with Flow

Authors: Yuchen Zhang (Carnegie Mellon University), Nikhil Keetha (Carnegie Mellon University), Chenwei Lyu (TikTok Inc.), Bhuvan Jhamb (CMU, Carnegie Mellon University), Yutian Chen (Carnegie Mellon University), Yuheng Qiu (Carnegie Mellon University), Jay Karhade (CMU, Carnegie Mellon University), Shreyas Jha (Nissan Advanced Technology Center), Yaoyu Hu (Carnegie Mellon University), Deva Ramanan (Carnegie Mellon University), Sebastian Scherer (Carnegie Mellon University), Wenshan Wang (School of Computer Science, Carnegie Mellon University)

HoliGS: Holistic Gaussian Splatting for Embodied View Synthesis

Authors: Xiaoyuan Wang (Carnegie Mellon University), Yizhou Zhao (Carnegie Mellon University), Botao Ye (ETH Zurich), Shan Xiaojun (), Weijie Lyu (University of California, Merced), Lu Qi (University of California, Merced), Kelvin Chan (Nanyang Technological University), Yinxiao Li (Google Deepmind), Ming-Hsuan Yang (Google / UC Merced)

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Authors: Yunlong Tang (University of Rochester), Pinxin Liu (University of Rochester), Mingqian Feng (University of Rochester), Zhangyun Tan (University of Rochester), Rui Mao (University of Rochester), Chao Huang (Department of Computer Science, University of Rochester), Jing Bi (University of Rochester), Yunzhong Xiao (Carnegie Mellon University), Susan Liang (University of Rochester), Hang Hua (University of Rochester), Ali Vosoughi (University of Rochester), Luchuan Song (University of Rochester), Zeliang Zhang (University of Rochester), Chenliang Xu (University of Rochester)

TAPIP3D: Tracking Any Point in Persistent 3D Geometry

Authors: Bowei Zhang (Peking University), Lei Ke (Tencent AI Lab Seattle), Adam Harley (Computer Science Department, Stanford University), Katerina Fragkiadaki (Carnegie Mellon University)

Reconstruct, Inpaint, Test-Time Finetune: Dynamic Novel-view Synthesis from Monocular Videos

Authors: Kaihua Chen (CMU, Carnegie Mellon University), Tarasha Khurana (Carnegie Mellon University), Deva Ramanan (Carnegie Mellon University)

CAT: Content-Adaptive Image Tokenization

Authors: Junhong Shen (Carnegie Mellon University), Kushal Tirumala (Meta AI Research, FAIR), Michihiro Yasunaga (Stanford University), Ishan Misra (Facebook AI Research), Luke Zettlemoyer (University of Washington; Meta), LILI YU (Meta), Chunting Zhou (FAIR)

Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

Authors: Thanh-Dat Truong (University of Arkansas), Huu-Thien Tran (University of Arkansas), Tran Son (Ho Chi Minh city University of Science, Vietnam National University), Bhiksha Raj (Carnegie Mellon University), Khoa Luu (University of Arkansas)

OSCAR: One-Step Diffusion Codec Across Multiple Bit-rates

Authors: Jinpei Guo (Shanghai Jiaotong University), Yifei Ji (Shanghai Jiaotong University), Zheng Chen (Shanghai Jiao Tong University), Kai Liu (Shanghai Jiaotong University), Min Liu (Skild AI), Wang Rao (Carnegie Mellon University), Wenbo Li (JD Joy Future Academy), Yong Guo (Max Planck Institute for Informatics), Yulun Zhang (Shanghai Jiao Tong University)

Salient Concept-Aware Generative Data Augmentation

Authors: Tianchen Zhao (Amazon), Xuanbai Chen (Carnegie Mellon University), Zhihua Li (Amazon), Jun Fang (Amazon AGI), DONGSHENG An (State University of New York, Stony Brook), Xiang Xu (Amazon), Zhuowen Tu (University of California, San Diego), Yifan Xing (Amazon)

Data-centric AI

ORBIT – Open Recommendation Benchmark for Reproducible Research with Hidden Tests

Authors: Jingyuan He (School of Computer Science, Carnegie Mellon University), Jiongnan Liu (None), Vishan Oberoi (Carnegie Mellon University), Bolin Wu (Carnegie Mellon University), Mahima Jagadeesh Patel (Carnegie Mellon University), Kangrui Mao (Carnegie Mellon University), Chuning Shi (CMU, Carnegie Mellon University), I-Ta Lee (Meta Platform Inc.), Arnold Overwijk (Meta), Chenyan Xiong (School of Computer Science, Carnegie Mellon University)

R&D-Agent-Quant: A Multi-Agent Framework for Data-Centric Factors and Model Joint Optimization

Authors: Yuante Li (Carnegie Mellon University), Xu Yang (Microsoft), Xiao Yang (Research, Microsoft), Xisen Wang (University of Oxford), Weiqing Liu (Microsoft), Jiang Bian (Microsoft Research)

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Authors: Nikhil Kandpal (Department of Computer Science), Brian Lester (Google DeepMind/University of Toronto), Colin Raffel (University of Toronto, Vector Institute and Hugging Face), Sebastian Majstorovic (EleutherAI), Stella Biderman (The Eleutherai Institute), Baber Abbasi (EleutherAI), Luca Soldaini (Allen Institute for AI), Enrico Shippole (Teraflop AI), A. Feder Cooper (Stanford University), Aviya Skowron (EleutherAI), Shayne Longpre (Massachusetts Institute of Technology), Lintang Sutawika (Carnegie Mellon University), Alon Albalak (Lila Sciences), Zhenlin Xu (Boson AI), Guilherme Penedo (HuggingFace), Loubna Ben allal (Hugging Face), Elie Bakouch (Hugging Face), John Pressman (EleutherAI Institute), Honglu Fan (Google DeepMind), Dashiell Stander (EleutherAI), Guangyu Song (EleutherAI), Aaron Gokaslan (MBZUAI Institute of Foundation Models), John Kirchenbauer (University of Maryland, College Park), Tom Goldstein (University of Maryland), Brian Bartoldson (Lawrence Livermore National Laboratory), Bhavya Kailkhura (Lawrence Livermore National Laboratory), Tyler Murray (Allen Institute for Artificial Intelligence)

Whose Instructions Count? Resolving Preference Bias in Instruction Fine-Tuning

Authors: Jiayu Zhang (Westlake University), Changbang Li (University of Pennsylvania), Yinan Peng (Hengxin Technology Ltd.), Weihao Luo (Donghua University, Shanghai), Peilai Yu (University of Munich, Ludwig-Maximilians-Universität München), Xuan Zhang (Carnegie Mellon University)

DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models

Authors: Cathy Jiao (Carnegie Mellon University), Yijun Pan (Yale University), Emily Xiao (Carnegie Mellon University), Daisy Sheng (Carnegie Mellon University), Niket Jain (Carnegie Mellon University), Hanzhang Zhao (CMU, Carnegie Mellon University), Ishita Dasgupta (School of Computer Science, Carnegie Mellon University), Jiaqi Ma (University of Illinois Urbana-Champaign), Chenyan Xiong (School of Computer Science, Carnegie Mellon University)

Jury-and-Judge Chain-of-Thought for Uncovering Toxic Data in 3D Visual Grounding

Authors: Kaixiang Huang (Zhejiang University), Qifeng Zhang (Zhejiang University), Jin Wang (Zhejiang University), Jingru Yang (Carnegie Mellon University), Yang Zhou (Zhejiang University), Huan Yu (Zhejiang University), Guodong Lu (Zhejiang University), Shengfeng He (Singapore Management University)

Faithful Group Shapley Value

Authors: Kiljae Lee (The Ohio State University), Ziqi Liu (Carnegie Mellon University), Weijing Tang (Carnegie Mellon University), Yuan Zhang (Ohio State University, Columbus)

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions

Authors: Sang Choe (Anthropic), Hwijeen Ahn (Carnegie Mellon University), Juhan Bae (Anthropic), Kewen Zhao (School of Computer Science, Carnegie Mellon University), Youngseog Chung (CMU, Carnegie Mellon University), Adithya Pratapa (Carnegie Mellon University, Amazon), Willie Neiswanger (USC), Emma Strubell (Carnegie Mellon University), Teruko Mitamura (Carnegie Mellon University), Jeff Schneider (CMU), Eduard Hovy (Carnegie Mellon University), Roger Grosse (University of Toronto), Eric Xing (CMU/MBZUAI/GenBio)

Deep Learning

HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models

Authors: Haoran Li (Carnegie Mellon University), Yingjie Qin (Fudan University), Baoyuan Ou (Engineer), Lai Xu (Xiaohongshu), Ruiwen Xu (xiaohongshu)

Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning

Authors: Yiqun Chen (Renmin University of China), Lingyong Yan (Baidu Online Network Technology (Beijing) Co., Ltd.), Weiwei Sun (Carnegie Mellon University), Xinyu Ma (Baidu), Yi Zhang (ByteDance Inc.), Shuaiqiang Wang (Baidu Inc.), Dawei Yin (Baidu), Yiming Yang (CMU), Jiaxin Mao (Renmin University of China)

Training Language Models to Reason Efficiently

Authors: Daman Arora (None), Andrea Zanette (Carnegie Mellon University)

Understanding the Evolution of the Neural Tangent Kernel at the Edge of Stability

Authors: Kaiqi Jiang (Princeton University), Jeremy Cohen (Flatiron Institute), Yuanzhi Li (CMU)

Diffusion Beats Autoregressive in Data-Constrained Settings

Authors: Mihir Prabhudesai (Carnegie Mellon University), Mengning Wu (Carnegie Mellon University), Amir Zadeh (Lambda), Katerina Fragkiadaki (Carnegie Mellon University), Deepak Pathak (Carnegie Mellon University)

Results of the Big ANN: NeurIPS’23 competition

Authors: Harsha Vardhan simhadri (Microsoft ), Martin Aumüller (IT University of Copenhagen), Matthijs Douze (Facebook AI Research), Dmitry Baranchuk (Yandex), Amir Ingber (Pinecone), Edo Liberty (Yale University), George Williams (Ansible AI), Ben Landrum (Cornell University), Magdalen Manohar (Carnegie Mellon University), Mazin Karjikar (University of Maryland, College Park), Laxman Dhulipala (UMD), Meng Chen (Fudan University), Yue Chen (Fudan University), Rui Ma (Fudan University), Kai Zhang (Fudan University), Yuzheng Cai (Fudan University), Jiayang Shi (Fudan University), Weiguo Zheng (Fudan University), Yizhuo Chen (Fudan University), Jie Yin (Tencent), Ben Huang (Baidu)

Ravan: Multi-Head Low-Rank Adaptation for Federated Fine-Tuning

Authors: Arian Raje (CMU, Carnegie Mellon University), Baris Askin (Carnegie Mellon University), Divyansh Jhunjhunwala (Carnegie Mellon University), Gauri Joshi (Carnegie Mellon University)

GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection

Authors: Xin Gao (Fudan University), Jiyao Liu (Fudan University), Guanghao Li (Fudan University), Yueming Lyu (Nanjing university), Jianxiong Gao (None), Weichen Yu (Carnegie Mellon University), Ningsheng Xu (Fudan University), Liang Wang (NLPR, China), Caifeng Shan (Nanjing University), Ziwei Liu (Nanyang Technological University), Chenyang Si (Sea AI Lab)

Reasoning Models Better Express Their Confidence

Authors: Dongkeun Yoon (KAIST), Seungone Kim (Carnegie Mellon University), Sohee Yang (University College London, University of London), Sunkyoung Kim (LG AI Research), Soyeon Kim (LG Corporation), Yongil Kim (LG Corporation), Eunbi Choi (LG AI Research), Yireun Kim (LG AI Research), Minjoon Seo (KAIST)

General Machine Learning

Synergy over Discrepancy: A Partition-Based Approach to Multi-Domain LLM Fine-Tuning

Authors: Hua Ye (nanjing university), Siyuan Chen (University of Bristol), Haoliang Zhang (The University of Oklahoma), Weihao Luo (Donghua University, Shanghai), Yanbin Li (The University of Tokyo), Xuan Zhang (Carnegie Mellon University)

Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning

Authors: Zhonghao He (University of Cambridge), Tianyi (Alex) Qiu (Peking University / UC Berkeley), Hirokazu Shirado (Carnegie Mellon University), Maarten Sap (Carnegie Mellon University)

Handling Missing Responses under Cluster Dependence with Applications to Language Model Evaluation

Authors: Zhenghao Zeng (Stanford University), David Arbour (Adobe Research), Avi Feller (University of California, Berkeley), Ishita Dasgupta (Adobe Systems), Atanu Sinha (Adobe Systems), Edward Kennedy (Carnegie Mellon University)

LLM Interpretability with Identifiable Temporal-Instantaneous Representation

Authors: Xiangchen Song (Carnegie Mellon University), Jiaqi Sun (Carnegie Mellon University), Zijian Li (Mohamed bin Zayed University of Artificial Intelligence), Yujia Zheng (Carnegie Mellon University), Kun Zhang (CMU & MBZUAI)

Inference-Time Personalized Alignment with a Few User Preference Queries

Authors: Victor-Alexandru Pădurean (Max Planck Institute for Software Systems), Parameswaran Kamalaruban (EPFL), Nachiket Kotalwar (Carnegie Mellon University), Alkis Gotovos (MPI-SWS), Adish Singla (MPI-SWS)

Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models

Authors: Xiyuan Zhang (AWS AI), Danielle Maddix Robinson (AWS AI Labs), Junming Yin (Amazon), Nick Erickson (Amazon Web Services), Abdul Fatir Ansari (Amazon), Boran Han (AWS), Shuai Zhang (AWS AI), Leman Akoglu (CMU), Christos Faloutsos (CMU), Michael Mahoney (UC Berkeley), Tony Hu (AWS AI), Huzefa Rangwala (George Mason University), George Karypis (University of Minnesota, Minneapolis), Yuyang (Bernie) Wang (AWS AI)

Exploring Neural Granger Causality with xLSTMs: Unveiling Temporal Dependencies in Complex Data

Authors: Harsh Poonia (Carnegie Mellon University), Felix Divo (Technische Universität Darmstadt), Kristian Kersting (TU Darmstadt), Devendra Singh Dhami (Eindhoven University of Technology)

Optimization

PARALLELPROMPT: Extracting Parallelism from Large Language Model Queries

Authors: Steven Kolawole (Carnegie Mellon University), Keshav Santhanam (Stanford University), Virginia Smith (Carnegie Mellon University), Pratiksha Thaker (CMU)

A Beyond-Worst-Case Analysis of Greedy k-means++

Authors: Qingyun Chen (University of California, Santa Cruz), Sungjin Im (University of California, Santa Cruz), Ben Moseley (Carnegie Mellon University), Ryan Milstrey (University of California, Merced), Chenyang Xu (Zhejiang University), Ruilong Zhang (Technische Universität München)

Improved Algorithms for Fair Matroid Submodular Maximization

Authors: Sepideh Mahabadi (Microsoft Research Redmond), Sherry Sarkar (Carnegie Mellon University), Jakub Tarnawski (Microsoft Research)

Probabilistic Methods

Sampling from multi-modal distributions with polynomial query complexity in fixed dimension via reverse diffusion

Authors: Adrien Vacher (ENSAE), Omar Chehab (Carnegie Mellon University), Anna Korba (GENES-CREST/ENSAE)

Efficient Bayesian Experiment Design with Equivariant Networks

Authors: Conor Igoe (FutureHouse), Tejus Gupta (Carnegie Mellon University), Jeff Schneider (CMU)

Reinforcement Learning

MyoChallenge 2024: A New Benchmark for Physiological Dexterity and Agility in Bionic Humans

Authors: Huiyi Wang (McGill University), Chun Kwang Tan (Northeastern University), Balint Hodossy (Imperial College London), Shirui Lyu (King’s College London, University of London), Pierre Schumacher (Max Planck Institute for Intelligent Systems, Max-Planck Institute), James Heald (University College London, University of London), Kai Biegun (University College London, University of London), Samo Hromadka (Gatsby Computational Neuroscience Unit), Maneesh Sahani (Gatsby Unit, UCL), Gunwoo Park (KAIST), Beomsoo Shin (KAIST), JongHyeon Park (None), Seungbum Koo (KAIST), Chenhui Zuo (Tsinghua University, Tsinghua University), Chengtian Ma (Tsinghua University, Tsinghua University), Yanan Sui (Tsinghua University), Nick Hansen (UC San Diego), Stone Tao (University of California – San Diego), Yuan Gao (Carnegie Mellon University), Hao Su (UCSD), Seungmoon Song (Stanford University), Letizia Gionfrida (King’s College London, University of London), Massimo Sartori (University of Twente), Guillaume Durandau (McGill University), Vikash Kumar (CMU / MyoLab), Vittorio Caggiano (MyoSuite)

Reasoning as an Adaptive Defense for Safety

Authors: Taeyoun Kim (Carnegie Mellon University), Fahim Tajwar (Carnegie Mellon University), Aditi Raghunathan (Carnegie Mellon University), Aviral Kumar (Carnegie Mellon University)

Compute-Optimal Scaling for Value-Based Deep RL

Authors: Preston Fu (University of California, Berkeley), Oleh Rybkin (University of California, Berkeley), Zhiyuan (Paul) Zhou (UC Berkeley, PI), Michal Nauman (University of Warsaw), Pieter Abbeel (UC Berkeley & Amazon), Sergey Levine (UC Berkeley), Aviral Kumar (Carnegie Mellon University)

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Authors: Frank (Fangzheng) Xu (Microsoft AI), Yufan Song (Carnegie Mellon University), Boxuan Li (Microsoft), Yuxuan Tang (Oracle), Kritanjali Jain (School of Computer Science, Carnegie Mellon University), Mengxue Bao (Tiktok), Zora Wang (Carnegie Mellon University), Xuhui Zhou (CMU, Carnegie Mellon University), Zhitong Guo (Meta), Murong Cao (University of Hong Kong), Mingyang Yang (Carnegie Mellon University), Hao Yang Lu (Carnegie Mellon University), Amaad Martin (School of Computer Science, Carnegie Mellon University), Zhe Su (Carnegie Mellon University), Leander Maben (CMU, Carnegie Mellon University), Raj Mehta (Carnegie Mellon University), Wayne Chi (Carnegie Mellon University), Lawrence Jang (Carnegie Mellon University), Yiqing Xie (Carnegie Mellon University), Shuyan Zhou (Facebook), Graham Neubig (Carnegie Mellon University)

Adaptively Coordinating with Novel Partners via Learned Latent Strategies

Authors: Benjamin Li (Carnegie Mellon University), Shuyang Shi (School of Computer Science, Carnegie Mellon University), Lucia Romero (University of Pittsburgh), Huao Li (Massachusetts Institute of Technology), Yaqi Xie (CMU), Woojun Kim (Carnegie Mellon University), Stefanos Nikolaidis (University of Southern California), Charles Lewis (University of Pittsburgh), Katia Sycara (Carnegie Mellon University), Simon Stepputtis (Virginia Polytechnic Institute and State University)

Retrospective In-Context Learning for Temporal Credit Assignment with Large Language Models

Authors: Wen-Tse Chen (School of Computer Science, Carnegie Mellon University), Jiayu Chen (The University of Hong Kong), Fahim Tajwar (Carnegie Mellon University), Hao Zhu (Stanford University), Xintong Duan (CMU, Carnegie Mellon University), Ruslan Salakhutdinov (Carnegie Mellon University), Jeff Schneider (CMU)

Behavior Injection: Preparing Language Models for Reinforcement Learning

Authors: Zhepeng Cen (Carnegie Mellon University), Yihang Yao (Carnegie Mellon University), William Han (Carnegie Mellon University), Zuxin Liu (OpenAI), DING ZHAO (Carnegie Mellon University)

Scaling Offline RL via Efficient and Expressive Shortcut Models

Authors: Nicolas Espinosa-Dice (Cornell University), Yiyi Zhang (Cornell University), Yiding Chen (Cornell University), Bradley Guo (Cornell University), Owen Oertell (Cornell University), Gokul Swamy (Carnegie Mellon University), Kianté Brantley (Kempner and SEAS at Harvard University), Wen Sun (Cornell University and Databricks)

Thinking vs. Doing: Improving Agent Reasoning by Scaling Test-Time Interaction

Authors: Junhong Shen (Carnegie Mellon University), Hao Bai (University of Illinois at Urbana-Champaign), Lunjun Zhang (University of Toronto), Yifei Zhou (University of California, Berkeley), Amrith Setlur (Carnegie Mellon University), Peter Tong (New York University), Diego Caples (AGI, Inc.), Nan Jiang (University of Illinois at Urbana-Champaign), Tong Zhang (UIUC), Ameet Talwalkar (CMU, Datadog), Aviral Kumar (Carnegie Mellon University)

Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts

Authors: Haizhong Zheng (CMU, Carnegie Mellon University), Yang Zhou (Carnegie Mellon University), Brian Bartoldson (Lawrence Livermore National Laboratory), Bhavya Kailkhura (Lawrence Livermore National Laboratory), Fan Lai (University of Illinois Urbana-Champaign), Jiawei Zhao (Meta FAIR), Beidi Chen (CMU / Amazon)

Bigger, Regularized, Categorical: High-Capacity Value Functions are Efficient Multi-Task Learners

Authors: Michal Nauman (University of Warsaw), Marek Cygan (University of Warsaw, NoMagic.AI), Carmelo Sferrazza (UC Berkeley / Amazon FAR / UT Austin), Aviral Kumar (Carnegie Mellon University), Pieter Abbeel (UC Berkeley & Amazon)

Social Aspects

Query-Efficient Locally Private Hypothesis Selection via the Scheffe Graph

Authors: Gautam Kamath (University of Waterloo), Alireza F. Pour (University of Waterloo), Matthew Regehr (University of Waterloo), David Woodruff (Carnegie Mellon University)

Fast Data Attribution for Text-to-Image Models

Authors: Sheng-Yu Wang (CMU), Aaron Hertzmann (Adobe), Alexei Efros (UC Berkeley), Richard Zhang (Adobe), Jun-Yan Zhu (Carnegie Mellon University)

On Fairness of Unified Multimodal Large Language Model for Image Generation

Authors: Ming Liu (Iowa State University), Hao Chen (CMU, Carnegie Mellon University), Jindong Wang (William & Mary), Liwen Wang (Iowa State University), Bhiksha Raj (Carnegie Mellon University), Wensheng Zhang (Iowa State University)

Discretization-free Multicalibration through Loss Minimization over Tree Ensembles

Authors: Hongyi Jin (UCLA Computer Science Department, University of California, Los Angeles), Zijun Ding (Carnegie Mellon University), Dung Daniel Ngo (J.P. Morgan Chase), Steven Wu (Carnegie Mellon University)

Struct-Bench: A Benchmark for Differentially Private Structured Text Generation

Authors: Shuaiqi Wang (CMU, Carnegie Mellon University), Vikas Raunak (Google DeepMind), Arturs Backurs (TTIC), Victor Reis (Microsoft), Pei Zhou (University of Southern California), Sihao Chen (Microsoft), Longqi Yang (Microsoft), Zinan Lin (Microsoft Research), Sergey Yekhanin (Microsoft), Giulia Fanti (CMU)

Distributionally Robust Feature Selection

Authors: Maitreyi Swaroop (Carnegie Mellon University), Tamar Krishnamurti (University of Pittsburgh), Bryan Wilder (Carnegie Mellon University)

Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM

Authors: Xiaoyu Wu (Rice University), Yifei Pang (Carnegie Mellon University), Terrance Liu (Carnegie Mellon University), Steven Wu (Carnegie Mellon University)

OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics

Authors: Vineeth Dorna (University of Massachusetts at Amherst), Anmol Mekala (University of Massachusetts at Amherst), Wenlong Zhao (University of Massachusetts Amherst), Andrew McCallum (UMass Amherst), Zico Kolter (Carnegie Mellon University), Zachary Lipton (Carnegie Mellon University / Abridge), Pratyush Maini (Carnegie Mellon University/ DatologyAI)

Predicting the Performance of Black-box Language Models with Follow-up Queries

Authors: Dylan Sam (OpenAI, Carnegie Mellon University), Marc Finzi (Carnegie Mellon University), Zico Kolter (Carnegie Mellon University)

Validating LLM-as-a-Judge Systems under Rating Indeterminacy

Authors: Luke Guerdan (Carnegie Mellon University), Solon Barocas (Microsoft Research; Cornell University), Kenneth Holstein (Carnegie Mellon University), Hanna Wallach (Microsoft), Steven Wu (Carnegie Mellon University), Alex Chouldechova (Microsoft)

Modeling the Economic Impacts of AI Openness Regulation

Authors: Tori Qiu (Carnegie Mellon University), Benjamin Laufer (Cornell Tech), Jon Kleinberg (Cornell University), Hoda Heidari (Carnegie Mellon University)

Valid Inference with Imperfect Synthetic Data

Authors: Yewon Byun (Carnegie Mellon University), Shantanu Gupta (Carnegie Mellon University), Zachary Lipton (Carnegie Mellon University / Abridge), Rachel Childers (University of Zurich), Bryan Wilder (Carnegie Mellon University)

GraSS: Scalable Data Attribution with Gradient Sparsification and Sparse Projection

Authors: Pingbang Hu (University of Illinois Urbana-Champaign), Joseph Melkonian (Washington University, Saint Louis), Weijing Tang (Carnegie Mellon University), Han Zhao (University of Illinois, Urbana Champaign), Jiaqi Ma (University of Illinois Urbana-Champaign)

AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents

Authors: Arman Zharmagambetov (FAIR @ Meta), Chuan Guo (Meta FAIR), Ivan Evtimov (Meta FAIR), Maya Pavlova (University of Waterloo), Ruslan Salakhutdinov (Carnegie Mellon University), Kamalika Chaudhuri (FAIR, Meta and UCSD)

Sequentially Auditing Differential Privacy

Authors: Tomás González Lara (Carnegie Mellon University), Mateo Dulce Rubio (CMU), Aaditya Ramdas (Carnegie Mellon University), Mónica Ribero (Google)

Fairshare Data Pricing via Data Valuation for Large Language Models

Authors: Luyang Zhang (CMU, Carnegie Mellon University), Cathy Jiao (Carnegie Mellon University), Beibei Li (Carnegie Mellon University), Chenyan Xiong (School of Computer Science, Carnegie Mellon University)

Private Evolution Converges

Authors: Tomás González Lara (Carnegie Mellon University), Giulia Fanti (CMU), Aaditya Ramdas (Carnegie Mellon University)

Theory

A Cramér–von Mises Approach to Incentivizing Truthful Data Sharing

Authors: Alex Clinton (University of Wisconsin – Madison), Thomas Zeng (University of Wisconsin – Madison), Yiding Chen (Cornell University), Jerry Zhu (Carnegie Mellon University), Kirthevasan Kandasamy (University of Wisconsin – Madison)

On Learning Verifiers and Implications to Chain-of-Thought Reasoning

Authors: Maria-Florina Balcan (Carnegie Mellon University), Avrim Blum (Toyota Technological Institute at Chicago), Zhiyuan Li (Toyota Technological Institute at Chicago), Dravyansh Sharma (Toyota Technological Institute at Chicago)

Exploration from a Primal-Dual Lens: Value-Incentivized Actor-Critic Methods for Sample-Efficient Online RL

Authors: Tong Yang (Carnegie Mellon University), Bo Dai (Google DeepMind & Georgia Tech), Lin Xiao (Facebook), Yuejie Chi (Yale University)

Nearly-Linear Time and Massively Parallel Algorithms for $k$-anonymity

Authors: Kevin Aydin (Google), Honghao Lin (Carnegie Mellon University), David Woodruff (Carnegie Mellon University), Peilin Zhong (Columbia University)

Stabilizing LTI Systems under Partial Observability: Sample Complexity and Fundamental Limits

Authors: Ziyi Zhang (Carnegie Mellon University), Yorie Nakahira (Carnegie Mellon University), Guannan Qu (Carnegie Mellon University)

Sharp Matrix Empirical Bernstein Inequalities

Authors: Hongjian Wang (Carnegie Mellon University), Aaditya Ramdas (Carnegie Mellon University)

Learning from Interval Targets

Authors: Rattana Pukdee (Carnegie Mellon University), Ziqi Ke (Bloomberg), Chirag Gupta (Bloomberg AI)

Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization

Authors: Yu Huang (University of Pennsylvania), Zixin Wen (Carnegie Mellon University), Aarti Singh (CMU), Yuejie Chi (Yale University), Yuxin Chen (University of Pennsylvania)

Sample complexity of data-driven tuning of model hyperparameters in neural networks with structured parameter-dependent dual function

Authors: Maria-Florina Balcan (Carnegie Mellon University), Anh Nguyen (Carnegie Mellon University), Dravyansh Sharma (Toyota Technological Institute at Chicago)

Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent

Authors: Tong Yang (Carnegie Mellon University), Yu Huang (University of Pennsylvania), Yingbin Liang (The Ohio State University), Yuejie Chi (Yale University)

Uncategorized

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Authors: Xeron Du (01.AI), Yifan Yao (Beijing University of Posts and Telecommunications), Kaijing Ma (Tongji University), Bingli Wang (Sichuan Agricultural University), Tianyu Zheng (Beijing University of Posts and Telecommunications), Zhu (Guangdong OPPO Mobile Telecommunications Corp.,Ltd.), Minghao Liu (2077AI), Yiming Liang (University of the Chinese Academy of Sciences), Xiaolong Jin (Purdue University), Zhenlin Wei (Harbin Engineering University), Chujie Zheng (Tsinghua University), Kaixin Deng (Hokkaido University), Shuyue Guo (Beijing University of Posts and Telecommunications), Shian Jia (Zhejiang University), Sichao Jiang (zhejiang university), Yiyan Liao (Peking University), Rui Li (Peking University), Qinrui Li (Cornell University), Sirun Li (Peking University), Yizhi Li (The University of Manchester), Yunwen Li (Chinese University of Hong Kong(shenzhen)), Dehua Ma (Beijing University of Posts and Telecommunications), Yuansheng Ni (University of Waterloo), Haoran Que (Beijing University of Aeronautics and Astronautics), Qiyao Wang (henzhen Institute of Advanced Technology, Chinese Academy of Sciences), Zhoufutu Wen (ByteDance Inc.), Siwei Wu (Nanjing University of Science and Technology), Tianshun Xing (Beijing University of Posts and Telecommunications), 明 许 (01.AI), Zhenzhu Yang (China University of Geoscience Beijing), Noah Wang (), Junting Zhou (Peking University), yuelin bai (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences), Xingyuan Bu (Alibaba Group), chenglin cai (Huawei Technologies Ltd.), Liang Chen (Peking University), Yifan Chen (ByteDance Inc.), Cheng Chengtuo (Zhejiang University), Tianhao Cheng (Fudan University), Keyi Ding (2077AI), Siming Huang (University of Melbourne), HUANG YUN (national university of singaore, National University of Singapore), Yaoru Li (Zhejiang University), Yizhe Li (Zhejiang University), Zhaoqun Li (Zhejiang University), Tianhao Liang (Zhejiang University), Chengdong Lin (Hangzhou Dianzi University), Hongquan Lin (University of Science and Technology of China), Yinghao Ma (Centre for Digital Music, Queen Mary University of London), Zhongyuan Peng (Fudan University), Zifan Peng (The Hong Kong University of Science and Technology (Guangzhou)), Qige Qi (ByteDance Inc.), Shi Qiu (Peking University), Xingwei Qu (University of Manchester), Shanghaoran Quan (Alibaba Group), Yizhou Tan (Harvard University), Zili Wang (stepfun), 王晨清 (abaka), Hao Wang (Beijing University of Aeronautics and Astronautics), Yiya Wang (Peking University), Yubo Wang (University of Waterloo), Jiajun Xu (Facebook), Kexin Yang (Alibaba Group), Ruibin Yuan (Carnegie Mellon University), Yuanhao Yue (Fudan University), Tianyang Zhan (ByteDance Inc.), Chun Zhang (ByteDance Inc.), Jinyang Zhang (Peking University), Xiyue Zhang (Peking University), Owen Zhang (Department of Computer Science, Princeton University), Yue Zhang (Suzhou University), Yongchi Zhao (Alibaba Group), Xiangyu Zheng (Fudan University), ChenghuaZhong (University of Science and Technology Beijing), Yang Gao (Nanjing University), Zhoujun Li (Beijing University of Aeronautics and Astronautics), Dayiheng Liu (Alibaba Group), Qian Liu (TikTok (Singapore)), Tianyu Liu (Alibaba), Shiwen Ni (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences), Junran Peng (Institute of automation, Chinese academy of science), Yujia Qin (Bytedance), Wenbo Su (Alibaba Group), Guoyin Wang (Alibaba Qwen Pilot), Shi Wang (Institute of Computing Science, Chinese Academy of Sciences), Jian Yang (Alibaba Group), Min Yang (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences), Meng Cao (Mohamed bin Zayed University of Artificial Intelligence), Xiang Yue (Carnegie Mellon University), ZHAO-XIANG ZHANG (Chinese Academy of Sciences, China), Wangchunshu Zhou (Guangdong OPPO Mobile Telecommunications Corp.,Ltd.), Jiaheng Liu (Nanjing University), Qunshu Lin (Abaka AI), Wenhao Huang (Key Laboratory of Machine Perception), Ge Zhang (University of Michigan – Ann Arbor)

Safety Pretraining: Toward the Next Generation of Safe AI

Authors: Pratyush Maini (Carnegie Mellon University/ DatologyAI), Sachin Goyal (CMU, Carnegie Mellon University), Dylan Sam (OpenAI, Carnegie Mellon University), Alexander Robey (Carnegie Mellon University), Yash Savani (Carnegie Mellon University), Yiding Jiang (Google Deepmind), Andy Zou (CMU, Gray Swan AI), Matt Fredrikson (CMU), Zachary Lipton (Carnegie Mellon University / Abridge), Zico Kolter (Carnegie Mellon University)

A Technical Report on “Erasing the Invisible”: The 2024 NeurIPS Competition on Stress Testing Image Watermarks

Authors: Mucong Ding (Department of Computer Science, University of Maryland, College Park), Bang An (University of Maryland, College Park), Tahseen Rabbani (University of Chicago), Chenghao Deng (University of Maryland), Anirudh Satheesh (University of Maryland, College Park), Souradip Chakraborty (University of Maryland, College Park), Mehrdad Saberi (Department of Computer Science, University of Maryland, College Park), Yuxin Wen (University of Maryland), Kyle Sang (University of Maryland), Aakriti Agrawal (University of Maryland, College Park), Xuandong Zhao (UC Berkeley), Mo Zhou (Johns Hopkins University), Mary-Anne Hartley (EPFL), Lei Li (Carnegie Mellon University), Yu-Xiang Wang (UCSD), Vishal Patel (Johns Hopkins University), Soheil Feizi (University of Maryland), Tom Goldstein (University of Maryland), Furong Huang (University of Maryland)

Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition

Authors: Andy Zou (CMU, Gray Swan AI), Maxwell Lin (University of California, Berkeley), Eliot Jones (Gray Swan), Micha Nowak (Bayerische Julius-Maximilians-Universität Würzburg), Mateusz Dziemian (Independent), Nick Winter (Gray Swan AI), Valent Nathanael (Gray Swan AI), Ayla Croft (Gray Swan AI), Xander Davies (University of Oxford), Jai Patel (UK AI Security Institute), Robert Kirk (University College London), Yarin Gal (University of Oxford), Dan Hendrycks (Center for AI Safety), Zico Kolter (Carnegie Mellon University), Matt Fredrikson (CMU)

Antidistillation Sampling

Authors: Yash Savani (Carnegie Mellon University), Asher Trockman (CMU), Zhili Feng (OpenAI), Yixuan Xu (Carnegie Mellon University), Avi Schwarzschild (Carnegie Mellon University), Alexander Robey (Carnegie Mellon University), Marc Finzi (Carnegie Mellon University), Zico Kolter (Carnegie Mellon University)

Information-Computation Tradeoffs for Noiseless Linear Regression with Oblivious Contamination

Authors: Ilias Diakonikolas (University of Wisconsin-Madison), Chao Gao (University of Chicago), Daniel Kane (UCSD), John Lafferty (Carnegie Mellon University), Ankit Pensia (IBM Research)

Is Your Diffusion Model Actually Denoising?

Authors: Daniel Pfrommer (Massachusetts Institute of Technology), Zehao Dou (OpenAI), Christopher Scarvelis (MIT), Max Simchowitz (Carnegie Mellon University), Ali Jadbabaie (MIT)

CSGO: Content-Style Composition in Text-to-Image Generation

Authors: Peng Xing (Nanjing University of Science and Technology), Haofan Wang (Carnegie Mellon University), Yanpeng Sun (Nanjing University of Science and Technology), wangqixun (Tencent Hunyuan), Baixu (ByteDance Inc.), Hao Ai (Beijing University of Aeronautics and Astronautics), Jen-Yuan Huang (Peking University), Zechao Li (Nanjing University of Science and Techonolgy)

RBench-V: A Primary Assessment for Visual Reasoning Models with Multimodal Outputs

Authors: Meng-Hao Guo (Tsinghua University), Xuanyu Chu (Tsinghua University), Qianrui Yang (Tsinghua University), Zhe-Han Mo (Tsinghua University), Yiqing Shen (Tsinghua University), Pei-lin Li (Tsinghua University, Tsinghua University), Xinjie Lin (Tsinghua University, Tsinghua University), Jinnian Zhang (University of Wisconsin, Madison), Xin-Sheng Chen (Tsinghua University), Yi Zhang (Beihang University), Kiyohiro Nakayama (Stanford University), Zhengyang Geng (CMU), Houwen Peng (Microsoft Research), Han Hu (Microsft Research Asia), Shi-min Hu (Tsinghua University, Tsinghua University)

Kinetics: Rethinking Test-Time Scaling Law

Authors: Ranajoy Sadhukhan (Carnegie Mellon University), Zhuoming Chen (Carnegie Mellon University), Haizhong Zheng (CMU, Carnegie Mellon University), Beidi Chen (CMU / Amazon)

AHa-Bench: Benchmarking Audio Hallucinations in Large Audio-Language Models

Authors: Xize Cheng (zhejiang university), Dongjie Fu (Zhejiang University), Chenyuhao Wen (University of Electronic Science and Technology of China), Shannon Yu (Tianjin University), Zehan Wang (Zhejiang University), Shengpeng Ji (Zhejiang University), Siddhant Arora (Carnegie Mellon University), Tao Jin (Zhejiang University), Shinji Watanabe (Carnegie Mellon University), Zhou Zhao (Zhejiang University)

Improving Model-Based Reinforcement Learning by Converging to Flatter Minima

Authors: Shrinivas Ramasubramanian (Carnegie Mellon University), Benjamin Freed (Carnegie Mellon University), Alexandre Capone (Carnegie Mellon University), Jeff Schneider (CMU)

Tutorials

New Frontiers of Hyperparameter Optimization: Recent advances and open challenges in theory and practice

Authors: Dravyansh Sharma (Toyota Technological Institute at Chicago), Colin White (Meta), Maria-Florina Balcan (Carnegie Mellon University)

Machine learning performance depends strongly on the data and on the choice of algorithms and hyperparameters, making hyperparameter tuning and algorithm selection essential. We survey widely used practical methods, including Bayesian optimization, bandit-based approaches, and recent techniques for large language models such as scaling laws and parameterization-aware methods, noting their limited theoretical guarantees. We then review recent theory-driven advances that characterize how performance varies with hyperparameters for core algorithms—including decision trees, linear models, and deep learning—enabling structure-aware tuning methods with PAC generalization guarantees. We conclude with open challenges in combining principled and practical approaches, optimizing over high-dimensional or discrete spaces, and scaling to distributed settings.

Data Privacy, Memorization, & Legal Implications in Generative AI: A Practical Guide

Authors: Pratyush Maini (Carnegie Mellon University/ DatologyAI), Joseph C. Gratz (Partner, Morrison Foerster LLP), A. Feder Cooper (Yale/Stanford)

Generative models are trained on vast datasets that often contain personal data and copyrighted content. As lawsuits, regulations, and standards emerge, practitioners increasingly need concrete, technically grounded guidance on how privacy and copyright law interact with the realities of modern model development. This tutorial connects data privacy, memorization, and copyright. We will alternate between technical material (attacks, defenses, measurement, and system design) and legal analysis (doctrines, active cases, and regulatory futures), with a focus on practical workflows that ML researchers, engineers, and policy teams can adopt today.

Foundations of Imitation Learning

Authors: Adam Block (Columbia University), Dylan Foster (Microsoft Research), Max Simchowitz (Carnegie Mellon University)

This tutorial frames imitation learning (IL) as a unifying way to understand supervised training of foundation models—learning by imitating large corpora of domain-specific demonstrations—across areas like large language model pre-training, robotics, and chemistry/life sciences. It surveys recent theory on when and why IL works with powerful generative models, explains the interventions and best practices the field has converged on, and points to opportunities to better connect theory and practice. A central theme is how domain-specific settings shape solutions, contrasting discrete problems like language modeling with continuous-control challenges in robotics. It also links techniques across domains, casting next-token prediction as behavior cloning with log-loss and relating exposure bias in generation to compounding error in control, while motivating tools like action chunking, score matching, and interactive data collection.

Scale Test-Time Compute on Modern Hardware

Authors: Zhuoming Chen (Carnegie Mellon University), Beidi Chen (Carnegie Mellon University), Azalia Mirhoseini (Stanford/Ricursive Intelligence)

Large language models have made major gains on reasoning tasks by scaling test-time compute using methods like chain-of-thought and sampling, which can boost performance beyond what pretraining alone delivers. However, deploying more test-time compute is hard because inference workloads tend to have low parallelism, irregular execution, heavy memory I/O, and dynamic control flow—creating bottlenecks like attention memory overhead and poor compute utilization. The tutorial surveys both systems advances (e.g., more efficient KV-cache management, optimized attention kernels, smarter scheduling) and algorithmic directions (e.g., architectures and parallel generation better suited to hardware). Its goal is to connect scaling theory with real deployment constraints and motivate practical, scalable LLM agent systems.

The Science of Benchmarking

Authors: Ziqiao Ma (University of Michigan), Michael Saxon (University of Washington), Xiang Yue (Carnegie Mellon University/Meta)

This tutorial argues that modern AI evaluation needs a more principled view of what benchmarks actually measure—and what they systematically miss—as models and use cases evolve. It maps out key pitfalls in today’s benchmarking practice (especially static metrics that fail to track changing model behavior) and frames evaluation as an epistemic design problem rather than just a leaderboard exercise. The tutorial then surveys emerging paradigms—including adversarial and dynamic benchmarks, model arenas, scaled human evaluation, simulators/sandboxes, and applied interpretability—plus a panel to compare perspectives across the community.

Liked Liked