CMU researchers are presenting 156 papers at the Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025), held from December 2nd-December 7th at the San Diego Convention. Here is a quick overview of the areas our researchers are working on:
Here are our most frequent collaborator institutions:
This paper introduces an Encoder–Attender–Decoder (EAD) framework to study task-optimized neural networks for tactile processing using realistic whisker-based simulations. Convolutional recurrent neural networks (ConvRNNs) emerge as the most effective encoders, both for tactile categorization and for producing representations that closely match activity in rodent somatosensory cortex, revealing a linear link between task performance and neural alignment. Notably, self-supervised contrastive ConvRNN models achieve neural fits comparable to supervised training, indicating that label-free learning can capture biologically relevant tactile representations. These findings highlight the importance of recurrent processing for understanding cortical tactile computation and for building robust embodied AI systems.
Authors: Yuxuan Zhou (CISPA Helmholtz Center for Information Security), Heng Li (Carnegie Mellon University), Zhi-Qi Cheng (University of Washington), Xudong Yan (City University of Macao), Yifei Dong (Carnegie Mellon University), Mario Fritz (CISPA Helmholtz Center for Information Security), Margret Keuper (University of Mannheim)
Label Smoothing is commonly used to reduce overconfidence and improve generalization, but it can paradoxically increase confidence in misclassified samples and collapse feature representations. This work analytically decomposes the LS loss, revealing an error-amplification term that strengthens incorrect predictions and drives representation collapse. To overcome this, the authors propose Max Suppression (MaxSup), which regularizes predictions uniformly by penalizing the top-1 logit instead of the ground-truth logit. Experiments show that MaxSup preserves intra-class diversity, improves class separation, and consistently outperforms LS across large-scale classification and downstream tasks.
Authors: Liwei Jiang (University of Washington), Yuanjun Chai (University of Washington), Margaret Li (University of Washington), Mickel Liu (University of Washington), Raymond Fok (University of Washington), Nouha Dziri (Allen Institute for AI), Yulia Tsvetkov (Department of Computer Science, University of Washington), Maarten Sap (Carnegie Mellon University), Yejin Choi (UW => Stanford / NVIDIA)
This paper introduces INFINITY-CHAT, a large-scale dataset of 26,000 diverse open-ended user queries and a comprehensive taxonomy of prompt types to evaluate creativity and diversity in language model outputs. Using this resource, the authors identify a pronounced “Artificial Hivemind” effect marked by both repetitive responses within a single model and striking similarities across different models. The dataset also includes over 31,000 human annotations enabling analysis of collective and individual preferences. Results show that existing models and evaluation methods are poorly calibrated to idiosyncratic human judgments, highlighting risks of homogenized AI outputs.
Authors: Zhengyang Geng (CMU), Mingyang Deng (Massachusetts Institute of Technology), Xingjian Bai (Massachusetts Institute of Technology), Zico Kolter (Carnegie Mellon University), Kaiming He (MIT)
The authors introduce MeanFlow, a principled one-step generative modeling framework based on the concept of average velocity rather than the instantaneous velocity used in prior flow-matching methods. The authors derive a formal identity linking average and instantaneous velocities to guide neural network training in a self-contained approach requiring no pretraining, distillation, or curriculum learning. MeanFlow achieves strong results, including a 3.43 FID on ImageNet 256×256 with a single function evaluation, outperforming previous one-step models. These results substantially narrow the performance gap between one-step and multi-step diffusion and flow-based methods.
Authors: Xinyuan Wang (University of Hong Kong), Bowen Wang (University of Hong Kong), Dunjie Lu (SUN YAT-SEN UNIVERSITY), Junlin Yang (Tsinghua University), Tianbao Xie (the University of Hong Kong, University of Hong Kong), Junli Wang (Alibaba Group), Jiaqi Deng (The University of Hong Kong), Xiaole Guo (University of Hong Kong), Yiheng Xu (University of Hong Kong), Chen Wu (Carnegie Mellon University), Zhennan Shen (Shanghai Jiaotong University), Zhuokai Li (University of Hong Kong), Ryan Li (Computer Science Department, Stanford University), Xiaochuan Li (Tsinghua University), Junda Chen (Harbin Institute of Technology), Boyuan Zheng (The University of Hong Kong), Li Peihang (University of Hong Kong), Fangyu Lei (Institute of automation, Chinese academy of science, Chinese Academy of Sciences), Ruisheng Cao (Shanghai Jiaotong University), Yeqiao Fu (University of Hong Kong), Dongchan Shin (University of Hong Kong), Martin Shin (University of Hong Kong), Hu Jiarui (University of Hong Kong), Yuyan Wang (Johns Hopkins University), Jixuan Chen (University of California, San Diego), Yuxiao Ye (The Hong Kong University of Science and Technology), Danyang Zhang (Shanghai Jiao Tong University), Yipu Wang (Institute of automation, Chinese academy of science, Chinese Academy of Sciences), Heng Wang (University of Illinois Urbana-Champaign), Diyi Yang (Stanford University), Victor Zhong (University of Waterloo), Y.Charles (Moonshot AI), Zhilin Yang (Tsinghua University, Tsinghua University), Tao Yu (University of Hong Kong)
This paper introduces OpenCUA, an open-source framework designed to enable transparent research into computer-use agents built with vision–language models. The framework includes an annotation system for collecting human demonstrations, AgentNet, a large-scale dataset spanning three operating systems and 200+ applications, and a scalable pipeline that converts demonstrations into state–action data with reflective chain-of-thought reasoning. End-to-end agent models trained with OpenCUA show strong benchmark performance, with OpenCUA-72B achieving a 45.0% success rate on OSWorld-Verified, setting a new open-source state of the art.
Authors: Jiatong Shi (Carnegie Mellon University), Yifan Cheng (Huazhong University of Science and Technology), Bo-Hao Su (Carnegie Mellon University), Hye-jin Shim (Carnegie Mellon University), Jinchuan Tian (Carnegie Mellon University), Samuele Cornell (Università Politecnica delle Marche), Yiwen Zhao (School of Computer Science, Carnegie Mellon University), Siddhant Arora (Carnegie Mellon University), Shinji Watanabe (Carnegie Mellon University)
This work presents ARECHO, an autoregressive chain-based framework for jointly evaluating multiple speech quality metrics such as PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), and MOS (Mean Opinion Score), which traditionally differ in scale and assumptions. ARECHO introduces a comprehensive tokenization pipeline, a dynamic classifier chain to model inter-metric dependencies, and a confidence-oriented two-step decoding scheme to improve inference reliability. Experiments show that ARECHO consistently outperforms baseline methods across speech enhancement, generation evaluation, and noisy-speech scenarios. The approach also improves interpretability and flexibility by enabling reference-free evaluation and subset metric queries.
Authors: Brandon Wood (FAIR at Meta), Misko Dzamba (Facebook), Xiang Fu (Periodic Labs), Meng Gao (Facebook), Muhammed Shuaibi (FAIR, Meta), Luis Barroso-Luque (Facebook), Kareem Abdelmaqsoud (Carnegie Mellon University), Vahe Gharakhanyan (Meta), John Kitchin (Carnegie Mellon University), Daniel Levine (Meta FAIR), Kyle Michel (Meta), Anuroop Sriram (Meta FAIR), Taco Cohen (Meta / FAIR), Abhishek Das (FAIR, Meta AI), Sushree Sahoo (Facebook), Ammar Rizvi (Meta), Zachary Ulissi (FAIR, Meta AI), Larry Zitnick (Fundamental AI Research at Meta AI)
This paper introduces Universal Models for Atoms (UMA), a family of large-scale models designed to rapidly and accurately predict properties from atomic simulations across chemistry and materials science. Trained on over 500 million unique 3D atomic structures spanning molecules, materials, and catalysts, UMA leverages empirical scaling laws and a novel mixture-of-linear-experts architecture to increase capacity without sacrificing speed. Evaluations show that a single UMA model, without fine-tuning, matches or outperforms specialized models across diverse applications.
This work addresses a key limitation of behavioral cloning (BC) in imitation learning: BC only teaches an agent to mimic expert actions at states the expert visited, leaving it unable to recover from mistakes. To overcome this, the authors propose SAILOR, which leverages learning to search (L2S) by training a world model and a reward model to plan and recover toward expert outcomes even after errors. SAILOR achieves stable and sample-efficient learning without additional human corrections and consistently outperforms state-of-the-art diffusion-policy BC methods across visual manipulation benchmarks. It also demonstrates robustness to nuanced failures and reward hacking, and the performance gap persists even when BC is trained with 5–10x more demonstrations.
Authors: Jiajun Shi (Beijing University of Aeronautics and Astronautics), Jian Yang (Alibaba Group), Jiaheng Liu (Nanjing University), Xingyuan Bu (Alibaba Group), Jiangjie Chen (ByteDance Seed), Junting Zhou (Peking University), Kaijing Ma (Tongji University), Zhoufutu Wen (ByteDance Inc.), Bingli Wang (Sichuan Agricultural University), Yancheng He (Alibaba Group), Liang Song (M-A-P), Hualei Zhu (Beijing University of Aeronautics and Astronautics), Shilong Li (Beijing University of Posts and Telecommunications), Xingjian Wang (Shanghai University of Electric Power), Wei Zhang (Beijing University of Aeronautics and Astronautics), Ruibin Yuan (Carnegie Mellon University), Yifan Yao (Beijing University of Posts and Telecommunications), Wenjun Yang (University College London, University of London), Yunli Wang (Kuaishou Technology), Siyuan Fang (Beijing University of Posts and Telecommunications), Siyu Yuan (Fudan University), Qianyu He (Fudan University), Robert Tang (Yale University), Yingshui Tan (Alibaba Group), Wangchunshu Zhou (Guangdong OPPO Mobile Telecommunications Corp.,Ltd.), ZHAO-XIANG ZHANG (Chinese Academy of Sciences, China), Zhoujun Li (Beijing University of Aeronautics and Astronautics), Wenhao Huang (Key Laboratory of Machine Perception), Ge Zhang (University of Michigan – Ann Arbor)
The authors introduce KORGym, a dynamic evaluation platform designed to comprehensively assess the reasoning abilities of large language models (LLMs) and vision-language models (VLMs). Unlike existing domain-specific benchmarks, KORGym offers over 50 interactive games in textual and visual formats, including multi-turn and reinforcement learning scenarios. Experiments on 19 LLMs and 8 VLMs reveal consistent reasoning patterns within model families and highlight the superior performance of closed-source models. The platform also enables analysis of factors such as modality, reasoning strategies, reinforcement learning approaches, and response length, providing a robust tool for advancing reasoning evaluation in complex environments.
Authors: Zhiqiu Lin (Carnegie Mellon University), Siyuan Cen (University of Massachusetts at Amherst), Daniel Jiang (Carnegie Mellon University), Jay Karhade (CMU, Carnegie Mellon University), Hewei Wang (Carnegie Mellon University), Chancharik Mitra (CMU, Carnegie Mellon University), Yu Tong Tiffany Ling (CMU, Carnegie Mellon University), Yuhan Huang (Carnegie Mellon University), Rushikesh Zawar (Carnegie Mellon University), Xue Bai (Adobe Systems), Yilun Du (Google Deepmind / Harvard), Chuang Gan (IBM), Deva Ramanan (Carnegie Mellon University)
This work presents CameraBench, a large-scale dataset and benchmark for evaluating camera motion understanding, comprising roughly 3,000 diverse videos annotated through a rigorous expert-driven process. A key contribution is a taxonomy of camera motion primitives, developed with cinematographers, which captures motions that require both geometric and semantic understanding. Human studies show that domain expertise and targeted training significantly improve motion recognition, such as distinguishing zoom from forward translation. Evaluations reveal that Structure-from-Motion models struggle with semantic motions, while generative video-language models struggle with geometric ones, and fine-tuning a generative VLM on CameraBench enables strong performance across motion-augmented captioning, video QA, and video-text retrieval tasks.
Authors: Weiwei Sun (Carnegie Mellon University), Haokun Liu (Department of Computer Science, University of Toronto), Nikhil Kandpal (Department of Computer Science), Colin Raffel (University of Toronto, Vector Institute and Hugging Face), Yiming Yang (CMU)
This paper presents AirRep, a scalable representation-based method for training data attribution (TDA) that learns task-specific, model-aligned representations optimized for measuring how training data affects model predictions. AirRep features a trainable encoder for attribution quality and an attention-based pooling mechanism to estimate group-wise influence accurately. Trained using a ranking objective over subsets labeled by their empirical effect, AirRep matches the performance of gradient-based methods like influence functions while being nearly 100× more efficient at inference.
Authors: Vijay Viswanathan (Carnegie Mellon University), Yanchao Sun (University of Maryland, College Park), Xiang Kong (Apple), Meng Cao (Apple), Graham Neubig (Carnegie Mellon University), Sherry Wu (Carnegie Mellon University)
This work introduces Reinforcement Learning from Checklist Feedback (RLCF), a method for improving instruction-following in language models using flexible, instruction-specific criteria rather than fixed metrics like helpfulness or harmfulness. RLCF extracts checklists from instructions and evaluates responses against each item using AI judges and verifier programs to compute rewards for reinforcement learning. Applied to models like Qwen2.5-7B-Instruct, RLCF improves performance across five benchmarks, achieving notable gains in hard satisfaction rates and win rates, and can also enhance other models off-policy, such as Llama 3.1 8B Instruct and OLMo 2 7B Instruct. The authors release their WildChecklists dataset, models, and code to support further research in flexible instruction alignment.
Authors: Ziyang Cai (Princeton University), Nayoung Lee (University of Wisconsin-Madison), Avi Schwarzschild (Carnegie Mellon University), Samet Oymak (University of Michigan – Ann Arbor), Dimitris Papailiopoulos (University of Wisconsin-Madison)
This paper studies length generalization in transformer language models—the ability to handle longer inputs than seen during training—through the concept of task association. The authors show that training on a longer, related auxiliary task can improve generalization to longer inputs on a target task across algorithmic domains like arithmetic, string manipulation, and maze navigation. They find similar transfer effects in pretrained language models, suggesting pretraining provides reusable computational scaffolding. Mechanistic analysis indicates that this length generalization transfer is linked to the reuse of attention heads between tasks, highlighting how transformers leverage compositional inductive structures.
Authors: Xinyu Yang (CMU), Yuwei An (Carnegie Mellon University), Hongyi Liu (Carnegie Mellon University), Tianqi Chen (Carnegie Mellon University), Beidi Chen (CMU / Amazon)
This work introduces Multiverse, a generative model that enables natively parallel generation by internalizing a MapReduce paradigm with Map, Process, and Reduce stages. The approach includes Multiverse Curator for automated data creation, Multiverse Attention for separating parallel reasoning steps, and Multiverse Engine for dynamic sequential-parallel inference. After minimal fine-tuning, Multiverse-32B matches leading autoregressive LLMs in performance while achieving up to 2× speedup and better scaling efficiency. The authors have open-sourced the full Multiverse ecosystem, including models, data, serving systems, and training pipelines.
Authors: Yujia Zheng (Carnegie Mellon University), Zhuokai Zhao (Meta), Zijian Li (Mohamed bin Zayed University of Artificial Intelligence), Yaqi Xie (CMU), Mingze Gao (Meta Inc.), Lizhu Zhang (Meta), Kun Zhang (CMU & MBZUAI)
This work introduces thought communication, a paradigm for multi-agent interaction that goes beyond natural language by enabling agents to share latent, mind-like representations directly. The authors formalize this process as a latent variable model, proving that both shared and private thoughts, as well as the global structure of thought sharing among agents, can be identified and recovered with theoretical guarantees. They develop a framework that extracts and distributes relevant latent thoughts to agents, enhancing collaboration across modalities. Experiments on synthetic and real-world benchmarks validate the approach, showing that thought communication can unlock collaborative advantages beyond what is possible with surface-level language-based exchanges.
Authors: Eray Can Elumar (CMU, Carnegie Mellon University), Cem Tekin (Bilkent University), Osman Yagan (Carnegie Mellon University)
This paper introduces CaMVo, a method for labeling datasets with large language models (LLMs) while keeping costs low. Instead of querying many LLMs for every example, CaMVo adaptively chooses only a few models based on how confident they are likely to be. It uses ideas from contextual bandits (LinUCB) and a Bayesian confidence estimator to decide which models to query and how to weight their votes—without needing any ground-truth labels. Experiments on MMLU and IMDB show that CaMVo matches or beats full majority voting but with far fewer LLM calls, making it a practical approach for efficient large-scale annotation.
The authors introduce C-MICL, a framework for learning constraints in optimization problems while guaranteeing that the resulting solutions remain feasible with high probability. Traditional learned constraints can fail due to model error or limited data, but C-MICL uses conformal prediction to add uncertainty-aware adjustments that ensure feasibility at a user-specified confidence level. The method works for both regression- and classification-based constraint learning and avoids the heavy computational overhead of ensemble approaches. Experiments show that C-MICL reliably meets feasibility targets, preserves strong optimization performance, and is significantly more efficient, offering a principled way to blend machine learning with safe decision-making.
The authors present SuffixDecoding, a new speculative decoding method tailored for emerging AI workloads like LLM-based agents, which generate long, repetitive, and predictable sequences. Unlike existing speculative decoding approaches designed for diverse, independent requests, SuffixDecoding uses suffix trees to efficiently cache and reuse long stretches of past tokens from prompts and model outputs. It adaptively adjusts how many tokens to speculate—expanding aggressively when predictions are likely to be accepted and backing off when uncertainty is higher. Experiments on agent-style tasks such as SWE-Bench and Text-to-SQL show that SuffixDecoding can deliver up to 3.9× speedups, making it well suited for fast, iterative agentic inference.
Authors: Seohong Park (UC Berkeley), Kevin Frans (UC Berkeley), Deepinder Mann (UC Berkeley), Benjamin Eysenbach (Princeton), Aviral Kumar (Carnegie Mellon University), Sergey Levine (UC Berkeley)
This paper examines why offline reinforcement learning (RL) often fails to scale, even when given massive datasets, large models, and ample compute. The authors find that long decision horizons—the number of steps required to propagate rewards—are a key bottleneck that prevents standard offline RL algorithms from improving with more data. Through extensive experiments, they show that reducing the effective horizon dramatically improves scalability and performance on challenging tasks. Building on this insight, they introduce SHARSA, a simple horizon-reduction method that achieves the strongest scaling behavior and best asymptotic performance across their benchmarks.
Authors: Yuda Song (Carnegie Mellon University), Dhruv Rohatgi (Massachusetts Institute of Technology), Aarti Singh (CMU), J. Bagnell (Carnegie Mellon University)
This paper studies when it’s better to distill privileged expert policies—which have access to latent state information during training—versus directly learning from partial observations in reinforcement learning. Using a simple theoretical model (the perturbed Block MDP) and controlled locomotion experiments, the authors show that the trade-off depends strongly on how stochastic the underlying latent dynamics are. When the latent state is easy to infer, distillation works well, but when it is highly stochastic, imitating the latent optimal policy can actually hurt performance. The results provide practical guidance: the best latent policy isn’t always the best one to distill, and deciding when to distill versus directly learning depends on the underlying uncertainty structure of the task.
Authors: Alexander Goldberg (Computer Science Department, School of Computer Science), Giulia Fanti (CMU), Nihar Shah (CMU)
MERIT is a principled framework for using randomized selection in settings like peer review or grant funding, where evaluations are noisy and uncertainty can make deterministic rankings unreliable. Instead of relying on ad-hoc randomization, MERIT uses interval estimates (e.g., confidence intervals) to model uncertainty and then optimizes for the worst-case expected number of true top-k items selected. The authors develop a polynomial-time algorithm that scales to large datasets and show that MERIT satisfies desirable fairness and robustness properties that existing methods lack. Experiments on synthetic peer-review data show that MERIT matches prior probabilistic methods in expected performance while providing stronger guarantees in worst-case scenarios.
Authors: Thomas Kuntz (EPFL – EPF Lausanne), Agatha Duzan (EPFL – EPF Lausanne), Hao Zhao (EPFL – EPF Lausanne), Francesco Croce (University of Tübingen), Zico Kolter (Carnegie Mellon University), Nicolas Flammarion (EPFL), Maksym Andriushchenko (ELLIS Institute Tübingen and MPI-IS)
OS-Harm is a benchmark for evaluating the safety of LLM-based computer use agents that interact directly with operating system interfaces. OS-Harm tests agents across three harm categories—deliberate misuse, prompt injection attacks, and model misbehavior—using 150 tasks spanning applications like email, browsers, and code editors. An automated judge evaluates both task performance and safety, achieving strong agreement with human annotations. Evaluations of leading agents reveal that models often comply with unsafe commands, are vulnerable to prompt injections, and sometimes take unsafe actions, highlighting the need for robust safety measures in these systems.
Authors: Pengrun Huang (University of California, San Diego), Chhavi Yadav (CMU), Kamalika Chaudhuri (FAIR, Meta and UCSD), Ruihan Wu (University of California, San Diego)
PropInfer is a benchmark designed to evaluate whether large language models (LLMs) can leak sensitive properties of the datasets used for fine-tuning, particularly in domains like healthcare. It tests property inference under both question-answering and chat-completion setups. Two tailored attacks—a prompt-based generation attack and a shadow-model attack leveraging word frequency—are proposed to extract dataset-level information. Empirical results show that these attacks can succeed across multiple pretrained LLMs, revealing an important and previously underexplored privacy risk.
Authors: Hyeong Kyu Choi (University of Wisconsin-Madison, Computer Sciences), Jerry Zhu (Carnegie Mellon University), Sharon Li (University of Wisconsin-Madison)
Multi-Agent Debate (MAD) improves large language model performance by having multiple agents reason collaboratively, but its key drivers were unclear. By separating Majority Voting from inter-agent debate, experiments across seven NLP benchmarks show that most gains come from majority voting rather than the debate itself. A theoretical analysis models debate as a stochastic process, revealing that debate alone doesn’t improve expected correctness, though targeted interventions that bias belief updates can enhance its impact. These results suggest that while MAD has potential, simple ensembling methods often remain a more reliable and effective approach.
Authors: Ioannis Anagnostides (Carnegie Mellon University), Ioannis Panageas (UC Irvine), Tuomas Sandholm (CMU, Strategy Robot, Optimized Markets, Strategic Machine), Jingming Yan (University of California, Irvine)
The study analyzes the complexity of computing equilibria in team-based zero-sum games and symmetric min-max optimization. It shows that finding epsilon-Nash equilibria in 3-player adversarial team games (2 vs. 1) is CLS-complete, resolving an open question about such games. Additionally, computing symmetric equilibria in symmetric min-max problems is PPAD-complete, even for quadratic objectives, and this extends to 6-player team games (3 vs. 3), implying that common symmetric dynamics cannot reliably converge. Finally, computing non-symmetric equilibria with polynomial precision is FNP-hard, highlighting the fundamental difficulty of equilibrium computation in these settings.
Authors: Emile Anand (Georgia Institute of Technology and Cognition Labs), Ishani Karmarkar (Stanford University), Guannan Qu (Carnegie Mellon University)
Scaling multi-agent reinforcement learning (MARL) is difficult due to the exponential growth of joint state and action spaces as agents increase. SUBSAMPLE-MFQ introduces a method that combines subsampling agents with mean-field Q-learning and a decentralized randomized policy, allowing efficient learning for any subset of k agents. The algorithm’s runtime scales polynomially in k, not the total number of agents n, making it practical for large systems. Theoretical guarantees show that the learned policy converges to the optimal policy at a rate of roughly 1 over root k, independent of the total agent count.
Authors: Zheng He (University of British Columbia), Roman Pogodin (Google), Yazhe Li (Microsoft), Namrata Deka (Carnegie Mellon University), Arthur Gretton (Google Deepmind / UCL), Danica J. Sutherland (University of British Columbia + Amii)
Conditional independence (CI) tests are central to tasks like causal discovery and fairness evaluation, but they often fail in practice despite theoretical guarantees. Focusing on the Kernel-based Conditional Independence (KCI) test, the work shows that many recent CI tests are special cases of a Generalized Covariance Measure. Practical performance is largely driven by errors in estimating the conditional mean, which affect Type I error, and by the choice of conditioning kernel, which influences test power but can also inflate false positives. These insights clarify why popular CI tests often underperform and highlight how careful kernel and estimation choices are crucial for reliable results.
Authors: Xiangcheng Zhang (Tsinghua), Yige Hong (Carnegie Mellon University), Weina Wang (Computer Science Department, Carnegie Mellon University)
Heterogeneity creates major challenges in large-scale decision-making, especially in weakly-coupled Markov decision processes (WCMDPs) where each subproblem has distinct dynamics. In the fully heterogeneous setting, the authors show that an efficiently computable policy can achieve an O(1/root N) optimality gap in long-run average reward per subproblem as the number of subproblems N grows. This work provides the first asymptotic optimality guarantee for fully heterogeneous average-reward WCMDPs. Key to this result is a novel use of projection-based Lyapunov functions that ensure convergence of rewards and costs even under complete heterogeneity.
Authors: Hyungjoo Chae (Georgia Institute of Technology), Seonghwan Kim (Yonsei University), Junhee Cho (Yonsei University), Seungone Kim (Carnegie Mellon University), Seungjun Moon (Yonsei University), Gyeom Hwangbo (University of Seoul), Dongha Lim (Korea Advanced Institute of Science & Technology), Minjin Kim (Yonsei University), Yeonjun Hwang (Yonsei University), Minju Gwak (Yonsei University), Dongwook Choi (Chung-Ang University), Minseok Kang (Yonsei University), Gwanhoon Im (Yonsei University), ByeongUng Cho (Yonsei University), Hyojun Kim (Yonsei University), Jun Han (Yonsei University), Taeyoon Kwon (Yonsei University), Minju Kim (Yonsei University), Beong-woo Kwak (Yonsei University), Dongjin Kang (Yonsei University), Jinyoung Yeo (Yonsei University)
Web navigation poses a long-horizon sequential decision-making challenge that goes beyond typical multimodal LLM tasks, but step-level reward models have been lacking. Web-Shepherd, the first process reward model (PRM) for web navigation, evaluates trajectories at each step, enabling both training and test-time assessment. The approach is supported by the WebPRM Collection, a 40K step-level dataset with annotated preference pairs, and WebRewardBench, a benchmark for evaluating PRMs. Experiments show Web-Shepherd outperforms GPT-4o by ~30 points on WebRewardBench and improves policy performance on WebArena-lite by 10.9 points while reducing verification cost by 10×, demonstrating a practical and efficient solution for web navigation tasks.
Mixed-motive multi-agent reinforcement learning requires balancing individual incentives with collective goals, which are often in conflict. The proposed adaptive conflict-aware gradient adjustment method dynamically balances policy gradients from individual and collective objectives, promoting cooperation while preserving fairness in task-specific rewards. Theoretical analysis guarantees monotonic improvement in both collective and individual outcomes, ensuring fairness across agents. Experiments in sequential social dilemma environments show that this approach outperforms baselines in social welfare while maintaining equitable outcomes for all agents.
Authors: Maonan Wang (The Chinese University of Hong Kong), Yirong Chen (Shanghai Artificial Intelligence Laboratory), Aoyu Pang (The Chinese University of Hong Kong), Yuxin Cai (Carnegie Mellon University), Chung Shue Chen (Nokia Bell Labs), Yuheng Kan (Enbodied AI Research Center, Fourier), Man On Pun (The Chinese University of Hong Kong, Shenzhen)
Authors: Gongxu Luo (Mohamed bin Zayed University of Artificial Intelligence), Haoyue Dai (Carnegie Mellon University), Longkang Li (MBZUAI), Chengqian Gao (Mohamed bin Zayed University of Artificial Intelligence), Boyang Sun (Mohamed bin Zayed University of Artificial Intelligence), Kun Zhang (CMU & MBZUAI)
Authors: Xianzhe Fan (The University of Hong Kong), Xuhui Zhou (CMU, Carnegie Mellon University), Chuanyang Jin (New York University), Kolby Nottingham (AI Dungeon / Voyage), Hao Zhu (Stanford University), Maarten Sap (Carnegie Mellon University)
Authors: Haoyang Fang (AWS), Boran Han (AWS), Nick Erickson (Amazon Web Services), Xiyuan Zhang (AWS AI), Su Zhou (Carnegie Mellon University), Anirudh Dagar (AWS), Jiani Zhang (Google), Caner Turkmen (Amazon Web Services), Tony Hu (AWS AI), Huzefa Rangwala (George Mason University), Ying Nian Wu (University of California, Los Angeles), Yuyang (Bernie) Wang (AWS AI), George Karypis (University of Minnesota, Minneapolis)
Authors: Tonghe Zhang (Carnegie Mellon University), Chao Yu (Tsinghua University, Tsinghua University), Sichang Su (The University of Texas at Austin), Yu Wang (Tsinghua University)
Authors: Muquan Yu (Chinese University of Hong Kong), Mu Nan (University of Hong Kong), Hossein Adeli (Columbia University), Jacob Prince (Harvard University), John A. Pyles (University of Washington), Leila Wehbe (Carnegie Mellon University), Maggie Henderson (Carnegie Mellon University), Michael Tarr (Carnegie Mellon University), Andrew Luo (University of Hong Kong)
Authors: Jonathan Zheng (Georgia Institute of Technology), Alan Ritter (Georgia Institute of Technology), Sauvik Das (Carnegie Mellon University), Wei “Coco” Xu (Georgia Institute of Technology)
Authors: Shizheng Wen (ETHZ – ETH Zurich), Arsh Kumbhat (None), Levi Lingsch (ETH Zurich), Sepehr Mousavi (ETHZ – ETH Zurich), Yizhou Zhao (Carnegie Mellon University), Praveen Chandrashekar (Tata Institute of Fundamental Research), Siddhartha Mishra (Swiss Federal Institute of Technology)
Authors: Jifan Zhang (Northwestern University), Fangxin Wang (University of Illinois at Chicago), Zihe Song (University of Illinois at Chicago), Philip S Yu (UIC), Kaize Ding (Northwestern University), Shixiang Zhu (Carnegie Mellon University)
Authors: Yue Huang (University of Notre Dame ), Zhengzhe Jiang (Sichuan University), Xiaonan Luo (University of Notre Dame), Kehan Guo (university of notre dame), Haomin Zhuang (University of Notre Dame), Yujun Zhou (University of Notre Dame), Zhengqing Yuan (University of Notre Dame), Xiaoqi Sun (Massachusetts Institute of Technology), Jules Schleinitz (California Institute of Technology), Yanbo Wang (Mohamed bin Zayed University of Artificial Intelligence), Shuhao Zhang (Carnegie Mellon University), Mihir Surve (University of Notre Dame), Nitesh Chawla (University of Notre Dame), Olaf Wiest (University of Notre Dame), Xiangliang Zhang (University of Notre Dame)
Authors: Yang Xiao (Hong Kong Polytechnic University), Jiashuo WANG (HKPU), Ruifeng Yuan (Hong Kong Polytechnic University), Chunpu Xu (Hong Kong Polytechnic University), Kaishuai Xu (Hong Kong Polytechnic University), Wenjie Li (The Hong Kong Polytechnic University), Pengfei Liu (Carnegie Mellon University)
Authors: Jiaqi Wei (Zhejiang University), Hao Zhou (South China University of Technology), Xiang Zhang (University of British Columbia), Di Zhang (Shanghai Artificial Intelligence Laboratory), Zijie Qiu (Fudan University), Noah Wei (Carnegie Mellon University), Jinzhe Li (Fudan University), Wanli Ouyang (Shanghai AI Lab), Siqi Sun (Fudan University)
Authors: Ziyang Ma (Shanghai Jiao Tong University), Yinghao Ma (Centre for Digital Music, Queen Mary University of London), Yanqiao Zhu (Shanghai Jiaotong University), Chen Yang (Shanghai Jiaotong University), Yi-Wen Chao (Nanyang Technological University), Ruiyang Xu (Shanghai Jiaotong University), Wenxi Chen (Shanghai Jiaotong University), Yuanzhe Chen (ByteDance Inc.), Zhuo Chen (ByteDance Inc.), Jian Cong (ByteDance Inc.), Kai Li (Tsinghua University, Tsinghua University), Keliang Li (, Chinese Academy of Sciences), Siyou Li (Queen Mary University of London), Xinfeng Li (Nanyang Technological University), Xiquan Li (Shanghai Jiaotong University), Zheng Lian (Institute of automation, Chinese academy of science, Chinese Academy of Sciences), Yuzhe Liang (Shanghai Jiaotong University), Minghao Liu (2077AI), Zhikang Niu (Shanghai Jiaotong University), Tianrui Wang (Tianjin University), Wang Yuping (University of Science and Technology of China), Yuxuan Wang (ByteDance), Yihao Wu (Nanyang Technological University), Guanrou Yang (Shanghai Jiaotong University), Jianwei Yu (Microsoft), Ruibin Yuan (Carnegie Mellon University), Zhisheng Zheng (University of Texas at Austin), Ziya Zhou (Hong Kong University of Science and Technology), Haina Zhu (Shanghai Jiaotong University), Wei Xue (Hong Kong University of Science and Technology), Emmanouil Benetos (Queen Mary University of London), Kai Yu (Shanghai Jiao Tong University), Eng-Siong Chng (Nanyang Technological University), Xie Chen (Shanghai Jiaotong University)
Authors: Joel Ye (Carnegie Mellon University), Fabio Rizzoglio (Northwestern University), Xuan Ma (Northwestern University), Adam Smoulder (CMU, Carnegie Mellon University), Hongwei Mao (University of Pittsburgh), Gary Blumenthal (University of Pittsburgh), William Hockeimer (University of Pittsburgh), Nicolas Kunigk (University of Pittsburgh), Dalton Moore (University of Chicago), Patrick Marino (Phantom Neuro), Raeed Chowdhury (None), J. Patrick Mayo (University of Pittsburgh), Aaron Batista (University of Pittsburgh), Steven Chase (None), Michael Boninger (University of Pittsburgh), Charles Greenspon (University of Chicago), Andrew B Schwartz (University of Pittsburgh), Nicholas Hatsopoulos (University of Chicago), Lee Miller (Northwestern University at Chicago), Kristofer Bouchard (Lawrence Berkeley National Laboratory), Jennifer Collinger (University of Pittsburgh), Leila Wehbe (Carnegie Mellon University), Robert Gaunt (University of Pittsburgh)
Authors: Chandler Smith (Oxford University), Marwa Abdulhai (University of California, Berkeley), Manfred Díaz (Mila, Quebec), Marko Tesic (University of Cambridge), Rakshit Trivedi (Massachusetts Institute of Technology), Sasha Vezhnevets (DeepMind), Lewis Hammond (University of Oxford / Cooperative AI Foundation), Jesse Clifton (Center on Long-Term Risk), Minsuk Chang (Google Deepmind), Edgar Duenez-Guzman (Google DeepMind), John Agapiou (Google DeepMind), Jayd Matyas (DeepMind), Danny Karmon (Google DeepMind), Beining Zhang (University of Southampton ), Jim Dilkes (University of Southampton), Akash Kundu (Heritage Institute of Technology), Hieu Minh Nguyen (Apart Research), Emanuel Tewolde (Carnegie Mellon University), Jebish Purbey (Tribhuvan University), Ram Mohan Rao Kadiyala (), Siddhant Gupta (Indian Institute of Technology, Roorkee), Aliaksei Korshuk (Coframe), Buyantuev Alexander (Higher School of Economics), Ilya Makarov (AIRI & ISP RAS), Gang Zhao (Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University), Rolando Fernandez (University of Texas at Austin), Zhihan Wang (University of Texas at Austin), Caroline Wang (The University of Texas at Austin | Google DeepMind), Jiaxun Cui (Meta), Lingyun Xiao (University of Texas at Austin), Di Shi (University of Texas at Austin), Yoonchang Sung (Nanyang Technological University), Muhammad Arrasy Rahman (The University of Texas at Austin), Peter Stone (The University of Texas at Austin, Sony AI), Yipeng Kang (National Key Laboratory of General Artificial Intelligence), Hyeonggeun Yun (Companoid Labs), Ananya Ananya (Stanford University), Taehun Cha (Korea University), Zhiqiang Wu (Tongji University), Elizaveta Tennant (University College London), Olivia Macmillan-Scott (UCL), Marta Segura (University College London, University of London), Diana Riazi (Department of Computer Science, University College London, University of London), Fuyang Cui (University of Toronto), Sriram Ganapathi (University of Waterloo), Toryn Klassen (University of Toronto), Nico Schiavone (University of Toronto), Mogtaba Alim (University of Toronto), Sheila McIlraith (University of Toronto and Vector Institute), Manuel Rios (Universidad de los Andes), Oswaldo Peña (Universidad Nacional de Colombia), Carlos Rojas (Grupo Bancolombia), Manuela Chacon-Chamorro (Universidad de los Andes), Rubén Manrique (Universidad de Los Andes), Luis Felipe Giraldo (Universidad de Los Andes), Nicanor Quijano (Universidad de Los Andes), Yiding Wang (Peking University), Yuxuan Chen (the University of Hong Kong, University of Hong Kong), Fangwei Zhong (Beijing Normal University), Mengmeng Wang (State Key Laboratory of General Artificial Intelligence), Wenming Tu (Shanghai Jiaotong University), Zhaowei Zhang (Peking University), Ziang Chen (Tsinghua University, Tsinghua University), Zixia Jia (BigAI), Xue Feng (BIGAI), Zilong Zheng (Beijing Institute for General Artificial Intelligence), Chichen Lin (), Weijian Fan (Communication University of China), Chenao Liu (Communication University of China), Sneheel Sarangi (New York University Abu Dhabi), Ziyan Wang (King’s College London; Microsoft Research), shuqing shi (Kings College London), Yali Du (King‘s College London), Avinaash Anand Kulandaivel (None), Yang Liu (BIGAI), Wu Ruiyang (Communication University of China), Chetan Talele (None), 陆孙嘉 (Communication University of China), Gema Parreno (–), Shamika Dhuri (Carnegie Mellon University), Bain McHale (CMU, Carnegie Mellon University), Tim Baarslag (Centrum Wiskunde & Informatica / Eindhoven University of Technology), Dylan Hadfield-Menell (MIT), Natasha Jaques (University of Washington, Google DeepMind), José Hernández-Orallo (Universitat Politècnica de València), Joel Leibo (DeepMind)
Authors: Chun Wang (Zhejiang University), Xiaojun Ye (Zhejiang University), Xiaoran Pan (Zhejiang University), Zihao Pan (None), Haofan Wang (Carnegie Mellon University), Yiren Song (National University of Singapore)
Authors: Runsong Zhu (The Chinese University of Hong Kong), Ka-Hei Hui (Autodesk), Zhengzhe Liu (Carnegie Mellon University), Qianyi Wu (Monash University), Weiliang Tang (The Chinese University of Hong Kong), Shi Qiu (The Chinese University of Hong Kong), Pheng-Ann Heng (The Chinese University of Hong Kong), Chi-Wing Fu (The Chinese University of Hong Kong)
Authors: Matvei Popov (Trinity University), Peter Robicheaux (Roboflow), Anish Madan (Carnegie Mellon University), Isaac Robinson (Roboflow), Joseph Nelson (Roboflow), Deva Ramanan (Carnegie Mellon University), Neehar Peri (Carnegie Mellon University)
Authors: Philip Schroeder (Massachusetts Institute of Technology), Ondrej Biza (Robotics and AI Institute), Thomas Weng (Carnegie Mellon University), Hongyin Luo (Massachusetts Institute of Technology), Jim Glass (Massachusetts Institute of Technology)
Authors: Hua Ye (nanjing university), Hang Ding (Shanghai Jiao Tong University), Siyuan Chen (University of Bristol), Yiyang Jiang (Hong Kong Polytechnic University), changyuan zhang (University of Hong Kong), Xuan Zhang (Carnegie Mellon University)
Authors: Yunlong Deng (Mohamed bin Zayed University of Artificial Intelligence), Guangyi Chen (MBZUAI&CMU), Tianpei Gu (ByteDance Inc.), Lingjing Kong (Carnegie Mellon University), Yan Li (Mohamed bin Zayed University of Artificial Intelligence), Zeyu Tang (Stanford University), Kun Zhang (CMU & MBZUAI)
Authors: Yizhi Li (The University of Manchester), Ge Zhang (University of Michigan – Ann Arbor), Yinghao Ma (Centre for Digital Music, Queen Mary University of London), Ruibin Yuan (Carnegie Mellon University), Zhu (Guangdong OPPO Mobile Telecommunications Corp.,Ltd.), Hangyu Guo (Alibaba Group), Yiming Liang (University of the Chinese Academy of Sciences), Jiaheng Liu (Nanjing University), Noah Wang (), Jian Yang (Alibaba Group), Siwei Wu (Nanjing University of Science and Technology), Xingwei Qu (University of Manchester), Jinjie Shi (Queen Mary, University of London), Xinyue Zhang (National University of Singapore), Zhenzhu Yang (China University of Geoscience Beijing), Yidan WEN (Northwest Polytechnical University Xi’an), Yanghai Wang (nanjing university), Shihao Li (nanjing university), ZHAO-XIANG ZHANG (Chinese Academy of Sciences, China), Ruibo Liu (Google DeepMind), Emmanouil Benetos (Queen Mary University of London), Wenhao Huang (Key Laboratory of Machine Perception), Chenghua Lin (University of Manchester)
Authors: Yunlong Tang (University of Rochester), Pinxin Liu (University of Rochester), Mingqian Feng (University of Rochester), Zhangyun Tan (University of Rochester), Rui Mao (University of Rochester), Chao Huang (Department of Computer Science, University of Rochester), Jing Bi (University of Rochester), Yunzhong Xiao (Carnegie Mellon University), Susan Liang (University of Rochester), Hang Hua (University of Rochester), Ali Vosoughi (University of Rochester), Luchuan Song (University of Rochester), Zeliang Zhang (University of Rochester), Chenliang Xu (University of Rochester)
Authors: Thanh-Dat Truong (University of Arkansas), Huu-Thien Tran (University of Arkansas), Tran Son (Ho Chi Minh city University of Science, Vietnam National University), Bhiksha Raj (Carnegie Mellon University), Khoa Luu (University of Arkansas)
Authors: Jinpei Guo (Shanghai Jiaotong University), Yifei Ji (Shanghai Jiaotong University), Zheng Chen (Shanghai Jiao Tong University), Kai Liu (Shanghai Jiaotong University), Min Liu (Skild AI), Wang Rao (Carnegie Mellon University), Wenbo Li (JD Joy Future Academy), Yong Guo (Max Planck Institute for Informatics), Yulun Zhang (Shanghai Jiao Tong University)
Authors: Tianchen Zhao (Amazon), Xuanbai Chen (Carnegie Mellon University), Zhihua Li (Amazon), Jun Fang (Amazon AGI), DONGSHENG An (State University of New York, Stony Brook), Xiang Xu (Amazon), Zhuowen Tu (University of California, San Diego), Yifan Xing (Amazon)
Authors: Yuante Li (Carnegie Mellon University), Xu Yang (Microsoft), Xiao Yang (Research, Microsoft), Xisen Wang (University of Oxford), Weiqing Liu (Microsoft), Jiang Bian (Microsoft Research)
Authors: Nikhil Kandpal (Department of Computer Science), Brian Lester (Google DeepMind/University of Toronto), Colin Raffel (University of Toronto, Vector Institute and Hugging Face), Sebastian Majstorovic (EleutherAI), Stella Biderman (The Eleutherai Institute), Baber Abbasi (EleutherAI), Luca Soldaini (Allen Institute for AI), Enrico Shippole (Teraflop AI), A. Feder Cooper (Stanford University), Aviya Skowron (EleutherAI), Shayne Longpre (Massachusetts Institute of Technology), Lintang Sutawika (Carnegie Mellon University), Alon Albalak (Lila Sciences), Zhenlin Xu (Boson AI), Guilherme Penedo (HuggingFace), Loubna Ben allal (Hugging Face), Elie Bakouch (Hugging Face), John Pressman (EleutherAI Institute), Honglu Fan (Google DeepMind), Dashiell Stander (EleutherAI), Guangyu Song (EleutherAI), Aaron Gokaslan (MBZUAI Institute of Foundation Models), John Kirchenbauer (University of Maryland, College Park), Tom Goldstein (University of Maryland), Brian Bartoldson (Lawrence Livermore National Laboratory), Bhavya Kailkhura (Lawrence Livermore National Laboratory), Tyler Murray (Allen Institute for Artificial Intelligence)
Authors: Kiljae Lee (The Ohio State University), Ziqi Liu (Carnegie Mellon University), Weijing Tang (Carnegie Mellon University), Yuan Zhang (Ohio State University, Columbus)
Authors: Yiqun Chen (Renmin University of China), Lingyong Yan (Baidu Online Network Technology (Beijing) Co., Ltd.), Weiwei Sun (Carnegie Mellon University), Xinyu Ma (Baidu), Yi Zhang (ByteDance Inc.), Shuaiqiang Wang (Baidu Inc.), Dawei Yin (Baidu), Yiming Yang (CMU), Jiaxin Mao (Renmin University of China)
Authors: Harsha Vardhan simhadri (Microsoft ), Martin Aumüller (IT University of Copenhagen), Matthijs Douze (Facebook AI Research), Dmitry Baranchuk (Yandex), Amir Ingber (Pinecone), Edo Liberty (Yale University), George Williams (Ansible AI), Ben Landrum (Cornell University), Magdalen Manohar (Carnegie Mellon University), Mazin Karjikar (University of Maryland, College Park), Laxman Dhulipala (UMD), Meng Chen (Fudan University), Yue Chen (Fudan University), Rui Ma (Fudan University), Kai Zhang (Fudan University), Yuzheng Cai (Fudan University), Jiayang Shi (Fudan University), Weiguo Zheng (Fudan University), Yizhuo Chen (Fudan University), Jie Yin (Tencent), Ben Huang (Baidu)
Authors: Dongkeun Yoon (KAIST), Seungone Kim (Carnegie Mellon University), Sohee Yang (University College London, University of London), Sunkyoung Kim (LG AI Research), Soyeon Kim (LG Corporation), Yongil Kim (LG Corporation), Eunbi Choi (LG AI Research), Yireun Kim (LG AI Research), Minjoon Seo (KAIST)
Authors: Hua Ye (nanjing university), Siyuan Chen (University of Bristol), Haoliang Zhang (The University of Oklahoma), Weihao Luo (Donghua University, Shanghai), Yanbin Li (The University of Tokyo), Xuan Zhang (Carnegie Mellon University)
Authors: Xiangchen Song (Carnegie Mellon University), Jiaqi Sun (Carnegie Mellon University), Zijian Li (Mohamed bin Zayed University of Artificial Intelligence), Yujia Zheng (Carnegie Mellon University), Kun Zhang (CMU & MBZUAI)
Authors: Harsh Poonia (Carnegie Mellon University), Felix Divo (Technische Universität Darmstadt), Kristian Kersting (TU Darmstadt), Devendra Singh Dhami (Eindhoven University of Technology)
Authors: Qingyun Chen (University of California, Santa Cruz), Sungjin Im (University of California, Santa Cruz), Ben Moseley (Carnegie Mellon University), Ryan Milstrey (University of California, Merced), Chenyang Xu (Zhejiang University), Ruilong Zhang (Technische Universität München)
Authors: Huiyi Wang (McGill University), Chun Kwang Tan (Northeastern University), Balint Hodossy (Imperial College London), Shirui Lyu (King’s College London, University of London), Pierre Schumacher (Max Planck Institute for Intelligent Systems, Max-Planck Institute), James Heald (University College London, University of London), Kai Biegun (University College London, University of London), Samo Hromadka (Gatsby Computational Neuroscience Unit), Maneesh Sahani (Gatsby Unit, UCL), Gunwoo Park (KAIST), Beomsoo Shin (KAIST), JongHyeon Park (None), Seungbum Koo (KAIST), Chenhui Zuo (Tsinghua University, Tsinghua University), Chengtian Ma (Tsinghua University, Tsinghua University), Yanan Sui (Tsinghua University), Nick Hansen (UC San Diego), Stone Tao (University of California – San Diego), Yuan Gao (Carnegie Mellon University), Hao Su (UCSD), Seungmoon Song (Stanford University), Letizia Gionfrida (King’s College London, University of London), Massimo Sartori (University of Twente), Guillaume Durandau (McGill University), Vikash Kumar (CMU / MyoLab), Vittorio Caggiano (MyoSuite)
Authors: Benjamin Li (Carnegie Mellon University), Shuyang Shi (School of Computer Science, Carnegie Mellon University), Lucia Romero (University of Pittsburgh), Huao Li (Massachusetts Institute of Technology), Yaqi Xie (CMU), Woojun Kim (Carnegie Mellon University), Stefanos Nikolaidis (University of Southern California), Charles Lewis (University of Pittsburgh), Katia Sycara (Carnegie Mellon University), Simon Stepputtis (Virginia Polytechnic Institute and State University)
Authors: Junhong Shen (Carnegie Mellon University), Hao Bai (University of Illinois at Urbana-Champaign), Lunjun Zhang (University of Toronto), Yifei Zhou (University of California, Berkeley), Amrith Setlur (Carnegie Mellon University), Peter Tong (New York University), Diego Caples (AGI, Inc.), Nan Jiang (University of Illinois at Urbana-Champaign), Tong Zhang (UIUC), Ameet Talwalkar (CMU, Datadog), Aviral Kumar (Carnegie Mellon University)
Authors: Gautam Kamath (University of Waterloo), Alireza F. Pour (University of Waterloo), Matthew Regehr (University of Waterloo), David Woodruff (Carnegie Mellon University)
Authors: Ming Liu (Iowa State University), Hao Chen (CMU, Carnegie Mellon University), Jindong Wang (William & Mary), Liwen Wang (Iowa State University), Bhiksha Raj (Carnegie Mellon University), Wensheng Zhang (Iowa State University)
Authors: Hongyi Jin (UCLA Computer Science Department, University of California, Los Angeles), Zijun Ding (Carnegie Mellon University), Dung Daniel Ngo (J.P. Morgan Chase), Steven Wu (Carnegie Mellon University)
Authors: Pingbang Hu (University of Illinois Urbana-Champaign), Joseph Melkonian (Washington University, Saint Louis), Weijing Tang (Carnegie Mellon University), Han Zhao (University of Illinois, Urbana Champaign), Jiaqi Ma (University of Illinois Urbana-Champaign)
Authors: Alex Clinton (University of Wisconsin – Madison), Thomas Zeng (University of Wisconsin – Madison), Yiding Chen (Cornell University), Jerry Zhu (Carnegie Mellon University), Kirthevasan Kandasamy (University of Wisconsin – Madison)
Authors: Maria-Florina Balcan (Carnegie Mellon University), Avrim Blum (Toyota Technological Institute at Chicago), Zhiyuan Li (Toyota Technological Institute at Chicago), Dravyansh Sharma (Toyota Technological Institute at Chicago)
Authors: Tong Yang (Carnegie Mellon University), Yu Huang (University of Pennsylvania), Yingbin Liang (The Ohio State University), Yuejie Chi (Yale University)
Authors: Xeron Du (01.AI), Yifan Yao (Beijing University of Posts and Telecommunications), Kaijing Ma (Tongji University), Bingli Wang (Sichuan Agricultural University), Tianyu Zheng (Beijing University of Posts and Telecommunications), Zhu (Guangdong OPPO Mobile Telecommunications Corp.,Ltd.), Minghao Liu (2077AI), Yiming Liang (University of the Chinese Academy of Sciences), Xiaolong Jin (Purdue University), Zhenlin Wei (Harbin Engineering University), Chujie Zheng (Tsinghua University), Kaixin Deng (Hokkaido University), Shuyue Guo (Beijing University of Posts and Telecommunications), Shian Jia (Zhejiang University), Sichao Jiang (zhejiang university), Yiyan Liao (Peking University), Rui Li (Peking University), Qinrui Li (Cornell University), Sirun Li (Peking University), Yizhi Li (The University of Manchester), Yunwen Li (Chinese University of Hong Kong(shenzhen)), Dehua Ma (Beijing University of Posts and Telecommunications), Yuansheng Ni (University of Waterloo), Haoran Que (Beijing University of Aeronautics and Astronautics), Qiyao Wang (henzhen Institute of Advanced Technology, Chinese Academy of Sciences), Zhoufutu Wen (ByteDance Inc.), Siwei Wu (Nanjing University of Science and Technology), Tianshun Xing (Beijing University of Posts and Telecommunications), 明 许 (01.AI), Zhenzhu Yang (China University of Geoscience Beijing), Noah Wang (), Junting Zhou (Peking University), yuelin bai (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences), Xingyuan Bu (Alibaba Group), chenglin cai (Huawei Technologies Ltd.), Liang Chen (Peking University), Yifan Chen (ByteDance Inc.), Cheng Chengtuo (Zhejiang University), Tianhao Cheng (Fudan University), Keyi Ding (2077AI), Siming Huang (University of Melbourne), HUANG YUN (national university of singaore, National University of Singapore), Yaoru Li (Zhejiang University), Yizhe Li (Zhejiang University), Zhaoqun Li (Zhejiang University), Tianhao Liang (Zhejiang University), Chengdong Lin (Hangzhou Dianzi University), Hongquan Lin (University of Science and Technology of China), Yinghao Ma (Centre for Digital Music, Queen Mary University of London), Zhongyuan Peng (Fudan University), Zifan Peng (The Hong Kong University of Science and Technology (Guangzhou)), Qige Qi (ByteDance Inc.), Shi Qiu (Peking University), Xingwei Qu (University of Manchester), Shanghaoran Quan (Alibaba Group), Yizhou Tan (Harvard University), Zili Wang (stepfun), 王晨清 (abaka), Hao Wang (Beijing University of Aeronautics and Astronautics), Yiya Wang (Peking University), Yubo Wang (University of Waterloo), Jiajun Xu (Facebook), Kexin Yang (Alibaba Group), Ruibin Yuan (Carnegie Mellon University), Yuanhao Yue (Fudan University), Tianyang Zhan (ByteDance Inc.), Chun Zhang (ByteDance Inc.), Jinyang Zhang (Peking University), Xiyue Zhang (Peking University), Owen Zhang (Department of Computer Science, Princeton University), Yue Zhang (Suzhou University), Yongchi Zhao (Alibaba Group), Xiangyu Zheng (Fudan University), ChenghuaZhong (University of Science and Technology Beijing), Yang Gao (Nanjing University), Zhoujun Li (Beijing University of Aeronautics and Astronautics), Dayiheng Liu (Alibaba Group), Qian Liu (TikTok (Singapore)), Tianyu Liu (Alibaba), Shiwen Ni (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences), Junran Peng (Institute of automation, Chinese academy of science), Yujia Qin (Bytedance), Wenbo Su (Alibaba Group), Guoyin Wang (Alibaba Qwen Pilot), Shi Wang (Institute of Computing Science, Chinese Academy of Sciences), Jian Yang (Alibaba Group), Min Yang (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences), Meng Cao (Mohamed bin Zayed University of Artificial Intelligence), Xiang Yue (Carnegie Mellon University), ZHAO-XIANG ZHANG (Chinese Academy of Sciences, China), Wangchunshu Zhou (Guangdong OPPO Mobile Telecommunications Corp.,Ltd.), Jiaheng Liu (Nanjing University), Qunshu Lin (Abaka AI), Wenhao Huang (Key Laboratory of Machine Perception), Ge Zhang (University of Michigan – Ann Arbor)
Authors: Mucong Ding (Department of Computer Science, University of Maryland, College Park), Bang An (University of Maryland, College Park), Tahseen Rabbani (University of Chicago), Chenghao Deng (University of Maryland), Anirudh Satheesh (University of Maryland, College Park), Souradip Chakraborty (University of Maryland, College Park), Mehrdad Saberi (Department of Computer Science, University of Maryland, College Park), Yuxin Wen (University of Maryland), Kyle Sang (University of Maryland), Aakriti Agrawal (University of Maryland, College Park), Xuandong Zhao (UC Berkeley), Mo Zhou (Johns Hopkins University), Mary-Anne Hartley (EPFL), Lei Li (Carnegie Mellon University), Yu-Xiang Wang (UCSD), Vishal Patel (Johns Hopkins University), Soheil Feizi (University of Maryland), Tom Goldstein (University of Maryland), Furong Huang (University of Maryland)
Authors: Andy Zou (CMU, Gray Swan AI), Maxwell Lin (University of California, Berkeley), Eliot Jones (Gray Swan), Micha Nowak (Bayerische Julius-Maximilians-Universität Würzburg), Mateusz Dziemian (Independent), Nick Winter (Gray Swan AI), Valent Nathanael (Gray Swan AI), Ayla Croft (Gray Swan AI), Xander Davies (University of Oxford), Jai Patel (UK AI Security Institute), Robert Kirk (University College London), Yarin Gal (University of Oxford), Dan Hendrycks (Center for AI Safety), Zico Kolter (Carnegie Mellon University), Matt Fredrikson (CMU)
Authors: Ilias Diakonikolas (University of Wisconsin-Madison), Chao Gao (University of Chicago), Daniel Kane (UCSD), John Lafferty (Carnegie Mellon University), Ankit Pensia (IBM Research)
Authors: Daniel Pfrommer (Massachusetts Institute of Technology), Zehao Dou (OpenAI), Christopher Scarvelis (MIT), Max Simchowitz (Carnegie Mellon University), Ali Jadbabaie (MIT)
Authors: Peng Xing (Nanjing University of Science and Technology), Haofan Wang (Carnegie Mellon University), Yanpeng Sun (Nanjing University of Science and Technology), wangqixun (Tencent Hunyuan), Baixu (ByteDance Inc.), Hao Ai (Beijing University of Aeronautics and Astronautics), Jen-Yuan Huang (Peking University), Zechao Li (Nanjing University of Science and Techonolgy)
Authors: Dravyansh Sharma (Toyota Technological Institute at Chicago), Colin White (Meta), Maria-Florina Balcan (Carnegie Mellon University)
Machine learning performance depends strongly on the data and on the choice of algorithms and hyperparameters, making hyperparameter tuning and algorithm selection essential. We survey widely used practical methods, including Bayesian optimization, bandit-based approaches, and recent techniques for large language models such as scaling laws and parameterization-aware methods, noting their limited theoretical guarantees. We then review recent theory-driven advances that characterize how performance varies with hyperparameters for core algorithms—including decision trees, linear models, and deep learning—enabling structure-aware tuning methods with PAC generalization guarantees. We conclude with open challenges in combining principled and practical approaches, optimizing over high-dimensional or discrete spaces, and scaling to distributed settings.
Authors: Pratyush Maini (Carnegie Mellon University/ DatologyAI), Joseph C. Gratz (Partner, Morrison Foerster LLP), A. Feder Cooper (Yale/Stanford)
Generative models are trained on vast datasets that often contain personal data and copyrighted content. As lawsuits, regulations, and standards emerge, practitioners increasingly need concrete, technically grounded guidance on how privacy and copyright law interact with the realities of modern model development. This tutorial connects data privacy, memorization, and copyright. We will alternate between technical material (attacks, defenses, measurement, and system design) and legal analysis (doctrines, active cases, and regulatory futures), with a focus on practical workflows that ML researchers, engineers, and policy teams can adopt today.
Authors: Adam Block (Columbia University), Dylan Foster (Microsoft Research), Max Simchowitz (Carnegie Mellon University)
This tutorial frames imitation learning (IL) as a unifying way to understand supervised training of foundation models—learning by imitating large corpora of domain-specific demonstrations—across areas like large language model pre-training, robotics, and chemistry/life sciences. It surveys recent theory on when and why IL works with powerful generative models, explains the interventions and best practices the field has converged on, and points to opportunities to better connect theory and practice. A central theme is how domain-specific settings shape solutions, contrasting discrete problems like language modeling with continuous-control challenges in robotics. It also links techniques across domains, casting next-token prediction as behavior cloning with log-loss and relating exposure bias in generation to compounding error in control, while motivating tools like action chunking, score matching, and interactive data collection.
Large language models have made major gains on reasoning tasks by scaling test-time compute using methods like chain-of-thought and sampling, which can boost performance beyond what pretraining alone delivers. However, deploying more test-time compute is hard because inference workloads tend to have low parallelism, irregular execution, heavy memory I/O, and dynamic control flow—creating bottlenecks like attention memory overhead and poor compute utilization. The tutorial surveys both systems advances (e.g., more efficient KV-cache management, optimized attention kernels, smarter scheduling) and algorithmic directions (e.g., architectures and parallel generation better suited to hardware). Its goal is to connect scaling theory with real deployment constraints and motivate practical, scalable LLM agent systems.
Authors: Ziqiao Ma (University of Michigan), Michael Saxon (University of Washington), Xiang Yue (Carnegie Mellon University/Meta)
This tutorial argues that modern AI evaluation needs a more principled view of what benchmarks actually measure—and what they systematically miss—as models and use cases evolve. It maps out key pitfalls in today’s benchmarking practice (especially static metrics that fail to track changing model behavior) and frames evaluation as an epistemic design problem rather than just a leaderboard exercise. The tutorial then surveys emerging paradigms—including adversarial and dynamic benchmarks, model arenas, scaled human evaluation, simulators/sandboxes, and applied interpretability—plus a panel to compare perspectives across the community.