ICLR 2026 Paper Announcements

We are excited to announce that REALM will be presenting seven papers at ICLR 2026! Below you can find a brief description and dedicated project website for each paper.

Solving Parameter-Robust Avoid Problems with Unknown Feasibility using Reinforcement Learning

Oswin So¹, Eric Yu¹, Songyuan Zhang¹, Matthew Cleaveland², Mitchell Black², Chuchu Fan¹
¹MIT ²MIT Lincoln Laboratory

Website

This paper addresses a fundamental mismatch between reinforcement learning (RL) and optimal safe controller synthesis that results in poor performance on low-probability states within the safe set.
We propose Feasibility-Guided Exploration (FGE), a method that simultaneously identifies safe initial conditions and solves the optimal control problem over these conditions using reinforcement learning.
FGE achieves over 50% more coverage than existing state-of-the-art on challenging initial conditions across tasks in the MuJoCo simulator.

ReFORM: Reflected Flows for On-Support Offline RL via Noise Manipulation

Songyuan Zhang, Oswin So¹, H M Sabbir Ahmad³, Eric Yu¹, Matthew Cleaveland², Mitchell Black²,
Chuchu Fan¹
¹MIT ²MIT Lincoln Laboratory ³Boston University

Website

ReFORM is an offline RL algorithm that utilizes a 2-stage flow-based policy to enforce the support constraint by construction, avoiding out-of-distribution errors without constraining policy improvement.
We propose applying reflected flow to generate constrained multimodal noise for the BC flow policy, thereby mitigating OOD errors while maintaining the multimodal policy.
Extensive experiments on 40 challenging tasks with datasets of different qualities demonstrate that, with a constant set of hyperparameters, ReFORM dominates all baselines using flow policy structures with the best hand-tuned hyperparameters on the performance profile curve.

SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing

Anjali Parashar¹, Yingke Li¹, Eric Yu¹, Fei Chen¹, James Neidhoefer¹, Devesh Upadhyay²,
Chuchu Fan¹
¹MIT ²SAAB

As autonomous systems are increasingly deployed in high-stakes human-centric domains such as drone emergency rescue, it becomes critical for us to evaluate their ethical alignment.
We propose SEED-SET, a Bayesian experimental design framework that incorporates domain-specific objective evaluations and subjective value judgments from stakeholders.
We validate our approach on two applications and find our method provides an interpretable and efficient trade-off between exploration and exploitation, generating up to 2x optimal test candidates compared to baselines, with 1.25x improvement in coverage of high dimensional search spaces.

TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

Yongchao Chen^1,2, Jiefeng Chen³, Rui Meng³, Ji Yin¹, Na Li², Chuchu Fan¹, Chi Wang⁴,
Tomas Pfister³, Jinsung Yoon³
¹MIT ²Harvard ³Google Cloud AI Research ⁴Google DeepMind

Website

Method: Proposes TUMIX (Tool-Use Mixture), a multi-agent test-time scaling framework that runs diverse tool-augmented agents (text, code, search) in parallel and iteratively refines answers via cross-agent sharing and majority voting.
Key Insights: Agent diversity and quality matter more than scale alone; combining Code Interpreter and Search increases coverage and accuracy. An LLM-based termination strategy stops refinement adaptively, preserving performance at ~49% inference cost.
Results: TUMIX outperforms strong test-time scaling baselines on HLE, GPQA, and AIME, achieving up to +3.55% average improvement over the best baseline at similar cost, with further gains under additional scaling.

R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-Stage Reinforcement Learning

Yongchao Chen^1,2, Yueying Liu³, Junwei Zhou⁴, Yilun Hao¹, Jingquan Wang⁵, Yang Zhang⁶, Na Li²,
Chuchu Fan¹
¹MIT ²Harvard ³University of Illinois Urbana-Champaign ⁴University of Michigan ⁵University of Wisconsin–Madison ⁶MIT-IBM Watson AI Lab

Website

Problem & Approach: Vanilla RL struggles to train LLMs to use a Code Interpreter across diverse tasks due to task heterogeneity and sparse rewards. The paper proposes R1-Code-Interpreter (R1-CI), combining supervised fine-tuning with multi-stage GRPO reinforcement learning guided by improvement potential to effectively integrate multi-turn text–code reasoning.
Key Innovation: Introduces a potential-based curriculum learning strategy that prioritizes samples with highest learning signal (success rate near 50%), substantially improving RL gains over standard GRPO.
Results: The best model, R1-CI-14B, achieves 72.4% test accuracy, outperforming GPT-4o (58.6%) and GPT-4o with Code Interpreter (70.9%), while showing emergent self-checking behavior and improved generalization.

Discrete Adjoint Matching

Oswin So¹, Brian Karrer², Chuchu Fan¹, Ricky T. Q. Chen², Guan-Horng Liu²
¹MIT ²FAIR at Meta

Website

Bridging a Critical Gap in Discrete Generative Modeling: Diffusion-based LLMs are a promising alternative to autoregressive models, but no principled fine-tuning methods existed for their discrete state spaces — where gradients don’t exist and continuous methods break down. DAM introduces the discrete adjoint, which recovers gradient-like information in discrete domains, enabling the same elegant regression-based fine-tuning that made Adjoint Matching so successful in continuous settings.
A Theoretically Grounded Framework for Reward Optimization: DAM solves entropy-regularized reward optimization for Continuous-Time Markov Chains, providing the first discrete variant of Adjoint Matching with a well-defined fixed-point guarantee — the unique fixed point of the matching objective is the optimal transition rate.
State-of-the-Art Results on Mathematical Reasoning: Applied to LLaDA-8B-Instruct, DAM achieves 87.1% accuracy on Sudoku, dramatically outperforming both the base LLaDA-8B model and the D1 fine-tuning method, demonstrating that diffusion-based LLMs can be effectively steered toward complex reasoning tasks through principled reward optimization.

Simulation to Rules: A Dual-VLM Framework for Formal Visual Planning

Yilun Hao¹,Yongchao Chen^1,2, Chuchu Fan¹, Yang Zhang³
¹MIT ²Harvard ³MIT-IBM Watson AI Lab

Website

Current VLMs lack precise spatial understanding and long-horizon reasoning, while formal PDDL planners depend on structured domain and problem files and cannot directly interpret visual inputs. This paper addresses the challenge of visual long-horizon planning by combining the strength of VLMs and PDDL.
We propose VLMFP, a Dual-VLM-guided framework that autonomously generates PDDL domain and problem files for visual planning, which, to our knowledge, is the first framework to leverage visual inputs to generate both PDDL files without human feedback or direct environment access.
VLMFP notably achieves 70.0% and 54.1% success rates with a fine-tuned Qwen2-VL-7B model as the SimVLM and GPT-4o as the GenVLM, outperforming the best baseline by 39.3% and 21.8% for unseen instances in seen and unseen appearances, respectively. Furthermore, VLMFP achieves multiple levels of generalizability to unseen instances, appearances, and game rules.