MILE: Model-based Intervention Learning

Abstract

Imitation learning techniques have been shown to be highly effective in real-world control scenarios, such as robotics. However, these approaches not only suffer from compounding error issues but also require human experts to provide complete trajectories. Although there exist interactive methods where an expert oversees the autonomous agent and intervenes if needed, these extensions only utilize the data collected during intervention periods and ignore the feedback signal hidden in non-intervention time steps. In this work, we create a model for formulating how the interventions occur in such cases, and we show that it is possible to learn a policy with just a handful of expert interventions. Our key insight is that it is possible to get crucial information about the quality of the current state and the optimality of the chosen action from expert feedback, regardless of the presence or the absence of intervention. We evaluate our method on various discrete and continuous simulation environments, a real-world robotic manipulation task, as well as a human subject study.

Method

Intervention Model

We propose an intervention model that is based on the probit model from discrete decision theory. Let \(\nu\) be a binary random variable that indicates whether the human intervenes \(\nu=1\) or not \(\nu=0\), and \(\bar{a}_h\) denote the nominal human action, i.e., the action human would take provided they decide to intervene. Mathematically, \(a_h=\bar{a}_h\) if and only if \(\nu=1\). Otherwise, \(a_h\) is not defined in that state. Finally, let \(\hat{\pi}\) denote the human's mental model of the robot, i.e., what the human believes the robot will do in a given state. This prediction is needed because in our problem setting, the human has to intervene before seeing the robot’s action.

\[ \begin{align}\label{eq:intervention} p(\nu=1\mid s) &= \sum_{a\in A}p(\bar{a}_h=a,\nu=1\mid s) = \sum_{a\in A}p(\bar{a}_h=a\mid s)p(\nu=1\mid \bar{a}_h=a,s) \end{align} \]

We assume the human is a (noisy) expert, represented by a Boltzmann policy. We use \(\sigma\) for the softmax operation that maps a vector to another vector that sums up to 1, and \(\Phi\) for the cdf of a standard normal distribution.

\[ \begin{align} p(\bar{a}_h=a\mid s) = \pi_h(a \mid s) := \sigma({Q(s,a)}) = \frac{\exp(Q(s,a))}{\sum_{a'\in A} \exp(Q(s,a'))}\: \end{align} \] \[ \begin{align} p(\nu=1\mid s) &= \sum_{a\in A}\pi_h(a \mid s)\Phi\left(\mathbb{E}_{a'\sim \hat{\pi}(\cdot\mid s)}[\ln \pi_h(a \mid s) - \ln\pi_h(a' \mid s)]-c\right) \nonumber\\ &= \mathbb{E}_{a\sim\pi_h(\cdot \mid s)}\left[\Phi\left(\mathbb{E}_{a'\sim \hat{\pi}(\cdot \mid s)}[\ln \pi_h(a \mid s) - \ln\pi_h(a' \mid s)]-c\right)\right] \label{eq:when_final}\\ p(a_h=\bar{a}_h\mid s) &= \pi_h(\bar{a}_h\mid s)p(\nu=1\mid s)\label{eq:how_final} \end{align} \]

Training Framework

In our learning algorithm, we model both the mental model and the policy with neural networks, \(\hat{\pi}_\xi\) and \(\pi_\theta\), respectively. Since the intervention model is differentiable, we conveniently utilize the gradients coming from it to jointly train these networks using the dataset of \((s,a_r,a_h,s')\) tuples. During inference time, we only employ the trained policy \(\pi_{\theta}\).

Simulation Experiments

We evaluated our method across four diverse simulation tasks. One of these tasks involves a discrete action space, specifically the LunarLander environment from Gymnasium. The remaining three tasks—Drawer-Open, Peg-Insertion, and Button-Press—are part of the MetaWorld suite, where the action space is continuous. MILE achieves the best results across all environments, showing its sample efficiency. For additional experiment details, click here.

User Study

We conducted a user study to analyze how accurately our model estimates the interventions made by different users, as the success of our method really relies on the success of the intervention model capturing when and how humans intervene the robot. We used the same real-robot task setting.

@inproceedings{korkmaz2025mile, title={MILE: Model-based Intervention Learning}, author={Yigit Korkmaz and Erdem Bıyık}, booktitle={International Conference on Robotics and Automation (ICRA)}, year={2025} }

MILE: Model-based Intervention Learning

Abstract

Video

Method

Intervention Model

Training Framework

Simulation Experiments

Iterative Training

Offline Demo Ablation

Real-Robot Experiment

User Study

Evaluation Rollouts

BibTeX