MILE: Model-based Intervention Learning

1University of Southern California

Abstract

Imitation learning techniques have been shown to be highly effective in real-world control scenarios, such as robotics. However, these approaches not only suffer from compounding error issues but also require human experts to provide complete trajectories. Although there exist interactive methods where an expert oversees the autonomous agent and intervenes if needed, these extensions only utilize the data collected during intervention periods and ignore the feedback signal hidden in non-intervention time steps. In this work, we create a model for formulating how the interventions occur in such cases, and we show that it is possible to learn a policy with just a handful of expert interventions. Our key insight is that it is possible to get crucial information about the quality of the current state and the optimality of the chosen action from expert feedback, regardless of the presence or the absence of intervention. We evaluate our method on various discrete and continuous simulation environments, a real-world robotic manipulation task, as well as a human subject study.

Video



Method

Intervention Model

We propose an intervention model that is based on the probit model from discrete decision theory. Let \(\nu\) be a binary random variable that indicates whether the human intervenes \(\nu=1\) or not \(\nu=0\), and \(\bar{a}_h\) denote the nominal human action, i.e., the action human would take provided they decide to intervene. Mathematically, \(a_h=\bar{a}_h\) if and only if \(\nu=1\). Otherwise, \(a_h\) is not defined in that state. Finally, let \(\hat{\pi}\) denote the human's mental model of the robot, i.e., what the human believes the robot will do in a given state. This prediction is needed because in our problem setting, the human has to intervene before seeing the robot’s action.

\[ \begin{align}\label{eq:intervention} p(\nu=1\mid s) &= \sum_{a\in A}p(\bar{a}_h=a,\nu=1\mid s) = \sum_{a\in A}p(\bar{a}_h=a\mid s)p(\nu=1\mid \bar{a}_h=a,s) \end{align} \]

We assume the human is a (noisy) expert, represented by a Boltzmann policy. We use \(\sigma\) for the softmax operation that maps a vector to another vector that sums up to 1, and \(\Phi\) for the cdf of a standard normal distribution.

\[ \begin{align} p(\bar{a}_h=a\mid s) = \pi_h(a \mid s) := \sigma({Q(s,a)}) = \frac{\exp(Q(s,a))}{\sum_{a'\in A} \exp(Q(s,a'))}\: \end{align} \] \[ \begin{align} p(\nu=1\mid s) &= \sum_{a\in A}\pi_h(a \mid s)\Phi\left(\mathbb{E}_{a'\sim \hat{\pi}(\cdot\mid s)}[\ln \pi_h(a \mid s) - \ln\pi_h(a' \mid s)]-c\right) \nonumber\\ &= \mathbb{E}_{a\sim\pi_h(\cdot \mid s)}\left[\Phi\left(\mathbb{E}_{a'\sim \hat{\pi}(\cdot \mid s)}[\ln \pi_h(a \mid s) - \ln\pi_h(a' \mid s)]-c\right)\right] \label{eq:when_final}\\ p(a_h=\bar{a}_h\mid s) &= \pi_h(\bar{a}_h\mid s)p(\nu=1\mid s)\label{eq:how_final} \end{align} \]

Training Framework

In our learning algorithm, we model both the mental model and the policy with neural networks, \(\hat{\pi}_\xi\) and \(\pi_\theta\), respectively. Since the intervention model is differentiable, we conveniently utilize the gradients coming from it to jointly train these networks using the dataset of \((s,a_r,a_h,s')\) tuples. During inference time, we only employ the trained policy \(\pi_{\theta}\).

Computational intervention model.

Computational intervention model.



Simulation Experiments

We evaluated our method across four diverse simulation tasks. One of these tasks involves a discrete action space, specifically the LunarLander environment from Gymnasium. The remaining three tasks—Drawer-Open, Peg-Insertion, and Button-Press—are part of the MetaWorld suite, where the action space is continuous. MILE achieves the best results across all environments, showing its sample efficiency. For additional experiment details, click here.

Computational intervention model.
Computational intervention model.

Iterative Training

In just 10 iterations, MILE is able to surpass the success rate of the initial policy with expert interventions, following with an even greater performance, while other baselines struggle.

Computational intervention model.

Offline Demo Ablation

We run an ablation study in Drawer-Open, comparing our method against the baselines when they have access to a small set of expert demonstrations. MILE performs at least as good as the baselines even when they have access to up to 5 demonstrations.

Computational intervention model.

Real-Robot Experiment

We also evaluated our method in real-world setting using a WidowX 6-DoF robot arm. The task is to put the octagonal block into the wooden box through the correct hole. For additional experiment details, click here.

Computational intervention model.

User Study

We conducted a user study to analyze how accurately our model estimates the interventions made by different users, as the success of our method really relies on the success of the intervention model capturing when and how humans intervene the robot. We used the same real-robot task setting.

Computational intervention model.

Evaluation Rollouts

MILE is able to successfully complete the real-robot task just after 4 iterations, with 12 intervention trajectories in total.