GABRIL: Gaze-Based Regularization for
Mitigating Causal Confusion
in Imitation Learning

University of Southern California
* Equal Contribution
intro fig
A self-driving agent encouters many misleading factors in the environment,
but human gaze can provide supervisory signals to guide the agent's learning process.

Abstract

Imitation Learning (IL) is a widely adopted approach which enables agents to learn from human expert demonstrations by framing the task as a supervised learning problem. However, IL often suffers from causal confusion, where agents misinterpret spurious correlations as causal relationships, leading to poor performance in testing environments with distribution shift. To address this issue, we introduce GAze-Based Regularization in Imitation Learning (GABRIL) , a novel method that leverages the human gaze data gathered during the data collection phase to guide the representation learning in IL. GABRIL utilizes a regularization loss which encourages the model to focus on causally relevant features identified through expert gaze and consequently mitigates the effects of confounding variables. We validate our approach in Atari environments and the Bench2Drive benchmark in CARLA by collecting human gaze datasets and applying our method in both domains. Experimental results show that the improvement of GABRIL over behavior cloning is around 179% more than the same number for other baselines in the Atari and 76% in the CARLA setup. Finally, we show that our method provides extra explainability when compared to regular IL agents.

method fig
GABRIL incorportaes two loss terms; First, a typical behavioral cloning loss used to reduce the error of action prediction. Second, a gaze-based regularization loss that encourages the model to focus on causally relevant features identified through expert gaze.

Datasets

To show the effectiveness of our method, we conduct experiments in Atari environments in addition to a more realistic benchmark, namely Bench2Drive developed in CARLA. We collect a dataset consisting of 1,160 minutes of human experts playing Atari games and another dataset of 71 minutes of expert driving in CARLA, both with recorded gaze data.

Atari

Our Atari dataset consists of 15 Atari games played for 1,160 minutes in total. Each game was rendered at a frame rate convenient for the player, ranging from 10 to 20 FPS. While the player viewed the game at its full-screen resolution, we recorded observations as grayscale images downscaled to 84x84. The recordings also contain the corresponding gaze data and discrete controller actions for each observation. All games were played with frame skip 4 and a sticky action probability of 0.25. Here are some samples from our dataset ...


Collected Episode

Overlaid Gaze Mask

Collected Episode

Overlaid Gaze Mask

Collected Episode

Overlaid Gaze Mask

CARLA

We used the open-source CARLA 0.9.15 to simulate urban driving. Specifically, we used Leaderboard 2.0 framework to execute scenarios and record data. We leveraged the recently proposed benchmark, Bench2Drive, comprising 44 driving tasks in different towns and weather conditions. Among all, we selected a diverse subset of 10 driving tasks with the highest potential of having causal confusion and collected 20 expert demonstrations with continuous actions for each task. The recordings contain 320x180 RGB images from a front view camera with fov=60°, in addition to the collected gaze coordinates, and the continuous action, including brake, steering angle, and throttle. Here are some examples of our CARLA dataset ...


Collected Episode

Overlaid Gaze Mask

Collected Episode

Overlaid Gaze Mask


Results

Atari

We used two variations of the Atari environments: normal and confounded versions. For every environment-baseline pair, we trained 8 separate models with different seeds and evaluated each with 100 seeds. We report the mean scores across all trials in the following tables. GABRIL successfully outperforms prior regularization methods in the ABC metric, as reported in the last two rows of the tables. Moreover, when combined with dropout methods, our method considerably boosts their performance. For instance, the composition of our method with GMD achieves 22.7% in normal and 32.7% mean ABC in confounded environments, which is near a double improvement when compared to the best baseline.


Normal Environment

tabular

(click for enlarged view)


Confounded Environment

tabular

(click for enlarged view)

Explainability in CARLA

Since our model is equipped with an innate gaze-predictor, one can use the resulting activation map in real-time to visualize what the agent is attending to when making predictions. The following GIFs provide instances of such visualization. When compared to the regular BC method, our model provides more meaningful interpretations. This capability is significantly essential for real-world applications where explainability is crucial.
As an example, consider the first row in which the agent is about to take a left turn at an intersection. GABRIL clearly considers the cars from the opposite lane as well as the traffic light ahead. However, a regular BC agent is considering the lateral traffic lights, which is a clear sign of causal confusion.


Image Observation

Overlaid Human Gaze Mask

GABRIL Attention Map

BC Attention Map