Imitation Learning (IL) is a widely adopted approach which enables agents to learn from human expert demonstrations by framing the task as a supervised learning problem. However, IL often suffers from causal confusion, where agents misinterpret spurious correlations as causal relationships, leading to poor performance in testing environments with distribution shift. To address this issue, we introduce GAze-Based Regularization in Imitation Learning (GABRIL) , a novel method that leverages the human gaze data gathered during the data collection phase to guide the representation learning in IL. GABRIL utilizes a regularization loss which encourages the model to focus on causally relevant features identified through expert gaze and consequently mitigates the effects of confounding variables. We validate our approach in Atari environments and the Bench2Drive benchmark in CARLA by collecting human gaze datasets and applying our method in both domains. Experimental results show that the improvement of GABRIL over behavior cloning is around 179% more than the same number for other baselines in the Atari and 76% in the CARLA setup. Finally, we show that our method provides extra explainability when compared to regular IL agents.
To show the effectiveness of our method, we conduct experiments in Atari environments in addition to a more realistic benchmark, namely Bench2Drive developed in CARLA. We collect a dataset consisting of 1,160 minutes of human experts playing Atari games and another dataset of 71 minutes of expert driving in CARLA, both with recorded gaze data.
Our Atari dataset consists of 15 Atari games played for 1,160 minutes in total. Each game was rendered at a frame rate convenient for the player, ranging from 10 to 20 FPS. While the player viewed the game at its full-screen resolution, we recorded observations as grayscale images downscaled to 84x84. The recordings also contain the corresponding gaze data and discrete controller actions for each observation. All games were played with frame skip 4 and a sticky action probability of 0.25. Here are some samples from our dataset ...
Collected Episode
Overlaid Gaze Mask
Collected Episode
Overlaid Gaze Mask
Collected Episode
Overlaid Gaze Mask
We used the open-source CARLA 0.9.15 to simulate urban driving. Specifically, we used Leaderboard 2.0 framework to execute scenarios and record data. We leveraged the recently proposed benchmark, Bench2Drive, comprising 44 driving tasks in different towns and weather conditions. Among all, we selected a diverse subset of 10 driving tasks with the highest potential of having causal confusion and collected 20 expert demonstrations with continuous actions for each task. The recordings contain 320x180 RGB images from a front view camera with fov=60°, in addition to the collected gaze coordinates, and the continuous action, including brake, steering angle, and throttle. Here are some examples of our CARLA dataset ...
Collected Episode
Overlaid Gaze Mask
Collected Episode
Overlaid Gaze Mask