Banner

Trajectory Improvement and Reward Learning
from Comparative Language Feedback

1 University of Southern California 2 University of California, Berkeley
CoRL 2024

Abstract

Learning from human feedback has gained traction in fields like robotics and natural language processing in recent years. While prior works mostly rely on human feedback in the form of comparisons, language is a preferable modality that provides more informative insights into user preferences. In this work, we aim to incorporate comparative language feedback to iteratively improve robot trajectories and to learn reward functions that encode human preferences. To achieve this goal, we learn a shared latent space that integrates trajectory data and language feedback, and subsequently leverage the learned latent space to improve trajectories and learn human preferences. To the best of our knowledge, we are the first to incorporate comparative language feedback into reward learning. Our simulation experiments demonstrate the effectiveness of the learned latent space and the success of our learning algorithms. We also conduct human subject studies that show our reward learning algorithm achieves a 23.9% higher subjective score on average and is 11.3% more time-efficient compared to preference-based reward learning, underscoring the superior performance of our method.

Approach

Approach

Our approach is composed of two stages: First, we learn a shared latent space where robot trajectories and human language feedback are aligned. To align embeddings of trajectories with those of language feedback, we first freeze T5 model and train the trajectory encoder. Subsequently, we perform co-finetuning of both components. Second, we leverage this learned latent space to improve robot trajectory or learn human preferences.

Simulation Results

We explored two primary ways to leverage the learned latent space: first, to iteratively improve the robot's trajectory, and second, to accurately learn user preferences. The experiments are conducted in Robosuite and Metaworld environment.

Improve Trajectories

Improve Trajectory

We leveraged the latent space to iteratively improve an initial suboptimal robot trajectory using simulated human language feedback. In both environments, we consistently improve the trajectories, which showcases the effectiveness of the learned latent space and the improvement algorithm. However, it can be noted that our algorithm cannot reach the performance of the optimal trajectory. This is because every improvement iteration is completely independent from the previous iterations and the robot may get stuck in a loop between good but non-optimal trajectories.

Preference Learning

Improve Trajectory

We also utilized the learned latent space to learn the user's preference, i.e., their reward function. The reward weights were randomly initialized to simulate human feedback. We adopt cross-entropy between learned rewards and true rewards, and he true reward value of the optimal trajectory based on the learned reward as the evaluation metric. Results demonstrates that our approach converges more quickly than the baseline.

User Studies

To further verify the effectiveness of our approach, we conducted human subject studies by recruiting 10 subjects (4 female, 6 male) from varying backgrounds and observing them to interact with our real robot. The subjects participated in the trajectory improvement and preference learning studies, similar to that in the simulation experiments.

Improve Trajectories

Improve Trajectory

The subjects has been consistently giving positive responses for user experience and speed of adaptation, but has two peaks in iterations to satisfaction. We conjecture that the dataset of 32 trajectories may not have contained trajectories that the users desired.

Preference Learning

Preference Learning

Our language-based method scored better than the comparison-based method for all attributes. The average score of the language-based method over all attributes is 23.9% higher than the comparison-based method.

In terms of the user rating of optimal trajectories (right most figure), To quantitatively assess the learning efficiency, we checked the area under curve (AUC) for both lines. We found that the AUC for the comparative language line is statistically significantly higher than that of preference comparison line (p < 0.05). This indicates that our language-based approach more efficiently captures human preferences compared to the comparison-based method.

Videos

BibTeX


        @inproceedings{yang2024trajectory,
          title={Trajectory Improvement and Reward Learning from Comparative Language Feedback},
          author={Yang, Zhaojing and Jun, Miru and Tien, Jeremy and Russell, Stuart J. and Dragan, Anca and Biyik, Erdem},
          booktitle={8th Annual Conference on Robot Learning},
          year={2024}
        }