Human-in-the-loop learning has gained traction in fields like robotics and natural language processing in recent years. While prior work mostly relies on human feedback in the form of preference comparisons due to its ease-of-use, this feedback type has multiple limitations. Namely, it does not let users explain the reasons for their preferences and provides only a binary signal for learning, resulting in huge data inefficiency. Consequently, training robots require a substantial amount of human feedback, occupying significant time and burdening the user. To overcome these challenges, we take the insight that language is a preferable medium compared to comparisons, providing more information regarding user preferences. Thus, in this work, we aim to incorporate comparative language feedback to iteratively improve robot trajectories and learn reward functions that encode human preferences. We learn a shared latent space that integrates trajectory data and language feedback, and subsequently leverage the learned latent space to improve trajectories and learn human preferences. Ergo, our method extrapolates deeper insights regarding user preferences while preserving intuitiveness and ease-of-use. We finally introduce an active learning method that integrates comparative language feedback to further boost data-efficiency. Our results in simulation experiments and user studies demonstrate the effectiveness of the learned latent space and the success of our learning algorithms. Our reward learning algorithm exhibits a 23.9% improvement in subjective score on average and 11.3% higher time-efficiency compared to the preference comparison method in the user studies, underscoring the better performance of our method. Lastly, our active querying method further improves user experience featuring an 8.31% average improvement in subjective scores compared to random querying.
Our approach is composed of three stages: (i) we learn a shared latent space where robot trajectories and comparative language feedback are aligned, (ii) we leverage this learned latent space for trajectory improvement and reward learning, and finally (iii) we further enhance preference learning through active query selection to achieve higher data efficiency.
We explored two primary ways to leverage the learned latent space: first, to iteratively improve the robot's trajectory, and second, to accurately learn user preferences. The experiments are conducted in Robosuite and Metaworld environment.
We leveraged the latent space to iteratively improve an initial suboptimal robot trajectory using simulated human language feedback. In both environments, we consistently improve the trajectories, which showcases the effectiveness of the learned latent space and the improvement algorithm. However, it can be noted that our algorithm cannot reach the performance of the optimal trajectory. This is because every improvement iteration is completely independent from the previous iterations and the robot may get stuck in a loop between good but non-optimal trajectories.
We also utilized the learned latent space to learn the user's preference, i.e., their reward function. The reward weights were randomly initialized to simulate human feedback. We adopt cross-entropy between learned rewards and true rewards, and he true reward value of the optimal trajectory based on the learned reward as the evaluation metric. Results demonstrates that our approach converges more quickly than the baseline.
To further verify the effectiveness of our approach, we conducted human subject studies by recruiting 10 subjects (4 female, 6 male) from varying backgrounds and observing them to interact with our real robot. The subjects participated in the trajectory improvement and preference learning studies, similar to that in the simulation experiments.
The subjects has been consistently giving positive responses for user experience and speed of adaptation, but has two peaks in iterations to satisfaction. We conjecture that the dataset of 32 trajectories may not have contained trajectories that the users desired.
Our language-based method scored better than the comparison-based method for all attributes. The average score of the language-based method over all attributes is 23.9% higher than the comparison-based method.
In terms of the user rating of optimal trajectories (right most figure), To quantitatively assess the learning efficiency, we checked the area under curve (AUC) for both lines. We found that the AUC for the comparative language line is statistically significantly higher than that of preference comparison line (p < 0.05). This indicates that our language-based approach more efficiently captures human preferences compared to the comparison-based method.
@inproceedings{yang2024trajectory,
title={Trajectory Improvement and Reward Learning from Comparative Language Feedback},
author={Yang, Zhaojing and Jun, Miru and Tien, Jeremy and Russell, Stuart J. and Dragan, Anca and Biyik, Erdem},
booktitle={8th Annual Conference on Robot Learning},
year={2024}
}