Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos

Abstract

Learning from Demonstrations, particularly from biological experts like humans and animals, often encounters significant data acquisition challenges. While recent approaches leverage internet videos for learning, they require complex, task-specific pipelines to extract and retarget motion data for the agent. In this work, we introduce a language-model-assisted bi-level programming framework that enables a reinforcement learning agent to directly learn its reward from internet videos, bypassing dedicated data preparation. The framework includes two levels: an upper level where a vision-language model (VLM) provides feedback by comparing the learner's behavior with expert videos, and a lower level where a large language model (LLM) translates this feedback into reward updates. The VLM and LLM collaborate within this bi-level framework, using a "chain rule" approach to derive a valid search direction for reward learning. We validate the method for reward learning from YouTube videos, and the results have shown that the proposed method enables efficient reward design from expert videos of biological agents for complex behavior synthesis.

Approach

Our bi-level framework leverages a combination of a Visual Language Model (VLM) and a Large Language Model (LLM) to enable robots to learn behaviors from expert demonstration videos. At the upper level, the VLM (Gemini-1.5 Pro) compares a biological demonstration video with a robot's behavior video, generating "Visual Feedback" (∂L → ∂ξ(πR)) that suggests how the robot should adjust its actions to better replicate the expert’s motions. This visual feedback is then passed to the lower-level LLM (GPT-4o), which translates it into "Reward Feedback" (∂ξ(πR) → ∂R), directly modifying the robot’s reward function encoded as Python code. Using contextual information from the robot’s environment code, the LLM adapts the reward code to reinforce behaviors aligned with the VLM's guidance. The updated reward code is subsequently integrated into a reinforcement learning (RL) loop within the Isaac Gym environment, where the robot’s policy is refined based on the enhanced reward structure. This framework effectively combines visual and textual feedback to create a robust and adaptive reward learning system that continuously improves the robot’s performance to closely match the expert demonstrations.

Experiments

We evaluate our approach on three robots—Ant, Humanoid, and ANYmal—in Isaac Gym environ- ments, learning rewards from video demonstrations of their biological counterparts: spider, human athlete, and dog. The biological motion videos, obtained directly from YouTube, are used to train the robots for various skillful motion tasks, including Spider Walking, Spider Jumping, Human Running, Human Split Landing, Dog Hopping.

Ablation Study

To demonstrate the benefits of our bi-level design, we conduct an ablation study comparing the proposed VLM-LLM bi-level method with a single-level VLM that directly processes expert videos to generate reward updates. We run five experiments for the Spider Walking and Human Running tasks and the findings clearly highlight the superiority of the VLM-LLM bi-level design, which can be attributed to its hierarchical structure. This architecture allows the VLM to focus on high-level planning and reward structure design, while the LLM specializes in environment-specific code generation and refinement.

BibTeX

@misc{mahesheka2024languagemodelassistedbilevelprogrammingreward,
  title={Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos}, 
  author={Harsh Mahesheka and Zhixian Xie and Zhaoran Wang and Wanxin Jin},
  year={2024},
  eprint={2410.09286},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2410.09286}, 
}