Tidbits from Tedrake, Zeng, Xiao, Koltun, and Wang et al.

by Owen Trueblood
2024 March 28

I spent the whole day just following my curiosity watching lectures and reading papers. If you take a look it’s probably just best to skim through and see if anything catches your eye. I saw a few interesting connections that I tried to highlight. The first three lectures are great overviews of the general recent big trends in robot intelligence. The drone lecture felt like a bit of a refreshing detour into concrete application of DL techniques. Then GLiDE came and hurt my brain but then eventually resolved into what looks like a really neat answer to some questions I had from ALOHA/ACT about why it is bad at recovering from big failures and how that could be improved upon.

Video: Princeton Robotics - Russ Tedrake - Dexterous Manipulation with Diffusion Policies

Video: Andy Zeng: From words to actions (slides)

Manipulation](https://transporternets.github.io/) [^3]: Implicit Behavioral Cloning [^4]: Do As I Can, Not As I Say: Grounding Language in Robotic Affordances a.k.a. SayCan [^5]: Socratic Models Composing Zero-Shot Multimodal Reasoning with Language [^6]: Visual Language Maps for Robot Navigation - This reminded me of OK-Robot where you can tell a robot to move things around in a space, and yes they cite this in that paper. [^7]: Language to Rewards for Robotic Skill Synthesis [^8]: Code as Policies: Language Model Programs for Embodied Control [^9]: Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners [^10]: PaLM-E: An Embodied Multimodal Language Model [^11]: Large Language Models as General Pattern Machines


Video: Stanford CS25: V2 I Robotics and Imitation Learning

Models](https://innermonologue.github.io/) [^13]: Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models a.k.a. DIAL [^14]: Can Wikipedia Help Offline Reinforcement Learning?


Video: MIT Robotics – Vladlen Koltun – A Quiet Revolution in Robotics Continued (paper)

This is about building a deep learning system that can train a policy which is able to run onboard a racing drone and beat the best human pilots. I thought it was interesting relative to the above videos in the following ways:

Abstraction](https://arxiv.org/abs/1804.09364) [^16]: Learning robust perceptive locomotion for quadrupedal robots in the wild

Grounding Language Plans in Demonstrations Through Counterfactual Perturbations (GLiDE)

The way I understand it is that they are trying to address the problem where an RL system doesn’t learn that there are only certain valid paths to a given goal when the goal is reached by a long path. Like if a robot is given the task of taking a soda can and dropping it in the garbage it might go successfully pick up the can but if the can is knocked out of its gripper along the way it might “recover” by just continuing on to the garbage with an empty gripper and then making the motion to drop something there. Instead it should realize that the goal will not be fulfilled if the can doesn’t end up in the trash, so it should pick the can back up. This issue was discussed a bit here in Stanford CS25: V2 I Robotics and Imitation Learning. It’s also a problem that I saw exhibited in the results from ACT5.

The “modes” discussed in GLiDE are like the states in a state machine. If the policy doesn’t follow the right transitions which are appropriate to the task (e.g. empty gripper -> holding can -> drop in trash can) then the task won’t be completed successfully.

They construct a prompt that causes the LLM to generate a state machine for the task, which is expressed by the user in natural language. The high level task plan from the LLM will incorporate “common sense”. But then the question is how do you connect that to trajectories on the robot? The answer: A human runs demonstrations where they complete the task successfully. Then this system automatically generates “counterfactual” demonstrations that it thinks would fail the task, supervised using another component that has been trained to guess whether the overall goal of the task has been completed successfully (e.g. is the soda can in the trash - Inner Monologue3 also had such a success detector). Taking the good and bad demonstrations together with the “modes” guessed by the LLM (the states of the state machine) the system then learns how the different parts of the demonstration trajectories fit the modes from the LLM. That’s the “grounding classifier”. The end result is a system that can look at what the robot is doing in real-time and guess where it is in the state machine for the task, and use that knowledge to restart the task in an appropriate place if it gets messed up.

This is super cool, because it provides a way to take advantage of the common sense that an LLM has about the structure of tasks that involve many high level steps (long horizon planning) without tinkering with the structure of the LLM (so it works out of the box with, say, ChatGPT). It kind of learns to translate that common sense into the real-time input->action space that the robot is operating in. But on another level this still feels kind of clunky because of the LLM prompting. How could this kind of thing be implemented down in a continuous space where you don’t have to pull the info out as text first? Like more in the spirit of Large Language Models as General Pattern Machines[^11]?

Hardware](https://tonyzhaozh.github.io/aloha/)

  1. TossingBot 

  2. [Transporter Networks: Rearranging the Visual World for Robotic 

  3. [Inner Monologue: Embodied Reasoning through Planning with Language  2

  4. [Driving Policy Transfer via Modularity and 

  5. [Learning Fine-Grained Bimanual Manipulation with Low-Cost