Presenter: Mac Schwager, Associate Professor
How general are generalist robot policies? Data scaling, diagnostic tools, and
memorization in VLAs
Abstract:
Vision-Language-Action (VLA) policies have recently emerged as a
promising paradigm for generalist robot autonomy. However, VLAs have
several challenges that must be overcome before they can achieve their
potential. Firstly, these models require fine tuning with
human-teleoperation demonstrations, which can be tedious, expensive,
and time consuming to collect. Secondly, policy performance is
limited to teleop demonstration quality, which can be highly variable
depending on the human teleoperator's skill and the dexterity barrier of
the teleop interface. Lastly, VLA models, with the current state of
practice, appear to suffer from strong overfitting to the fine-tuning
data. All of these issues lead to "generalist" policies that do not
generalize very well. In this talk I will describe recent work in my
lab to address each of these problems. I will describe techniques we
have developed to scale up demonstration data by leveraging 3D
Gaussian Splatting models and optimization-based planning experts to
generate arbitrary volumes of high-quality visual demonstrations to
augment or replace human teleop data. I will describe our work on
multi-task progress models that can track, based on visual inputs and
text prompts, the progress of a demonstration. This can be used to
filter human teleop data for high quality training data, and can be
used as an online performance monitor during policy execution for
fault detection, recovery guidance, and diagnostics. Finally, I will
describe our work on memorization vs generalization in visuo-motor
policies, where we find that current fine tuning practices cause
overfitting to the training data, limiting a VLA's generalization
capabilities. I will explore some remedies for this problem. The
talk will include experimental results for drone navigation policies,
drone aerial manipulation policies, and table-top manipulation
policies.
Mac Schwager
Associate Professor
Aeronautics and Astronautics
Computer Science (by courtesy)
Stanford University
Bio:
Mac Schwager is an Associate Professor of Aeronautics and Astronautics
and Computer Science (by courtesy) at Stanford University. He directs
the Multi-robot Systems Lab (MSL) where he studies robot autonomy. He
is interested in learning-based autonomy for UAVs, manipulators, and
robotic vehicles, 3D mapping and SLAM, algorithmic and analytical
tools for verifiable safety in learning-based autonomy, and
collaborative intelligence in groups of robots and animals. He
obtained his BS degree from Stanford, and his MS and PhD degrees from
MIT. He was a postdoctoral researcher at the University of
Pennsylvania and MIT. He received the NSF CAREER award in 2014, the
DARPA YFA in 2018, and has received numerous best paper awards in
conferences and journals including the IEEE Transactions on Robotics
best paper award in 2016, the Best Paper Award in Robot Manipulation
in ICRA 2018, and the Best Paper Award in Multi-Robot Systems in ICRA
2020.
memorization in VLAs
Abstract:
Vision-Language-Action (VLA) policies have recently emerged as a
promising paradigm for generalist robot autonomy. However, VLAs have
several challenges that must be overcome before they can achieve their
potential. Firstly, these models require fine tuning with
human-teleoperation demonstrations, which can be tedious, expensive,
and time consuming to collect. Secondly, policy performance is
limited to teleop demonstration quality, which can be highly variable
depending on the human teleoperator's skill and the dexterity barrier of
the teleop interface. Lastly, VLA models, with the current state of
practice, appear to suffer from strong overfitting to the fine-tuning
data. All of these issues lead to "generalist" policies that do not
generalize very well. In this talk I will describe recent work in my
lab to address each of these problems. I will describe techniques we
have developed to scale up demonstration data by leveraging 3D
Gaussian Splatting models and optimization-based planning experts to
generate arbitrary volumes of high-quality visual demonstrations to
augment or replace human teleop data. I will describe our work on
multi-task progress models that can track, based on visual inputs and
text prompts, the progress of a demonstration. This can be used to
filter human teleop data for high quality training data, and can be
used as an online performance monitor during policy execution for
fault detection, recovery guidance, and diagnostics. Finally, I will
describe our work on memorization vs generalization in visuo-motor
policies, where we find that current fine tuning practices cause
overfitting to the training data, limiting a VLA's generalization
capabilities. I will explore some remedies for this problem. The
talk will include experimental results for drone navigation policies,
drone aerial manipulation policies, and table-top manipulation
policies.
Mac Schwager
Associate Professor
Aeronautics and Astronautics
Computer Science (by courtesy)
Stanford University
Bio:
Mac Schwager is an Associate Professor of Aeronautics and Astronautics
and Computer Science (by courtesy) at Stanford University. He directs
the Multi-robot Systems Lab (MSL) where he studies robot autonomy. He
is interested in learning-based autonomy for UAVs, manipulators, and
robotic vehicles, 3D mapping and SLAM, algorithmic and analytical
tools for verifiable safety in learning-based autonomy, and
collaborative intelligence in groups of robots and animals. He
obtained his BS degree from Stanford, and his MS and PhD degrees from
MIT. He was a postdoctoral researcher at the University of
Pennsylvania and MIT. He received the NSF CAREER award in 2014, the
DARPA YFA in 2018, and has received numerous best paper awards in
conferences and journals including the IEEE Transactions on Robotics
best paper award in 2016, the Best Paper Award in Robot Manipulation
in ICRA 2018, and the Best Paper Award in Multi-Robot Systems in ICRA
2020.