Here, we show unconditional rollouts from BeT models trained from multi-modal demonstartions on the CARLA, Block push, and Franka Kitchen environments. Due to the multi-modal architecture of BeT, even in the same environment successive rollouts can achieve different goals, or the same goals in different ways.
BeT is based of three key insights.
We use a k-means based clustering to cluster continuous actions into discrete bins. The bin centers, learned from the offline data, are used to convert each continous actions into a discrete and a continuous component. These components can be recombined into a full, continous action at any time.
Our MinGPT model learns to predict a categorical distribution over the bins, as well as a residual continous component of an actions given bins. We train the bin predictor part using a negative-log likelihood based Focal loss, and the residual action predictor part using a multi-task loss.
During test, our model predicts a bin, and then uses the bin center and the associated residual continous action to reconstruct a full continous action to execute in the environment.
Performance of BeT compared with different baselines in learning from demonstrations. For CARLA, we
measure
the
probability of the car reaching the goal successfully. For Block push, we measure the probability of
reaching one and
two blocks, and the probabilities of pushing one and two blocks to respective squares. For Kitchen, we
measure the
probability of
Distribution of most frequent tasks completed in sequence in the Kitchen environment. Each task is colored differently, and frequency is shown out of a 1,000 unconditional rollouts from the models.
In this work, we introduce Behavior Transformers (BeT), which uses a transformer-decoder based backbone with a discrete action mode predictor coupled with a continuous action offset corrector to model continuous actions sequences from open-ended, multi-modal demonstrations. While BeT shows promise, the truly exciting use of it would be to learn diverse behavior from human demonstrations or interactions in the real world. In parallel, extracting a particular, unimodal behavior policy from BeT during online interactions, either by distilling the model or by generating the right "prompts", would make BeT tremendously useful as a prior for online Reinforcement Learning.