Reference: Behavior Cloning for Manipulation -

Hello, it’s time for my yearly post! The goal of this one is to be a handy reference for behavior cloning as applied to manipulation – something I’ve been spending quite a bit of time on over the past 6-9 months. The field has been progressing at an absurdly rapid pace of late (somehow we’re at a point where robots can do laundry), so I thought it’d be a good exercise to summarize where we’re at right now and how we got here (at least from my perspective and based on what I’ve read so far). I’m hoping this becomes a useful resource for any manipulation enthusiasts stumbling upon it, and of course, for future me as well!

The topic of interest here is going to be behavior cloning (BC). Why behavior cloning you ask? Because it is currently the most promising way of training policies that can perform dexterous manipulation. I highly recommend reading this post by Vincent Vanhoucke, which delves into how/why the field made this transition in the first place. And while you’re at it, I would also recommend reading another one of his relevant posts on robot learning. None of this to say that other methods like RL isn’t useful now, it just isn’t the protagonist of this particular post.

Now that we’ve got the motivation part out of the way, let’s talk about what we’re actually gonna talk about. Sidelining our protagonist for now, we highlight some of the supporting cast that will make this a compelling story. This is going to include some conceptual groundwork, some history lessons (starting from the pre-BC era), teleoperation, simulation and some miscellaneous topics. Analogies aside, there are certain things we need to know about how such systems are set up, trained and evaluated, and I’m hoping to give a broad but general overview of these items.

This isn’t going to be an all-encompassing, self-sufficient post on the subject matter. Neither is it going to explain any of the theory/concepts being introduced. Each item will be briefly summarized, with references for deeper reading. The hope is that the references listed here, along with the references in these references, paints a well rounded introduction (or refresher) to the field.

Groundwork

For brevity, I’m going to avoid listing core ML concepts such as LayerNorm, Residual Networks, etc. Information on these topics can be obtained by tracing references from the listed papers and tend to be more accessible in general.

The first idea here is going to be the original paper on transformers [1] (No prizes for guessing this one). A transformer is a neural network architecture designed to learn how elements of a sequence relate to each other. They’re also the major driving force behind the current boom in AI. Instead of attempting to explain this any further, I’m just going to list out a few resources that I’ve found to be extremely helpful (but please do read the original paper first):

This series by 3B1B provides some good intuition on the topic
This paper from Anthropic dives deeper into the topic by trying to reverse engineer the model
This video by Andrej Karpathy implements a transformer model from scratch

And countless others. There definitely isn’t any shortage of information/tutorials on transformers at this point.

We now move over to BERT [2]. BERT introduced a bidirectional encoder-only transformer model that was able to achieve state of the art performance on a number of NLP tasks. It followed the now popular pre-training -> fine-tuning procedure.

Up next we have the GPT-3 paper [3]. This was also where I had my “Oh crap” moment with LLMs. A few realizations were had after this one:

We can just scale up the size of the model and the amount of data the model is trained on, and we get better performance
These models are actually generating text that is difficult to distinguish from human written text
These models have the potential to be incredibly useful

The only thing that was changed from GPT-2 was the model size and amount of input data. But apparently that’s all it takes because transformers scale unreasonably well with data.

Time to go beyond language models and see how these ideas made their way into robotics. A lot of it had to do with being able to work with multi-modal data (language + images, etc.). TL;DR: Researchers quickly realized that transformers weren’t just useful for modelling languages, they’re extremely adept at modeling sequences. So basically, if you can transform (pun intended) your problem into a sequence of representations/embeddings, chances are that using some sort of transformer model will likely solve the problem.

We start off this section with ViTs [4]. This paper is the answer to the question: Can we use transformers for image recognition tasks? Each image is split into 16x16 patches that are flattened, mapped through a linear layer and then passed as input to a transformer model. The output of the self-attention encoder can then be used for classification tasks.

Moving on to some multi-modal stuff, we have CLIP [5]. CLIP combines text and vision encoders to train a joint embedding representation that can perform zero-shot tasks like image classification. The core idea here is that once we express both text and images in the same multi-modal embedding space, we can use the cosine similarity between the embedding vectors to learn a relationship between text and the contents of images. CLIP was pre-trained on 400 million image-text paris, and the resulting model was able to perform zero-shot image classification, image retrieval, etc.

I wasn’t too sure about adding this one as it might be getting too specific, but I thought it was worth mentioning as it gets used in Pi-Zero (stay tuned). And by this one, I mean the PaLI series of models [6], [7], [8], [9]. PaLI [6] is a multilingual vision-language model that was (similar to CLIP) trained on a large amount of image-text data. Images are encoded using a pre-trained ViT and text is encoded using a per-trained language model. They are then jointly trained on a variety of tasks such as image captioning, visual question answering (VQA), OCR, etc. This enables the model to then zero-shot these vision-language tasks in multiple languages. PaLI-X [7] scales up both the model size and training data to achieve better performance on these tasks. PaLI-3 [8] changes the architecture a bit, making it a smaller and faster model while still achieving SOTA comparable performance on vision-language tasks. Finally, PaLI-Gemma [9] is a 3B parameter VLM that is basically the open-source equivalent of PaLI-3. It uses a 400M SigLIP [10] model for encoding images and a 2B Gemma [11] model for encoding text, which makes it a pretty good candidate for a lightweight VLM backbone.

Stepping away from LLM/VLM architectures, let’s look at a few ideas that get used in building and training the actual models. First in line is FiLM [12]. FiLM is a technique for conditioning the features of a neural network on some additional information. It applies a linear (affine) transformation to the features where the parameters of the transformation are learned from the additional information. As we’ll see later, this is pretty useful when we have encoded image/text and we want to condition the output of the model (robot actions in our case) on the encoded features.

We’ve talked a lot about architectures, so here’s an important reference on the training aspect of things: LoRA [13]. LoRA is a technique that allows us to fine-tune LLMs and VLMs with a small number of parameters. It does this by adding a weight delta matrix to the original weight matrices of the model, and then updating only the delta matrices during training (keeping the original weights frozen). The delta matrices are represented as low rank matrices, formed by the product of two smaller matrices. This allows us to fine-tune large models with a small number of parameters, and is the key idea that makes the fine-tuning of some of the recent robotics foundational models possible without requiring beefy compute.

VLMs are the part that have made manipulation policies “smart”. They allow us to use natural language to describe tasks, and allow models to reason about how to go about performing the given tasks. They’re able to encode common sense reasoning and knowledge about the world, which was almost impossible to do previously. But there’s still a missing piece here: robots still need to be able to convert these high level descriptions and reasoning capabilities into low level controls/actions, which brings us to the final part of this section.

Diffusion [14] models have been the state of the art for image/video generation for a while now (think Dall-E, Stable Diffusion, Midjourney, Sora, etc). Notably, they have certain properties that make them suitable for generating robot actions as well (primarily their ability to generate multi-modal trajectories). In general, diffusion models are generative models that learn to generate data in a particular distribution through an iterative denoising process. Given a noisy input, the model learns how to denoise it step by step until the final output is close to the original data distribution. In the case of images, we start with an image that is just noise, and then iteratively denoise it to get a final image that resembles the images in the training set. The same idea can be applied to robot actions, where we start with a noisy action sequence and iteratively denoise it to get a final action sequence that resembles the actions in the training set. It would be remiss of me to not mention the Denoising Diffusion Implicit Models (DDIM) [15] paper here, which is a variant of diffusion models that allows for generation with fewer denoising steps. Most diffusion implementations in BC use the DDIM implementation as it allows for faster generation of actions while still maintaining the quality of the generated actions. The training process is similar, but during sampling, instead of adding noise at each step, we treat it as a deterministic process. A resource that I’ve found to be extremely helpful for diffusion is this tutorial [16] by Stanley Chan.

And finally we have Flow Matching [17], [18], something that has been gaining more popularity recently. Flow matching can be thought of as a generalization of diffusion models, where instead of learning to denoise a noisy input, the model learns a vector field that describes how to move from the noise distribution to a target distribution. The sampling process is similar to DDIM, where the learned vector field can be integrated starting from a noise sample. When the flow is defined in certain ways, the process learns something similar to the score function in diffusion. Flow matching also allows for different kinds of smooth paths that define how we move from the noise distribution to the target distribution, instead of being limited to a noise schedule. Like DDIM, flow matching models are also easy to train and generate samples from, while allowing for greater flexibility. Here’s a very good post from Deepmind that compares the two generative modeling approaches.

And with that we have laid all the groundwork we need for the rest of the post. Next up, we will look at some of the history of manipulation policies and try to trace a line from the days of yore to the current state of the art.

Pit stop: Behavior Cloning

Before we get to the meat of the post, let’s talk about behavior cloning (BC) itself. Behavior cloning falls under the umbrella of imitation learning and is a form of supervised learning where we train a model to mimic the behavior of an expert. In the context of manipulation, this means collecting expert data (usually from human teleoperators) and using this to train a policy that mimics the expert’s actions. Unlike reinforcement learning (RL), the algorithm itself does not explore and learn from rewards. This makes it a lot more sample efficient and practical as we can treat it as a regression problem, with the model learning the distribution of the expert actions.

As with anything, there are some caveats to this approach:

BC algorithms can suffer from compounding errors, taking the robot’s state away from the data distribution that the model has seen (also known as distribution shift)
There can be an inherent lack of understanding of the task and lack of generalization to unseen scenarios
It is difficult to train a model to correct for mistakes as the model is only trained on positive examples from the expert
Expert data must be of high quality and the performance of the model is usually bounded by the performance of the expert

In the next section, we’ll take a look at how some of the recent advances in the field have tried to address these issues.

Evolution of BC

I wasn’t kidding when I said that this would be a history lesson. The earliest attempt at behavior cloning and a “pixels to actions” approach was ALVINN [19], which was developed in 1988! It was a simple neural network (multilayer perceptron) that was trained to predict the steering direction of a van given an input image from a camera and a lidar.

We now jump forward to 2016, to one of the earliest attempts at using deep learning for manipulation by training an end-end visuomotor policy [20]. While not exactly behavior cloning, this paper is a bit too important to be left out, as it demonstrated that we could train practical end-end policies to perform non-trivial manipulation tasks. The model is CNN based and takes in an RGB image and robot state as input and outputs motor torques directly. The model was able to perform manipulation tasks such as shape sorting, screwing a cap onto a bottle, etc.

Moving on to some of the earlier BC methods for manipulation, we have [21] and [22]. [21] employed VR based teleoperation to collect expert data for training a CNN based network that output the required linear and angular velocities of the end-effector (along with gripper state). [22] used Leap Motion/Play Station controllers for teleoperation and a VAE-GAN autoencoder to learn a latent representation, which was then fed into an LSTM to produce the robot actions (joint commands in this case). A one-hot task selector vector was also used to condition the model on different tasks.

There was then some research done on trying to scale this process up. Transformer models for manipulation wasn’t popular yet, but people were thinking about:

How to scale up the amount of data collected through teleoperation
Generating large open source datasets for training policies

RoboTurk [23] developed a teleoperation method that anyone around the world could use to collect data. The method involved using phones as the medium for teleoperation, while streaming image data to the client’s web browser. It was possible to remotely collect around 130 hours of manipulation data (picking, assembly, etc) in a day. This was then used to train a demonstration guided PPO policy to perform these tasks.

RoboNet [24] came later, with the intention of creating a large scale diverse dataset for manipulation. It was a collection of ~160,000 trajectories along with video data from 7 different robots performing a variety of different tasks. Foundational models for robotics weren’t quite a thing during the time, but this paper had the right idea: Using a large amount of diverse data to train models that can generalize zero shot to new tasks. The models could also be fine-tuned on smaller datasets to improve performance on specific tasks. Sound familiar?

Next in line is BC-Zero [25], which takes this idea of creating a large scale generalist policy a step further. BC-Zero was trained on imitation learning data consisting of 25,000+ episodes over 100+ tasks and was able to generalize to unseen some unseen tasks with a decent success rate. The models inputs are RGB images and a task command. The task command takes the form of a language instruction or a video of the task to be performed. CNNs (ResNet-18 based) are used to encode the input images and the task command videos. Language commands were encoded using a separate multilingual language encoder (Even though this was published in 2022, transformers for robotics weren’t as common yet). The encoded language/video task commands condition the vision encoder through FiLM [12] layers. This final conditioned encoding is then passed through an action head (linear layers + ReLU) to compute the final actions. The actions in this case are end-effector deltas in terms of XYZ and axis-angle, along with the gripper state. A VR based teleoperation strategy was used to collect data, with each episode/task being labelled with a task command (text description or a human demonstration video of the same task). Some of the data was also collected using a human in the loop approach, similar to HG-DAgger [26]. The result of all this, was a model that could not only perform single task BC with a high success rate, but also generalize (zero and few shot) to unseen tasks that were held out from the training set (with a respectable success rate of ~40%). This work made it clear (if it wasn’t already) that scaling up the models and the amount of training data was the way to go for generalist policies. It also laid some of the groundwork for the robotics foundational models that would come later.

Once roboticists got a taste of what was possible with large scale data and models, it was time to go all in. SayCan [27] combined the power of LLMs and BC policies, not by building a large scale model operating on language instructions, but by using an existing LLM to generate sub-tasks. The sub-tasks were then fed to a BC/RL policy (like BC-Zero) to perform the task. The robot was already trained to be able to perform a set of atomic tasks (picking items, etc), and the LLM was used to generate a sequence of these atomic tasks given a potentially complex user instruction. The thing that integrates these two portions is an affordance function (value function) that evaluates the sub-tasks generated by the LLM and selects the ones that are most likely to succeed given the current state of the robot. This part is basically learning a Q function to pick the best sub-task conditioned on the current robot state and a language description. This helps ground the LLM generated sub-tasks in the real world, ensuring that the robot tries to execute sub-tasks that are actually feasible. The affordance function was trained using a TD based method based on trial and error data collected from the robot attempting to perform various sub-tasks. The function was conditioned on the language description of the sub-task through a frozen language encoder. The robot was able to execute tasks specified through complex natural language instructions in a zero-shot manner.

And now we reach another major checkpoint: The RT (Robotics Transformer) series of papers. RT-1 [28] came out in 2022 and was one of the first successful attempts at using transformers trained on large scale data for manipulation. The ~35M model takes in a history of RGB images and a language instruction and output robot actions, all in one model. The image input is encoded using a pre-trained EfficientNet [29] model conditioned on the encoded language instruction through FiLM [12]. The output tokens (after computing a more compact representation) are then passed through a transformer (self-attention) network to produce the final output actions at a frequency of 3 Hz. The actions themselves are tokens that represent the end-effector deltas (xyz, rpy), gripper state, pose deltas of the mobile base and an additional token that indicates whether the arm or the base should be moved or if the given task is done. To learn the output actions as tokens, the actions from the training set are discretized into bins (of size 256). Almost 130k episdoes of robot data were fed into this model during training, making full use of the transformer architecture’s scaling capabilities. What do we get for all this effort? A policy that boasts a 97% success rate on tasks seen during training, beating existing methods like BC-Zero. The model was also capable of generalizing to unseen tasks, distractors and more realistic instructions. It was also capable of performing long horizon tasks, achieving SOTA performance here as well (over SayCan).

So now we have a large scale dataset, a transformer model, and proof that it is possible to train a large scale model for manipulation. What next? Scale everything up, obviously. Enter RT-2 [30], a much larger model trained not just on the robot data from RT-1, but also on vision-language internet data. The paper also coined the term VLA (Vision-Language-Action model), which we’ll see being used a lot from this point onwards. The idea is that by treating the discretized actions as separate tokens, a transformer model can be trained to output language tokens as well as action tokens. By training the model to predict tokens for tasks such as image captioning, VQA, etc. along with robot actions, the model is able to learn a joint representation allowing it to perform tasks that require deep visual and language understanding. This also enables it to perform chain-of-thought reasoning in a planning step which enables it to output better actions for more complex tasks. The model is build on top of existing pre-trained VLMs such as PaLI-X [7]. The PaLI-X version was built on both a 5B and a 55B model, marking a significant increase in the model size compared to the RT-1 model. The size of these models also means that they had to be run as a cloud service at a rate of 1-3 Hz for the 55B model and 5Hz for the 5B model. RT-2 was shown to produce state of the art results, outclassing other models (including RT-1) on a variety of tasks. It also performed a lot better in unseen environments/tasks, in no small part due to the large scale internet data it was trained on.

LLMs pre-trained on internet scale data are known to outperform models trained on smaller task specific datasets. Open-X embodiment [31] is the embodiment of this idea for robotic manipulation. It introduces truly generalist policies, not just by proposing any novel architecture changes, but by taking the existing models (RT-1 and RT-2) and training them on a more diverse dataset. How diverse you ask? The paper introduces the Open-X embodiment dataset (OXE): An open source large scale dataset (Currently still the largest open source dataset for robot learning), whose size is only surpassed by the number of authors on the paper. It features ~1M trajectories from 22 different robot embodiments (single arm, dual arm, quadrupeds, etc) all packaged into a tensorflow dataset format. For training the RT models, a subset of the dataset containing 9 manipulator embodiments are used. The inputs and outputs are made homogenous to an extent – A single RGB image is chosen as input, along with a language instruction. The output is a set of 8 tokens (7 for the arm and 1 for the end of the episode) that represent actions discretized into bins. The inputs and outputs could still mean different things as the camera views for each dataset within OXE are different, along with different interpretations of the actions (position vs velocity, etc). But what would come as a surprise to most people reading the paper for the first time is the fact that the model trained on this dataset performs and generalizes better than the models trained on just the RT data. It exhibits positive transfer across different embodiments, and the RT models trained on the OXE dataset (referred to as RT-1-X and RT-2-X) outperform the original models. Note that RT-1-X trained on a large scale task specific dataset, did not outperform RT-1. But RT-2-X was able to outperform both RT-1 and RT-2, showing the need for model capacity while dealing with data of this magnitude.

If the efficacy of cross embodiment training wasn’t surprising enough, turns out that we can also train models with data from non-manipulator embodiments (navigation robots, quadrupeds, etc). This results in more robust policies that match the performance of specialist policies. Without spending too much time here, couple of the papers that introduce this idea are extreme cross embodiment [32] and Crossformer [33].

That brings us to the end of the scaling section. Time to focus more on some of the model architectures that power today’s SOTA BC policies. We start off with the Action Chunking Transformer (ACT) [34]. The paper has two major contributions: An economical open source platform for bimanual manipulation and teleoperation (ALOHA) and an 80M transformer architecture for learning end-end policies for fine-grained manipulation tasks. The paper also introduced the “action chunking” concept which is fancy talk for saying that the model predicts a horizon of actions instead of a single action. Before we get to the specifics of the model, let’s address the elephant in the room: The $20k economical platform. Yes, it is economical given how absurdly expensive manipulation hardware is (Don’t believe me? How about after seeing grippers that cost $6k?). Hardware in the space is getting a lot cheaper, but it wasn’t too long ago that research labs were spending upwards of $100k on arms. Now that the elephant has been eviscerated, let’s talk about the model. A transformer encoder takes as input 4 RGB images, all encoded through ResNet-18 feature extraction, along with the current joint positions. A transformer decoder then takes the encoded features and outputs a tensor of size k x 14 where k is the action horizon and 14 is the dimension of the action space. Unlike some of the previous papers, the model predicts joint angles instead of end-effector deltas. Also note that the model doesn’t do action binning, it directly outputs the action tensor instead of tokens, which is then trained using an L1 loss (which worked better than the more common L2 loss. Another interesting read on this topic is this experiment). The model was able to achieve pretty good performance (80-90%) on some tasks like opening the cap of a small cup, etc that require precise control. This was possible with a dataset of just 50 episodes for the individual tasks. It also improved on other methods such as BeT, which we’ll get to in a bit.

The downside of a field moving this rapidly is that some good ideas tend to get overshadowed by the next shiny thing pretty quickly. Octo [35] found itself at the receiving end of this phenomenon. Octo is a small (27M and 93M) transformer model that was designed with the primary goal of being modular. It includes encoders for instructions, observations and goals and a transformer network that operate on the encoded features as tokens to output action chunks (through a diffusion process). The model processes and outputs tokens in a way that makes it very easy to add/remove inputs and action heads, making it easy to adapt it to different use cases. It also makes it easy to fine tune on custom data. All of this is enabled by some clever masking on the transfomer side of things. The model was trained on ~800k trajectories from the OXE dataset and is published as an open source model. Despite how interesting and altruistic the paper is, it unfortunately got trumped by some of the other models that came out right after. These days you’ll find Octo in the comparison tables of policies that perform better :(

That thing about Octo being overshadowed? Meet OpenVLA [36] which came out a few weeks after Octo did. Seeing the model be able to respond to natural language instructions and manipulate objects, even for objects/tasks weren’t entirely in distribution was quite a sight to behold. And I’m talking about the out of the box model here, not even a fine-tuned version. OpenVLA is a 7B parameter model trained on ~970k episodes from OXE. The model takes in an input RGB image and a language instruction and outputs a single action (not a chunk) in the form of an end-effector delta. To encode the input image, patches that make up the image are encoded using a combination of DinoV2 [37] and SigLIP [10]. They mention that the DinoV2 encoder helps improve spatial reasoning capabilities. It uses LLama 2 [38] as the VLM backbone, helping it with being able to reason about its inputs (something that Octo was lacking). The instruction is encoded using the LLama tokenizer and all of the tokens are processed through the Llama 2 7B model. Similar to what we’ve seen previously, the model outputs tokens corresponding to binned actions. The trained model is capable of inference at a frequency of ~6Hz on an RTX 4090 GPU. OpenVLA was able to achieve better performance that RT-1-X, Octo and even RT-2-X, a model that is ~8x larger. The best part is that all of the code and the model weights are open source, making it easy to be used and fine-tuned.

One of the challenges with behavior cloning is being able to model multi-modality in the expert data. A very good example of this is the Push T task. BeT was one of the earlier works that tried to address this issue. BeT [39] explicitly captured modes in the data by learning a separate k-means encoder/decoder that is capable of decomposing a given action vector into a discrete action bin and an error correction offset. During training, the transformer model learns to predict the action bin (from among the ‘k’ action classes) and offset given a history of observations (images). During inference, the model outputs the action offset and a probability vector for the action bins, which is then used to sample an action bin. This sampled action and the predicted offset are then decoded into the final continuous action vector. BeT was able to achieve good performance on some multi-modal simulation environments. VQ-BeT [40], that came in a couple of years after BeT, is the successor of BeT. It employed a residual Vector Quantization (VQ) [41] based approach to encode the actions. Similar to BeT, VQ-BeT has an action discretization phase and a learning phase. In the discretization phase, an encoder and decoder network are learned to discretize the action chunks (not a single action vector anymore) into vectors from the hierarchical codebooks. The paper used 2 VQ residual layers for all the experiments. The training phase is similar to BeT: A sequence of images are encoded and passed through a transformer model that outputs probabilities for the action chunk classes. The difference being that instead of outputting a single set of probabilities, each VQ layer has its own set of probabilities that need to be sampled to compute the final continuous action chunk. VQ-BeT addresses one of the main pain points of BeT, which was having to pre determine the number of action classes k, while also dealing with the model’s sensitivity to this value. VQ-BeT was shown to be more effective than BeT and even Diffusion Policy (which we’ll get to next) on multiple simulation and real world environments.

And now we finally get to one of my favorites: Diffusion Policy (DP) [42]. It does what you would expect, which is generate robot actions using a diffusion [14] process. Diffusion is able to naturally model multi-modality in the data and is another method to tackle this problem in the context of BC. The inputs to the model are a history of observations (RGB images and robot poses). The model outputs an action trajectory that is generated through a de-noising process. More specifically, the model outputs the predicted error for the given noisy action input. This process is repeated for a number of steps until the noisy action (which started as a gaussian noise sample) has been completely denoised into an action trajectory that makes sense. FiLM [12] layers are used to condition the diffusion model on the observations. The paper proposes a CNN and a transformer based model. The CNN based model uses ResNet-18 to encode the images feeding into FiLM. The transformer model uses a multi-head cross attention scheme to process the observation encodings (Images encoded through a ViT). This was evaluated on a multitude of simulation and real world tasks (some involving non trivial contact dynamics) and was able to achieve SOTA performance on most of them, surpassing models like BeT (VQ-BeT wasn’t out yet). The paper mentions that the CNN based model was sufficient for most tasks, with the transformer model only being necessary for tasks that required finer high frequency control (Eg: velocity control). Despite the transformer based model achieving better performance in most of the tasks, it was also sensitive to model parameters and was difficult to train. Now’s also a good time to quickly mention that papers like DiT-Block policy [43] came out later that were able to address some of these issues with utilizing transformers in diffusion based BC policies.

Remember ACT and ALOHA? ALOHA Unleashed [44] was also unleashed towards the end of 2024 that featured a diffusion based policy for training bi-manual policies on the ALOHA platform. The model looks similar to the formula that we’ve been seeing: ResNet to encode the images and a transformer architecture to process the encodings. In this case, a transformer encoder block generates the tokens and a transformer decoder blocks denoises the noisy action chunks. Naturally, the decoder block needs to run multiple times during inference to generate the final denoised action chunk. This isn’t that much of the problem as the policy is run open loop, enabling its execution at frequencies >50 Hz on an RTX 4090 GPU (Well not a problem unless you’re trying to run it on edge).

All of this finally brings us to Pi-0 [45]. By no means is this the end state of BC, but I thought it’d be a good place to end the storyline of the post given that it’s probably the state of the art imitation learning policy for manipulation (Although Deepmind might disagree [46]). Pi-0 is a generalist policy built on top of a pre-trained VLM. It tries to best make use of the large scale model and data to train a model that is capable of some astonishing feats of manipulation. I would highly recommend watching the policy in action before reading the paper. What powers the model is the absolutely monstrous amount of data that was collected – ~10,000 hours of teleoperation data, amounting to ~900M timesteps! It follows a pre-train -> fine-tune scheme where the ungodly amount of data is first used to train a base model, after which more curated datasets are used to fine-tune for specific tasks. The VLM backbone used in the model is the 3B Pali-Gemma [9] model, that is able to ingest language instructions. RGB image inputs are encoded using ViTs. A separate 300M action expert is used to output action chunks. The action head also takes in robot proprioception data. The action head doesn’t use any causal masking, allowing all the action tokens to attend to each other. Instead of using diffusion, Pi-0 uses flow matching [17] to generate the action chunks. The model is capable of executing complex long horizon manipulation tasks such as table bussing, folding clothes, laundry picking and sorting, etc. Pi-0 outclasses DP, ACT, OpenVLA and poor Octo, while also being adept at following language instructions, making it a truly generalist policy.

The storyline of the main quest for this post ends here, but I need to do justice to a few more papers that didn’t quite fit the collection above.

Datasets

I’ll keep this section short and sweet. Datasets are the lifeline of VLAs. More data = good. More open source data = even better. We’ve already talked about the OXE dataset [31], which is the largest open source collection of dataset for imitation learning (especially for manipulation). It is composed of individual open source datasets, the full list of which can be found in the paper. But I specifically wanted to highlight the Bridge [47] and Droid [48] datasets. Bridge is a collection of ~7000 episodes of teleoperated manipulation data collected on the WidowX robot. Droid is a collection of ~70000 episodes (~350 hours) of teleoperated data collected on the Franka Panda arm.

Hardware

We’ve talked a lot about the algorithmic side of things, but now it’s time to talk about hardware. I’m not going to talk about robotic arms/grippers and nor am I going to talk about teleoperation system (Need to save these for another post). Instead, I’ll just mention a few papers that introduced interesting hardware platforms.

We’ve already seen ALOHA [34] before, but thought I’d give it another mention. The platform has gotten pretty popular in the research community, with similarly designed platforms being more widely adopted. In a world where dexterous manipulators designed with revolute joints is the norm, the Hello Stretch [49] platform takes a different approach. I’m not entirely sold on its practicality, but it is interesting nonetheless. You can see it featured in papers like this [50] one.

Moving over to hardware for teleoperation/data collection, we start with one of my favorite ideas from last year. UMI [51] is a hardware setup that allows people to collect data themselves without the need for doing it through teleoperation. This is done by using a gripper like device with a camera setup that users can hold. Robots can then be fitted with a similar setup and voila, whatever policies were trained with data collected through human manipulation can now be transferred to the robot. There are some limitations like being restricted to using wrist cameras, etc. but don’t let that distract you from the novelty of the idea. On the topic of effective methods of data collection, I’ll end the section with Meta’s Project Aria glasses [52], a wearable device that can be used to collect egocentric data. An example of how this can be used for imitation learning is the EgoMimic [53] paper, which enables users to wear the glasses and collect data naturally without the need for teleoperation, or having to cosplay as a crab.

Simulation and Benchmarking

Simulation offers two primary benefits: It can allow us to collect data at scale and it can be used to evaluate and compare policies in a structured manner. Neither of these things are straightforward as it depends on the simulator’s ability to produce realistic sensor data and realistically model the physics of the environemnt. You’ll find researchers on both sides of the fence, some who swear by simulation and others who would rather put in the effort to collect real world data. This obviously isn’t a black and white issue, but it’s unclear to me what the balance here looks like. What is clear though, is the fact that simulators keep getting better every day, and even though most of the best models today are trained on real world data, this trend is likely to shift in the near future (I’m sure Nvidia would agree).

I can’t not start this section off with MuJoCo [54]. It started off as a project at the University of Washington, made its way to being used as the de-facto physics engine for RL (OpenAI Gym, etc) and was eventually acquired and open-sourced by Deepmind who currently maintain it. A lot of the other simulators and environments that we’ll see below are built on top of MuJoCo. It’s impact on the field cannot be overstated, and it is still the go-to physics engine for learning based approaches. On the topic of physics engines, I can’t stop myself from mentioning Drake [56], being the Drake (not the rapper) fanboy that I am. Drake caters to an audience that is looking for more accurate and high fidelity modeling of contact and dynamics. Drake is much more than just a physics engine, but we don’t need to get into that now.

Moving away from physics engines and onto simulators, we have AI2-THOR [57], a simulator that uses a Unity backend and provides photo-realistic environments for training agents in home-like environments. SAPIEN [58] is a versatile simulator that uses the PhysX physics engine and is capable of simulating a variety of robotic tasks. RLBench [59] is a simulation benchmark of 100 unique manipulation tasks, each with proprioception and RGBD observations, along with trajectories from motion planners. Robosuite [60] uses MuJoCo as its physics engine and offers a set of standardized environments and tasks for benchmarking RL and imitation learning policies. It also provides ways to procedurally generate new environments and tasks, making it easy to scale up the data collection process. ManiSkill [61] builds on top of SAPIEN and provides a set of benchmarking tasks, while also providing ~36,000 expert trajectories for training imitation learning policies. This [62] paper introduced the Franka kitchen environment, which is a MuJoCo based kitchen environment designed for training long horizon RL and imitation learning tasks. This next one reminds of this xkcd comic, but more simulators and environments are always going to be good for the field (probably). RoboHive [63] encompasses some other simulators and environments, but also provides its own set of MuJoCo powered benchmarking environments and tasks. It does this for a variety of different embodiments, while also providing methods of evaluation and baselines. SIMPLER [64] is a collection of SAPIEN based simulated environments specifically created for policies trained on real world data. Why do this? Because policy evaluation is an extremely annoying and expensive process. The paper delves deeper into the sim to real gap and some of the metrics used to quantify this, so do give it a read. MimicGen [65] and DexMimicGen [66] aren’t simulators but the concept they propose is way too intriguing to be left out. They introduce a method to generate an exponential number of synthetic trajectories from a limited number of human demonstrations (more than 200x) that helps in accelerating BC training.

I can’t really end this section without talking about the work the Nvidia has been doing. Isaac Sim and Isaac Lab provide high fidelity simulation environments for a variety of tasks including robot learning. Cosmos is their foundation model for generating synthetic data for different embodiments. Check out the Gr00t [67] paper for how various sources of data (including synthetic data) can be incorporated into the training process to create foundation models for robotics.

Miscellaneous

There are a few papers that I couldn’t really fit into the flow of the rest of the post, so I’ll just mention them here. I’ll start with this paper [68], that identifies the most challenging aspects of learning from demonstrations and analyzes which model design and data choices matter most. Evaluations were done on both simulated and real world tasks. 6 different learning algorithms were used for evaluation (not restricted to BC algorithms). This paper came out before the VLA era, so you won’t find the popular methods of today featured here. Some of the results/findings might also be less relevant today given how much the field has changed, but it is a very informative read nonetheless.

Implicit Behavior Cloning [69] is the next one in this list. The gist of the paper is that policies learnt through implicit energy based models usually tend to outperform explicit policies. It features some nice visualizations and elucidates everything in a nice and intuitive manner. It also details how these energy based models can be trained for BC, while also evaluating it on real and simulated manipulation tasks.

To close things out, we have data scaling laws [70], which is a bit more recent compared to the previous two. Scaling laws for LLMs [71] are well known at this point, but how does this translate to robotics? The number of environments, embodiments and tasks make this learning space a lot more complex. We probably need more research in this area but the paper [70] does a good job at laying some of the groundwork for this by evaluating model performance against object/environment diversity and number of demonstrations.

My big wall of text ends here. Honestly, I wasn’t sure if I’d see this all the way through, so I’m gonna give myself a pat on the back for this one.

References

Ashish, V. (2017). Attention is all you need. Advances in neural information processing systems, 30, I. [Link]
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186). [Link]
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901. [Link]
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. [Link]
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR. [Link]
Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A. J., Padlewski, P., Salz, D., … & Soricut, R. (2022). Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794. [Link]
Chen, X., Djolonga, J., Padlewski, P., Mustafa, B., Changpinyo, S., Wu, J., … & Soricut, R. (2023). Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565. [Link]
Chen, X., Wang, X., Beyer, L., Kolesnikov, A., Wu, J., Voigtlaender, P., … & Soricut, R. (2023). Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199. [Link]
Beyer, L., Steiner, A., Pinto, A. S., Kolesnikov, A., Wang, X., Salz, D., … & Zhai, X. (2024). Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726. [Link]
Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 11975-11986). [Link]
Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., … & Kenealy, K. (2024). Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295. [Link]
Perez, E., Strub, F., De Vries, H., Dumoulin, V., & Courville, A. (2018, April). Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No. 1). [Link]
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., … & Chen, W. (2022). Lora: Low-rank adaptation of large language models. ICLR, 1(2), 3. [Link]
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840-6851. [Link]
Song, J., Meng, C., & Ermon, S. (2020). Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502. [Link]
Chan, S. (2024). Tutorial on diffusion models for imaging and vision. Foundations and Trends® in Computer Graphics and Vision, 16(4), 322-471. [Link]
Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., & Le, M. (2022). Flow matching for generative modeling. arXiv preprint arXiv:2210.02747. [Link]
Lipman, Y., Havasi, M., Holderrieth, P., Shaul, N., Le, M., Karrer, B., … & Gat, I. (2024). Flow matching guide and code. arXiv preprint arXiv:2412.06264. [Link]
Pomerleau, D. A. (1988). Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1. [Link]
Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(39), 1-40. [Link]
Zhang, T., McCarthy, Z., Jow, O., Lee, D., Chen, X., Goldberg, K., & Abbeel, P. (2018, May). Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE international conference on robotics and automation (ICRA) (pp. 5628-5635). Ieee. [Link]
Rahmatizadeh, R., Abolghasemi, P., Bölöni, L., & Levine, S. (2018, May). Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration. In 2018 IEEE international conference on robotics and automation (ICRA) (pp. 3758-3765). IEEE. [Link]
Mandlekar, A., Zhu, Y., Garg, A., Booher, J., Spero, M., Tung, A., … & Fei-Fei, L. (2018, October). Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning (pp. 879-893). PMLR. [Link]
Dasari, S., Ebert, F., Tian, S., Nair, S., Bucher, B., Schmeckpeper, K., … & Finn, C. (2019). Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215. [Link]
Jang, E., Irpan, A., Khansari, M., Kappler, D., Ebert, F., Lynch, C., … & Finn, C. (2022, January). Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning (pp. 991-1002). PMLR. [Link]
Kelly, M., Sidrane, C., Driggs-Campbell, K., & Kochenderfer, M. J. (2019, May). Hg-dagger: Interactive imitation learning with human experts. In 2019 International Conference on Robotics and Automation (ICRA) (pp. 8077-8083). IEEE. [Link]
Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., … & Zeng, A. (2022). Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691. [Link]
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., … & Zitkovich, B. (2022). Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817. [Link]
Tan, M., & Le, Q. (2019, May). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105-6114). PMLR. [Link]
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., … & Zitkovich, B. (2023). Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818. [Link]
O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., … & Chen, M. (2024, May). Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA) (pp. 6892-6903). IEEE. [Link]
Yang, J., Glossop, C., Bhorkar, A., Shah, D., Vuong, Q., Finn, C., … & Levine, S. (2024). Pushing the limits of cross-embodiment learning for manipulation and navigation. arXiv preprint arXiv:2402.19432. [Link]
Doshi, R., Walke, H., Mees, O., Dasari, S., & Levine, S. (2024). Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. arXiv preprint arXiv:2408.11812. [Link]
Zhao, T. Z., Kumar, V., Levine, S., & Finn, C. (2023). Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705. [Link]
Team, O. M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., … & Levine, S. (2024). Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213. [Link]
Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., … & Finn, C. (2024). Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246. [Link]
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., … & Bojanowski, P. (2023). Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193. [Link]
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., … & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. [Link]
Shafiullah, N. M., Cui, Z., Altanzaya, A. A., & Pinto, L. (2022). Behavior transformers: Cloning $ k $ modes with one stone. Advances in neural information processing systems, 35, 22955-22968. [Link]
Lee, S., Wang, Y., Etukuru, H., Kim, H. J., Shafiullah, N. M. M., & Pinto, L. (2024). Behavior generation with latent actions. arXiv preprint arXiv:2403.03181. [Link]
Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., & Tagliasacchi, M. (2021). Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 495-507. [Link]
Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., … & Song, S. (2023). Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 02783649241273668. [Link]
Dasari, S., Mees, O., Zhao, S., Srirama, M. K., & Levine, S. (2024). The ingredients for robotic diffusion transformers. arXiv preprint arXiv:2410.10088. [Link]
Zhao, T. Z., Tompson, J., Driess, D., Florence, P., Ghasemipour, K., Finn, C., & Wahid, A. (2024). Aloha unleashed: A simple recipe for robot dexterity. arXiv preprint arXiv:2410.13126. [Link]
Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., … & Zhilinsky, U. (2024). $\pi_0 $: A Vision-Language-Action Flow Model for General Robot Control. arXiv preprint arXiv:2410.24164. [Link]
Team, G. R., Abeyruwan, S., Ainslie, J., Alayrac, J. B., Arenas, M. G., Armstrong, T., … & Zhou, Y. (2025). Gemini robotics: Bringing ai into the physical world. arXiv preprint arXiv:2503.20020. [Link]
Ebert, F., Yang, Y., Schmeckpeper, K., Bucher, B., Georgakis, G., Daniilidis, K., … & Levine, S. (2021). Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396. [Link]
Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., … & Finn, C. (2024). Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945. [Link]
Kemp, C. C., Edsinger, A., Clever, H. M., & Matulevich, B. (2022, May). The design of stretch: A compact, lightweight mobile manipulator for indoor human environments. In 2022 International Conference on Robotics and Automation (ICRA) (pp. 3150-3157). IEEE. [Link]
Etukuru, H., Naka, N., Hu, Z., Lee, S., Mehu, J., Edsinger, A., … & Shafiullah, N. M. M. (2024). Robot utility models: General policies for zero-shot deployment in new environments. arXiv preprint arXiv:2409.05865. [Link]
Chi, C., Xu, Z., Pan, C., Cousineau, E., Burchfiel, B., Feng, S., … & Song, S. (2024). Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. arXiv preprint arXiv:2402.10329. [Link]
Engel, J., Somasundaram, K., Goesele, M., Sun, A., Gamino, A., Turner, A., … & Newcombe, R. (2023). Project aria: A new tool for egocentric multi-modal ai research. arXiv preprint arXiv:2308.13561. [Link]
Kareer, S., Patel, D., Punamiya, R., Mathur, P., Cheng, S., Wang, C., … & Xu, D. (2024). Egomimic: Scaling imitation learning via egocentric video. arXiv preprint arXiv:2410.24221. [Link]
Todorov, E., Erez, T., & Tassa, Y. (2012, October). Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems (pp. 5026-5033). IEEE. [Link]
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540. [Link]
Tedrake, R., & Team, T. D. D. (2019). Drake: Model-based design and verification for robotics. Retrieved from https://drake.mit.edu [Link]
Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., … & Farhadi, A. (2017). Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474. [Link]
Xiang, F., Qin, Y., Mo, K., Xia, Y., Zhu, H., Liu, F., … & Su, H. (2020). Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11097-11107). [Link]
James, S., Ma, Z., Arrojo, D. R., & Davison, A. J. (2020). Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2), 3019-3026. [Link]
Zhu, Y., Wong, J., Mandlekar, A., Martín-Martín, R., Joshi, A., Nasiriany, S., & Zhu, Y. (2020). robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293. [Link]
Mu, T., Ling, Z., Xiang, F., Yang, D., Li, X., Tao, S., … & Su, H. (2021). Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. arXiv preprint arXiv:2107.14483. [Link]
Gupta, A., Kumar, V., Lynch, C., Levine, S., & Hausman, K. (2019). Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956. [Link]
Kumar, V., Shah, R., Zhou, G., Moens, V., Caggiano, V., Gupta, A., & Rajeswaran, A. (2023). Robohive: A unified framework for robot learning. Advances in Neural Information Processing Systems, 36, 44323-44340. [Link]
Li, X., Hsu, K., Gu, J., Pertsch, K., Mees, O., Walke, H. R., … & Xiao, T. (2024). Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941. [Link]
Mandlekar, A., Nasiriany, S., Wen, B., Akinola, I., Narang, Y., Fan, L., … & Fox, D. (2023). Mimicgen: A data generation system for scalable robot learning using human demonstrations. arXiv preprint arXiv:2310.17596. [Link]
Jiang, Z., Xie, Y., Lin, K., Xu, Z., Wan, W., Mandlekar, A., … & Zhu, Y. (2024). Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. arXiv preprint arXiv:2410.24185. [Link]
Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., … & Zhu, Y. (2025). Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734. [Link]
Mandlekar, A., Xu, D., Wong, J., Nasiriany, S., Wang, C., Kulkarni, R., … & Martín-Martín, R. (2021). What matters in learning from offline human demonstrations for robot manipulation. arXiv preprint arXiv:2108.03298. [Link]
Florence, P., Lynch, C., Zeng, A., Ramirez, O. A., Wahid, A., Downs, L., … & Tompson, J. (2022, January). Implicit behavioral cloning. In Conference on robot learning (pp. 158-168). PMLR. [Link]
Lin, F., Hu, Y., Sheng, P., Wen, C., You, J., & Gao, Y. (2024). Data scaling laws in imitation learning for robotic manipulation. arXiv preprint arXiv:2410.18647. [Link]
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., … & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. [Link]

reference robotics manipulation behavior_cloning machine_learning generalist_policies VLA