Regarding robotic learning, the standard practice is to use datasets tailored to the particular robot and job at hand to train policies. Starting from scratch in this manner necessitates a substantial amount of data collection for every activity, and the policies that are produced typically display little generalizability. Theoretically, data gathered from previous robots and jobs could be a solution; training models on various control issues could enhance their ability to generalize and perform better on subsequent tasks. In contrast to the pervasiveness of general-purpose models in computer vision and natural language processing, creating a “general-purpose robot model” capable of controlling various robots has proven to be a formidable challenge. Dealing with robot embodiments, sensor configurations, action spaces, task specifications, surroundings, and compute budgets are unique issues when training a unified control strategy in robotics.
Several publications have put forward robotic foundation models that accomplish just that—directly translate robot observations into actions—and offer generalizability to new domains and robots with zero or few shots. Because of their versatility in low-level visuomotor control across activities, settings, and robotic systems, these models are generally called “generalist robot policies” (GRPs). While there has been progress toward a “general-purpose robot model,” these models still have a ways to go. For example, they don’t allow for effective finetuning to new domains; the biggest ones aren’t even available to the public. Another issue is that they limit downstream users to a pre-defined and often restrictive set of input observations, like a single camera stream.
To better accommodate the variety of user interfaces found in robotic applications further down the line, researchers from UC Berkeley, Stanford, Carnegie Mellon University, and Google Deepmind provide a method for pretraining generalist robot policies.
Octo is a transformer-based strategy pre-trained using 800k robot demonstrations from the Open X-Embodiment dataset, the largest dataset on robot manipulation. Octo is the first generalist robot manipulation policy to be completely open-source, including the data, model checkpoints, and training pipeline. It is also the first GRP to be effectively fine tuned to new observations and action spaces.
When trained on a varied dataset of robots and tasks, the model is a transformer architecture that can convert any number of input tokens—generated from observations and tasks—into actions. This policy may be trained once and used for several robots, different camera setups (e.g., wrist or workspace cameras), and other input methods (e.g., language commands, goal images) by simply switching the tokens provided into the model. The model can be easily adjusted to accommodate other robot configurations, sensory inputs, action spaces, or morphologies by incorporating the necessary adapters and refining it using a small dataset from the target domain and a reasonable computing budget.
Previous research has delved into the individual components of Octo, such as a transformer backbone, goal image specification support, and a diffusion head to model expressive action distributions. However, the true power of this combination as a generalist robot policy is a new and innovative concept. The researchers conducted extensive experiments on nine robots from four different universities, demonstrating that their integrated system achieves state-of-the-art results in out-of-the-box multi-robot control for single and dual-arm manipulation tasks. They also showed that Octo can be effectively used as an initialization for fine-tuning to new observation and action spaces in unseen setups. Throughout these experiments, they analyzed the impact of several design choices on the pretrained GRP’s quality, including data distribution, model architecture, and policy formulation. The evaluation underscored the importance of scale and flexibility in achieving optimal performance.
In addition to this publication, the team is making all the necessary resources available for training, using, reproducing, and refining an Octo model. With 27M and 93M parameters, respectively, their pretrained Octo model checkpoints allow language and goal image task specification out of the box and multiple RGB camera inputs. In addition to their whole pre-training pipeline, which includes optimal data loaders, transformer implementations for multimodal inputs, and tools to monitor training progress, they also offer scripts for fine-tuning these models on new domains.
While the team acknowledges that there is still room for improvement in the model, such as language conditioning, support for wrist cameras, and the incorporation of data beyond ideal demonstrations, Octo represents a significant step towards creating generalist robot policies that are compatible with a variety of robot settings. Octo aims to provide a practical platform where researchers and practitioners can access larger datasets related to robotics. They envision that their work will enable the use of pretrained models for rapid task learning and generalization, thereby advancing the field of robotics and machine learning.
Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 42k+ ML SubReddit
Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.
Credit: Source link