Don’t miss OpenAI, Chevron, Nvidia, Kaiser Permanente, and Capital One leaders only at VentureBeat Transform 2024. Gain essential insights about GenAI and expand your network at this exclusive three day event. Learn More
Researchers at the University of Tokyo and Alternative Machine have developed a humanoid robot system that can directly map natural language commands to robot actions. Named Alter3, the robot has been designed to take advantage of the vast knowledge contained in large language models (LLMs) such as GPT-4 to perform complicated tasks such as taking a selfie or pretending to be a ghost.
This is the latest in a growing body of research that brings together the power of foundation models and robotics systems. While such systems have yet to reach a scalable commercial solution, they have propelled robotics research forward in recent years and are showing much promise.
How LLMs control robots
Alter3 uses GPT-4 as the backend model. The model receives a natural language instruction that either describes an action or a situation to which the robot must respond.
The LLM uses an “agentic framework” to plan a series of actions that the robot must take to achieve its goal. In the first stage, the model acts as a planner that must determine the steps required to perform the desired action.
Countdown to VB Transform 2024
Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now
Next, the action plan is passed on to a coding agent which generates the commands required for the robot to perform each of the steps. Since GPT-4 has not been trained on the programming commands of Alter3, the researchers use its in-context learning ability to adapt its behavior to the API of the robot. This means that the prompt includes a list of commands and a set of examples that show how each command can be used. The model then maps each of the steps to one or more API commands that are sent for execution to the robot.
“Before the LLM appeared, we had to control all the 43 axes in certain order to mimic a person’s pose or to pretend a behavior such as serving a tea or playing a chess,” the researchers write. “Thanks to LLM, we are now free from the iterative labors.”
Learning from human feedback
Language is not the most fine-grained medium for describing physical poses. Therefore, the action sequence generated by the model might not exactly produce the desired behavior in the robot.
To support corrections, the researchers have added functionality that allows humans to provide feedback such as “Raise your arm a bit more.” These instructions are sent to another GPT-4 agent that reasons over the code, makes the necessary corrections and returns the action sequence to the robot. The refined action recipe and code are stored in a database for future use.
The researchers tested Alter3 on several different tasks, including everyday actions such as taking a selfie and drinking tea as well as mimicry motions such as pretending to be a ghost or a snake. They also tested the model’s ability to respond to scenarios that require elaborate planning of actions.
“The training of the LLM encompasses a wide array of linguistic representations of movements. GPT-4 can map these representations onto the body of Alter3 accurately,” the researchers write.
GPT-4’s extensive knowledge about human behaviors and actions makes it possible to create more realistic behavior plans for humanoid robots such as Alter3. The researchers’ experiments show that they were also able to mimic emotions such as embarrassment and joy in the robot.
“Even from texts where emotional expressions are not explicitly stated, the LLM can infer adequate emotions and reflect them in Alter3’s physical responses,” the researchers write.
More advanced models
The use of foundation models is becoming increasingly popular in robotics research. For example, Figure, which is valued at $2.6 billion, uses OpenAI models behind the scenes to understand human instructions and carry out actions in the real world. As multi-modality becomes the norm in foundation models, robotics systems will become better equipped to reason about their environment and choose their actions.
Alter3 is part of a category of projects that use off-the-shelf foundation models as reasoning and planning modules in robotics control systems. Alter3 does not use a fine-tuned version of GPT-4, and the researchers point out that the code can be used for other humanoid robots.
Other projects such as RT-2-X and OpenVLA use special foundation models that have been designed to directly produce robotics commands. These models tend to produce more stable results and generalize to more tasks and environments. But they also require technical skills and are more expensive to create.
One thing that is often overlooked in these projects is the base challenges of creating robots that can perform primitive tasks such as grasping objects, maintaining their balance, and moving around.“There’s a lot of other work that goes on at the level below that those models aren’t handling,” AI and robotics research scientist Chris Paxton told VentureBeat in an interview earlier this year. “And that’s the kind of stuff that is hard to do. And in a lot of ways, it’s because the data doesn’t exist.”