“Revolutionizing Robot Training: Google’s Groundbreaking Methods Utilizing Video and Extensive Language Models”

Google’s DeepMind Robotics researchers are one of a number of teams exploring the space’s potential. The newly announced AutoRT is designed to harness large foundational models, to a number of different ends. In a standard example given by the DeepMind team, the system begins by leveraging a Visual Language Model (VLM) for better situational awareness. A large language model, meanwhile, suggests tasks that can be accomplished by the hardware, including its end effector. LLMs are understood by many to be the key to unlocking robotics that effectively understand more natural language commands, reducing the need for hard-coding skills.

2024: The Year of AI and Robotics

In the upcoming year of 2024, the convergence of generative AI and large foundational models with robotics is expected to bring groundbreaking advancements in various fields. This has sparked immense excitement for potential applications, such as in the areas of learning and product design. A key player in this research is Google’s DeepMind Robotics team, who is exploring the vast potential of this space.

“Our goal is to give robotics a better understanding of what humans want.” – DeepMind Robotics Team

Traditionally, robots have been designed for a specific task to be performed repeatedly throughout their lifespan. While they excel in this particular task, even the slightest changes or errors can prove to be challenging for them.

However, DeepMind’s latest innovation, the AutoRT, is set to change that. This new system utilizes large foundational models to achieve a variety of goals. In fact, the team demonstrated a standard example where the AutoRT utilized a Visual Language Model (VLM) to improve its situational awareness. With the help of cameras, the system can effectively manage a group of robots working together and understand their environment and the objects within it.

Additionally, the use of a large language model (LLM) allows the system to suggest tasks that can be completed by the robots, including the use of their “end effector”. LLMs are considered to be the key to enabling robots to understand natural language more effectively, minimizing the need for manual coding of skills.

The AutoRT has already undergone rigorous testing for the past seven months, with the ability to coordinate up to 20 robots and a total of 52 different devices. Notably, DeepMind has gathered a vast amount of data, including over 77,000 trials and 6,000 tasks.

Another exciting development from the team is the RT-Trajectory, which leverages video input for robotic learning. While many teams are exploring the use of online videos to train robots, RT-Trajectory adds an intriguing element. It overlays a two-dimensional sketch of the robotic arm in action over the video, providing valuable visual cues for the model as it learns its control policies.

“RT-Trajectory not only represents a step towards building more efficient robots in uncharted situations, but also unlocks knowledge from existing datasets.” – DeepMind Robotics Team

In testing, RT-Trajectory proved to have double the success rate of the previous RT-2 method, achieving 63% accuracy compared to 29%. It was tested on 41 different tasks, showing promising results in both training and testing.

The team further adds, “RT-Trajectory utilizes detailed robotic motion information present in all datasets, yet often underutilized. This not only brings us closer to creating robots that can move accurately in new environments, but also unlocks crucial insights from existing datasets.”