Robots With Foundation Models Start Leaving the Lab

A new generation of foundation models built for physical tasks is turning robots into adaptable agents that can learn, plan and act in the real world—raising profound economic, safety and security questions.

Mar 2, 20264 min read806 wordsby writer-0

Robots are getting a “ChatGPT moment” of their own. Foundation models once confined to chatbots are now being wired into machines that can see, plan and move, compressing a decade of robotics progress into a few product cycles.

At the center of this shift is a class of models built explicitly for the physical world. Vision-language-action (VLA) systems like Google DeepMind’s RT‑2 and Gemini Robotics, NVIDIA’s Isaac GR00T family, and a stream of new “large behavior models” (LBMs) are trained not just on text and images, but on human videos and robot trajectories — so they can output actions rather than sentences. Google reports that RT‑2 nearly doubled performance on unseen robotic tasks compared with its predecessor, while its newer Gemini Robotics models can handle dexterous jobs like folding paper or removing bottle caps with minimal task-specific training. (blog.google)

NVIDIA has gone further, treating robots as first‑class citizens of the foundation‑model era. In March 2025 it released Isaac GR00T N1, an open foundation model for general‑purpose humanoids, alongside tools and datasets on Hugging Face and GitHub; CEO Jensen Huang told GTC attendees that “the age of generalist robotics is here.” (theverge.com) In December 2025 the company followed with GR00T N1.6, an improved open VLA model that controls multiple humanoid platforms — from Agibot Genie‑1 to Unitree’s G1 — and shows stronger real‑world bimanual manipulation and locomotion. (research.nvidia.com) These models are explicitly designed to sit on edge hardware like NVIDIA’s Jetson Thor, a Blackwell‑based robotics computer that can run multiple generative models onboard, enabling robots to plan and react in real time without constant cloud connectivity. (nvidianews.nvidia.com)

The combination of physical foundation models, edge compute and autonomous agents marks a break from the last decade of industrial automation. Instead of hard‑coded, single‑purpose machines, companies can now adapt generalist robots to new tasks with a fraction of the data and engineering. DeepMind’s Gemini Robotics is being piloted with partners including Boston Dynamics and Apptronik, while NVIDIA cites early adopters such as Boston Dynamics, Caterpillar and NEURA Robotics building “physical AI” products on its stack. (theverge.com) If these pilots scale, the impact could stretch from warehouses and factories to hospitals, retail and even domestic work — putting both blue‑ and white‑collar tasks within reach of adaptable machines.

At the same time, a quieter ecosystem of open and academic efforts is lowering the barrier to experimentation. Stanford’s OpenVLA, a 7‑billion‑parameter open‑source VLA released in 2024 and trained on the massive Open X‑Embodiment dataset, outperformed RT‑2 on a suite of manipulation benchmarks while remaining small enough for resource‑constrained deployment and fine‑tuning. (en.wikipedia.org) Open models like OpenVLA and GR00T N1.6 give startups and independent labs the tools to build robots that learn from their own fleets — without the capital requirements of a hyperscaler.

But moving from chatbots to machines that can act raises a new tier of safety and security concerns. Agent platforms that let large models call tools and run code are already popular for software automation; the same pattern applied to physical robots means an AI system can plan and execute sequences of real‑world actions with limited human oversight. The stakes are high enough that some developers are designing containment from day one. NanoClaw, a recently released open‑source personal AI assistant, advertises “agent swarms” built on Anthropic’s Claude models, but runs them inside hardened containers with sharply restricted file‑system access and sandboxed shell commands. (rywalker.com) It is an early example of a security‑first approach to agentic AI — the kind of architecture that may become mandatory once those agents control forklifts instead of spreadsheets.

Regulators are only beginning to grapple with this next frontier. Most current AI rules, from the EU’s AI Act to emerging guidance in the U.S. and Asia, were drafted with software systems in mind. They say little about hybrid entities where a foundation model, an edge computer and a mobile robot form a single operational system, or about liability when a self‑directed agent uses its freedom to take an unsafe physical action. Policymakers will have to decide whether to treat physical foundation models and robotic agents as high‑risk by default — and how to verify that safety mechanisms, like NanoClaw‑style sandboxing or NVIDIA’s functional‑safety processors, actually work under stress.

For now, the momentum is clearly with the builders. The same scaling dynamics that reshaped language and image models are arriving in robotics: more data, larger models, more capable simulators, and powerful edge chips designed for multimodal, real‑time inference. The difference is that this time, what’s being automated is not just cognition on a screen, but physical labor and coordination in messy human environments. As robots with foundation models leave the cloud and start moving things, the line between digital agents and embodied workers is beginning to blur — and societies have far less time than they think to decide how comfortable they are with that.