HoloAgent-0 - A Unified Embodied Agent Framework

HoloAgent-0: A Unified Embodied Agent Framework with 3D Spatial Memory

Xiaolin Zhou^*1, Liu Liu^*1†, Tingyang Xiao^*1, Wei Feng^*1, Fa Fu², Xinrui Meng²,
Xinjie Wang¹, Jialiang Han¹, Boyang Yu¹, Yun Du¹, Wei Sui², Zhizhong Su¹

¹Horizon Robotics ²D-Robotics Robotics
^*Equal contribution ^†Corresponding: Liu Liu (nemo.liu@horizon.auto)

A unified embodied agent framework that converts language instructions into executable skill graphs, grounds them in persistent 3D memory, and monitors real-world robot execution for feedback-driven recovery.

Code Demo Framework

Abstract

From Digital Agents to Physical Robots

HoloAgent-0 narrows the embodiment gap between digital LLM agents and real-world robots, where physical execution is continuous, embodiment-dependent, uncertain, and safety-constrained. The system organizes heterogeneous robot models and controllers through three coupled layers: Embodied AgentOS for planning, scheduling, monitoring, and re-planning; 3D spatial memory for physical-world grounding; and embodied skills for executable robot action. Together, these layers make long-horizon physical skills composable, observable, and verifiable.

Real-Robot Demonstrations

Closed-Loop Execution in Physical Scenes

These demonstrations show HoloAgent-0 deployed on real hardware across motion generation, object search, cross-robot coordination, and mobile manipulation. The videos highlight how monitored skills, spatial memory, and runtime feedback support closed-loop execution in physical scenes.

Navigation and Dance Coordination

Heterogeneous robots share task context through the framework while combining mobile navigation, execution monitoring, and expressive humanoid motion.

Long-Horizon Mobile Manipulation

The framework decomposes long-horizon mobile manipulation into navigation, grasping, placement, verification, and recovery steps.

Active Exploration in a New Environment

The robot enters an unseen environment, actively explores reachable space, expands 3D memory online, and updates the scene graph for later navigation and search.

Interactive Humanoid Command Execution

A humanoid robot follows open-ended human commands, freely navigating the environment and executing corresponding embodied actions.

A Day with a Robot Companion

Powered by a multimodal agent brain and a full-stack robot skill library, the system brings large models into the physical world. The robot understands language, reasons over 3D space, navigates autonomously, communicates naturally, and turns human intent into long-horizon physical action.

A Day in the Life of a Robot Guide

A robot guide provides natural-language workspace assistance, leading users to destinations, answering intent-driven requests, and adapting its route through spatial memory.

Prompt Motion Control Execute and verify short-horizon whole-body commands.

Active Object Search Explore, build the map, and verify the target coffee machine.

Cross-Robot Coordination Route one robot while another performs a dance skill.

Long-Horizon Mobile Manipulation Decompose laundry folding into navigation, pick-and-place, motion, and manipulation steps.

Embodied Agent Framework

Framework for Closed-Loop Robot Execution

Embodied AgentOS converts natural-language instructions into executable skill graphs, schedules robot resources, monitors execution, and triggers clarification or re-planning from runtime feedback. Through a ROS2 command/status bus, it closes the loop between spatial retrieval, skill-graph planning, execution monitoring, memory updates, and feedback-driven recovery.

Skill Layer

Embodied Skills as Executable Robot Interfaces

The Skill Layer exposes executable robot capabilities as typed, monitored interfaces rather than low-level controls. Navigation, perception, manipulation, whole-body motion, interaction, and cross-embodiment coordination backends bind task intent to specific robot embodiments while reporting progress, failure modes, safety state, and recoverability for verification and online recovery.

Interaction

Speech and Perception

Speech, perception, detection, localization, and verification connect human intent with runtime visual evidence.

HoloNavi

Spatial Navigation

Memory-grounded navigation retrieves candidate places and objects, selects viewpoints, verifies progress online, and updates task state from feedback.

HoloBrain

Manipulation

Pick, place, open, hand-over, push, fold, and grasp-anything skills combine task intent, visual observations, embodiment priors, and semantic memory.

HoloMotion

Whole-Body Control

Reference tracking, velocity control, posture adjustment, expressive motion, and recovery provide monitored whole-body control for humanoids.

Collaboration

Cross-Embodiment

Heterogeneous robots share memory records, typed skill calls, and status events while preserving platform-specific sensing, actuation, safety, and recovery constraints.

Memory Layer

Spatial and Temporal Memory for Embodied Agents

The Memory Layer provides spatial and temporal memory for HoloAgent-0. Spatial memory converts sensor streams into a persistent 3D world representation for grounding, localization, navigation, and manipulation, including geometry, topology, occupancy, robot pose, open-vocabulary semantic instances, and a hierarchical multimodal scene graph. Temporal memory records goals, plan state, execution and recovery traces, and outcome summaries so AgentOS can retrieve current world state, task history, and recent execution evidence.

Geometry

Metric Geometry and Localization

Maintains coordinate frames, robot pose, dense geometry, topology, occupancy, traversability evidence, and localization indices for embodied skills.

Semantics

Open-Vocabulary 3D Semantic Mapping

Lifts 2D foundation-model features onto geometry memory and associates observations with persistent language-queryable 3D instances.

HMSG

Hierarchical Multimodal Scene Graph

Organizes floors, rooms, views, and objects for room-, view-, and object-level grounding, verification, and reasoning.

Temporal

Task State and Memory Update

Records goals, plan state, status events, verification results, recovery decisions, and updates affected spatial records after execution.

Experiments

Navigation, Memory, and Real-Robot Execution

The evaluation covers spatial memory, long-horizon navigation, and closed-loop execution. Quantitative experiments test semantic mapping and the HoloAgent-Nav navigation stack, while real-robot deployments show the same AgentOS runtime composing humanoid motion, object search, cross-robot coordination, and mobile manipulation through monitored skills.

82.6% HM3D-ObjNav SR

The AgentOS-wrapped navigation stack reaches the strongest simulated success rate under the HM3D-ObjNav protocol.

42.8% HM3D-ObjNav SPL

Feedback-driven execution improves path efficiency over the slow-reasoning FSR-VLN reference baseline.

97.70% Real Top-1@1m

Real-robot goal reaching succeeds under the strict 1.0 m physical-apartment threshold.

98.90% Real Top-5@1m

Top-5 candidate retrieval and monitored execution remain strong under the same strict threshold.

Future Work

Embodiment-Aware Robot-Agent Stack

HoloAgent-0 is an early step toward full-stack robot-agent systems that connect language-level reasoning, embodied execution, persistent memory, and safe evaluation. Its current limitations point toward instruction-aligned robot foundation models, broader embodiment support with full-stack humanoid skills, and code generation for robot evolution.

Action Models

Instruction-Aligned Robot Foundation Models

Future robot foundation models should expose language-aligned, composable action spaces that AgentOS can schedule, monitor, and verify.

Embodiments

Broader Embodiment and Humanoid Skills

AgentOS should coordinate sensing, actuation, safety, mobility, object-centric navigation, manipulation, interaction, and recovery across platforms.

Code Generation

Robot Evolution with Digital-Twin Validation

Coding agents can generate robot actions and execution policies from task intent, robot APIs, and environment context, while EmbodiedGen-style digital twins validate them before real-world deployment.

Citation

BibTeX

@article{holoagent2026,
  title   = {HoloAgent-0: A Unified Embodied Agent Framework with 3D Spatial Memory},
  author  = {Zhou, Xiaolin and Liu, Liu and Xiao, Tingyang and Feng, Wei and Fu, Fa and Meng, Xinrui and Wang, Xinjie and Han, Jialiang and Yu, Boyang and Du, Yun and Lin, Tianwei and Sui, Wei and Su, Zhizhong},
  journal = {Technical Report},
  year    = {2026},
  url     = {https://github.com/HorizonRobotics/HoloAgent}
}