.. _model_zoo_sem: SEM: Enhancing Spatial Understanding for Robust Robot Manipulation ======================================================================= .. image:: ../../_static/sem/sem_structure.png :alt: System Architecture Diagram :align: center :width: 100% Introduction ----------------------- Robust robot manipulation in diverse real-world environments hinges on a core capability: spatial understanding, the ability to perceive and reason about 3D geometry, object relations, and the robot's own embodiment. While recent diffusion-based policy models have demonstrated strong generative capacity for action modeling, their effectiveness remains limited by perceptual bottlenecks: 1. a reliance on 2D visual encoders with limited spatial reasoning, 2. insufficient modeling of the robot's internal structure beyond end-effector trajectories. In this work, we propose **SEM (Spatial Enhanced Manipulation model)**, a novel diffusion-based policy framework that explicitly enhances spatial understanding through two complementary perceptual modules. First, we introduce a 3D spatial enhancer that lifts multi-view 2D image features into a unified 3D representation by leveraging camera geometry and depth observations. This module enables fine-grained spatial reasoning while preserving the semantic richness of 2D vision backbones, effectively resolving the spatial ambiguities inherent in purely 2D encodings. Second, we propose a joint-centric robot state encoder that models the robot's kinematic structure as a graph of interconnected joints. By predicting distributions over full joint trajectories, not just end-effector poses, and applying graph attention to capture joint dependencies, this module enhances embodiment-aware understanding and enables finer-grained control. By integrating these two modules into a unified diffusion transformer, **SEM** enhances action generation capabilities, improving both spatial understanding of the external environment and internal robot embodiment, resulting in more robust, generalizable, and precise manipulation. Key Features ------------------------ 1. **Spatial Enhancer**: * Elevates multi-view 2D image features into a unified 3D representation using camera geometry and depth observations. * Enables fine-grained spatial reasoning while preserving the semantic richness of 2D vision backbones. * Resolves spatial ambiguities inherent in purely 2D encodings. .. image:: ../../_static/sem/sem_feature_enhancer.png :alt: System Architecture Diagram :align: center :width: 100% 2. **Joint-Centric Robot State Encoder**: * Models the robot's kinematic structure as a graph of interconnected joints. * Predicts distributions over full joint trajectories, not just end-effector poses. * Applies graph attention to capture joint dependencies, enhancing embodiment-aware understanding and enabling finer-grained control. 3. **Feature Enhancer**: * Facilitates cross-modal fusion and alignment of text and image features, emphasizing task-relevant regions. * Offers two implementations: GroundingDINO-style enhancer and language model-style enhancer. 4. **Action Decoder**: * Employs a diffusion transformer to predict future trajectories from aggregated features. * Includes additional cross-attention modules for robot state, image, and text features. * Temporal cross-attention operates in a causal manner to ensure unidirectional flow along the time axis. .. image:: ../../_static/sem/sem_encoder_decoders.png :alt: System Architecture Diagram :align: center :width: 100% Resources --------------------- * **Paper (arXiv):** `SEM: Enhancing Spatial Understanding for Robust Robot Manipulation `_ Citation -------------------- If you find **SEM** useful in your research, please consider citing the following paper: .. code-block:: bibtex @article{lin2025sem, title={SEM: Enhancing Spatial Understanding for Robust Robot Manipulation}, author={Lin, Xuewu and Lin, Tianwei and Huang, Lichao and Xie, Hongyu and Jin, Yiwei and Li, Keyu and Su, Zhizhong}, journal={arXiv preprint arXiv:2505.16196}, year={2025} }