
RoboTransfer: A Diffusion-Based Framework for Photo-Realistic, Geometry-Consistent, and Controllable Robotic Data Synthesis
RoboTransfer: A Diffusion-Based Framework for Photo-Realistic, Geometry-Consistent, and Controllable Robotic Data Synthesis
Imitation learning has become a cornerstone in robotic manipulation. However, collecting large-scale real-world robot demonstrations remains prohibitively expensive. While simulators offer a more cost-effective alternative, the significant sim-to-real gap poses substantial challenges to scalability.
To address this, we present RoboTransfer, a diffusion-based video generation framework for robotic data synthesis. Unlike prior approaches, RoboTransfer integrates multi-view geometry with explicit, fine-grained control over scene components, including background textures and object-level attributes. Through cross-view feature interaction and the incorporation of global depth-normal priors, RoboTransfer ensures geometric consistency across views.
Our framework supports highly controllable generation capabilities—such as background replacement and object swapping— facilitating diverse and photorealistic multi-view video synthesis. Experimental results demonstrate that RoboTransfer significantly enhances both geometric fidelity and visual realism. Furthermore, policies trained on RoboTransfer-generated data exhibit a 33.3% relative improvement in success rate under the Diff-Obj setting, and a remarkable 251% improvement under the more challenging Diff-All scenario.
Given the same structured input, the model enables flexible editing of background attributes, such as texture and color. This real-to-real generation framework enhances the diversity of training data, improving the generalization of the policy model for downstream tasks.
Meanwhile, the appearance of foreground objects, including their color, can be effectively modified.
RoboTransfer generates photorealistic videos from simulated structural inputs, including out-of-distribution cases. This sim-to-real paradigm minimizes the dependence on structured annotations from real-world datasets, making robotic learning more scalable and flexible. To showcase the effectiveness of our approach, we evaluate it on two tasks: Bowls Stack and Cup Place, sourced from the CVPR2025 RoboTwin Benchmark.
This section presents a side-by-side comparison of results produced by different methods. The video below illustrates qualitative differences in generation or performance across approaches.
This section presents visual results from real-world experiments conducted to evaluate our data generation approach for visual policy models. We carried out extensive trials in physical environments to assess the effectiveness of the proposed method. Full statistical results and analysis are provided in the accompanying paper.
@misc{2025robotransfer,
title={RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer},
author={Liu Liu and Xiaofeng Wang and Guosheng Zhao and Keyu Li and Wenkang Qin and Jiaxiong Qiu and Zheng Zhu and Guan Huang and Zhizhong Su},
year={2025},
eprint={2505.23171},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.23171},
}