RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer

Abstract

Imitation learning has become a cornerstone in robotic manipulation. However, collecting large-scale real-world robot demonstrations remains prohibitively expensive. While simulators offer a more cost-effective alternative, the significant sim-to-real gap poses substantial challenges to scalability.

To address this, we present RoboTransfer, a diffusion-based video generation framework for robotic data synthesis. Unlike prior approaches, RoboTransfer integrates multi-view geometry with explicit, fine-grained control over scene components, including background textures and object-level attributes. Through cross-view feature interaction and the incorporation of global depth-normal priors, RoboTransfer ensures geometric consistency across views.

Our framework supports highly controllable generation capabilities—such as background replacement and object swapping— facilitating diverse and photorealistic multi-view video synthesis. Experimental results demonstrate that RoboTransfer significantly enhances both geometric fidelity and visual realism. Furthermore, policies trained on RoboTransfer-generated data exhibit a 33.3% relative improvement in success rate under the Diff-Obj setting, and a remarkable 251% improvement under the more challenging Diff-All scenario.

Qualitative Results

1. Real2Real Transfer

Given the same structured input, the model enables flexible editing of background attributes, such as texture and color. This real-to-real generation framework enhances the diversity of training data, improving the generalization of the policy model for downstream tasks.

Left: Displays the raw data and construction constraints pre-labeled from the raw input.
Right: Displays diverse generation results with different background conditions.

Scene 1 (Left: Raw Data, Right: Diverse Generation Result)

Scene 2 (Left: Raw Data, Right: Diverse Generation Result)

Scene 3 (Left: Raw Data, Right: Diverse Generation Result)

Meanwhile, the appearance of foreground objects, including their color, can be effectively modified.

First row: Displays the raw data.
Second and third rows: Show the construction constraints pre-labeled from the raw input.
Fourth row: Presents the specified foreground appearance conditions.
Fifth row: Shows the generated results, which can be flexibly edited to alter the appearance of foreground objects.
Sixth row: Illustrates the background appearance conditions.

Scene 1: Foreground objects modified

Scene 2: Foreground objects modified

2. Sim2Real Transfer

RoboTransfer generates photorealistic videos from simulated structural inputs, including out-of-distribution cases. This sim-to-real paradigm minimizes the dependence on structured annotations from real-world datasets, making robotic learning more scalable and flexible. To showcase the effectiveness of our approach, we evaluate it on two tasks: Bowls Stack and Cup Place, sourced from the CVPR2025 RoboTwin Benchmark.

First row: Simulator-rendered outputs.
Second and third rows: Construction constraints imposed by the simulator.
Fourth row: Generated results.
Fifth row: Appearance conditions.

Bowls Stack: Scene 1

Bowls Stack: Scene 2

Bowls Stack: Scene 3

Bowls Stack: Scene 4

Cup Place: Scene 1

Cup Place: Scene 2

Cup Place: Scene 3

Cup Place: Scene 4

3. Comparison

This section presents a side-by-side comparison of results produced by different methods. The video below illustrates qualitative differences in generation or performance across approaches.

Visual comparison of different methods.

Visualization of Real Robot Experiments

This section presents visual results from real-world experiments conducted to evaluate our data generation approach for visual policy models. We carried out extensive trials in physical environments to assess the effectiveness of the proposed method. Full statistical results and analysis are provided in the accompanying paper.

Visualization of Keyframe Sequences from Real Robot Experiments.

BibTeX


      @misc{2025robotransfer,
        title={RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer}, 
        author={Liu Liu and Xiaofeng Wang and Guosheng Zhao and Keyu Li and Wenkang Qin and Jiaxiong Qiu and Zheng Zhu and Guan Huang and Zhizhong Su},
        year={2025},
        eprint={2505.23171},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2505.23171}, 
  }