In embodied intelligence systems, a key component is 3D perception algorithm, which enables agents to understand their surrounding environments. Previous algorithms primarily rely on point cloud, which, despite offering precise geometric information, still constrain perception performance due to inherent sparsity, noise, and data scarcity. In this work, we introduce a novel image-centric 3D perception model, BIP3D, which leverages expressive image features with explicit 3D position encoding to overcome the limitations of point-centric methods. Specifically, we leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding. Together, these modules enable BIP3D to achieve multi-view, multi-modal feature fusion and end-to-end 3D perception. BIP3D demonstrates exceptionally superior 3D perception capabilities, achieving state-of-the-art performance in both 3D detection and grounding tasks on public datasets.
 
            The Architecture Diagram of BIP3D, where the red stars indicate the parts that have been modified or added compared to the base model, GroundingDINO, and dashed lines indicate optional elements.
BIP3D is capable of handling a variety of scenarios, delivering robust perception capabilities in dense environments, as well as for small and oversized targets. Moreover, BIP3D exhibits excellent generalization across both intrinsic and extrinsic camera parameters, supporting multiple camera types. The following videos showcase its robust capabilities in diverse settings.