OmniManip

OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

CVPR 2025 Highlight

Mingjie Pan^1,2*, Jiyao Zhang^1,2*, Tianshu Wu¹, Yinghao Zhao³, Wenlong Gao³ Hao Dong^1,2,

¹CFCS, School of Computer Science, Peking University.
²PKU-AgiBot Lab. ³AgiBot.
^*The first two authors contributed equally.

OmniManip is capable of handling diverse open-vocabulary instructions and objects in a zero-training manner.

Abstract

The development of general robotic systems capable of manipulating in unstructured environments is a significant challenge. While Vision-Language Models(VLM) excel in high-level commonsense reasoning, they lack the fine-grained 3D spatial understanding required for precise manipulation tasks. Fine-tuning VLM on robotic datasets to create Vision-Language-Action Models(VLA) is a potential solution, but it is hindered by high data collection costs and generalization issues. To address these challenges, we propose a novel object-centric representation that bridges the gap between VLM's high-level reasoning and the low-level precision required for manipulation. Our key insight is that an object's canonical space, defined by its functional affordances, provides a structured and semantically meaningful way to describe interaction primitives, such as points and directions. These primitives act as a bridge, translating VLM's commonsense reasoning into actionable 3D spatial constraints. In this context, we introduce a dual closed-loop, open-vocabulary robotic manipulation system: one loop for high-level planning through primitive resampling, interaction rendering and VLM checking, and another for low-level execution via 6D pose tracking. This design ensures robust, real-time control without requiring VLM fine-tuning. Extensive experiments demonstrate strong zero-shot generalization across diverse robotic manipulation tasks, highlighting the potential of this approach for automating large-scale simulation data generation.

Method

Given instructions and RGB-D observations, OmniManip utilizes VLM and VFM to identify task-relevant objects and decompose the task into distinct stages. During each stage, OmniManip extracts object-centric canonical interaction primitives as spatial constraints and employs the RRC mechanism for closed-loop planning. For execution, the trajectory is optimized by constraints and updated via a 6D pose tracker, achieving closed-loop execution.

Dual Closed-loop System Design

Closed-loop Planning.

Closed-loop Execution.

Application to Long-horizon Task

With the integration of a VLM-based high-level planner, OmniManip can accomplish long-horizon tasks. The high-level planner is responsible for task decomposition, while OmniManip executes each subtask. Below are two examples of long-horizon tasks.

User: "Hi robot, cook rice for me."

Subtasks:

“Open the lid”
“Pour the rice”
“Add the water”
“Close the lid”
“Click start button (top left corner)”
“Wait 20 minutes”
“Open the lid”

User: "The table is too messy, organize it."

Subtasks:

“Insert pen into holder”
“Throw paper ball into bin”
“Open drawer”
“Place toy into drawer”
“Close drawer”

Cross-embodiment Capabilities

OmniManip is a hardware-agnostic approach that can be easily deployed on various types of robotic embodiments. It utilizes the common-sense understanding capabilities of Vision-Language Models (VLM) to achieve open-vocabulary manipulation. We have implemented this operational framework on AgiBot's dual-arm humanoid robot.

Simulation Data Collection

OmniManip can be seamlessly applied to large-scale simulation data generation. Our follow-up work will be released soon, please stay tuned.

Join Our Team

We are seeking highly self-motivated interns and offer ample hardware and computing resources. If you're interested, please contact us at hao.dong@pku.edu.cn or pmj@stu.pku.edu.cn.