OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

Mingjie Pan1,2*, Jiyao Zhang1,2*, Tianshu Wu1, Yinghao Zhao3, Wenlong Gao3 Hao Dong1,2,
1CFCS, School of Computer Science, Peking University.
2PKU-AgiBot Lab. 3AgiBot.
*The first two authors contributed equally.


Bridging high-level reasoning and precise 3D manipulation, OmniManip uses object-centric representations to translate VLM outputs into actionable 3D constraints. A dual-loop system combines VLM-guided planning with 6D pose tracking for execution, achieving generalization in diverse robotic tasks with a zero-training manner.


OmniManip is capable of handling diverse open-vocabulary instructions and objects in a zero-training manner.

Abstract

The development of general robotic systems capable of manipulating in unstructured environments is a significant challenge. While Vision-Language Models(VLM) excel in high-level commonsense reasoning, they lack the fine-grained 3D spatial understanding required for precise manipulation tasks. Fine-tuning VLM on robotic datasets to create Vision-Language-Action Models(VLA) is a potential solution, but it is hindered by high data collection costs and generalization issues. To address these challenges, we propose a novel object-centric representation that bridges the gap between VLM's high-level reasoning and the low-level precision required for manipulation. Our key insight is that an object's canonical space, defined by its functional affordances, provides a structured and semantically meaningful way to describe interaction primitives, such as points and directions. These primitives act as a bridge, translating VLM's commonsense reasoning into actionable 3D spatial constraints. In this context, we introduce a dual closed-loop, open-vocabulary robotic manipulation system: one loop for high-level planning through primitive resampling, interaction rendering and VLM checking, and another for low-level execution via 6D pose tracking. This design ensures robust, real-time control without requiring VLM fine-tuning. Extensive experiments demonstrate strong zero-shot generalization across diverse robotic manipulation tasks, highlighting the potential of this approach for automating large-scale simulation data generation.

Method

Given instructions and RGB-D observations, OmniManip utilizes VLM and VFM to identify task-relevant objects and decompose the task into distinct stages. During each stage, OmniManip extracts object-centric canonical interaction primitives as spatial constraints and employs the RRC mechanism for closed-loop planning. For execution, the trajectory is optimized by constraints and updated via a 6D pose tracker, achieving closed-loop execution.

Dual Closed-loop System Design

Image 2

Closed-loop Planning.



Image 2

Closed-loop Execution.

Application to Long-horizon Task

With the integration of a VLM-based high-level planner, OmniManip can accomplish long-horizon tasks. The high-level planner is responsible for task decomposition, while OmniManip executes each subtask. Below are two examples of long-horizon tasks.

User: "Hi robot, cook rice for me."

Subtasks:
  1. “Open the lid”
  2. “Pour the rice”
  3. “Add the water”
  4. “Close the lid”
  5. “Click start button (top left corner)”
  6. “Wait 20 minutes”
  7. “Open the lid”
User: "The table is too messy, organize it."

Subtasks:
  1. “Insert pen into holder”
  2. “Throw paper ball into bin”
  3. “Open drawer”
  4. “Place toy into drawer”
  5. “Close drawer”

Cross-embodiment Capabilities

OmniManip is a hardware-agnostic approach that can be easily deployed on various types of robotic embodiments. It utilizes the common-sense understanding capabilities of Vision-Language Models (VLM) to achieve open-vocabulary manipulation. We have implemented this operational framework on AgiBot's dual-arm humanoid robot.

Simulation Data Collection

OmniManip can be seamlessly applied to large-scale simulation data generation. Our follow-up work will be released soon, please stay tuned.

Join Our Team

We are seeking highly self-motivated interns and offer ample hardware and computing resources. If you're interested, please contact us at hao.dong@pku.edu.cn or pmj@stu.pku.edu.cn.