SpareDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation

  • 🌲Department of Computer Science, Stanford University

  • 🏮CFCS, School of Computer Science, Peking University

  • 🤖Institute for AI, Peking University

  • 🌸PKU-WUHAN Institute for Artificial Intelligence


  • We propose a novel framework for one-shot learning of dexterous manipulations, utilizing semantic scene understandings distilled into 3D feature fields.
  • We develop an efficient methodology for extracting view-consistent 3D features from 2D image models, incorporating a lightweight feature refinement network and a point pruning mechanism. This enables the direct application of our network to novel scenes to predict consistent features without any modifications or fine-tuning.
  • Extensive real-world experiments with a dexterous hand validate our method, demonstrating robustness and superior generalization capabilities in diverse scenarios.

Method Overview

3D Feature Distillation and Point Pruning

Initially, Given a pair of RGBD scans from a 3D scene, we first extract DINO features from the images and back-project them to the partial point clouds. However, while DINO does exhibit correspondences across views, it is not strictly multiview-invariant, leading to local feature discrepancies in overlapping views. To address the problem of local feature discrepancies in overlapping views, we propose a lightweight feature refinement network. This network, consisting of a two-layer per-point MLP, can be efficiently trained on a single source scene. When applied to novel scenes, it produces consistent and high-quality features without any modifications. The main concept behind this approach is to ensure that features within the same neighborhood are similar, while those from distant neighborhoods are distinct. We achieve this by optimizing the network's weights using the contrastive learning framework. During training, the refined features obtained from the network are passed through a projection head to obtain projected features. However, after training on the source scene, the projection head is discarded, and only the feature refinement network is utilized for novel scenes. Additionally, we implement a point-pruning mechanism to augment feature continuity within each local neighborhood.

End-Effector Optimization

By establishing coherent feature fields on both source and target scenes, we devise an energy function that facilitates the minimization of feature discrepancies w.r.t. the end-effector parameters between the demonstration and the target manipulation. Additionally, to prevent extreme hand poses that could potentially damage the robot, we impose a pose constraint EposepĪ²q penalizing out-of-limit hand pose.


We evaluate our model by deploying it to a robot hand and conducting experiments under various setups. Given the superior stability of large image models like DINO on real photos compared to synthetic images, we opt to assess our method directly in real-world settings, bypassing simulations.

This figure visualizes the energy field derived from the differences in features between the source and target scenes. Areas marked in yellow represent regions with smaller feature differences, indicating higher similarity, while purple denotes larger differences, highlighting dissimilarity between features.

Features on the point clouds are visualized with RGB values after PCA to 3 channels. The hand positions obtained through optimization are shown in blue.

If you have any questions, please contact Qianxu Wang (