3D Feature Distillation and Point Pruning
Initially, Given a pair of RGBD scans from a 3D scene, we first extract DINO features from the images and back-project them to the partial point clouds. However, while DINO does exhibit correspondences across views, it is not strictly multiview-invariant, leading to local feature discrepancies in overlapping views. To address the problem of local feature discrepancies in overlapping views, we propose a lightweight feature refinement network. This network, consisting of a two-layer per-point MLP, can be efficiently trained on a single source scene. When applied to novel scenes, it produces consistent and high-quality features without any modifications. The main concept behind this approach is to ensure that features within the same neighborhood are similar, while those from distant neighborhoods are distinct. We achieve this by optimizing the network's weights using the contrastive learning framework. During training, the refined features obtained from the network are passed through a projection head to obtain projected features. However, after training on the source scene, the projection head is discarded, and only the feature refinement network is utilized for novel scenes. Additionally, we implement a point-pruning mechanism to augment feature continuity within each local neighborhood.
By establishing coherent feature fields on both source and target scenes, we devise an energy function that facilitates the minimization of feature discrepancies w.r.t. the end-effector parameters between the demonstration and the target manipulation. Additionally, to prevent extreme hand poses that could potentially damage the robot, we impose a pose constraint Eposepβq penalizing out-of-limit hand pose.
We evaluate our model by deploying it to a robot hand and conducting experiments under various setups. Given the superior stability of large image models like DINO on real photos compared to synthetic images, we opt to assess our method directly in real-world settings, bypassing simulations.
This figure visualizes the energy field derived from the differences in features between the source and target scenes. Areas marked in yellow represent regions with smaller feature differences, indicating higher similarity, while purple denotes larger differences, highlighting dissimilarity between features.
Features on the point clouds are visualized with RGB values after PCA to 3 channels. The hand positions obtained through optimization are shown in blue.
If you have any questions, please contact Qianxu Wang (firstname.lastname@example.org).