Conference on Robot Learning (CoRL) 2024
T2SQNet-based method succeeds in sequentially grasping the objects without re-recognition, while avoiding collisions with other objects and the environment.
T2SQNet-based method also successfully rearranges the surrounding objects and finally retrieves an initially non-graspable target object (e.g., wineglass).
Recognizing and manipulating transparent tableware from partial view RGB image observations is made challenging by the difficulty in obtaining reliable depth measurements of transparent objects. In this paper we present the Transparent Tableware SuperQuadric Network (T2SQNet), a neural network model that leverages a family of newly extended deformable superquadrics to produce low-dimensional, instance-wise and accurate 3D geometric representations of transparent objects from partial views. As a byproduct and contribution of independent interest, we also present TablewareNet, a publicly available toolset of seven parametrized shapes based on our extended deformable superquadrics, that can be used to generate new datasets of tableware objects of diverse shapes and sizes. Experiments with T2SQNet trained with TablewareNet show that T2SQNet outperforms existing methods in recognizing transparent objects, in some cases by significant margins, and can be effectively used in robotic applications like decluttering and target retrieval.
Coming soon
Superquadrics, parametrized by only a few parameters, can represent a relatively wide range of geometric shapes. We employ two kinds of superquadrics: superellipsoids, which have been used for object manipulation, and superparaboloids, which are newly introduced. Superellipsoids and superparaboloids are implicit surfaces with the following implicit functions with size parameters \((a_1, a_2, a_3) \in \mathbb{R}_+^3\) and shape parameters \((e_1, e_2) \in \mathbb{R}_+^2\): for \(\textbf{x} = (x, y, z)\), $$ \begin{equation*} \overbrace{f_{se}(\textbf{x})=\left(\left|\frac{x}{a_1}\right|^{\frac{2}{e_2}} + \left|\frac{y}{a_2}\right|^{\frac{2}{e_2}}\right)^{\frac{e_2}{e_1}} + \left|\frac{z}{a_3}\right|^{\frac{2}{e_1}} = 1 }^{\text{Superellipsoid}}, \:\:\:\:\:\: \overbrace{f_{sp}(\textbf{x})= \left(\left|\frac{x}{a_1}\right|^{\frac{2}{e_2}} + \left|\frac{y}{a_2}\right|^{\frac{2}{e_2}}\right)^{\frac{e_2}{e_1}} - \left(\frac{z}{a_3}\right) = 1}^{\text{Superparaboloid}} \label{eq:sq} \end{equation*} $$ Deformable superquadrics extend superquadrics by incorporating global deformations, including tapering, bending, and shearing transformations. By adjusting the parameters, various surfaces can be represented, as shown below.
We combine deformable superquadrics to define templates representing seven types of tableware: wine glasses, bottles, beer bottles, bowls, dishes, handleless cups, and mugs. By adjusting parameters, we can generate diverse 3D tableware meshes. Spawning these meshes in a user-defined environment (e.g., table or shelf) within a physics simulator allows us to generate cluttered scenes. Using Blender, a photorealistic renderer, with transparent textures, we obtain RGB images of the scenes from arbitrary camera poses.
Overall, our method consists of four steps: (1) mask prediction in 2D images, (2) prediction of 3D bounding boxes, (3) computation of a smoothed visual hull through voxel carving, and (4) prediction of tableware parameters (i.e., a set of superquadric parameters). We apply these modules sequentially during inference, which can lead to the accumulation of prediction errors. To address this, we develop techniques to train each module accurately and robustly against noise and sim-to-real gaps.
The figure below shows the ground-truth shapes of the transparent TRansPose objects alongside the inferred implicit surfaces from T2SQNet. Although capturing surface details, such as the curvature of a water bottle, is challenging due to the nature of superquadric surfaces, we can confirm that T2SQNet infers the overall shapes to a considerable extent.
T2SQNet offers several practical advantages for downstream object manipulation tasks. For example, it allows for the easy design of an effective 6-DoF grasp sampler based on deformable superquadric representations, enables rapid collision checks through implicit function representations of deformable superquadric surfaces, and facilitates target-driven manipulation with instance-wise object recognition.
We demonstrate the effectiveness of our model, T2SQNet, on two object manipulation tasks: (i) sequential decluttering, which involves sequential grasping in a cluttered environment, and (ii) target retrieval, which involves object rearrangement planning to retrieve an initially non-graspable target object. The target object is indicated by a specific tableware class name (e.g., wineglass). Real-world manipulation videos can be found below.
@inproceedings{kim2024t2sqnet,
title={T$^2$SQNet: A Recognition Model for Manipulating Partially Observed Transparent Tableware Objects},
author={Kim, Young Hun and Kim, Seungyeon and Lee, Yonghyeon and Park, Frank C},
booktitle={8th Annual Conference on Robot Learning}
}