Vegetable Peeling: A Case Study in Constrained Dexterous Manipulation

Tao Chen1, Eric Cousineau2, Naveen Kuppuswamy2, Pulkit Agrawal1
1Massachusetts Institute of Technology, 2Toyota Research Institute
{taochen, pulkitag}@mit.edu
Abstract

Recent studies have made significant progress in addressing dexterous manipulation problems, particularly in in-hand object reorientation. However, there are few existing works that explore the potential utilization of developed dexterous manipulation controllers for downstream tasks. In this study, we focus on constrained dexterous manipulation for food peeling. Food peeling presents various constraints on the reorientation controller, such as the requirement for the hand to securely hold the object after reorientation for peeling. We propose a simple system for learning a reorientation controller that facilitates the subsequent peeling task. Videos are available at: https://taochenshh.github.io/projects/veg-peeling.

Refer to caption
Figure 1: We present a dexterous manipulation system that utilizes an Allegro hand mounted on a Franka robot arm to reorient food items for downstream peeling. The other Franka robot arm (the right arm in the figure) uses its gripper to grasp a peeler for peeling. The reorientation controller for the Allegro hand is learned through reinforcement learning, while the peeling is performed via teleoperation. In the figure, we demonstrate the process of reorienting and peeling a melon, a sweet potato, and a squash from top to bottom row.

Keywords: In-hand object reorientation, vegetable peeling

1 Introduction

Having robots perform food preparation tasks has been of great interest in robotics. Imagine the scenario of making mashed potatoes, where a critical step is to peel potatoes. Humans peel potatoes by grasping the potato in one hand and using the second hand to actuate a peeler to remove the potato’s skin. After a part of the potato is peeled, it is rotated while being held in the hand (i.e., in-hand manipulation) and peeled again. The sequence of rotating and peeling continues until all of the potato’s skin is removed. In this work, we present a robotic system that can re-orient different vegetables using an Allegro hand in a way that their skin can be peeled using another manipulator. Our setup is shown in Figure 1 and Figure 2.

In-hand rotation of vegetables is an instance of dexterous manipulation problem [1], a family of tasks that involves continuously controlling the force on an object while it is moving with respect to the fingertips [2, 3]. The challenges in dexterous manipulation stem from the frequent making and breaking of contact, issues in contact modeling, high-dimensional control space, perception challenges due to severe occlusions, etc. A body of work made simplifying assumptions such as manipulating convex objects [4, 5, 1, 6], small finger motions[7, 8, 9], slow or quasi-static motion or manipulating a few specific objects [10, 7, 8] to leverage trajectory optimization or planning-based methods to achieve in-hand object re-orientation [1, 7, 8, 9, 6, 4, 5, 10]. Another line of work has used reinforcement learning for in-hand re-orientation[11, 12, 13, 14, 15] and recent works have leveraged simulation to train policies capable of dynamically re-orienting a diverse set of new objects in real-time and in the real world [11, 12].

Refer to caption
Figure 2: Robot setup for reorientation and peeling.

There are several challenges in adapting re-orientation controllers for a downstream task such as peeling vegetables. These challenges stem from the fact that controllers optimized for re-orientation [16, 13, 14, 15, 12] are only optimized to continuously reorient the object and not to satisfy numerous constraints arising from task-specific requirements. For instance, peeling vegetables requires the hand to first stop re-orienting the object and then for the peeler to peel the vegetable. Many prior works solve a version of the re-orientation problem where the object is continuously rotated  [17, 16, 13] or otherwise perform quasistatic re-orientation [8]. Stopping and re-starting dynamic re-orientation is difficult due to the challenge of dealing with the object’s inertia. Second, the hand needs to hold the object firmly enough to resist forces applied by the peeler. The closest work that attempts to hold the object at a target configuration [12] is only able to loosely hold the object which is insufficient for resisting forces. Third, the hand needs to reorient the vegetable along a specific axis in place. Here, the specific axis refers to the rotational axis on the object that is parallel to the peeling direction. Similar to how humans reorient vegetables for peeling, it is desirable for the hand to reorient the object in place so that multiple consecutive cycles of reorientation and peeling can be performed. If the object substantially shifts its position during reorientation, the controller will struggle to reorient and hold the object at future time steps. Fourth, when the vegetable is held stationary the fingers should not obstruct the top surface of the vegetable to ensure that the peeler can peel the vegetable.

While in-hand object reorientation has been widely studied [11, 12, 16, 18, 13, 17], no prior works can satisfy the constraints mentioned above. Yet, these constraints become critical for downstream dexterous manipulation beyond object re-orientation. We use vegetable peeling as a case study to investigate the challenges and solutions for building a dexterous manipulation system that can operate under constraints. We develop a framework where we leverage reinforcement learning in simulation to train a policy that can perform object re-orientation under constraints. For the peeling task, we explored two approaches - a teleoperation-based method leveraging human guidance as well as an autonomous vision-based technique. Our contributions are as follows:

  1. 1.

    A framework for solving dexterous manipulation problems under the aforementioned constraints.

  2. 2.

    We propose a method that can make RL policy learn to stop its motion and hold objects firmly in hand – a critical behavior for many downstream dexterous manipulation problems.

  3. 3.

    We present a step towards a robotic system capable of peeling diverse vegetables with different shapes, masses, and material properties while holding and manipulating the vegetables in hand.

2 Related Work

In-hand Object Reorientation: Dexterous manipulation involves the use of high degrees-of-freedom (DoF) manipulators for object manipulation [19]. Its requirement for high-dimensional real-time control and its nature of frequent contact-making and breaking present grand challenges to roboticists. Recently, there has been a growth of interest in a particular instance of dexterous manipulation problems: in-hand object reorientation. This problem is of particular interest as it is a necessary step in many tool-use scenarios. For example, to use a screwdriver for tightening a screw, one has to reorient the screwdriver to align it with the screw. We can cluster the works in in-hand object reorientation from many aspects. For example, from the perspective of sensory information, [20] studies open-loop cube reorientation without using any sensors, [21, 5, 16, 10, 22] use motion capture system or special tracking markers for object reorientation, [17] uses proprioceptive sensors such as joint encoders, [23, 24, 15, 14] use tactile sensors and [25, 16, 12, 18] utilize vision sensors. In terms of the dynamics of the system, [7, 8, 9] achieved object reorientation under the assumption of quasi-static motion where object moves slowly and its inertia effect can be ignored, while [15, 16, 12, 14, 26] focuses on dynamic object reorientation where object is manipulated in a fast and dynamic way. To make in-hand object manipulation useful for downstream tool use tasks, one important aspect of the skill is the ability of stably and firmly holding the object in end of the policy rollout. While many prior works on dynamic manipulation such as [16, 10, 14, 15, 17] only consider endlessly rotating the object in hand and cannot stop the object stably when the object reaches the goal orientation, some works such as [12, 26] try to develop controllers that can reorient objects in hand and also hold the object in the goal orientation. Our work studies dynamic in-hand object manipulation with the capability of stopping objects stably in hand.

Reinforcement Learning for Contact-rich Tasks: Contact-rich tasks are particularly challenging due to the difficulty in modeling the system dynamics, especially when the tasks are performed in the wild, outside of a constrained and controlled setting. Examples of such tasks include quadruped robots hiking in mountains and robot hands reorienting various everyday objects. There have been many works using reinforcement learning to learn controllers for solving contact-rich tasks [27, 16, 13, 28, 29, 30, 31]. In the real world, robots typically only have access to a limited amount of state information of the system due to the lack of sensors or the challenges in setting up the sensors. Using reinforcement learning to learn controllers from scratch with limited sensory information tends to be data-inefficient. One way to speed up policy learning is to provide asymmetric information to the policy and value function, where the value function observes much more privileged information [16, 13, 27, 32]. Another method is to decouple policy learning into two stages: a reinforcement learning stage where agents (teacher) observe privileged fully-observable state information, and an imitation learning stage where the policy with limited sensory observation input (student) learns to imitate the policy with fully-observable state information. This approach has been successfully applied to various contact-rich problems such as locomotion [33, 34, 30, 35, 36] and dexterous manipulation [11, 12, 17]. Our pipeline is built upon the idea of teacher-student policy learning and has made several key improvements, which we will detail below.

3 Method

Peeling requires a reorientation controller that can stop its motion and firmly hold objects after reorientation. The first step in stopping is to decide when re-orientation should be stopped. One possibility is to have a perception system predict the desired rotation angle after which the next round of peeling would be performed. To accomplish the goal, the robot would need to track changes in object pose and compare it with the target rotation angle. However, accurately estimating object pose is challenging, especially when generalization to new objects is necessary [37, 16, 13, 31].

One of our insights is that instead of training a predictor for desired rotation angle and object pose estimation, it can be easier and sufficient to train a binary vision classifier that detects in real-time when the peeled part has been turned over. With such a classifier, the reorientation controller’s job is simply to keep reorienting the object until it receives a stop signal. In this formulation, unlike prior works [11, 12], the reorientation controller is not conditioned on target orientation but rather on a stop signal. Formally, the policy takes as input a binary variable Itstop{0,1}subscriptsuperscript𝐼𝑠𝑡𝑜𝑝𝑡01I^{stop}_{t}\in\{0,1\}italic_I start_POSTSUPERSCRIPT italic_s italic_t italic_o italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } representing the stop signal. If Itstop=1subscriptsuperscript𝐼𝑠𝑡𝑜𝑝𝑡1I^{stop}_{t}=1italic_I start_POSTSUPERSCRIPT italic_s italic_t italic_o italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1, the policy should stop immediately and ensure the fingers stably and firmly hold the object. Otherwise, the policy should continue reorienting the object. Note that in this work, we focus on learning the reorientation controller, leaving integration of a vision classifier to future work.

The next question is how to train such a policy. Using RL to train the policy from scratch can be challenging and requires extensive reward shaping because Itstop=1subscriptsuperscript𝐼𝑠𝑡𝑜𝑝𝑡1I^{stop}_{t}=1italic_I start_POSTSUPERSCRIPT italic_s italic_t italic_o italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 is a rare event in an episode, and when the Itstopsubscriptsuperscript𝐼𝑠𝑡𝑜𝑝𝑡I^{stop}_{t}italic_I start_POSTSUPERSCRIPT italic_s italic_t italic_o italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is flipped to one from zero, the policy needs to quickly stop the motion posing a hard-exploration challenge.

Prior works [11, 12] show success in training a goal-conditioned object reorientation controller. Can we leverage a goal-conditioned reorientation controller to train a controller that reacts to a stop signal? It turns out we can formulate this using the teacher-student learning framework [11, 12, 38, 35, 34]. Specifically, we can use RL to train a goal-conditioned controller that reorients an object by random goal angles along its rotational axis. This acts as the teacher. Next, we can use imitation learning (specifically DAGGER [39]) to train a controller conditioned on the stop signal to imitate the teacher. The stop signal can be generated during training by checking if the orientation distance to the goal is below a threshold. Using imitation learning bypasses the hard exploration challenge.

3.1 Teacher Policy Learning: Reorient and Stop

We train the teacher policy to re-orient the object along a pre-defined axis and stop (see Figure 3(a)). The teacher is formulated as a goal-conditioned policy 𝒂t=π(𝒐t,𝒂t1,g)subscriptsuperscript𝒂𝑡superscript𝜋subscriptsuperscript𝒐𝑡subscript𝒂𝑡1𝑔\bm{a}^{\mathcal{E}}_{t}=\pi^{\mathcal{E}}(\bm{o}^{\mathcal{E}}_{t},\bm{a}_{t-% 1},g)bold_italic_a start_POSTSUPERSCRIPT caligraphic_E end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT caligraphic_E end_POSTSUPERSCRIPT ( bold_italic_o start_POSTSUPERSCRIPT caligraphic_E end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_g ), where \mathcal{E}caligraphic_E represents variables for the teacher policy, 𝒐tsubscript𝒐𝑡\bm{o}_{t}bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the observation, 𝒂tsubscript𝒂𝑡\bm{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the action command, g𝑔gitalic_g is the goal representing the amount by which the object needs to be re-oriented. g𝑔gitalic_g is randomly and uniformly sampled from [1.57,4.0]1.574.0[1.57,4.0][ 1.57 , 4.0 ]rad during training.

While the teacher policy’s formulation is similar to that in prior works [11, 12], we propose (i) a much simpler reward function, (ii) new success criteria that effectively encourages the policy to stop the object and firmly hold it, and (iii) an interpolation scheme that enables smoother policy actions in the real world.

3.1.1 Reward Function

A common approach to designing the reward function is to create multiple terms that make it easier for the manipulator to discover the desired behavior (i.e., reward shaping). For instance, to facilitate exploration, we can devise a reward term that reduces the distance between the fingertips and the center of mass (CoM) of the object. To discourage excessive translational motion of the object during rotation, we can create a reward term that penalizes the displacement of the CoM. To discourage the object from rotating with undesired motion along other axes, we can add another reward term that reduces the distance between the tip of the thumb and the centerline of the palm. This ensures that the thumb applies force close to the object’s CoM, rather than to one side of the object. Additionally, we need to design a reward term that discourages the fingers from covering the top surface of the object, which affects peeling. Hence, designing multiple reward terms is necessary to regulate the behavior under specific constraints. Balancing these terms requires extensive hyper-parameter tuning.

For the task of in-hand re-orientation, we found that the reward function can be substantially simplified by using a task demonstration. However, unlike prior works that rely on trajectory-level demonstrations [40, 41], our method only requires a one-step demonstration (a keyframe), which is much easier to collect. Specifically, we manually move the real Allegro hand to a good pose where the constraints mentioned above are satisfied (e.g., the fingers do not cover the food item), and the fingers touch the object and are ready to reorient it. We record the joint positions as 𝒒demosuperscript𝒒𝑑𝑒𝑚𝑜\bm{q}^{demo}bold_italic_q start_POSTSUPERSCRIPT italic_d italic_e italic_m italic_o end_POSTSUPERSCRIPT. During training in simulation, we encourage the joint positions at any time step to be close to 𝒒demosuperscript𝒒𝑑𝑒𝑚𝑜\bm{q}^{demo}bold_italic_q start_POSTSUPERSCRIPT italic_d italic_e italic_m italic_o end_POSTSUPERSCRIPT.

Overall, our reward function is as follows:

rt=c1𝟙(Task successful)+c21|Δθt|+ϵθ+c3𝒒t𝒒demo22subscript𝑟𝑡subscript𝑐11Task successfulsubscript𝑐21Δsubscript𝜃𝑡subscriptitalic-ϵ𝜃subscript𝑐3superscriptsubscriptnormsubscript𝒒𝑡superscript𝒒𝑑𝑒𝑚𝑜22\displaystyle r_{t}=c_{1}\mathds{1}(\text{Task successful})+c_{2}\frac{1}{|% \Delta\theta_{t}|+\epsilon_{\theta}}+c_{3}\left\|\bm{q}_{t}-\bm{q}^{demo}% \right\|_{2}^{2}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_1 ( Task successful ) + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | roman_Δ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | + italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_ARG + italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∥ bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_q start_POSTSUPERSCRIPT italic_d italic_e italic_m italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (1)

where c1=800,c2=1.5,c3=0.6formulae-sequencesubscript𝑐1800formulae-sequencesubscript𝑐21.5subscript𝑐30.6c_{1}=800,c_{2}=1.5,c_{3}=-0.6italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 800 , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1.5 , italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = - 0.6 are coefficients. 𝟙(Task successful)1Task successful\mathds{1}(\text{Task successful})blackboard_1 ( Task successful ) is 1111 when the task is successfully completed, and 00 otherwise. ΔθtΔsubscript𝜃𝑡\Delta\theta_{t}roman_Δ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the distance between the object’s current and goal orientation. The first two terms are task rewards for object reorientation. The last term is to regulate hand behavior.

3.1.2 Success Criteria

In a goal-conditioned object reorientation, a common way to claim the task successful is by checking if the distance between the object’s current and the goal orientation is smaller than a threshold value (orientation criterion Cori=Δθ<θ¯subscript𝐶𝑜𝑟𝑖Δ𝜃¯𝜃C_{ori}=\Delta\theta<\bar{\theta}italic_C start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT = roman_Δ italic_θ < over¯ start_ARG italic_θ end_ARG) [16, 13]. Another criterion is that all the fingertips should make contact with the object (contact criterion Ccontactsubscript𝐶𝑐𝑜𝑛𝑡𝑎𝑐𝑡C_{contact}italic_C start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_a italic_c italic_t end_POSTSUBSCRIPT), a pre-requisite for firmly holding the object after reorientation. However, only checking these two criteria is insufficient to ensure the policy learns to stop the motion and hold the object firmly around the goal orientation, as discussed in [12]. The policy can oscillate around the goal state due to observation and control delay and noise.

To further encourage the policy to stop robot motion when the goal is reached and firmly hold the object, we propose adding time constraints to the success criteria: both Corisubscript𝐶𝑜𝑟𝑖C_{ori}italic_C start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT and Ccontactsubscript𝐶𝑐𝑜𝑛𝑡𝑎𝑐𝑡C_{contact}italic_C start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_a italic_c italic_t end_POSTSUBSCRIPT should be continuously satisfied for T¯succsuperscript¯𝑇𝑠𝑢𝑐𝑐\bar{T}^{succ}over¯ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT italic_s italic_u italic_c italic_c end_POSTSUPERSCRIPT time steps. Adding this criterion makes the MDP partially observable since the policy’s observation lacks the knowledge of time. Therefore, to facilitate policy learning, we augment the observation space with a scalar indicator variable Isucc=tsucc/T¯succ[0,1]superscript𝐼𝑠𝑢𝑐𝑐superscript𝑡𝑠𝑢𝑐𝑐superscript¯𝑇𝑠𝑢𝑐𝑐01I^{succ}=t^{succ}/\bar{T}^{succ}\in[0,1]italic_I start_POSTSUPERSCRIPT italic_s italic_u italic_c italic_c end_POSTSUPERSCRIPT = italic_t start_POSTSUPERSCRIPT italic_s italic_u italic_c italic_c end_POSTSUPERSCRIPT / over¯ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT italic_s italic_u italic_c italic_c end_POSTSUPERSCRIPT ∈ [ 0 , 1 ], where tsuccsuperscript𝑡𝑠𝑢𝑐𝑐t^{succ}italic_t start_POSTSUPERSCRIPT italic_s italic_u italic_c italic_c end_POSTSUPERSCRIPT is the number of consecutive steps satisfying Corisubscript𝐶𝑜𝑟𝑖C_{ori}italic_C start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT and Ccontactsubscript𝐶𝑐𝑜𝑛𝑡𝑎𝑐𝑡C_{contact}italic_C start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_a italic_c italic_t end_POSTSUBSCRIPT. The observation space becomes 𝒐:=𝒐Isuccassignsuperscript𝒐direct-sumsuperscript𝒐superscript𝐼𝑠𝑢𝑐𝑐\bm{o}^{\mathcal{E}}:=\bm{o}^{\mathcal{E}}\oplus I^{succ}bold_italic_o start_POSTSUPERSCRIPT caligraphic_E end_POSTSUPERSCRIPT := bold_italic_o start_POSTSUPERSCRIPT caligraphic_E end_POSTSUPERSCRIPT ⊕ italic_I start_POSTSUPERSCRIPT italic_s italic_u italic_c italic_c end_POSTSUPERSCRIPT. In this work, θ¯=0.2¯𝜃0.2\bar{\theta}=0.2over¯ start_ARG italic_θ end_ARG = 0.2rad, T¯succ=8superscript¯𝑇𝑠𝑢𝑐𝑐8\bar{T}^{succ}=8over¯ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT italic_s italic_u italic_c italic_c end_POSTSUPERSCRIPT = 8.

3.1.3 Reset Constraints

As mentioned earlier, a reorientation policy for peeling needs to meet several constraints, such as in-place and fixed-axis reorientation (Figure 3(b)). While one could design individual reward terms to satisfy these constraints, tuning these reward terms to achieve the desired result can be difficult. Instead, it is much simpler to formulate the constraints as reset conditions. In other words, if the constraints are violated, the episode is reset immediately. This incentivizes the policy to explore only in space where the constraints are satisfied. Similar techniques were also used in some prior works [11, 12, 14].

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 3: (a) shows an example of the rotational axis of a melon. (b) shows an example where the object’s orientation (the blue line) has a large deviation from the desired rotational axis (the green line). We reset the episode when this occurs. (c) shows the policy Architecture for the teacher and the student. In this figure, we use 𝒐tsubscript𝒐𝑡\bm{o}_{t}bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to represent all the policy input at each time step.

3.1.4 Interpolation and Reference for Action Commands

Our neural network controller operates at a relatively low control frequency of 12121212Hz. To track the joint position command, a low-level PD controller runs at 300300300300Hz. To ensure smoother joint motion, we interpolate the low-frequency joint position commands. While more complex interpolation schemes such as spline interpolation are possible, we found that simple linear interpolation is sufficient to generate smooth higher-frequency (60606060Hz) joint position commands. To do this, we linearly interpolate between the current reference joint positions (𝒒trefsuperscriptsubscript𝒒𝑡𝑟𝑒𝑓\bm{q}_{t}^{ref}bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT) and the desired joint positions (𝒒t+1cmdsuperscriptsubscript𝒒𝑡1𝑐𝑚𝑑\bm{q}_{t+1}^{cmd}bold_italic_q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_m italic_d end_POSTSUPERSCRIPT) for the next policy control time step. We then send the interpolated joint position commands to the PD controllers. Mathematically, 𝒒t+1cmd,n=𝒒tref+nN𝒂tsuperscriptsubscript𝒒𝑡1𝑐𝑚𝑑𝑛superscriptsubscript𝒒𝑡𝑟𝑒𝑓𝑛𝑁subscript𝒂𝑡\bm{q}_{t+1}^{cmd,n}=\bm{q}_{t}^{ref}+\frac{n}{N}\bm{a}_{t}bold_italic_q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_m italic_d , italic_n end_POSTSUPERSCRIPT = bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT + divide start_ARG italic_n end_ARG start_ARG italic_N end_ARG bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where n[1,N]𝑛1𝑁n\in[1,N]italic_n ∈ [ 1 , italic_N ] (N=5𝑁5N=5italic_N = 5) and 𝒒t+1cmd,nsuperscriptsubscript𝒒𝑡1𝑐𝑚𝑑𝑛\bm{q}_{t+1}^{cmd,n}bold_italic_q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_m italic_d , italic_n end_POSTSUPERSCRIPT represents the nthsuperscript𝑛𝑡n^{th}italic_n start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT interpolated joint position command for the next policy control time step.

When the action space is chosen as the change in joint position, the target joint position for the PD controller is calculated as follows: 𝒒t+1cmd=𝒒t+𝒂tsuperscriptsubscript𝒒𝑡1𝑐𝑚𝑑subscript𝒒𝑡subscript𝒂𝑡\bm{q}_{t+1}^{cmd}=\bm{q}_{t}+\bm{a}_{t}bold_italic_q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_m italic_d end_POSTSUPERSCRIPT = bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [12, 11, 16]. Here, 𝒒tsubscript𝒒𝑡\bm{q}_{t}bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the current joint position, and at=Δ𝒒tsubscript𝑎𝑡Δsubscript𝒒𝑡a_{t}=\Delta\bm{q}_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Δ bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the desired change in joint positions, as described earlier. In this case, the reference is chosen to be the current joint positions, i.e., 𝒒tref=𝒒tsuperscriptsubscript𝒒𝑡𝑟𝑒𝑓subscript𝒒𝑡\bm{q}_{t}^{ref}=\bm{q}_{t}bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT = bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. However, we found that this scheme results in significant jerky motion when combined with action interpolation. To illustrate this, consider a simplified example of one joint, as shown in Figure 4(a). Since we are using a PD controller only to control the joint position, there is usually an error in tracking the joint position command, as shown by the difference between qtcmdsuperscriptsubscript𝑞𝑡𝑐𝑚𝑑q_{t}^{cmd}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_m italic_d end_POSTSUPERSCRIPT and qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. If we set qtref=qtsubscriptsuperscript𝑞𝑟𝑒𝑓𝑡subscript𝑞𝑡q^{ref}_{t}=q_{t}italic_q start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, when we interpolate between qtrefsubscriptsuperscript𝑞𝑟𝑒𝑓𝑡q^{ref}_{t}italic_q start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and qt+1cmdsubscriptsuperscript𝑞𝑐𝑚𝑑𝑡1q^{cmd}_{t+1}italic_q start_POSTSUPERSCRIPT italic_c italic_m italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, it tends to cause a sudden change in the PD controller’s set point, as shown in  Figure 4(a). A sudden change in the set point can cause a sudden change in the joint torque command and hence cause jerky motion. To resolve this issue, we use the previous joint position command as the reference, as shown in Figure 4(b). In other words, 𝒒tref=𝒒tcmdsubscriptsuperscript𝒒𝑟𝑒𝑓𝑡superscriptsubscript𝒒𝑡𝑐𝑚𝑑\bm{q}^{ref}_{t}=\bm{q}_{t}^{cmd}bold_italic_q start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_m italic_d end_POSTSUPERSCRIPT, and 𝒒t+1cmd=𝒒tcmd+𝒂tsuperscriptsubscript𝒒𝑡1𝑐𝑚𝑑superscriptsubscript𝒒𝑡𝑐𝑚𝑑subscript𝒂𝑡\bm{q}_{t+1}^{cmd}=\bm{q}_{t}^{cmd}+\bm{a}_{t}bold_italic_q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_m italic_d end_POSTSUPERSCRIPT = bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_m italic_d end_POSTSUPERSCRIPT + bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: Examples of joint position commands after interpolation sent to a low-level PD controller. Refer to caption represents the actual joint position of the motor. Refer to caption is the computed desired joint position. Refer to caption on the green line shows the interpolated joint position commands that are sent to the low-level PD controller. (a) shows the case of 𝒒t+1cmd=𝒒t+𝒂tsuperscriptsubscript𝒒𝑡1𝑐𝑚𝑑subscript𝒒𝑡subscript𝒂𝑡\bm{q}_{t+1}^{cmd}=\bm{q}_{t}+\bm{a}_{t}bold_italic_q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_m italic_d end_POSTSUPERSCRIPT = bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, while (b) shows the case of 𝒒t+1cmd=𝒒tcmd+𝒂tsuperscriptsubscript𝒒𝑡1𝑐𝑚𝑑superscriptsubscript𝒒𝑡𝑐𝑚𝑑subscript𝒂𝑡\bm{q}_{t+1}^{cmd}=\bm{q}_{t}^{cmd}+\bm{a}_{t}bold_italic_q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_m italic_d end_POSTSUPERSCRIPT = bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_m italic_d end_POSTSUPERSCRIPT + bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We can see that (b) generates much smoother joint commands.

3.2 Student Policy Learning: Imitate and Stop

After learning a goal-conditional teacher policy 𝒂t=π(𝒐t,𝒂t1,g)subscriptsuperscript𝒂𝑡superscript𝜋subscriptsuperscript𝒐𝑡subscript𝒂𝑡1𝑔\bm{a}^{\mathcal{E}}_{t}=\pi^{\mathcal{E}}(\bm{o}^{\mathcal{E}}_{t},\bm{a}_{t-% 1},g)bold_italic_a start_POSTSUPERSCRIPT caligraphic_E end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT caligraphic_E end_POSTSUPERSCRIPT ( bold_italic_o start_POSTSUPERSCRIPT caligraphic_E end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_g ), the next question is how to train a real-world deployable student policy that can rotate the object in hand and hold it stably after reorientation. We propose conditioning the student policy on a stop signal Itstop{0,1}subscriptsuperscript𝐼𝑠𝑡𝑜𝑝𝑡01I^{stop}_{t}\in\{0,1\}italic_I start_POSTSUPERSCRIPT italic_s italic_t italic_o italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 }: 𝒂t𝒮=π𝒮(𝒐t𝒮,𝒂t1,Itstop)subscriptsuperscript𝒂𝒮𝑡superscript𝜋𝒮subscriptsuperscript𝒐𝒮𝑡subscript𝒂𝑡1subscriptsuperscript𝐼𝑠𝑡𝑜𝑝𝑡\bm{a}^{\mathcal{S}}_{t}=\pi^{\mathcal{S}}(\bm{o}^{\mathcal{S}}_{t},\bm{a}_{t-% 1},I^{stop}_{t})bold_italic_a start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ( bold_italic_o start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_I start_POSTSUPERSCRIPT italic_s italic_t italic_o italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). In other words, the student policy should continue reorienting the object when Itstop=0subscriptsuperscript𝐼𝑠𝑡𝑜𝑝𝑡0I^{stop}_{t}=0italic_I start_POSTSUPERSCRIPT italic_s italic_t italic_o italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0, but stably hold the object when Itstop=1subscriptsuperscript𝐼𝑠𝑡𝑜𝑝𝑡1I^{stop}_{t}=1italic_I start_POSTSUPERSCRIPT italic_s italic_t italic_o italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1. This design choice provides flexibility in how we control the policy to stop the reorientation. For example, the policy could rotate the object for a pre-specified amount of time (i.e., set Itstop=1subscriptsuperscript𝐼𝑠𝑡𝑜𝑝𝑡1I^{stop}_{t}=1italic_I start_POSTSUPERSCRIPT italic_s italic_t italic_o italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 after t𝑡titalic_t seconds). Alternatively, an external perception module could detect when the peeled part has fully turned over, triggering Itstop=1subscriptsuperscript𝐼𝑠𝑡𝑜𝑝𝑡1I^{stop}_{t}=1italic_I start_POSTSUPERSCRIPT italic_s italic_t italic_o italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 and the policy to stop the motion and hold the object immediately.

How can we use the learned goal-conditioned teacher policy to train a student policy that is conditioned on the stop signal? We can set the value for Itstopsubscriptsuperscript𝐼𝑠𝑡𝑜𝑝𝑡I^{stop}_{t}italic_I start_POSTSUPERSCRIPT italic_s italic_t italic_o italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT automatically during policy rollout based on the orientation distance ΔθtΔsubscript𝜃𝑡\Delta\theta_{t}roman_Δ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Itstop={0if Δθt>θ¯1otherwisesuperscriptsubscript𝐼𝑡𝑠𝑡𝑜𝑝cases0if Δθt>θ¯1otherwise\displaystyle I_{t}^{stop}=\begin{cases}0&\text{if $\Delta\theta_{t}>\bar{% \theta}$}\\ 1&\text{otherwise}\end{cases}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_o italic_p end_POSTSUPERSCRIPT = { start_ROW start_CELL 0 end_CELL start_CELL if roman_Δ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > over¯ start_ARG italic_θ end_ARG end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL otherwise end_CELL end_ROW

Details about the observation space and the policy architecture are in Section A.3 in the appendix.

3.3 Peeling

In this section, we demonstrate that our reorientation controller can be used for downstream peeling tasks. We use the dexterous robot hand to do the reorientation and then control another Franka Panda robot arm to do the peeling as shown in  Figure 2. To control the robot arm, we experimented with both using a teleoperation system and an automatic vision-based peeling system.

3.3.1 Teleoperation-based peeling

We used a leader-follower teleoperation system in which a human operator controls a leader system, and the Franka arm follows the motion of the leader in real-time. A 200 Hz operational space impedance controller [42] runs on the Panda arm, controlling for pose via torque, and an operator interacts with a Haption Virtuose 6D HF TAO111https://www.haption.com/en/products-en/virtuose-6d-tao-en.html#fa-download-downloads device. Bilateral position-position haptic coupling is done between the two devices. The controllers and haptic coupling are implemented using Drake [43].

Refer to caption
Figure 5: (a): the Allegro hand holds a papaya to be peeled. (b): we utilize Grounded SAM to segment the papaya. (c): the 3D point cloud representing the segmented papaya’s exposed surface. (d): we take a slice of this point cloud at the center region along the papaya’s longest axis. (e): the points within this center slice are projected onto the central plane aligned with the axis. (f): we fit a spline curve to the projected points to obtain the desired trajectory for the peeler tip to follow.

3.3.2 Vision-based peeling

While teleoperation provides effective peeling commands for the Franka arm and demonstrates that our reorientation controller can firmly grasp objects after reorientation, automating the peeling process would be ideal. One approach to achieve this is by computing the peeler’s motion trajectory based on RGB and depth vision data. The trajectory can be determined through the following steps (see Figure 5): (1) We utilize Grounded SAM [44] to segment the target vegetable given an image and vegetable name input. (2) Using the segmentation mask and depth data, we reconstruct the 3D point cloud representing the vegetable’s top surface. (3) We identify the vegetable’s longest axis (the peeling direction) by applying principal component analysis. (4) We slice the point cloud into a 2cm thick segment along the central plane that crosses the center point and aligns with the longest axis. We then project all the points within the slice onto the plane. (5) We fit a spline curve to the projected points to obtain a smooth trajectory for the peeler tip. Finally, cartesian-space position control moves the peeler along this trajectory while keeping the peeler orientation fixed.

4 Results

To quantitatively evaluate the real-world policy transfer performance, we tested the controller on four vegetables (Figure 2(a)): a pumpkin (mass: 827827827827g), a melon(623623623623g), a radish(727727727727g), a papaya(848848848848g).

4.1 Traveling distance for a fixed amount of commanded motion time

The first question we want to answer is whether the learned policy can successfully reorient vegetables in the real world. In peeling, the width of the peeled part depends on the peeler’s width. Thus, it is more informative to measure how much the reorientation controller rotates an object by the traveling distance of a surface point, rather than the absolute rotation angle. Specifically, we mark a reference point Prefsuperscript𝑃𝑟𝑒𝑓P^{ref}italic_P start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT on the object surface near the mid-point of its rotational axis. At the start, we ensure Prefsuperscript𝑃𝑟𝑒𝑓P^{ref}italic_P start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT is centered and facing upward when held. After reorientation, we record the new point Pnewsuperscript𝑃𝑛𝑒𝑤P^{new}italic_P start_POSTSUPERSCRIPT italic_n italic_e italic_w end_POSTSUPERSCRIPT that is now centered and facing upward. We then measure the contour length from Pnewsuperscript𝑃𝑛𝑒𝑤P^{new}italic_P start_POSTSUPERSCRIPT italic_n italic_e italic_w end_POSTSUPERSCRIPT to Prefsuperscript𝑃𝑟𝑒𝑓P^{ref}italic_P start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT along the surface (Figure 2(b)).

To demonstrate the capability of our controller to reorient real objects, we conducted two rounds of testing. Our controller is trained to stop motion when it receives a stop signal. In the first round, we sent the stop signal 3.5 seconds after the controller started rotating. In the second round, we sent the stop signal 7 seconds after start. We repeated each test 10 times. As shown in Figure 6(a), the controller successfully reoriented all four food items by a sufficient amount for peeling. When commanded to reorient for 3.5s, 90% of tests reoriented the objects by at least 4cm. With 7s, 90% of tests reoriented objects by at least 7.3cm. Given more time, the controller reoriented objects by a larger amount.

Refer to caption
(a)
Refer to caption
(b)
Figure 6: (a): Violin plots showing the distribution of the traveling distance of a point on the object surface after the controller is commanded to rotate the object for 3.5 s and 7 s, respectively. (b): Violin plot showing the distribution of time taken by the controller to transition from rotating the object in hand to firmly holding the object after receiving the stop signal. The x𝑥xitalic_x-axis represents the timing of the stop signal sent to the controller after it starts.

4.2 How well does the controller track the commanded motion time?

As discussed in Section 3, if our controller can quickly respond to a stop signal at any time step, it can be combined with a perception system that tracks peeling progress. Hence, we measured how long it takes to stop the hand and object motion after receiving the stop signal. As shown in Figure 6(b), the motion stops after 0.4s on average after the controller receives the stop signal.

4.3 Firm grasp after reorientation

To enable downstream peeling, the reorientation controller must learn to firmly grasp the object after stopping finger motion. We tested this by checking if the Allegro hand and object could be lifted in the air for 3s by only lifting the object with a single human hand. Table B.1 in the appendix shows that across objects and commanded times, the controller firmly grasped objects in 90% of tests. Moreover, our controller possesses the capability of performing consecutive reorientations. It can repetitively execute the sequence of peeling and reorientation multiple times in succession.

4.4 Real-world Peeling

We evaluated whether the reorientation controller could reorient food items to facilitate peeling (Figure 1). We tested using an Allegro hand and a Leap hand [45]. Testing showed that peeling applied substantial pulling forces on objects. However, in most cases, both hands maintained a firm enough grasp to enable successful peeling. Failures often occur when holding small objects, as some fingertips may fail to establish secure contact with the surface.

5 Discussions

The reorientation controller presented in this study is a blind controller that relies solely on proprioceptive sensory information. While it has demonstrated the ability to successfully reorient heavy objects and securely hold them in place, its performance could potentially be enhanced by incorporating visual and tactile feedback. The current system has a few failure modes. Firstly, the object might slip out of the hand since the controller does not utilize any vision information. Secondly, the controller might fail if the vegetables are small, as the fingers cannot effectively make contact with the object. When using a vision-based peeling approach to peel the vegetables, the segmentation network (Grounded SAM) might fail to correctly identify and segment the target vegetable in the image. Sometimes, the segmentation mask would incorrectly include the robot hand. Some fine-tuning of the pre-trained Grounded SAM model would be necessary to mitigate such issues. Future work could involve learning a peeling policy via behavior cloning on data collected via teleoperation to achieve better autonomy of the system. Additionally, incorporating visual and tactile feedback into the reorientation controller could potentially enhance its performance

Acknowledgments

We thank the anonymous reviewers for their helpful comments in revising the paper. We also extend our appreciation to the members of Toyota Research Institute for their valuable feedback on the formulation of our research idea and their engaging discussions about related research problems.

References

  • Rus [1999] D. Rus. In-hand dexterous manipulation of piecewise-smooth 3-d objects. The International Journal of Robotics Research, 18(4):355–381, 1999.
  • Mason et al. [1989] M. T. Mason, J. K. Salisbury, and J. K. Parker. Robot hands and the mechanics of manipulation. The MIT Press, 1989.
  • Dafle et al. [2014] N. C. Dafle, A. Rodriguez, R. Paolini, B. Tang, S. S. Srinivasa, M. Erdmann, M. T. Mason, I. Lundberg, H. Staab, and T. Fuhlbrigge. Extrinsic dexterity: In-hand manipulation with external forces. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 1578–1585. IEEE, 2014.
  • Bai and Liu [2014] Y. Bai and C. K. Liu. Dexterous manipulation using both palm and fingers. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 1560–1565. IEEE, 2014.
  • Sundaralingam and Hermans [2019] B. Sundaralingam and T. Hermans. Relaxed-rigidity constraints: kinematic trajectory optimization and collision avoidance for in-grasp manipulation. Autonomous Robots, 43(2):469–483, 2019.
  • Mordatch et al. [2012] I. Mordatch, Z. Popović, and E. Todorov. Contact-invariant optimization for hand manipulation. In Proceedings of the ACM SIGGRAPH/Eurographics symposium on computer animation, pages 137–144, 2012.
  • Pang et al. [2022] T. Pang, H. Suh, L. Yang, and R. Tedrake. Global planning for contact-rich manipulation via local smoothing of quasi-dynamic contact models. arXiv preprint arXiv:2206.10787, 2022.
  • Morgan et al. [2022] A. S. Morgan, K. Hang, B. Wen, K. Bekris, and A. M. Dollar. Complex in-hand manipulation via compliance-enabled finger gaiting and multi-modal planning. IEEE Robotics and Automation Letters, 7(2):4821–4828, 2022.
  • Abondance et al. [2020] S. Abondance, C. B. Teeple, and R. J. Wood. A dexterous soft robotic hand for delicate in-hand manipulation. IEEE Robotics and Automation Letters, 5(4):5502–5509, 2020.
  • Nagabandi et al. [2020] A. Nagabandi, K. Konolige, S. Levine, and V. Kumar. Deep dynamics models for learning dexterous manipulation. In Conference on Robot Learning, pages 1101–1112. PMLR, 2020.
  • Chen et al. [2022a] T. Chen, J. Xu, and P. Agrawal. A system for general in-hand object re-orientation. In Conference on Robot Learning, pages 297–307. PMLR, 2022a.
  • Chen et al. [2022b] T. Chen, M. Tippur, S. Wu, V. Kumar, E. Adelson, and P. Agrawal. Visual dexterity: In-hand dexterous manipulation from depth. arXiv e-prints, pages arXiv–2211, 2022b.
  • Handa et al. [2022] A. Handa, A. Allshire, V. Makoviychuk, A. Petrenko, R. Singh, J. Liu, D. Makoviichuk, K. Van Wyk, A. Zhurkevich, B. Sundaralingam, et al. Dextreme: Transfer of agile in-hand manipulation from simulation to reality. arXiv preprint arXiv:2210.13702, 2022.
  • Yin et al. [2023] Z.-H. Yin, B. Huang, Y. Qin, Q. Chen, and X. Wang. Rotating without seeing: Towards in-hand dexterity through touch. arXiv preprint arXiv:2303.10880, 2023.
  • Khandate et al. [2022] G. Khandate, M. Haas-Heger, and M. Ciocarlie. On the feasibility of learning finger-gaiting in-hand manipulation with intrinsic sensing. In 2022 International Conference on Robotics and Automation (ICRA), pages 2752–2758. IEEE, 2022.
  • Andrychowicz et al. [2020] O. M. Andrychowicz, B. Baker, M. Chociej, R. Józefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020.
  • Qi et al. [2023] H. Qi, A. Kumar, R. Calandra, Y. Ma, and J. Malik. In-hand object rotation via rapid motor adaptation. In Conference on Robot Learning, pages 1722–1732. PMLR, 2023.
  • Allshire et al. [2022] A. Allshire, M. MittaI, V. Lodaya, V. Makoviychuk, D. Makoviichuk, F. Widmaier, M. Wüthrich, S. Bauer, A. Handa, and A. Garg. Transferring dexterous manipulation from gpu simulation to a remote real-world trifinger. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11802–11809. IEEE, 2022.
  • Okamura et al. [2000] A. M. Okamura, N. Smaby, and M. R. Cutkosky. An overview of dexterous manipulation. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), volume 1, pages 255–262. IEEE, 2000.
  • Bhatt et al. [2021] A. Bhatt, A. Sieler, S. Puhlmann, and O. Brock. Surprisingly robust in-hand manipulation: An empirical study. Robotics: Science and Systems (RSS), 2021.
  • Kumar et al. [2016] V. Kumar, A. Gupta, E. Todorov, and S. Levine. Learning dexterous manipulation policies from experience and imitation. arXiv preprint arXiv:1611.05095, 2016.
  • Calli et al. [2018] B. Calli, K. Srinivasan, A. Morgan, and A. M. Dollar. Learning modes of within-hand manipulation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3145–3151. IEEE, 2018.
  • Ishihara et al. [2006] T. Ishihara, A. Namiki, M. Ishikawa, and M. Shimojo. Dynamic pen spinning using a high-speed multifingered hand with high-speed tactile sensor. In 6th IEEE-RAS International Conference on Humanoid Robots, pages 258–263. IEEE, 2006.
  • Van Hoof et al. [2015] H. Van Hoof, T. Hermans, G. Neumann, and J. Peters. Learning robot in-hand manipulation with tactile features. In 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), pages 121–127. IEEE, 2015.
  • Calli and Dollar [2017] B. Calli and A. M. Dollar. Vision-based model predictive control for within-hand precision manipulation with underactuated grippers. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2839–2845. IEEE, 2017.
  • Furukawa et al. [2006] N. Furukawa, A. Namiki, S. Taku, and M. Ishikawa. Dynamic regrasping using a high-speed multifingered hand and a high-speed vision system. In Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006., pages 181–187. IEEE, 2006.
  • OpenAI et al. [2019] OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
  • Tan et al. [2018] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke. Sim-to-real: Learning agile locomotion for quadruped robots. Robotics: Science and Systems (RSS), 2018.
  • Da et al. [2021] X. Da, Z. Xie, D. Hoeller, B. Boots, A. Anandkumar, Y. Zhu, B. Babich, and A. Garg. Learning a contact-adaptive controller for robust, efficient legged locomotion. In Conference on Robot Learning, pages 883–894. PMLR, 2021.
  • Li et al. [2021] Z. Li, X. Cheng, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, and K. Sreenath. Reinforcement learning for robust parameterized locomotion control of bipedal robots. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 2811–2817. IEEE, 2021.
  • Pitz et al. [2023] J. Pitz, L. Röstel, L. Sievers, and B. Bäuml. Dextrous tactile in-hand manipulation using a modular reinforcement learning architecture. arXiv preprint arXiv:2303.04705, 2023.
  • Pinto et al. [2017] L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel. Asymmetric actor critic for image-based robot learning. arXiv preprint arXiv:1710.06542, 2017.
  • Margolis et al. [2022a] G. B. Margolis, G. Yang, K. Paigwar, T. Chen, and P. Agrawal. Rapid locomotion via reinforcement learning. Robotics: Science and Systems (RSS), 2022a.
  • Margolis et al. [2022b] G. B. Margolis, T. Chen, K. Paigwar, X. Fu, D. Kim, S. Kim, and P. Agrawal. Learning to jump from pixels. In Conference on Robot Learning, pages 1025–1034. PMLR, 2022b.
  • Lee et al. [2020] J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter. Learning quadrupedal locomotion over challenging terrain. Science robotics, 5(47):eabc5986, 2020.
  • Kumar et al. [2021] A. Kumar, Z. Fu, D. Pathak, and J. Malik. Rma: Rapid motor adaptation for legged robots. Robotics: Science and Systems (RSS), 2021.
  • Tremblay et al. [2018] J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and S. Birchfield. Deep object pose estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790, 2018.
  • Chen et al. [2020] D. Chen, B. Zhou, V. Koltun, and P. Krähenbühl. Learning by cheating. In Conference on Robot Learning, pages 66–75. PMLR, 2020.
  • Ross et al. [2011] S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011.
  • Ho and Ermon [2016] J. Ho and S. Ermon. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
  • Peng et al. [2021] X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG), 40(4):1–20, 2021.
  • Khatib [1987] O. Khatib. A unified approach for motion and force control of robot manipulators: The operational space formulation. IEEE Journal on Robotics and Automation, 3(1):43–53, Feb. 1987. ISSN 2374-8710. doi:10.1109/JRA.1987.1087068. Conference Name: IEEE Journal on Robotics and Automation.
  • Tedrake and the Drake Development Team [2019] R. Tedrake and the Drake Development Team. Drake: Model-based design and verification for robotics, 2019. URL https://drake.mit.edu.
  • Ren et al. [2024] T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y. Chen, F. Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159, 2024.
  • Shaw et al. [2023] K. Shaw, A. Agarwal, and D. Pathak. Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning. arXiv preprint arXiv:2309.06440, 2023.
  • Makoviychuk et al. [2021] V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, and G. State. Isaac gym: High performance GPU based physics simulation for robot learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021.
  • Deitke et al. [2023] M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023.
  • Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Appendix A Training

A.1 Training setup

Robot: We use an Allegro Hand that is controlled via a PD controller at 300300300300Hz. Our control policy sets joint position commands and runs at a lower frequency at 12121212Hz.

Simulation: We trained the policies in Isaac Gym simulation [46]. To set dynamics-related robot parameters in the simulation, we followed a prior approach [12], which uses a gradient-free search method to find the dynamics parameters for each joint (joint friction, damping, maximum joint velocity, and maximum effort) in simulation that generates the motor response that is closest to the real motors.

Object Dataset: We collected 23232323 object meshes (potatoes, squash, cucumber, etc.) from Objaverse [47]. 10101010 variants for each mesh were created by varying the size. The mass of the object was randomly sampled in the range of [80,960]80960[80,960][ 80 , 960 ]g. Note that we aim to reorient much heavier objects than prior works [16, 12, 11, 13].

Refer to caption
Figure A.1: Object dataset used in this work. We collected meshes of carrot, sweet potato, potato, squash, pumpkin, etc.

A.2 Teacher Policy Learning

A.2.1 Observation and Action Space

𝒐tsubscriptsuperscript𝒐𝑡\bm{o}^{\mathcal{E}}_{t}bold_italic_o start_POSTSUPERSCRIPT caligraphic_E end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT includes joint positions and velocities, the fingertip poses and velocities, object pose and velocity, the distance between the current object orientation and the goal orientation, and whether any of the fingertips touch the object. 𝒂tsubscript𝒂𝑡\bm{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the delta joint position command. The neural network policy runs at 12121212Hz.

A.2.2 Domain randomization and Perturbation during training

During training, we apply domain randomization on the joint stiffness and damping, friction, and restitution. Additionally, we randomly apply a perturbation force on the object’s CoM. We randomly sample the direction of the perturbation force and set its magnitude to 10mosubscript𝑚𝑜m_{o}italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, where mosubscript𝑚𝑜m_{o}italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT is the object mass.

A.3 Student Policy Learning

A.3.1 Observation Space

In this work, we only use proprioceptive sensory information (joint positions 𝒒tsubscript𝒒𝑡\bm{q}_{t}bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and velocities 𝒒˙tsubscript˙𝒒𝑡\dot{\bm{q}}_{t}over˙ start_ARG bold_italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) as the observation input (𝒐t𝒮subscriptsuperscript𝒐𝒮𝑡\bm{o}^{\mathcal{S}}_{t}bold_italic_o start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT). Our findings indicate that relying solely on proprioceptive sensory information results in strong performance. Future research could investigate incorporating visual data to further enhance the system’s capabilities, such as preventing objects from slipping out of the grasp.

A.3.2 Policy Architecture

As the student policy only has access to a limited amount of sensory information (a POMDP setting), it is important to incorporate history information, as has been done in previous works [16, 13, 12]. While [16, 13, 12] utilized RNNs to process history information, Transformers [48] have gained significant attention due to their improved performance and faster training in domains such as natural language processing. Therefore, in this work, we employ a Transformer-based policy architecture. 𝒂t𝒮=π𝒮(𝒐1𝒮,𝒂0,I1stop,,𝒐t𝒮,𝒂t1,Itstop)subscriptsuperscript𝒂𝒮𝑡superscript𝜋𝒮subscriptsuperscript𝒐𝒮1subscript𝒂0superscriptsubscript𝐼1𝑠𝑡𝑜𝑝subscriptsuperscript𝒐𝒮𝑡subscript𝒂𝑡1subscriptsuperscript𝐼𝑠𝑡𝑜𝑝𝑡\bm{a}^{\mathcal{S}}_{t}=\pi^{\mathcal{S}}(\bm{o}^{\mathcal{S}}_{1},\bm{a}_{0}% ,I_{1}^{stop},...,\bm{o}^{\mathcal{S}}_{t},\bm{a}_{t-1},I^{stop}_{t})bold_italic_a start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ( bold_italic_o start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_o italic_p end_POSTSUPERSCRIPT , … , bold_italic_o start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_I start_POSTSUPERSCRIPT italic_s italic_t italic_o italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The policy is a decoder-only attention network (Figure 3(c)) with three self-attention layers. The hidden size is 256256256256, the intermediate size is 512512512512, and the number of attention heads is 8888. The policy is trained using DAGGER [39].

Appendix B Testing

B.1 Testing setup

Figure 2(a) show the objects used for evaluation. Figure 2(b) illustrates how we measure the traveling distance of the rotation motion.

Refer to caption
(a)
Refer to caption
(b)
Figure B.2: (a) shows the objects for evaluation: melon, radish, pumpkin, papaya. (b) shows the traveling distance. Before reorientation begins, we ensure a reference point (point A) is facing upward. After reorientation, we identify the point (point B) now facing upward. We then measure the distance from point A to point B along the contour.

B.2 Firm grasp after reorientation

Table B.1 shows the success rate of the lifting action after the reorientation. It shows that our reorientation controller can control the fingers to firmly hold the object after the reorientation.

Table B.1: Successful lifting rate (10 tests each)
Commanded motion time Pumpkin Melon Papaya Radish
3.5s 80% 90% 80% 90%
7s 100% 90% 100% 90%

B.3 Ablation study

Demo term in Reward function

We proposed using a keyframe demonstration to ease reward shaping. To evaluate its effectiveness, we compared learning curves of the teacher policies trained with and without the c3𝒒t𝒒demo22subscript𝑐3superscriptsubscriptnormsubscript𝒒𝑡superscript𝒒𝑑𝑒𝑚𝑜22c_{3}\left\|\bm{q}_{t}-\bm{q}^{demo}\right\|_{2}^{2}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∥ bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_q start_POSTSUPERSCRIPT italic_d italic_e italic_m italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT reward term. As shown in Figure 3(a), adding the keyframe substantially improved learning. Additionally, it demonstrates that mimicking the keyframe pose via a single reward term effectively reduces the reward-shaping burden.

Refer to caption
(a)
Refer to caption
(b)
Figure B.3: (a) shows learning curves of the teacher policies with or without c3𝒒t𝒒demo22subscript𝑐3superscriptsubscriptnormsubscript𝒒𝑡superscript𝒒𝑑𝑒𝑚𝑜22c_{3}\left\|\bm{q}_{t}-\bm{q}^{demo}\right\|_{2}^{2}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∥ bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_q start_POSTSUPERSCRIPT italic_d italic_e italic_m italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in the reward function. (b) shows the differences between student policies trained with different sensory information (joint positions and velocities vs. joint positions only).
Necessity of having joint velocity information in π𝒮superscript𝜋𝒮\pi^{\mathcal{S}}italic_π start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT

The student policy’s sensory input included joint positions and velocities. We investigated whether including joint velocity information in the input is beneficial. Figure 3(b) shows that adding joint velocities to the input improved performance.

Transformer vs RNN

Different from prior works [16, 13, 11, 12], our student policy uses a Transformer architecture instead of an RNN architecture. We compared the learning performance of a Transformer-based policy and an RNN-based policy. Figure 4(a) and Figure 4(b) show that a Transformer-based policy learns much faster and gets better performance at convergence than an RNN-based policy.

Refer to caption
(a)
Refer to caption
(b)
Figure B.4: Learning curves of student policies with a Transformer or RNN architecture with respect to the number of samples and wall-clock time, respectively.