AdaptiGraph: Material-Adaptive Graph-Based
Neural Dynamics for Robotic Manipulation
Abstract
Predictive models are a crucial component of many robotic systems. Yet, constructing accurate predictive models for a variety of deformable objects, especially those with unknown physical properties, remains a significant challenge. This paper introduces AdaptiGraph, a learning-based dynamics modeling approach that enables robots to predict, adapt to, and control a wide array of challenging deformable materials with unknown physical properties. AdaptiGraph leverages the highly flexible graph-based neural dynamics (GBND) framework, which represents material bits as particles and employs a graph neural network (GNN) to predict particle motion. Its key innovation is a unified physical property-conditioned GBND model capable of predicting the motions of diverse materials with varying physical properties without retraining. Upon encountering new materials during online deployment, AdaptiGraph utilizes a physical property optimization process for a few-shot adaptation of the model, enhancing its fit to the observed interaction data. The adapted models can precisely simulate the dynamics and predict the motion of various deformable materials, such as ropes, granular media, rigid boxes, and cloth, while adapting to different physical properties, including stiffness, granular size, and center of pressure. On prediction and manipulation tasks involving a diverse set of real-world deformable objects, our method exhibits superior prediction accuracy and task proficiency over non-material-conditioned and non-adaptive models. The project page is available at https://robopil.github.io/adaptigraph/.
![[Uncaptioned image]](https://cdn.statically.io/img/arxiv.org/x1.png)
I Introduction
Learning predictive models, also known as system identification, is a crucial component of many robotic tasks. Whereas classical methods rely on the explicit parameterization of the system state and struggle with systems that have high degrees of freedom, a significant body of work over the last decade has attempted to learn models directly from visual observations. Prior approaches have learned predictive models based on pixels [11, 18] or latent representations of images [15, 16]. However, such representations often overlook the structure of the environment and do not generalize well across different camera poses, object poses, robots, object sizes, and object shapes. Recently, a series of studies have employed Graph Neural Networks (GNN) to model environments as 3D particles and their pairwise interactions [23, 39, 40, 47]. A graph representation has proven effective in capturing relational bias and predicting complex motions of deformable objects, but prior works typically only focus on a single material and would require extensive training to model an object of new material or with unknown physical properties. Hence, it is an important challenge to provide such graph-based models to adapt to objects and tasks involving diverse materials and varying physical properties, such as manipulating ropes with different stiffness and granular media with different granularity (Fig. 1).
In this work, we present a unified framework for modeling the dynamics of objects with different materials and physical properties. In addition to classifying objects into discrete material types such as rigid objects, ropes, etc., we further consider a range of intra-class physical property variations in each material type. We propose to encode this variation using a continuous variable which we call the physical property variable, and integrate the variable into a Graph-Based Neural Dynamics (GBND) framework (Fig. 1). The physical property variable indicates the important intrinsic properties of each material category, including stiffness for deformable objects and the center of pressure position for rigid objects. By encoding the material type and physical property variables into particles in the graph, the model learns material-specific dynamic functions that predict different physical behaviors for objects with different physical properties. We then employ a test-time adaptation method to reason about the physical properties of novel objects. Specifically, the robot actively interacts with the novel object, observes its response, and estimates its physical properties to optimize the model’s fit to the observed reactions. The estimation is performed in a few-shot manner and can be directly applied to planning and trajectory optimization for downstream manipulation tasks.
In our experiments, we verify this framework on four types of objects: rigid objects, granular objects, rope-like objects, and cloth-like objects. Experiments show that our framework can distinguish and model the dynamics of objects across a broad range of physical properties, for instance, from very soft ropes like yarn and shoelaces to very stiff ropes like cables, and from very fine-grained granular matter like coffee beans to very coarse-grained ones like toy blocks (Fig. 1). The model is trained on diverse data collected with a simulator and tested with online adaptation on real objects. The results demonstrate that (1) our adaptation module provides consistent and interpretable estimates of the objects’ physical property variables, and (2) by conditioning on the estimated physical property variable, the model can carry out more accurate dynamics estimation and more efficient manipulation, especially for objects with extreme or out-of-distribution physical properties.
II Related Work
II-A Model Learning for Robotic Manipulation
Analytical physics-based models facilitate a wide span of robotic manipulation tasks [17, 34, 53]. However, building accurate physics models is often infeasible in the real world due to unobservable physics properties such as mass, friction, and stiffness, occluded surfaces of geometry, sensitivity to parameter estimates, and the high computational expense of simulating deformable objects. To mitigate these issues, recent approaches apply learning-based techniques to obtain dynamics models directly from sensory inputs [7, 33, 51, 18, 11, 2]. Graph-based representations and GNNs have been proven effective in modeling the complex behaviors of non-rigid objects due to their ability to capture spatial relational bias [3, 35, 23, 27, 37, 47]. Prior work has explored the application of graph-based dynamics models on a variety of material types, including rigid bodies [23, 19, 29], plasticine [39, 40], clothes [36, 27, 35, 30], fluids [22, 37], and granular matter [47]. However, nearly all of these approaches focus on a single type of material and fail to consider variation in physical properties, thus limiting their generalization and adaptation capabilities. In contrast, our method considers a wider range of materials and variations in physical properties in a single property-conditioned graph-based neural dynamics models, and this enables our approach to adaptively estimate the unknown physical properties of unseen objects through interaction.
II-B Physical Property Estimation and Few-Shot Adaptation
Estimating and adapting to the physical properties of unseen objects is an inherent challenge in various robotic applications. Previous works have attempted to infer physical properties by tweaking parameters in physics-based simulations [8, 44, 25, 12, 42, 49] or utilizing extra modalities, e.g., tactile signal [38, 45, 52], but these approaches have a high demand for the object’s full state information, or require extra sensors. In comparison, adaptively learning explicit physical property variables [1, 22, 50, 6, 24, 30] or low-dimensional latent representations that implicitly encode physical properties [21, 10] in a neural network only requires partial observations and few-shot exploratory interactions as input. A line of work goes further by using the large vision-language models to infer physical properties solely from static observation [13, 46], but these estimations are rough and do not involve actual interactions. For estimation/adaptation from interactions, previous efforts were still limited to simulations or focused only on single types of materials, e.g., rigid objects. There is also a dilemma in choosing the representation form: explicit variables suffer from domain gaps such as the sim-to-real gap, yet latent representation has relatively lower interpretability. In contrast, our approach incorporates a graph-structured model within an inverse optimization framework, offering interpretability, generalizability to objects beyond the training distribution, and applicability to a broader array of material types, including rigid boxes, ropes, cloths, and granular substances, in the real-world scenario.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x2.png)
III Methods
We first introduce the problem formulation in Sec. III-A. Then, we introduce the perception module and the structure of our physical property-conditioned graph-based dynamics model in Sec. III-B. We discuss the test-time adaptation algorithm for physics property estimation in Sec. III-C. Finally, in Sec. III-D, we introduce how we perform closed-loop control for downstream manipulation tasks.
III-A Problem Formulation
Our aim is to learn a dynamics model, , that is conditioned on the material type and continuous physical property variable , and develop a test-time few-shot adaptation scheme to infer the physical property variable for unseen objects. Specifically, the dynamics model predicts how the environment will change if the robot applies a given action:
(1) |
where indicates the material type (e.g., rigid, granular, rope, cloth), indicates material-specific physical property variables, and , , are the robot action, current environment state at time , and the next state at time , respectively. In our approach, we train the dynamics model to minimize the accumulated future prediction loss.
By conditioning on and , the model learns to predict material-dependent physical behaviors, based on which we can perform physical property estimation through the following optimization problem:
(2) |
where is the iteration number indicating the number of interactions with the unseen object, and is the cost function measuring the discrepancy between the predicted future state and the observed state .
III-B Material-Conditioned Graph-based Neural Dynamics Model
We propose to instantiate the dynamics model with a graph neural network. Following prior work on graph-based neural dynamics (GBND) [22, 47, 39, 40], we define the environment state as a graph: , where is the vertex set representing object particles, and denotes the edge set representing interacting particle pairs at time step . Given the point cloud input, the object particle positions are determined by the farthest point sampling method [32], which ensures sufficient coverage of the object’s geometry. We construct edges between particles based on a spatial distance threshold . We also sample particles on the robot end-effector and construct relations between robot particles and object particles.
The main improvement of our model over previous works based on GBND is our material- and physical property-conditioning module (Fig. 2a). Suppose a vertex has material and physical property variable . We incorporate this material information into the vertex features along with the 3D position information over history timesteps and the vertex attribute which indicates whether the particle belongs to an object or the robot end-effector. Formally, . The history positions implicitly encode the velocity information. Empirically, we choose . The relation features between a pair of particles is denoted as , where are the receiver particle index and the sender particle index of the edge respectively. The edge attribute contains information such as whether the sender and receiver belong to the same or different objects.
The constructed vertex and edge features are first fed into the vertex encoder and the edge encoder respectively to get the latent vertex and edge embeddings and :
(3) |
Then, an edge propagation network and vertex propagation network performs iterative update of the vertex and edge embeddings to perform multi-step message passing. Specifically, for , a single message passing step is as follows:
(4) | ||||
(5) |
where indicates the index set of edges in which vertex is the receiver at time , and is the total number of message passing steps. Finally, one vertex decoder predicts the system’s state at the next time step: .
Translation equivariance. Translation equivariance is a desired property for dynamics models. Formally, for any global 3D translation added to the particle locations, the predictions should also be translated identically. We enforce translation equivariance by passing the position difference of receiver and sender particles to the edge encoder , instead of passing absolution particle positions to the vertex encoder .
Training. To regulate the cumulative dynamics prediction error, we supervise the model’s prediction results on prediction steps and perform backpropagation through time to optimize model parameters. In practice, we choose for all tasks for balancing efficiency and performance. We use MSE loss on predicted object particle positions as the loss function:
(6) |
To obtain training data at scale, we generate diverse object trajectories by randomizing robot actions and object configurations using physics-based simulators. Most importantly, we randomize the material configuration for each instance in the dataset. To achieve this, we identify the physics property and randomize the property over a wide range of feasible values.
III-C Few-Shot Physical Property Adaptation
After learning the material-conditioned GBND model, we deploy the model to objects with unknown physical properties in the real world. Inspired by human’s ability to reason about objects’ physical properties by interacting with them, we design an inverse optimization pipeline through few-shot curiosity-driven interaction.
Specifically, to estimate the physical property variable, the robot actively interacts with the object. In each iteration, it selects the action that maximizes the predicted displacement of the object. Intuitively, the action that maximizes displacement is likely to reveal more information about the object’s physical properties than random actions would.
After each interaction, the robot updates its estimate of the object by minimizing the dynamics prediction error from previous interactions. As the robot undergoes several interactions, the estimation of physical property tends to stabilize, reaching the final optimized value.
In our experiments, we adopt a fixed number of iterations for adaptation. We measure the displacement of the object by computing the Chamfer Distance (CD) between the current state and the predicted state :
(7) |
where and denote the vertex sets at state and , respectively. The actions for curiosity-driven interactions are optimized using the Model-Predictive Path Integral (MPPI) [48] trajectory optimization algorithm to maximize the above Chamfer Distance.
For inverse optimization at the interaction step, we adopt gradient-free optimizers including Bayesian Optimization (BO) for single-dimensional physical property variables and CMA-ES for multi-dimensional variables. We instantiate the optimization problem described in Eq. 2 by specifying the cost function as the Chamfer Distance between the dynamics prediction and the true outcome after each interaction:
(8) |
where .
For some materials whose physical properties span a large range (e.g., stiffness for ropes), the test object can potentially fall outside the training distribution of the model. Our material-conditioned model allows for generalization beyond the training domain by directly setting the domain of at the adaptation stage to be an extension of the maximal range of in the training data. Specifically, the minimum value and maximum value for is and , where and are the maximum and minimum value of in the training dataset.
III-D Closed-Loop Model-Based Planning
Using the estimated physics parameter , the learned model adapts to new objects, yielding lower dynamics prediction errors on the few-shot, curiosity-driven online interaction dataset. Thus, we can also use the adapted model to perform closed-loop planning for material-specific manipulation tasks within a Model Predictive Control (MPC) framework [5]. The improved dynamics prediction accuracy after few-shot adaptation will help the robot manipulate the object more efficiently and effectively towards goal configurations.
Concretely, the model-based control pipeline is defined as follows: given the state space and the action space , the cost function is a mapping from to . For each starting state , we iteratively sample actions in the action space, apply the learned dynamics model to predict the outcome, and apply the MPPI trajectory optimization algorithm for the action sequence that minimizes the cost function. In our experiments, the cost function includes a task-related term that measures the distance from the current state to the desired target, along with other penalty terms for infeasible actions and collision avoidance. Please refer to Sec. B.2 of the supplementary material for details.
IV Experiments
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x3.png)
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x4.png)
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x5.png)
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x6.png)
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x7.png)
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x8.png)
In this section, we evaluate the proposed framework across a diverse range of object manipulation tasks. Our experiments are designed to answer the following questions: (1) Is the GBND model capable of accurately predicting the movements of objects with varied physical properties? (2) Can the test-time adaptation module effectively estimate real-world physical properties of objects through few-shot interactions? (3) To what extent does the integration of the adaptation module enhance the model’s ability for model-based planning in the downstream manipulation tasks?
IV-A Evaluation Materials and Corresponding Tasks
To demonstrate the modeling power of our framework for diverse materials, we implement one task for each of the 4 material categories: rigid box pushing, rope straightening, granular pile gathering, and cloth relocating.
Rigid Box Pushing. The task is to use a point contact to push a box to a target position and orientation, which demands precise control over the translation and rotation motions in the presence of uncertainty of the center of pressure [54]. The physical property variable is defined to be the normalized 2D position of the center of mass from the top view. As illustrated in Fig. 2a, it is a 2-dimensional variable with range . We use the mean squared error as the cost function.
Rope Straightening. The task is to rearrange the rope to a target configuration on the tabletop. We consider the stiffness of the rope as the physical property variable and define it as a normalized continuous variable where and correspond to the minimal and maximal stiffness in the simulator, respectively.
Granular Pile Gathering. The target is defined as a region on the tabletop, and the task is to gather the granular piles in an arbitrary initial distribution into the target region. We consider the granular size/granularity as the physical property variable and use a normalized variable to represent the size of a single grain in the pile.
Cloth Relocating. The task is to use grippers to grasp the cloth and drag it on the table to place the cloth in the target configuration. We use a continuous variable to represent the stiffness of the cloth, which affects whether a piece of cloth will wrinkle or fold during a drag.
IV-B Environment and Evaluation Setup
Simulation. Simulations of deformable and granular materials are conducted using NVIDIA FleX [22, 31], a position-based simulation framework designed to model interactions between objects of varying materials across multiple tasks, including pushing granular objects [47], straightening ropes [26], and unfolding clothing [14]. Additionally, Pymunk [4] is utilized for simulating boxes that vary in shape and center of pressure.
For each material type, a dataset consisting of 1000 episodes is generated, with each episode featuring 5 random robot-object interactions. Within each episode, an object is assigned random physical properties (such as stiffness and granule size) that fall within a pre-defined range. To simulate interactions between the robot and the object, five random trajectories, involving either pushing or pulling actions, are created for every object. Throughout these interactions, data on the positions of particles and the robot’s end-effector are gathered, which are then utilized for model training. More details on the simulation environment and data collection can be found in Sec. B.1 of the supplementary material.
Real World. Fig. 3 presents the general setup in both the simulator and the real world. In the real-world experiments, we use a UFACTORY xArm 6 robot with 6 DoF and xArm’s parallel gripper. For rigid box pushing and rope straightening tasks, we substitute the original grippers with a cylinder stick while we utilize a flat pusher for the granular pile manipulation task. These tools are 3D-printed and the same with the simulation setup to mitigate the sim-to-real gap. We fix four calibrated RealSense D455 RGBD cameras at four locations surrounding the workspace to capture the RGBD images at 15Hz and 1280x720 resolution. The robot manipulates objects within a 70 cm45 cm planar workspace.
Implementation Details. In all experiments, we assume the material type to be known, and the particles of the same object share the same and physical property variable . To extract object point clouds from raw RGB-D inputs, we deploy the GroundingDINO [28] and Segment Anything [20] model to detect and segment the table surface and objects. For the target object, we fuse the segmented partial point cloud from 4 views and apply a farthest point sampling method to a fixed pointwise distance threshold. For the cylinder stick and gripper, we use one particle to represent the end effector position, and for the flat granular pusher, we use 5 points to represent the end effector position and geometry.
Baselines. To demonstrate the importance of parameter conditioning, we consider two baseline methods in our main experiments: (1) GNN uses a graph neural network with the same architecture of our model, which is trained separately for each material category, but not conditioned on the physics parameter . (2) Ours w/o Adaptation is an ablated version of our material-adaptive model by using only the mean physical property variable as input in deployment. In Sec. A of the appendix, we include additional comparisons by finetuning the GNN baseline and adapting a physics-based simulator to demonstrate the effectiveness of our proposed conditioning method and the benefits of learned dynamics models.
IV-C Forward Dynamics Prediction
Fig. 4 shows the qualitative comparisons between our material-conditioned GBND model and the baseline model Ours w/o Adaptation. The comparisons reveal that, with estimated physical property, the model’s prediction matches the interaction outcome more accurately. For instance, in the rope scenario, the baseline model’s prediction fails to capture both the below-average stiffness of the yarn object and the above-average stiffness of a polymer rope. In contrast, our method successfully accounts for variations in their motions, exhibiting more precise forecasts of unusual behaviors. Likewise, our model surpasses the baseline in scenarios involving materials with extreme physical properties, such as rigid boxes that differ in center of pressure, granular materials of various sizes, and clothes of differing stiffness.
Fig. 5 further validates our model’s effectiveness on a simulated test set of 200 objects each with distinct physical properties. Our approach surpasses both baselines, GNN and Ours w/o Adaptation, demonstrating superior accuracy and stability for all the material types addressed in our study. Particularly, for rigid boxes, our model significantly outperforms the baselines with a near-perfect prediction accuracy.
IV-D Physical Property Estimation
For physical property estimation, we randomly initialize the object location on the tabletop and perform 10 interactions.
Rigid Box. We use two boxes with different sizes: the sugar box (175mm89mm) and the cracker box (210mm158mm). We initialize the center of pressure (CoP) to be at 4 different locations for each box by putting weights at different locations inside the box. A visualization of all CoPs’ normalized positions and our predicted CoP positions is shown in Fig. 6a. From the figure, we can observe that for all 8 data points, the predicted CoP positions are close to the ground truth CoP position. Moreover, the heatmap error shows that the low-error region for the CoP location forms a single global minima, and the predicted CoP positions converge to around the minimum value after around 5 interaction steps.
Rope. We test our model on 9 different types of ropes. As shown in Fig. 6b, the model can extrapolate beyond the training data range [0.0, 1.0] and estimate out-of-range values for ropes with extreme stiffness/softness. The mean CD on the interaction observations gives clear and unique minimum points, and the stiffness ranking of the different types of ropes is consistent with the actual stiffness from human perception.
Granular. As shown in Fig. 6c, we test our model on 9 different types of granular objects by selecting representative objects of each granularity level, ranging from approximately 1cm to 3cm. Results show that the predicted granularity ranking is consistent with the actual granular size. The model correctly predicts granola as the smallest grains and the toy blocks as the largest grains.
Cloth. As shown in Fig. 6d, we test our model on 5 different cloth instances, each with a different fabric material. The model correctly identifies the modal as the softest cloth (lowest stiffness). As another soft material, the flannel cloth is also estimated to be softer than cotton and microfiber cloths. While the training dataset does not contain any plastic-like materials, the model generalizes to a piece of plastic sheet and correctly predicts that it is very stiff.
Furthermore, in Sec. C.1 of the supplementary material, we present additional experiments that consider multiple parameters, namely the stiffness and friction of ropes. The results demonstrate that our method can be extended to recover more than one type of physical property simultaneously and yield better accuracy in dynamics prediction.
IV-E Model-Based Planning
We further demonstrate that our material-conditioned GBND model and physical property adaptation can be integrated into an MPC framework to achieve a series of robotic manipulation tasks. Our experiments cover 4 distinct tasks outlined in Sec. IV-A, with a maximum limit of 10 planning steps imposed. Across all material types, our approach consistently meets the objectives within the allotted planning steps, unlike the baseline approach Ours w/o Adaptation, which fails to achieve the goals due to its disregard for physical properties. For instance, in the rigid box pushing task, the baseline method incorrectly assumes the geometric center as the center of pressure, leading to inaccurate predictions of the box’s straightforward movement post-push. Conversely, our method dynamically adjusts the center of pressure estimations during the interactions, thereby reaching the desired configuration in just three steps. Furthermore, as depicted in Fig. 1, the dynamics of pushing granular objects of different sizes vary significantly - larger granules push forward while smaller ones tend to stack and leave a trail. The baseline method, treating the motion of toy blocks and average granular piles similarly, fails to accumulate them in the target zone. Our method, however, identifies and adapts to the varied dynamics of granular materials, successfully completing the task.
Fig. 8 offers quantitative results comparing the performance of our method against the baseline method Ours w/o Adaptation, focusing on efficiency and error tolerance. Across four distinct tasks, our approach demonstrates superior performance, achieving lower errors within a constrained number of planning steps and attaining a higher success rate under a stringent error margin.
V Conclusion and Future Work
We present AdaptiGraph, a unified graph-based neural dynamics framework for modeling multiple materials with unknown physical properties. We propose to condition the dynamics model on physical property variables and perform online few-shot physical property estimation. Experiments show that AdaptiGraph can precisely simulate the dynamics of multiple deformable materials, and adapt to objects with varying physical properties during deployment. We demonstrate the effectiveness of our framework across a wide range of objects in manipulation tasks.
AdaptiGraph is a flexible framework. Currently, we train our model on four material types (ropes, granular objects, rigid boxes, and cloth) and a single type of physical property for each material. A future direction of our work is to extend our method to include more object materials and a more comprehensive set of physical properties that determine object dynamics. It is also possible to model heterogeneous object interactions using our framework by learning the dynamics model on a material-conditioned heterogeneous graph.
Acknowledgment
This work is supported, in part, by NIFA Award 2021-67021-34418. We thank Mingtong Zhang, Binghao Huang, Yixuan Wang, and Hanxiao Jiang for the helpful discussions.
References
- Agrawal et al. [2016] Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics. Advances in neural information processing systems, 29, 2016.
- Babaeizadeh et al. [2017] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017.
- Battaglia et al. [2018] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, George Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks, 2018.
- Blomqvist [2022] Victor Blomqvist. Pymunk. https://pymunk.org, November 2022.
- Camacho and Bordons Alba [2013] Eduardo F. Camacho and Carlos Bordons Alba. Model Predictive Control. Springer Science & Business Media, 2013.
- Chen et al. [2022] Zhenfang Chen, Kexin Yi, Yunzhu Li, Mingyu Ding, Antonio Torralba, Joshua B. Tenenbaum, and Chuang Gan. Comphy: Compositional physical reasoning of objects and events from videos, 2022.
- Chua et al. [2018] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models, 2018.
- Ding et al. [2021] Mingyu Ding, Zhenfang Chen, Tao Du, Ping Luo, Joshua B. Tenenbaum, and Chuang Gan. Dynamic visual reasoning by learning differentiable physics models from video and language, 2021.
- Edelsbrunner and Mücke [1994] Herbert Edelsbrunner and Ernst P Mücke. Three-dimensional alpha shapes. ACM Transactions On Graphics (TOG), 13(1):43–72, 1994.
- Evans et al. [2022] Ben Evans, Abitha Thankaraj, and Lerrel Pinto. Context is everything: Implicit identification for dynamics adaptation. In 2022 International Conference on Robotics and Automation (ICRA), pages 2642–2648. IEEE, 2022.
- Finn and Levine [2017] Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE, 2017.
- Frank et al. [2010] Barbara Frank, Rüdiger Schmedding, Cyrill Stachniss, Matthias Teschner, and Wolfram Burgard. Learning the elasticity parameters of deformable objects with a manipulation robot. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1877–1883. IEEE, 2010.
- Gao et al. [2023] Jensen Gao, Bidipta Sarkar, Fei Xia, Ted Xiao, Jiajun Wu, Brian Ichter, Anirudha Majumdar, and Dorsa Sadigh. Physically grounded vision-language models for robotic manipulation. arXiv preprint arXiv:2309.02561, 2023.
- Ha and Song [2021] Huy Ha and Shuran Song. Flingbot: The unreasonable effectiveness of dynamic manipulation for cloth unfolding. In Conference on Robotic Learning (CoRL), 2021.
- Hafner et al. [2019] Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
- Hafner et al. [2020] Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
- Hogan and Rodriguez [2016] François Robert Hogan and Alberto Rodriguez. Feedback control of the pusher-slider system: A story of hybrid and underactuated contact dynamics. arXiv preprint arXiv:1611.08268, 2016.
- Hoque et al. [2020] Ryan Hoque, Daniel Seita, Ashwin Balakrishna, Aditya Ganapathi, Ajay Kumar Tanwani, Nawid Jamali, Katsu Yamane, Soshi Iba, and Ken Goldberg. Visuospatial foresight for multi-step, multi-task fabric manipulation. arXiv preprint arXiv:2003.09044, 2020.
- Huang et al. [2023] Isabella Huang, Yashraj Narang, Ruzena Bajcsy, Fabio Ramos, Tucker Hermans, and Dieter Fox. Defgraspnets: Grasp planning on 3d fields with graph neural nets, 2023.
- Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Kumar et al. [2021] Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik. Rma: Rapid motor adaptation for legged robots. 2021.
- Li et al. [2019a] Yunzhu Li, Jiajun Wu, Russ Tedrake, Joshua B Tenenbaum, and Antonio Torralba. Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. In ICLR, 2019a.
- Li et al. [2019b] Yunzhu Li, Jiajun Wu, Jun-Yan Zhu, Joshua B Tenenbaum, Antonio Torralba, and Russ Tedrake. Propagation networks for model-based control under partial observation. In ICRA, 2019b.
- Li et al. [2020] Yunzhu Li, Toru Lin, Kexin Yi, Daniel Bear, Daniel Yamins, Jiajun Wu, Joshua Tenenbaum, and Antonio Torralba. Visual grounding of learned physical models. In International conference on machine learning, pages 5927–5936. PMLR, 2020.
- Liang et al. [2019] Junbang Liang, Ming Lin, and Vladlen Koltun. Differentiable cloth simulation for inverse problems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/28f0b864598a1291557bed248a998d4e-Paper.pdf.
- Lin et al. [2020] Xingyu Lin, Yufei Wang, Jake Olkin, and David Held. Softgym: Benchmarking deep reinforcement learning for deformable object manipulation. In Conference on Robot Learning, 2020.
- Lin et al. [2022] Xingyu Lin, Yufei Wang, Zixuan Huang, and David Held. Learning visible connectivity dynamics for cloth smoothing. In Conference on Robot Learning, pages 256–266. PMLR, 2022.
- Liu et al. [2023a] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023a.
- Liu et al. [2023b] Ziang Liu, Genggeng Zhou, Jeff He, Tobia Marcucci, Li Fei-Fei, Jiajun Wu, and Yunzhu Li. Model-based control with sparse neural dynamics. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=ymBG2xs9Zf.
- Longhini et al. [2023] Alberta Longhini, Marco Moletta, Alfredo Reichlin, Michael C Welle, David Held, Zackory Erickson, and Danica Kragic. Edo-net: Learning elastic properties of deformable objects from graph dynamics. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 3875–3881. IEEE, 2023.
- Macklin et al. [2014] Miles Macklin, Matthias Müller, Nuttapong Chentanez, and Tae-Yong Kim. Unified particle physics for real-time applications. ACM Transactions on Graphics (TOG), 33(4):1–12, 2014.
- Moenning and Dodgson [2003] Carsten Moenning and Neil A Dodgson. Fast marching farthest point sampling. Technical report, University of Cambridge, Computer Laboratory, 2003.
- Nagabandi et al. [2019] Anusha Nagabandi, Kurt Konoglie, Sergey Levine, and Vikash Kumar. Deep Dynamics Models for Learning Dexterous Manipulation. In Conference on Robot Learning (CoRL), 2019.
- Pang et al. [2023] Tao Pang, HJ Terry Suh, Lujie Yang, and Russ Tedrake. Global planning for contact-rich manipulation via local smoothing of quasi-dynamic contact models. IEEE Transactions on Robotics, 2023.
- Pfaff et al. [2020] Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter W Battaglia. Learning mesh-based simulation with graph networks. arXiv preprint arXiv:2010.03409, 2020.
- Puthuveetil et al. [2023] Kavya Puthuveetil, Sasha Wald, Atharva Pusalkar, Pratyusha Karnati, and Zackory Erickson. Robust body exposure (robe): A graph-based dynamics modeling approach to manipulating blankets over people. IEEE Robotics and Automation Letters, 2023.
- Sanchez-Gonzalez et al. [2020] Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, and Peter Battaglia. Learning to simulate complex physics with graph networks. In International conference on machine learning, pages 8459–8468. PMLR, 2020.
- She et al. [2020] Yu She, Shaoxiong Wang, Siyuan Dong, Neha Sunil, Alberto Rodriguez, and Edward Adelson. Cable manipulation with a tactile-reactive gripper, 2020.
- Shi et al. [2022] Haochen Shi, Huazhe Xu, Zhiao Huang, Yunzhu Li, and Jiajun Wu. Robocraft: Learning to see, simulate, and shape elasto-plastic objects with graph networks. arXiv preprint arXiv:2205.02909, 2022.
- Shi et al. [2023] Haochen Shi, Huazhe Xu, Samuel Clarke, Yunzhu Li, and Jiajun Wu. Robocook: Long-horizon elasto-plastic object manipulation with diverse tools. arXiv preprint arXiv:2306.14447, 2023.
- Shreiner and Group [2009] Dave Shreiner and The Khronos OpenGL ARB Working Group. OpenGL Programming Guide: The Official Guide to Learning OpenGL, Versions 3.0 and 3.1. Addison-Wesley Professional, 7th edition, 2009. ISBN 0321552628.
- Sundaresan et al. [2022] Priya Sundaresan, Rika Antonova, and Jeannette Bohgl. Diffcloud: Real-to-sim from point clouds with differentiable simulation and rendering of deformable objects. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10828–10835. IEEE, 2022.
- Tremblay et al. [2023] Jean-francois Tremblay, Francois Robert Hogan, David Paul Meger, and Gregory Lewis Dudek. Learning active tactile perception through belief-space control, November 2 2023. US Patent App. 18/141,031.
- Tung et al. [2023] Fish Tung, Mingyu Ding, Zhenfang Chen, Daniel M. Bear, Chuang Gan, Joshua B. Tenenbaum, Daniel L. K. Yamins, Judith Fan, and Kevin A. Smith. Physion++: Evaluating physical scene understanding that requires online inference of different physical properties. arXiv, 2023.
- Wang et al. [2020] Chen Wang, Shaoxiong Wang, Branden Romero, Filipe Veiga, and Edward H Adelson. Swingbot: Learning physical features from in-hand tactile exploration for dynamic swing-up manipulation. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020.
- Wang et al. [2023a] Yi Ru Wang, Jiafei Duan, Dieter Fox, and Siddhartha Srinivasa. Newton: Are large language models capable of physical reasoning? arXiv preprint arXiv:2310.07018, 2023a.
- Wang et al. [2023b] Yixuan Wang, Yunzhu Li, Katherine Driggs-Campbell, Li Fei-Fei, and Jiajun Wu. Dynamic-Resolution Model Learning for Object Pile Manipulation. In Proceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023b. doi: 10.15607/RSS.2023.XIX.047.
- Williams et al. [2017] Grady Williams, Andrew Aldrich, and Evangelos A Theodorou. Model predictive path integral control: From theory to parallel computation. Journal of Guidance, Control, and Dynamics, 40(2):344–357, 2017.
- Wu et al. [2015] Jiajun Wu, Ilker Yildirim, Joseph J Lim, William T Freeman, and Joshua B Tenenbaum. Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In Advances in Neural Information Processing Systems, pages 127–135, 2015.
- Xu et al. [2019] Zhenjia Xu, Jiajun Wu, Andy Zeng, Joshua B Tenenbaum, and Shuran Song. Densephysnet: Learning dense physical object representations via multi-step dynamic interactions. In Robotics: Science and Systems (RSS), 2019.
- Yang et al. [2023] Linhan Yang, Bidan Huang, Qingbiao Li, Ya-Yen Tsai, Wang Wei Lee, Chaoyang Song, and Jia Pan. Tacgnn:learning tactile-based in-hand manipulation with a blind robot, 2023.
- Yao and Hauser [2023] Shaoxiong Yao and Kris Hauser. Estimating tactile models of heterogeneous deformable objects in real time. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 12583–12589, 2023. doi: 10.1109/ICRA48891.2023.10160731.
- Yu et al. [2016] Kuan-Ting Yu, Maria Bauza, Nima Fazeli, and Alberto Rodriguez. More than a million ways to be pushed. a high-fidelity experimental dataset of planar pushing. In 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 30–37. IEEE, 2016.
- Zhou et al. [2019] Jiaji Zhou, Yifan Hou, and Matthew T Mason. Pushing revisited: Differential flatness, trajectory planning, and stabilization. The International Journal of Robotics Research, 38(12-13):1477–1489, 2019.
Appendix
1Contents
Appendix A Comparison with Additional Baselines
A.1 Ablation on Material Conditioning
Expanding on Fig. 5 from the main paper, we introduce an additional baseline, Unified GNN, to study the importance of material type conditioning. As outlined in Tab. I, we establish the following baselines: (1) Unified GNN, a singular GNN model trained on a combined dataset of rope, cloth, and granular materials, without any conditioning on material types or physical properties; (2) Separate GNN, which employs a graph neural network with the same architecture as our model but lacks conditioning on physical parameters, and is independently trained for each material category; (3) Ours w/o Adaptation, an ablated version of our material-adaptive model conditioned on the mean physical property variable .
The quantitative findings are displayed in Fig. 9. Within this assessment, the Unified GNN has the lowest performance, with the highest variance, showing that it fails to model the complex dynamics brought by distinct material types and physical properties. The Seperate GNN baseline performs better on individual material types than the Unified GNN, reaching comparable performance with Ours w/o Adaptation. However, the lack of physical property adaptability has led to inaccuracies. Overall, Ours has achieved the best performance in all material types. The results have demonstrated the relative importance of material conditioning, physical property conditioning, and online adaptation in dynamics prediction performance.
A.2 Adaptation Using Different Base Models
In our main paper, we have showcased the superior performance of our model in terms of dynamics prediction error when compared to two baseline models: GNN, which does not use physical property conditioning, and Ours w/o Adaptation, where the online adaptation module is removed. In this section, we further compare our model to (1) simulators incorporating physical property adaptation and (2) fine-tuning unconditional GNN. We evaluate their dynamics prediction error after adaptation in few-shot real-world interactions.
Settings. We employ the same simulators used for generating our training data: FleX [22, 31] for deformable objects and Pymunk [4] for rigid boxes. Given the observed point cloud, we map the point cloud to object states in simulation with a perception model and then use the simulators to perform dynamics rollout and optimization-based physical property estimation.
We design category-specific perception models to mitigate the sim-to-real gap. For boxes, we extract the 4 corners of the box from the top view and create an identical 2D box in Pymunk. For ropes, we apply mesh reconstruction based on alpha shapes [9] to derive the rope mesh in the FleX simulator. For clothes and granular objects, we extract the contour of the point cloud’s projection on the table surface, and construct object instances, i.e., a piece of cloth or granular pieces, that exactly cover the contour region.
Method Unified GNN Separate GNN Ours w/o Adaptation Ours Cond. on material type? Cond. on physical property? Online adaptation?
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x10.png)
Results. From Fig. 10, we can observe that our method, with online adaptation, exhibits the lowest dynamics prediction error. It achieves an error reduction of for rigid boxes, for ropes, for clothes, and for granular objects compared to our model without adaptation. Notably, the error reduction ratio surpasses that achieved by fine-tuning an unconditional GNN-based dynamics model. Compared with simulator-based physical property adaptation, our model demonstrates lower dynamics prediction error for rigid boxes, for ropes, for clothes, and for granular objects. We attribute this improvement to the inherent system identification error and the instability of the simulator. Using a learning-based dynamics model directly on point clouds enhances our model’s robustness to noisy visual inputs.
Moreover, our model is significantly faster than simulators. Running the Bayesian optimization algorithm for 50 iterations takes approximately 7 seconds for our model on a desktop computer equipped with an i9-13900K CPU and an NVIDIA GeForce RTX 4090 GPU, whereas it takes approximately 900 seconds for the FleX simulator.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x11.png)
Appendix B Additional Implementation Details
B.1 Simulation and Data Collection
For training our GBND model, we generate datasets encompassing variable physical properties in simulators. In FleX [22, 31], we render the robot workspace and the xArm6 robot mesh through OpenGL [41] from four camera angles to closely mirror our real-world configuration. In Pymunk [4], the robot pusher is represented as a circular shape with a radius of 1cm. Fig. 11 shows the simulation setup for our data collection. The subsequent paragraphs detail the data generation process for each material type in the simulation.
Rigid Box. As shown in Fig. 11a, a circular rigid pusher interacts with a rigid box from random positions and angles. The length of the rigid box is uniformly sampled from 150300 mm and the width is uniformly sampled from 50200 mm. The center of pressure (CoP), represented by a 2-dimensional normalized coordinate in , is sampled uniformly over the box surface. The friction coefficient between the box and the table is fixed as a control variable. We generate 1000 data episodes, each containing 1 pushing action on 1 box with random size and CoP.
Rope. As shown in Fig. 11b, we use a cylinder pusher with a radius of 1cm to randomly interact with a simulated rope. The workspace in the simulation measures 90 cm70 cm. We uniformly randomize the length, thickness, and stiffness of the rope. We collect 1000 episodes of data, each comprising 5 continuous random pushes on one rope.
Cloth. The cloth simulation environment is created following the approach used in SoftGym [26], as shown in Fig. 11c. For the cloth properties, we vary the stretch stiffness (which determines resistance to elongation), bend stiffness (resistance to bending), and shear stiffness (resistance to sliding or twisting deformations). We randomly create rectangular clothes with lengths and widths uniformly distributed from 19 to 21 cm and 31 to 34 cm, respectively. We collect data across 1000 episodes, with each episode involving 5 continuous random interactions on one piece of cloth.
Granular Object. Adhering to the setup in [47], we use irregular polygonal meshes to represent granular objects, as shown in Fig. 11d. The scale of the granules is uniformly sampled in 13 cm. We also randomize the number of granular objects and the initial coverage area of the pile. The collected data consists of 1000 episodes, each comprising 5 continuous random interactions on one granular pile.
B.2 Model-Based Planning
We apply the MPPI trajectory optimization algorithm for model-based planning. Given the dynamics model (here we omit the material and physical property conditions for convenience), the cost function we minimize is:
(9) |
where the task term measures the distance from the current state to the target , and the penalty term produces high cost for infeasible actions.
Task term. For rope straightening and cloth relocating, the cost term is defined as the Chamfer Distance between the current state and the target state:
(10) |
For granular pile gathering, we use the nearest distance from object particles to the target rectangle:
(11) |
For rigid box pushing, since we have the correspondence between the observed box corners and the target corners, we use the Mean Squared Error (MSE):
(12) |
Penalty term. For all tasks, the penalty cost is defined as
(13) |
where is the robot workspace; is the particle set in state ; and represent end-effector and object particles, respectively. Thus, the penalty term penalizes actions that make the object particles move out of the workspace and the actions that will make the end-effector contact the object in . We set except for clothes where as the grasping action allows contact.
Appendix C Discussion and Potential Extensions
Heuristics (Ours) | Uncertainty-Driven | |||
Object | Estimation | Variance | Estimation | Variance |
Rope 1 | 1.09 | 0.029 | 1.20 | 0.024 |
Rope 2 | -0.02 | 0.025 | -0.01 | 0.035 |
Rope 3 | -0.05 | 0.030 | 0.00 | 0.029 |
Rope 4 | 0.88 | 0.020 | 0.83 | 0.016 |
Rope 5 | 0.67 | 0.018 | 0.63 | 0.019 |
C.1 Multiple Properties Recovering
Expanding on one-dimensional physical parameter conditioning for deformable objects, we designed an experiment to show that our method can also be applied to more than one physical property.
We consider ropes with 2-dimensional physical properties: stiffness and friction coefficient . We generate the same amount of data as our previous setting in the Nvidia FleX [22, 31] simulator with varying stiffness and friction and train a model conditioned on both properties. Then, we apply the model to ropes in the real world and perform property estimation and forward dynamics prediction. Results are shown in Fig. 12. As we can observe, the model can give reasonable estimates by predicting high friction in w/ sheet cases and low friction in w/o sheet cases. The stiffness estimations for all three ropes with and without sheets are also consistent. The dynamics prediction error for the 2D model (conditioned on both stiffness and friction) is generally lower than the 1D model (conditioned on stiffness only), showing that the dynamics prediction will be more accurate by incorporating more relevant properties.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x12.png)
C.2 Identification with Uncertainty
The uncertainty of the physics parameters could be an important indicator measuring the estimation’s confidence. Minimizing the uncertainty can also be used as an objective when selecting interactions [43]. In comparison, our method selects actions that produce maximum displacement on object particles. In this experiment, we compare an uncertainty-driven interaction selection scheme with our heuristics-driven scheme.
We can define the belief state over the parameter space based on the dynamics prediction error:
(14) |
where is the probability density of physics parameter under belief , is the set of interaction data, CD represents Chamfer Distance, is a temperature hyper-parameter which we set to , and is a normalizing factor. Parameters that give a lower dynamics prediction error will have higher probability density . With this definition, a natural way to measure the uncertainty of a belief is by the uncertainty in the dynamics prediction outcomes, given current state and control action :
(15) |
Intuitively, by selecting an action that can maximize the uncertainty in the above equation, the interaction result will most effectively discriminate parameters sampled from the belief and thus be more effective in reducing the variance of post-adaptation. In practice, we sample parameters from and calculate the above equation as an MPPI objective.
Results of using this uncertainty-driven interaction selection approach are provided in Tab. II. We compare it with our heuristics-based approach and test on 5 different ropes. The estimated parameters are consistent, and there is no consistent advantage in post-adaptation variance over one another. Given that the heuristics-based approach is computationally faster, it is more suitable for our identification tasks.