Skip to main content

Showing 1–13 of 13 results for author: Shridhar, M

  1. arXiv:2407.07875  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    Generative Image as Action Models

    Authors: Mohit Shridhar, Yat Long Lo, Stephen James

    Abstract: Image-generation diffusion models have been fine-tuned to unlock new capabilities such as image-editing and novel view synthesis. Can we similarly unlock image-generation models for visuomotor control? We present GENIMA, a behavior-cloning agent that fine-tunes Stable Diffusion to 'draw joint-actions' as targets on RGB images. These images are fed into a controller that maps the visual targets int… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: Project website, code, checkpoints: https://genima-robot.github.io/

  2. arXiv:2407.00278  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    PerAct2: A Perceiver Actor Framework for Bimanual Manipulation Tasks

    Authors: Markus Grotz, Mohit Shridhar, Tamim Asfour, Dieter Fox

    Abstract: Bimanual manipulation is challenging due to precise spatial and temporal coordination required between two arms. While there exist several real-world bimanual systems, there is a lack of simulated benchmarks with a large task diversity for systematically studying bimanual capabilities across a wide range of tabletop tasks. This paper addresses the gap by extending RLBench to bimanual manipulation.… ▽ More

    Submitted 28 June, 2024; originally announced July 2024.

  3. arXiv:2310.01361  [pdf, other

    cs.LG cs.CL cs.CV cs.RO

    GenSim: Generating Robotic Simulation Tasks via Large Language Models

    Authors: Lirui Wang, Yiyang Ling, Zhecheng Yuan, Mohit Shridhar, Chen Bao, Yuzhe Qin, Bailin Wang, Huazhe Xu, Xiaolong Wang

    Abstract: Collecting large amounts of real-world interaction data to train general robotic policies is often prohibitively expensive, thus motivating the use of simulation data. However, existing methods for data generation have generally focused on scene-level diversity (e.g., object instances and poses) rather than task-level diversity, due to the human effort required to come up with and verify novel tas… ▽ More

    Submitted 21 January, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

    Comments: See our project website (https://liruiw.github.io/gensim), demo and datasets (https://huggingface.co/spaces/Gen-Sim/Gen-Sim), and code (https://github.com/liruiw/GenSim) for more details

    Journal ref: International Conference on Learning Representations (ICLR), 2024

  4. arXiv:2306.13818  [pdf, other

    cs.RO cs.CV

    AR2-D2:Training a Robot Without a Robot

    Authors: Jiafei Duan, Yi Ru Wang, Mohit Shridhar, Dieter Fox, Ranjay Krishna

    Abstract: Diligently gathered human demonstrations serve as the unsung heroes empowering the progression of robot learning. Today, demonstrations are collected by training people to use specialized controllers, which (tele-)operate robots to manipulate a small number of objects. By contrast, we introduce AR2-D2: a system for collecting demonstrations which (1) does not require people with specialized traini… ▽ More

    Submitted 23 June, 2023; originally announced June 2023.

    Comments: Project website: www.ar2d2.site

  5. arXiv:2210.06849  [pdf, other

    cs.CV

    Retrospectives on the Embodied AI Workshop

    Authors: Matt Deitke, Dhruv Batra, Yonatan Bisk, Tommaso Campari, Angel X. Chang, Devendra Singh Chaplot, Changan Chen, Claudia Pérez D'Arpino, Kiana Ehsani, Ali Farhadi, Li Fei-Fei, Anthony Francis, Chuang Gan, Kristen Grauman, David Hall, Winson Han, Unnat Jain, Aniruddha Kembhavi, Jacob Krantz, Stefan Lee, Chengshu Li, Sagnik Majumder, Oleksandr Maksymets, Roberto Martín-Martín, Roozbeh Mottaghi , et al. (14 additional authors not shown)

    Abstract: We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of… ▽ More

    Submitted 4 December, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

  6. arXiv:2209.05451  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

    Authors: Mohit Shridhar, Lucas Manuelli, Dieter Fox

    Abstract: Transformers have revolutionized vision and natural language processing with their ability to scale with large datasets. But in robotic manipulation, data is both limited and expensive. Can manipulation still benefit from Transformers with the right problem formulation? We investigate this question with PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct… ▽ More

    Submitted 11 November, 2022; v1 submitted 12 September, 2022; originally announced September 2022.

    Comments: CoRL 2022. Project Website: https://peract.github.io/

  7. arXiv:2109.12098  [pdf, other

    cs.RO cs.CL cs.CV cs.LG

    CLIPort: What and Where Pathways for Robotic Manipulation

    Authors: Mohit Shridhar, Lucas Manuelli, Dieter Fox

    Abstract: How can we imbue robots with the ability to manipulate objects precisely but also to reason about them in terms of abstract concepts? Recent works in manipulation have shown that end-to-end networks can learn dexterous skills that require precise spatial reasoning, but these methods often fail to generalize to new goals or quickly learn transferable concepts across tasks. In parallel, there has be… ▽ More

    Submitted 24 September, 2021; originally announced September 2021.

    Comments: CoRL 2021. Project Website: https://cliport.github.io/

  8. arXiv:2107.12514  [pdf, other

    cs.CL cs.AI cs.CV cs.LG cs.RO

    Language Grounding with 3D Objects

    Authors: Jesse Thomason, Mohit Shridhar, Yonatan Bisk, Chris Paxton, Luke Zettlemoyer

    Abstract: Seemingly simple natural language requests to a robot are generally underspecified, for example "Can you bring me the wireless mouse?" Flat images of candidate mice may not provide the discriminative information needed for "wireless." The world, and objects in it, are not flat images but complex 3D shapes. If a human requests an object based on any of its basic properties, such as color, shape, or… ▽ More

    Submitted 15 September, 2021; v1 submitted 26 July, 2021; originally announced July 2021.

    Comments: Conference on Robot Learning (CoRL) 2021

  9. arXiv:2010.11453  [pdf, other

    cs.CR cs.LG cs.NI

    Machine Learning-Based Early Detection of IoT Botnets Using Network-Edge Traffic

    Authors: Ayush Kumar, Mrinalini Shridhar, Sahithya Swaminathan, Teng Joon Lim

    Abstract: In this work, we present a lightweight IoT botnet detection solution, EDIMA, which is designed to be deployed at the edge gateway installed in home networks and targets early detection of botnets prior to the launch of an attack. EDIMA includes a novel two-stage Machine Learning (ML)-based detector developed specifically for IoT bot detection at the edge gateway. The ML-based bot detector first em… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  10. arXiv:2010.03768  [pdf, other

    cs.CL cs.AI cs.CV cs.LG cs.RO

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Authors: Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, Matthew Hausknecht

    Abstract: Given a simple request like Put a washed apple in the kitchen fridge, humans can reason in purely abstract terms by imagining action sequences and scoring their likelihood of success, prototypicality, and efficiency, all without moving a muscle. Once we see the kitchen in question, we can update our abstract plans to fit the scene. Embodied agents require the same abilities, but existing work does… ▽ More

    Submitted 14 March, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

    Comments: ICLR 2021; Data, code, and videos are available at alfworld.github.io

  11. arXiv:1912.01734  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.RO

    ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

    Authors: Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, Dieter Fox

    Abstract: We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with non-reversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demons… ▽ More

    Submitted 30 March, 2020; v1 submitted 3 December, 2019; originally announced December 2019.

    Comments: Computer Vision and Pattern Recognition (CVPR) 2020 ; https://askforalfred.com/

  12. arXiv:1806.03831  [pdf, other

    cs.RO cs.CL cs.CV

    Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction

    Authors: Mohit Shridhar, David Hsu

    Abstract: This paper presents INGRESS, a robot system that follows human natural language instructions to pick and place everyday objects. The core issue here is the grounding of referring expressions: infer objects and their relationships from input images and language expressions. INGRESS allows for unconstrained object categories and unconstrained language expressions. Further, it asks questions to disam… ▽ More

    Submitted 11 June, 2018; originally announced June 2018.

    Comments: In Robotics: Science & Systems (RSS) 2018

  13. arXiv:1707.05720  [pdf, other

    cs.RO cs.AI cs.CL

    Grounding Spatio-Semantic Referring Expressions for Human-Robot Interaction

    Authors: Mohit Shridhar, David Hsu

    Abstract: The human language is one of the most natural interfaces for humans to interact with robots. This paper presents a robot system that retrieves everyday objects with unconstrained natural language descriptions. A core issue for the system is semantic and spatial grounding, which is to infer objects and their spatial relationships from images and natural language expressions. We introduce a two-stag… ▽ More

    Submitted 18 July, 2017; originally announced July 2017.

    Comments: 8 pages, 4 figures, Accepted at RSS 2017 Workshop on Spatial-Semantic Representations in Robotics