-
Controlling Space and Time with Diffusion Models
Authors:
Daniel Watson,
Saurabh Saxena,
Lala Li,
Andrea Tagliasacchi,
David J. Fleet
Abstract:
We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), conditioned on one or more images of a general scene, and a set of camera poses and timestamps. To overcome challenges due to limited availability of 4D training data, we advocate joint training on 3D (with camera pose), 4D (pose+time) and video (time but no pose) data and propose a new architecture that enables the sam…
▽ More
We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), conditioned on one or more images of a general scene, and a set of camera poses and timestamps. To overcome challenges due to limited availability of 4D training data, we advocate joint training on 3D (with camera pose), 4D (pose+time) and video (time but no pose) data and propose a new architecture that enables the same. We further advocate the calibration of SfM posed data using monocular metric depth estimators for metric scale camera control. For model evaluation, we introduce new metrics to enrich and overcome shortcomings of current evaluation schemes, demonstrating state-of-the-art results in both fidelity and pose control compared to existing diffusion models for 3D NVS, while at the same time adding the ability to handle temporal dynamics. 4DiM is also used for improved panorama stitching, pose-conditioned video to video translation, and several other tasks. For an overview see https://4d-diffusion.github.io
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Multi-person eye tracking for real-world scene perception in social settings
Authors:
Shreshth Saxena,
Areez Visram,
Neil Lobo,
Zahid Mirza,
Mehak Rafi Khan,
Biranugan Pirabaharan,
Alexander Nguyen,
Lauren K. Fink
Abstract:
Eye movements provide a window into human behaviour, attention, and interaction dynamics. Previous research suggests that eye movements are highly influenced by task, setting, and social others; however, most eye tracking research is conducted in single-person, in-lab settings and is yet to be validated in multi-person, naturalistic contexts. One such prevalent real-world context is the collective…
▽ More
Eye movements provide a window into human behaviour, attention, and interaction dynamics. Previous research suggests that eye movements are highly influenced by task, setting, and social others; however, most eye tracking research is conducted in single-person, in-lab settings and is yet to be validated in multi-person, naturalistic contexts. One such prevalent real-world context is the collective viewing of a shared scene in social settings, for example, viewing a concert, film, lecture, sports, etc. Here, we apply mobile eye tracking in a real-world multi-person setup and develop a system to stream, record, and analyse synchronised data. We tested our proposed, open-source system while participants (N=60) watched a live concert and a documentary film screening during a public event. We tackled challenges related to networking bandwidth requirements, real-time monitoring, and gaze projection from individual egocentric perspectives to a common coordinate space for shared gaze analysis. Our system achieves precise time synchronisation and accurate gaze projection in challenging dynamic scenes. Further, to illustrate the potential of collective eye-tracking data, we introduce and evaluate novel analysis metrics and visualisations. Overall, our approach contributes to the development and application of versatile multi-person eye tracking systems in real-world social settings. This advancement enables insight into collaborative behaviour, group dynamics, and social interaction, with high ecological validity. Moreover, it paves the path for innovative, interactive tools that promote collaboration and coordination in social contexts.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
DPHGNN: A Dual Perspective Hypergraph Neural Networks
Authors:
Siddhant Saxena,
Shounak Ghatak,
Raghu Kolla,
Debashis Mukherjee,
Tanmoy Chakraborty
Abstract:
Message passing on hypergraphs has been a standard framework for learning higher-order correlations between hypernodes. Recently-proposed hypergraph neural networks (HGNNs) can be categorized into spatial and spectral methods based on their design choices. In this work, we analyze the impact of change in hypergraph topology on the suboptimal performance of HGNNs and propose DPHGNN, a novel dual-pe…
▽ More
Message passing on hypergraphs has been a standard framework for learning higher-order correlations between hypernodes. Recently-proposed hypergraph neural networks (HGNNs) can be categorized into spatial and spectral methods based on their design choices. In this work, we analyze the impact of change in hypergraph topology on the suboptimal performance of HGNNs and propose DPHGNN, a novel dual-perspective HGNN that introduces equivariant operator learning to capture lower-order semantics by inducing topology-aware spatial and spectral inductive biases. DPHGNN employs a unified framework to dynamically fuse lower-order explicit feature representations from the underlying graph into the super-imposed hypergraph structure. We benchmark DPHGNN over eight benchmark hypergraph datasets for the semi-supervised hypernode classification task and obtain superior performance compared to seven state-of-the-art baselines. We also provide a theoretical framework and a synthetic hypergraph isomorphism test to express the power of spatial HGNNs and quantify the expressivity of DPHGNN beyond the Generalized Weisfeiler Leman (1-GWL) test. Finally, DPHGNN was deployed by our partner e-commerce company for the Return-to-Origin (RTO) prediction task, which shows ~7% higher macro F1-Score than the best baseline.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Maximizing Weighted Dominance in the Plane
Authors:
Waseem Akram,
Sanjeev Saxena
Abstract:
Let P be a set of n weighted points, Q be a set of m unweighted points in the plane, and k a non-negative integer. We consider the problem of computing a subset $Q'\subseteq Q$ with size at most k such that the sum of the weights of the points of P dominated by at least one point in the set Q' is maximized. A point q in the plane dominates another point p if and only if $x(q)\ge x(p)$ and…
▽ More
Let P be a set of n weighted points, Q be a set of m unweighted points in the plane, and k a non-negative integer. We consider the problem of computing a subset $Q'\subseteq Q$ with size at most k such that the sum of the weights of the points of P dominated by at least one point in the set Q' is maximized. A point q in the plane dominates another point p if and only if $x(q)\ge x(p)$ and $y(q)\ge y(p)$, and at least one inequality is strict.
We present a solution to the problem that takes O(n + m)-space and $O(k \min\{n+m, \frac{n}{k}+m^2\}\log m)$-time. We (conditionally) improve upon the existing result (the bounds of our solution are interesting when $m= o(\sqrt{n}))$.
Moreover, we also present a simple algorithm solving the problem in $O(km^2+n\log m)$-time and $O(n+m)$-space. The bounds of the algorithm are interesting when $m= o(\sqrt{n})$.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI
Authors:
Hannah Chafetz,
Sampriti Saxena,
Stefaan G. Verhulst
Abstract:
Since late 2022, generative AI has taken the world by storm, with widespread use of tools including ChatGPT, Gemini, and Claude. Generative AI and large language model (LLM) applications are transforming how individuals find and access data and knowledge. However, the intricate relationship between open data and generative AI, and the vast potential it holds for driving innovation in this field re…
▽ More
Since late 2022, generative AI has taken the world by storm, with widespread use of tools including ChatGPT, Gemini, and Claude. Generative AI and large language model (LLM) applications are transforming how individuals find and access data and knowledge. However, the intricate relationship between open data and generative AI, and the vast potential it holds for driving innovation in this field remain underexplored areas. This white paper seeks to unpack the relationship between open data and generative AI and explore possible components of a new Fourth Wave of Open Data: Is open data becoming AI ready? Is open data moving towards a data commons approach? Is generative AI making open data more conversational? Will generative AI improve open data quality and provenance? Towards this end, we provide a new Spectrum of Scenarios framework. This framework outlines a range of scenarios in which open data and generative AI could intersect and what is required from a data quality and provenance perspective to make open data ready for those specific scenarios. These scenarios include: pertaining, adaptation, inference and insight generation, data augmentation, and open-ended exploration. Through this process, we found that in order for data holders to embrace generative AI to improve open data access and develop greater insights from open data, they first must make progress around five key areas: enhance transparency and documentation, uphold quality and integrity, promote interoperability and standards, improve accessibility and useability, and address ethical considerations.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Towards quantum computing for clinical trial design and optimization: A perspective on new opportunities and challenges
Authors:
Hakan Doga,
M. Emre Sahin,
Joao Bettencourt-Silva,
Anh Pham,
Eunyoung Kim,
Alan Andress,
Sudhir Saxena,
Aritra Bose,
Laxmi Parida,
Jan Lukas Robertus,
Hideaki Kawaguchi,
Radwa Soliman,
Daniel Blankenberg
Abstract:
Clinical trials are pivotal in the drug discovery process to determine the safety and efficacy of a drug candidate. The high failure rates of these trials are attributed to deficiencies in clinical model development and protocol design. Improvements in the clinical drug design process could therefore yield significant benefits for all stakeholders involved. This paper examines the current challeng…
▽ More
Clinical trials are pivotal in the drug discovery process to determine the safety and efficacy of a drug candidate. The high failure rates of these trials are attributed to deficiencies in clinical model development and protocol design. Improvements in the clinical drug design process could therefore yield significant benefits for all stakeholders involved. This paper examines the current challenges faced in clinical trial design and optimization, reviews established classical computational approaches, and introduces quantum algorithms aimed at enhancing these processes. Specifically, the focus is on three critical aspects: clinical trial simulations, site selection, and cohort identification. This study aims to provide a comprehensive framework that leverages quantum computing to innovate and refine the efficiency and effectiveness of clinical trials.
△ Less
Submitted 19 April, 2024;
originally announced April 2024.
-
MediSwift: Efficient Sparse Pre-trained Biomedical Language Models
Authors:
Vithursan Thangarasa,
Mahmoud Salem,
Shreyas Saxena,
Kevin Leong,
Joel Hestness,
Sean Lie
Abstract:
Large language models (LLMs) are typically trained on general source data for various domains, but a recent surge in domain-specific LLMs has shown their potential to outperform general-purpose models in domain-specific tasks (e.g., biomedicine). Although domain-specific pre-training enhances efficiency and leads to smaller models, the computational costs of training these LLMs remain high, posing…
▽ More
Large language models (LLMs) are typically trained on general source data for various domains, but a recent surge in domain-specific LLMs has shown their potential to outperform general-purpose models in domain-specific tasks (e.g., biomedicine). Although domain-specific pre-training enhances efficiency and leads to smaller models, the computational costs of training these LLMs remain high, posing budgeting challenges. We introduce MediSwift, a suite of biomedical LMs that leverage sparse pre-training on domain-specific biomedical text data. By inducing up to 75% weight sparsity during the pre-training phase, MediSwift achieves a 2-2.5x reduction in training FLOPs. Notably, all sparse pre-training was performed on the Cerebras CS-2 system, which is specifically designed to realize the acceleration benefits from unstructured weight sparsity, thereby significantly enhancing the efficiency of the MediSwift models. Through subsequent dense fine-tuning and strategic soft prompting, MediSwift models outperform existing LLMs up to 7B parameters on biomedical tasks, setting new benchmarks w.r.t efficiency-accuracy on tasks such as PubMedQA. Our results show that sparse pre-training, along with dense fine-tuning and soft prompting, offers an effective method for creating high-performing, computationally efficient models in specialized domains.
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
Improving Deep Generative Models on Many-To-One Image-to-Image Translation
Authors:
Sagar Saxena,
Mohammad Nayeem Teli
Abstract:
Deep generative models have been applied to multiple applications in image-to-image translation. Generative Adversarial Networks and Diffusion Models have presented impressive results, setting new state-of-the-art results on these tasks. Most methods have symmetric setups across the different domains in a dataset. These methods assume that all domains have either multiple modalities or only one mo…
▽ More
Deep generative models have been applied to multiple applications in image-to-image translation. Generative Adversarial Networks and Diffusion Models have presented impressive results, setting new state-of-the-art results on these tasks. Most methods have symmetric setups across the different domains in a dataset. These methods assume that all domains have either multiple modalities or only one modality. However, there are many datasets that have a many-to-one relationship between two domains. In this work, we first introduce a Colorized MNIST dataset and a Color-Recall score that can provide a simple benchmark for evaluating models on many-to-one translation. We then introduce a new asymmetric framework to improve existing deep generative models on many-to-one image-to-image translation. We apply this framework to StarGAN V2 and show that in both unsupervised and semi-supervised settings, the performance of this new model improves on many-to-one image-to-image translation.
△ Less
Submitted 22 February, 2024; v1 submitted 19 February, 2024;
originally announced February 2024.
-
MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models
Authors:
Saumya Saxena,
Mohit Sharma,
Oliver Kroemer
Abstract:
Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. I…
▽ More
Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework, MResT (Multi-Resolution Transformer), for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-language models to operate on low-frequency global features along with small non-pretrained models to adapt to high frequency local feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves (2X on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model
Authors:
Saurabh Saxena,
Junhwa Hur,
Charles Herrmann,
Deqing Sun,
David J. Fleet
Abstract:
While methods for monocular depth estimation have made significant strides on standard benchmarks, zero-shot metric depth estimation remains unsolved. Challenges include the joint modeling of indoor and outdoor scenes, which often exhibit significantly different distributions of RGB and depth, and the depth-scale ambiguity due to unknown camera intrinsics. Recent work has proposed specialized mult…
▽ More
While methods for monocular depth estimation have made significant strides on standard benchmarks, zero-shot metric depth estimation remains unsolved. Challenges include the joint modeling of indoor and outdoor scenes, which often exhibit significantly different distributions of RGB and depth, and the depth-scale ambiguity due to unknown camera intrinsics. Recent work has proposed specialized multi-head architectures for jointly modeling indoor and outdoor scenes. In contrast, we advocate a generic, task-agnostic diffusion model, with several advancements such as log-scale depth parameterization to enable joint modeling of indoor and outdoor scenes, conditioning on the field-of-view (FOV) to handle scale ambiguity and synthetically augmenting FOV during training to generalize beyond the limited camera intrinsics in training datasets. Furthermore, by employing a more diverse training mixture than is common, and an efficient diffusion parameterization, our method, DMD (Diffusion for Metric Depth) achieves a 25\% reduction in relative error (REL) on zero-shot indoor and 33\% reduction on zero-shot outdoor datasets over the current SOTA using only a small number of denoising steps. For an overview see https://diffusion-vision.github.io/dmd
△ Less
Submitted 20 December, 2023;
originally announced December 2023.
-
NeRFiller: Completing Scenes via Generative 3D Inpainting
Authors:
Ethan Weber,
Aleksander Hołyński,
Varun Jampani,
Saurabh Saxena,
Noah Snavely,
Abhishek Kar,
Angjoo Kanazawa
Abstract:
We propose NeRFiller, an approach that completes missing portions of a 3D capture via generative 3D inpainting using off-the-shelf 2D visual generative models. Often parts of a captured 3D scene or object are missing due to mesh reconstruction failures or a lack of observations (e.g., contact regions, such as the bottom of objects, or hard-to-reach areas). We approach this challenging 3D inpaintin…
▽ More
We propose NeRFiller, an approach that completes missing portions of a 3D capture via generative 3D inpainting using off-the-shelf 2D visual generative models. Often parts of a captured 3D scene or object are missing due to mesh reconstruction failures or a lack of observations (e.g., contact regions, such as the bottom of objects, or hard-to-reach areas). We approach this challenging 3D inpainting problem by leveraging a 2D inpainting diffusion model. We identify a surprising behavior of these models, where they generate more 3D consistent inpaints when images form a 2$\times$2 grid, and show how to generalize this behavior to more than four images. We then present an iterative framework to distill these inpainted regions into a single consistent 3D scene. In contrast to related works, we focus on completing scenes rather than deleting foreground objects, and our approach does not require tight 2D object masks or text. We compare our approach to relevant baselines adapted to our setting on a variety of scenes, where NeRFiller creates the most 3D consistent and plausible scene completions. Our project page is at https://ethanweber.me/nerfiller.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Minimizing Factual Inconsistency and Hallucination in Large Language Models
Authors:
Muneeswaran I,
Shreya Saxena,
Siva Prasad,
M V Sai Prakash,
Advaith Shankar,
Varun V,
Vishal Vaddina,
Saisubramaniam Gopalakrishnan
Abstract:
Large Language Models (LLMs) are widely used in critical fields such as healthcare, education, and finance due to their remarkable proficiency in various language-related tasks. However, LLMs are prone to generating factually incorrect responses or "hallucinations," which can lead to a loss of credibility and trust among users. To address this issue, we propose a multi-stage framework that generat…
▽ More
Large Language Models (LLMs) are widely used in critical fields such as healthcare, education, and finance due to their remarkable proficiency in various language-related tasks. However, LLMs are prone to generating factually incorrect responses or "hallucinations," which can lead to a loss of credibility and trust among users. To address this issue, we propose a multi-stage framework that generates the rationale first, verifies and refines incorrect ones, and uses them as supporting references to generate the answer. The generated rationale enhances the transparency of the answer and our framework provides insights into how the model arrived at this answer, by using this rationale and the references to the context. In this paper, we demonstrate its effectiveness in improving the quality of responses to drug-related inquiries in the life sciences industry. Our framework improves traditional Retrieval Augmented Generation (RAG) by enabling OpenAI GPT-3.5-turbo to be 14-25% more faithful and 16-22% more accurate on two datasets. Furthermore, fine-tuning samples based on our framework improves the accuracy of smaller open-access LLMs by 33-42% and competes with RAG on commercial models.
△ Less
Submitted 23 November, 2023;
originally announced November 2023.
-
The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation
Authors:
Saurabh Saxena,
Charles Herrmann,
Junhwa Hur,
Abhishek Kar,
Mohammad Norouzi,
Deqing Sun,
David J. Fleet
Abstract:
Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity. We show that they also excel in estimating optical flow and monocular depth, surprisingly, without task-specific architectures and loss functions that are predominant for these tasks. Compared to the point estimates of conventional regression-based methods, diffusion models also…
▽ More
Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity. We show that they also excel in estimating optical flow and monocular depth, surprisingly, without task-specific architectures and loss functions that are predominant for these tasks. Compared to the point estimates of conventional regression-based methods, diffusion models also enable Monte Carlo inference, e.g., capturing uncertainty and ambiguity in flow and depth. With self-supervised pre-training, the combined use of synthetic and real data for supervised training, and technical innovations (infilling and step-unrolled denoising diffusion training) to handle noisy-incomplete training data, and a simple form of coarse-to-fine refinement, one can train state-of-the-art diffusion models for depth and optical flow estimation. Extensive experiments focus on quantitative performance against benchmarks, ablations, and the model's ability to capture uncertainty and multimodality, and impute missing values. Our model, DDVM (Denoising Diffusion Vision Model), obtains a state-of-the-art relative depth error of 0.074 on the indoor NYU benchmark and an Fl-all outlier rate of 3.26\% on the KITTI optical flow benchmark, about 25\% better than the best published method. For an overview see https://diffusion-vision.github.io.
△ Less
Submitted 5 December, 2023; v1 submitted 2 June, 2023;
originally announced June 2023.
-
Efficient Neural Network based Classification and Outlier Detection for Image Moderation using Compressed Sensing and Group Testing
Authors:
Sabyasachi Ghosh,
Sanyam Saxena,
Ajit Rajwade
Abstract:
Popular social media platforms employ neural network based image moderation engines to classify images uploaded on them as having potentially objectionable content. Such moderation engines must answer a large number of queries with heavy computational cost, even though the actual number of images with objectionable content is usually a tiny fraction. Inspired by recent work on Neural Group Testing…
▽ More
Popular social media platforms employ neural network based image moderation engines to classify images uploaded on them as having potentially objectionable content. Such moderation engines must answer a large number of queries with heavy computational cost, even though the actual number of images with objectionable content is usually a tiny fraction. Inspired by recent work on Neural Group Testing, we propose an approach which exploits this fact to reduce the overall computational cost of such engines using the technique of Compressed Sensing (CS). We present the quantitative matrix-pooled neural network (QMPNN), which takes as input $n$ images, and a $m \times n$ binary pooling matrix with $m < n$, whose rows indicate $m$ pools of images i.e. selections of $r$ images out of $n$. The QMPNN efficiently outputs the product of this matrix with the unknown sparse binary vector indicating whether each image is objectionable or not, i.e. it outputs the number of objectionable images in each pool. For suitable matrices, this is decoded using CS decoding algorithms to predict which images were objectionable. The computational cost of running the QMPNN and the CS algorithms is significantly lower than the cost of using a neural network with the same number of parameters separately on each image to classify the images, which we demonstrate via extensive experiments. Our technique is inherently resilient to moderate levels of errors in the prediction from the QMPNN. Furthermore, we present pooled deep outlier detection, which brings CS and group testing techniques to deep outlier detection, to provide for the case when the objectionable images do not belong to a set of pre-defined classes. This technique enables efficient automated moderation of off-topic images shared on topical forums dedicated to sharing images of a certain single class, many of which are currently human-moderated.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
GNN-Assisted Phase Space Integration with Application to Atomistics
Authors:
Shashank Saxena,
Jan-Hendrik Bastek,
Miguel Spinola,
Prateek Gupta,
Dennis M. Kochmann
Abstract:
Overcoming the time scale limitations of atomistics can be achieved by switching from the state-space representation of Molecular Dynamics (MD) to a statistical-mechanics-based representation in phase space, where approximations such as maximum-entropy or Gaussian phase packets (GPP) evolve the atomistic ensemble in a time-coarsened fashion. In practice, this requires the computation of expensive…
▽ More
Overcoming the time scale limitations of atomistics can be achieved by switching from the state-space representation of Molecular Dynamics (MD) to a statistical-mechanics-based representation in phase space, where approximations such as maximum-entropy or Gaussian phase packets (GPP) evolve the atomistic ensemble in a time-coarsened fashion. In practice, this requires the computation of expensive high-dimensional integrals over all of phase space of an atomistic ensemble. This, in turn, is commonly accomplished efficiently by low-order numerical quadrature. We show that numerical quadrature in this context, unfortunately, comes with a set of inherent problems, which corrupt the accuracy of simulations -- especially when dealing with crystal lattices with imperfections. As a remedy, we demonstrate that Graph Neural Networks, trained on Monte-Carlo data, can serve as a replacement for commonly used numerical quadrature rules, overcoming their deficiencies and significantly improving the accuracy. This is showcased by three benchmarks: the thermal expansion of copper, the martensitic phase transition of iron, and the energy of grain boundaries. We illustrate the benefits of the proposed technique over classically used third- and fifth-order Gaussian quadrature, we highlight the impact on time-coarsened atomistic predictions, and we discuss the computational efficiency. The latter is of general importance when performing frequent evaluation of phase space or other high-dimensional integrals, which is why the proposed framework promises applications beyond the scope of atomistics.
△ Less
Submitted 20 March, 2023;
originally announced March 2023.
-
Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency
Authors:
Vithursan Thangarasa,
Shreyas Saxena,
Abhay Gupta,
Sean Lie
Abstract:
Recent research has focused on weight sparsity in deep neural network training to reduce FLOPs, aiming for improved efficiency (test accuracy w.r.t training FLOPs). However, sparse weight training often compromises accuracy, requiring extended training schedules to attain the accuracy of dense models. In contrast, our approach, Sparse Iso-FLOP Transformations (Sparse-IFT), uses sparsity to improve…
▽ More
Recent research has focused on weight sparsity in deep neural network training to reduce FLOPs, aiming for improved efficiency (test accuracy w.r.t training FLOPs). However, sparse weight training often compromises accuracy, requiring extended training schedules to attain the accuracy of dense models. In contrast, our approach, Sparse Iso-FLOP Transformations (Sparse-IFT), uses sparsity to improve accuracy while maintaining dense model FLOPs. Using a single hyperparameter (i.e., the sparsity level), Sparse-IFTs efficiently replace dense layers, expanding the search space for optimal sparse masks. In addition, dynamic sparse training (DST) with Sparse-IFT models effectively navigate this larger sparse mask-weight space, which is evidenced by a spectral analysis using Ramanujan graph properties. Our study reveals a robust correlation among mask topology, weights, and final performance. Notably, without adjusting any training hyperparameters, replacing dense layers with Sparse-IFT yields significant improvements, such as a +3.5% boost for ResNet-18 on ImageNet and +0.9% for GPT-3 Small on the Open LLM leaderboard. To the best of our knowledge, this is the first work to demonstrate the use of sparsity for improving the accuracy of dense models through a set of simple-to-use sparse transformations. Code is available at: https://github.com/CerebrasResearch/Sparse-IFT.
△ Less
Submitted 17 July, 2024; v1 submitted 20 March, 2023;
originally announced March 2023.
-
SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models
Authors:
Vithursan Thangarasa,
Abhay Gupta,
William Marshall,
Tianda Li,
Kevin Leong,
Dennis DeCoste,
Sean Lie,
Shreyas Saxena
Abstract:
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural Language Processing (NLP). Instead of directly training on a downstream task, language models are first pre-trained on large datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then fine-tuned on task-specific data (e.g., natural language generation, text summarization, etc.). Sca…
▽ More
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural Language Processing (NLP). Instead of directly training on a downstream task, language models are first pre-trained on large datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then fine-tuned on task-specific data (e.g., natural language generation, text summarization, etc.). Scaling the model and dataset size has helped improve the performance of LLMs, but unfortunately, this also lead to highly prohibitive computational costs. Pre-training LLMs often require orders of magnitude more FLOPs than fine-tuning and the model capacity often remains the same between the two phases. To achieve training efficiency w.r.t training FLOPs, we propose to decouple the model capacity between the two phases and introduce Sparse Pre-training and Dense Fine-tuning (SPDF). In this work, we show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training (Sparse Pre-training) and then recover the representational capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs, without a significant loss in accuracy on the downstream tasks relative to the dense baseline. By rigorously evaluating multiple downstream tasks, we also establish a relationship between sparsity, task complexity and dataset size. Our work presents a promising direction to train large GPT models at a fraction of the training FLOPs using weight sparsity, while retaining the benefits of pre-trained textual representations for downstream tasks.
△ Less
Submitted 29 July, 2023; v1 submitted 18 March, 2023;
originally announced March 2023.
-
Monocular Depth Estimation using Diffusion Models
Authors:
Saurabh Saxena,
Abhishek Kar,
Mohammad Norouzi,
David J. Fleet
Abstract:
We formulate monocular depth estimation using denoising diffusion models, inspired by their recent successes in high fidelity image generation. To that end, we introduce innovations to address problems arising due to noisy, incomplete depth maps in training data, including step-unrolled denoising diffusion, an $L_1$ loss, and depth infilling during training. To cope with the limited availability o…
▽ More
We formulate monocular depth estimation using denoising diffusion models, inspired by their recent successes in high fidelity image generation. To that end, we introduce innovations to address problems arising due to noisy, incomplete depth maps in training data, including step-unrolled denoising diffusion, an $L_1$ loss, and depth infilling during training. To cope with the limited availability of data for supervised training, we leverage pre-training on self-supervised image-to-image translation tasks. Despite the simplicity of the approach, with a generic loss and architecture, our DepthGen model achieves SOTA performance on the indoor NYU dataset, and near SOTA results on the outdoor KITTI dataset. Further, with a multimodal posterior, DepthGen naturally represents depth ambiguity (e.g., from transparent surfaces), and its zero-shot performance combined with depth imputation, enable a simple but effective text-to-3D pipeline. Project page: https://depth-gen.github.io
△ Less
Submitted 28 February, 2023;
originally announced February 2023.
-
Storage in Computational Geometry
Authors:
Yijie Han,
Sanjeev Saxena
Abstract:
We show that $n$ real numbers can be stored in a constant number of real numbers such that each original real number can be fetched in $O(\log n)$ time.
Although our result has implications for many computational geometry problems, we show here, combined with Han's $O(n\sqrt{\log n})$ time real number sorting algorithm [3, arXiv:1801.00776], we can improve the complexity of Kirkpatrick's point l…
▽ More
We show that $n$ real numbers can be stored in a constant number of real numbers such that each original real number can be fetched in $O(\log n)$ time.
Although our result has implications for many computational geometry problems, we show here, combined with Han's $O(n\sqrt{\log n})$ time real number sorting algorithm [3, arXiv:1801.00776], we can improve the complexity of Kirkpatrick's point location algorithm [8] to $O(n\sqrt{\log n})$ preprocessing time, a constant number of real numbers for storage and $O(\log n)$ point location time. Kirkpatrick's algorithm uses $O(n\log n)$ preprocessing time, $O(n)$ storage and $O(\log n)$ point location time. The complexity results in Kirkpatrick's algorithm was the previous best result. Although Lipton and Tarjan's algorithm [10] predates Kirkpatrick's algorithm and has the same complexity, Kirkpatrick's algorithm is simpler and has a better structure.
This paper can be viewed as a companion paper of paper [3, arXiv:1801.00776].
△ Less
Submitted 23 February, 2023;
originally announced February 2023.
-
Large-Scale Knowledge Synthesis and Complex Information Retrieval from Biomedical Documents
Authors:
Shreya Saxena,
Raj Sangani,
Siva Prasad,
Shubham Kumar,
Mihir Athale,
Rohan Awhad,
Vishal Vaddina
Abstract:
Recent advances in the healthcare industry have led to an abundance of unstructured data, making it challenging to perform tasks such as efficient and accurate information retrieval at scale. Our work offers an all-in-one scalable solution for extracting and exploring complex information from large-scale research documents, which would otherwise be tedious. First, we briefly explain our knowledge…
▽ More
Recent advances in the healthcare industry have led to an abundance of unstructured data, making it challenging to perform tasks such as efficient and accurate information retrieval at scale. Our work offers an all-in-one scalable solution for extracting and exploring complex information from large-scale research documents, which would otherwise be tedious. First, we briefly explain our knowledge synthesis process to extract helpful information from unstructured text data of research documents. Then, on top of the knowledge extracted from the documents, we perform complex information retrieval using three major components- Paragraph Retrieval, Triplet Retrieval from Knowledge Graphs, and Complex Question Answering (QA). These components combine lexical and semantic-based methods to retrieve paragraphs and triplets and perform faceted refinement for filtering these search results. The complexity of biomedical queries and documents necessitates using a QA system capable of handling queries more complex than factoid queries, which we evaluate qualitatively on the COVID-19 Open Research Dataset (CORD-19) to demonstrate the effectiveness and value-add.
△ Less
Submitted 14 February, 2023;
originally announced February 2023.
-
Dominance for Containment Problems
Authors:
Waseem Akram,
Sanjeev Saxena
Abstract:
In a containment problem, the goal is to preprocess a set of geometric objects so that, given a geometric query object, we can report all the objects containing the query object. We consider the containment problem where input objects are homothetic triangles and the query objects considered are line segments, circles, and trapezoids with bases parallel to either axis. We show that this problem ca…
▽ More
In a containment problem, the goal is to preprocess a set of geometric objects so that, given a geometric query object, we can report all the objects containing the query object. We consider the containment problem where input objects are homothetic triangles and the query objects considered are line segments, circles, and trapezoids with bases parallel to either axis. We show that this problem can be solved using the 3-d query dominance problem. The solutions presented can also be extended for higher dimensions.
△ Less
Submitted 20 December, 2022;
originally announced December 2022.
-
A Generalist Framework for Panoptic Segmentation of Images and Videos
Authors:
Ting Chen,
Lala Li,
Saurabh Saxena,
Geoffrey Hinton,
David J. Fleet
Abstract:
Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, withou…
▽ More
Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task. A diffusion model is proposed to model panoptic masks, with a simple architecture and generic loss function. By simply adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically. With extensive experiments, we demonstrate that our simple approach can perform competitively to state-of-the-art specialist methods in similar settings.
△ Less
Submitted 12 October, 2023; v1 submitted 12 October, 2022;
originally announced October 2022.
-
Dynamic Inference on Graphs using Structured Transition Models
Authors:
Saumya Saxena,
Oliver Kroemer
Abstract:
Enabling robots to perform complex dynamic tasks such as picking up an object in one sweeping motion or pushing off a wall to quickly turn a corner is a challenging problem. The dynamic interactions implicit in these tasks are critical towards the successful execution of such tasks. Graph neural networks (GNNs) provide a principled way of learning the dynamics of interactive systems but can suffer…
▽ More
Enabling robots to perform complex dynamic tasks such as picking up an object in one sweeping motion or pushing off a wall to quickly turn a corner is a challenging problem. The dynamic interactions implicit in these tasks are critical towards the successful execution of such tasks. Graph neural networks (GNNs) provide a principled way of learning the dynamics of interactive systems but can suffer from scaling issues as the number of interactions increases. Furthermore, the problem of using learned GNN-based models for optimal control is insufficiently explored. In this work, we present a method for efficiently learning the dynamics of interacting systems by simultaneously learning a dynamic graph structure and a stable and locally linear forward model of the system. The dynamic graph structure encodes evolving contact modes along a trajectory by making probabilistic predictions over the edges of the graph. Additionally, we introduce a temporal dependence in the learned graph structure which allows us to incorporate contact measurement updates during execution thus enabling more accurate forward predictions. The learned stable and locally linear dynamics enable the use of optimal control algorithms such as iLQR for long-horizon planning and control for complex interactive tasks. Through experiments in simulation and in the real world, we evaluate the performance of our method by using the learned interaction dynamics for control and demonstrate generalization to more objects and interactions not seen during training. We introduce a control scheme that takes advantage of contact measurement updates and hence is robust to prediction inaccuracies during execution.
△ Less
Submitted 29 September, 2022;
originally announced September 2022.
-
On Brooks' Theorem
Authors:
Gopalan Sajith,
Sanjeev Saxena
Abstract:
In this note we give two proofs of Brooks' Theorem. The first is obtained by modifying an earlier proof and the second by combining two earlier proofs. We believe these proofs are easier to teach in Computer Science courses.
In this note we give two proofs of Brooks' Theorem. The first is obtained by modifying an earlier proof and the second by combining two earlier proofs. We believe these proofs are easier to teach in Computer Science courses.
△ Less
Submitted 3 August, 2022;
originally announced August 2022.
-
Simpler O(1) Query Algorithm for Level Ancestors
Authors:
Sanjeev Saxena
Abstract:
This note describes a very simple O(1) query time algorithm for finding level ancestors. This is basically a serial (re)-implementation of the parallel algorithm of Berkman and Vishkin (O.Berkman and U.Vishkin, Finding level-ancestors in trees, JCSS, 48, 214--230, 1994).
Although the basic algorithm has preprocessing time of O(n log n), by having additional levels or using table lookup, the prep…
▽ More
This note describes a very simple O(1) query time algorithm for finding level ancestors. This is basically a serial (re)-implementation of the parallel algorithm of Berkman and Vishkin (O.Berkman and U.Vishkin, Finding level-ancestors in trees, JCSS, 48, 214--230, 1994).
Although the basic algorithm has preprocessing time of O(n log n), by having additional levels or using table lookup, the preprocessing time can be reduced to almost linear or linear.
The table lookup algorithm can be built in O(1) parallel time with $n$ processors and can also be used to simplify the parallel algorithm of Berkman and Vishkin and make it optimal.
△ Less
Submitted 29 July, 2024; v1 submitted 25 July, 2022;
originally announced July 2022.
-
BioTABQA: Instruction Learning for Biomedical Table Question Answering
Authors:
Man Luo,
Sharad Saxena,
Swaroop Mishra,
Mihir Parmar,
Chitta Baral
Abstract:
Table Question Answering (TQA) is an important but under-explored task. Most of the existing QA datasets are in unstructured text format and only few of them use tables as the context. To the best of our knowledge, none of TQA datasets exist in the biomedical domain where tables are frequently used to present information. In this paper, we first curate a table question answering dataset, BioTABQA,…
▽ More
Table Question Answering (TQA) is an important but under-explored task. Most of the existing QA datasets are in unstructured text format and only few of them use tables as the context. To the best of our knowledge, none of TQA datasets exist in the biomedical domain where tables are frequently used to present information. In this paper, we first curate a table question answering dataset, BioTABQA, using 22 templates and the context from a biomedical textbook on differential diagnosis. BioTABQA can not only be used to teach a model how to answer questions from tables but also evaluate how a model generalizes to unseen questions, an important scenario for biomedical applications. To achieve the generalization evaluation, we divide the templates into 17 training and 5 cross-task evaluations. Then, we develop two baselines using single and multi-tasks learning on BioTABQA. Furthermore, we explore instructional learning, a recent technique showing impressive generalizing performance. Experimental results show that our instruction-tuned model outperforms single and multi-task baselines on an average by ~23% and ~6% across various evaluation settings, and more importantly, instruction-tuned model outperforms baselines by ~5% on cross-tasks.
△ Less
Submitted 5 July, 2022;
originally announced July 2022.
-
A Unified Sequence Interface for Vision Tasks
Authors:
Ting Chen,
Saurabh Saxena,
Lala Li,
Tsung-Yi Lin,
David J. Fleet,
Geoffrey Hinton
Abstract:
While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of "core" computer vision tasks can also be unified if formulated in terms of…
▽ More
While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of "core" computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface. We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs, e.g., bounding boxes or dense masks. Despite that, by formulating the output of each task as a sequence of discrete tokens with a unified interface, we show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization. To solve a specific task, we use a short prompt as task description, and the sequence output adapts to the prompt so it can produce task-specific output. We show that such a model can achieve competitive performance compared to well-established task-specific models.
△ Less
Submitted 15 October, 2022; v1 submitted 15 June, 2022;
originally announced June 2022.
-
A Model for Predicting Ignition Potential of Complex Fuel in Diurnally Variable Environment
Authors:
Saurabh Saxena,
Ritambhara Dubey,
Neda Yaghoobian
Abstract:
Fuel ignition potential is one of the primary drivers influencing the extent of damage in wildland and wildland-urban interface fires. Determining fire and ember exposure of fuels that vary spatially and temporally will help to recognize necessary defensive actions and reduce damages. In this paper, the development of a new computational model, Temperature And Moisture Evolution predictor for comp…
▽ More
Fuel ignition potential is one of the primary drivers influencing the extent of damage in wildland and wildland-urban interface fires. Determining fire and ember exposure of fuels that vary spatially and temporally will help to recognize necessary defensive actions and reduce damages. In this paper, the development of a new computational model, Temperature And Moisture Evolution predictor for complex Fuel in Open Environment (TAMEFOE), is presented. TAMEFOE predicts the diurnal temperature and moisture content evolution and vulnerability to flame ignition of objects/fuels with complex shapes or settings and materials under variable environmental conditions. The model is applicable to complex fuel scenarios (e.g., interface or intermix communities) composed of natural and manmade random-shaped objects in open atmosphere under the influence of local weather and diurnal solar radiation. The vulnerability of fuel to ember or fire ignition is determined by predicting the transient temperature and dryness of fuel in connection with the surrounding, local environment, and flame heat if any exists. In this regard, a detailed surface energy balance analysis, coupled with a water budget analysis, is performed in high spatiotemporal resolution. The model performance was validated against several existing analytical and measured data. The discrete, high-resolution surface temperature and moisture content information obtained from the model can also provide unsteady boundary conditions for computational fluid dynamics simulations when coupled physics is desired.
△ Less
Submitted 16 January, 2023; v1 submitted 8 May, 2022;
originally announced June 2022.
-
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Authors:
Chitwan Saharia,
William Chan,
Saurabh Saxena,
Lala Li,
Jay Whang,
Emily Denton,
Seyed Kamyar Seyed Ghasemipour,
Burcu Karagol Ayan,
S. Sara Mahdavi,
Rapha Gontijo Lopes,
Tim Salimans,
Jonathan Ho,
David J Fleet,
Mohammad Norouzi
Abstract:
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only c…
▽ More
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment. See https://imagen.research.google/ for an overview of the results.
△ Less
Submitted 23 May, 2022;
originally announced May 2022.
-
Comparison and Analysis of Image-to-Image Generative Adversarial Networks: A Survey
Authors:
Sagar Saxena,
Mohammad Nayeem Teli
Abstract:
Generative Adversarial Networks (GANs) have recently introduced effective methods of performing Image-to-Image translations. These models can be applied and generalized to a variety of domains in Image-to-Image translation without changing any parameters. In this paper, we survey and analyze eight Image-to-Image Generative Adversarial Networks: Pix2Pix, CycleGAN, CoGAN, StarGAN, MUNIT, StarGAN2, D…
▽ More
Generative Adversarial Networks (GANs) have recently introduced effective methods of performing Image-to-Image translations. These models can be applied and generalized to a variety of domains in Image-to-Image translation without changing any parameters. In this paper, we survey and analyze eight Image-to-Image Generative Adversarial Networks: Pix2Pix, CycleGAN, CoGAN, StarGAN, MUNIT, StarGAN2, DA-GAN, and Self Attention GAN. Each of these models presented state-of-the-art results and introduced new techniques to build Image-to-Image GANs. In addition to a survey of the models, we also survey the 18 datasets they were trained on and the 9 metrics they were evaluated on. Finally, we present results of a controlled experiment for 6 of these models on a common set of metrics and datasets. The results were mixed and showed that, on certain datasets, tasks, and metrics, some models outperformed others. The last section of this paper discusses those results and establishes areas of future research. As researchers continue to innovate new Image-to-Image GANs, it is important to gain a good understanding of the existing methods, datasets, and metrics. This paper provides a comprehensive overview and discussion to help build this foundation.
△ Less
Submitted 26 August, 2022; v1 submitted 23 December, 2021;
originally announced December 2021.
-
Exploring and Mitigating Gender Bias in Recommender Systems with Explicit Feedback
Authors:
Shrikant Saxena,
Shweta Jain
Abstract:
Recommender systems are indispensable because they influence our day-to-day behavior and decisions by giving us personalized suggestions. Services like Kindle, Youtube, and Netflix depend heavily on the performance of their recommender systems to ensure that their users have a good experience and to increase revenues. Despite their popularity, it has been shown that recommender systems reproduce a…
▽ More
Recommender systems are indispensable because they influence our day-to-day behavior and decisions by giving us personalized suggestions. Services like Kindle, Youtube, and Netflix depend heavily on the performance of their recommender systems to ensure that their users have a good experience and to increase revenues. Despite their popularity, it has been shown that recommender systems reproduce and amplify the bias present in the real world. The resulting feedback creates a self-perpetuating loop that deteriorates the user experience and results in homogenizing recommendations over time. Further, biased recommendations can also reinforce stereotypes based on gender or ethnicity, thus reinforcing the filter bubbles that we live in. In this paper, we address the problem of gender bias in recommender systems with explicit feedback. We propose a model to quantify the gender bias present in book rating datasets and in the recommendations produced by the recommender systems. Our main contribution is to provide a principled approach to mitigate the bias being produced in the recommendations. We theoretically show that the proposed approach provides unbiased recommendations despite biased data. Through empirical evaluation on publicly available book rating datasets, we further show that the proposed model can significantly reduce bias without significant impact on accuracy. Our method is model agnostic and can be applied to any recommender system. To demonstrate the performance of our model, we present the results on four recommender algorithms, two from the K-nearest neighbors family, UserKNN and ItemKNN, and the other two from the matrix factorization family, Alternating least square and Singular value decomposition.
△ Less
Submitted 5 December, 2021;
originally announced December 2021.
-
Point Enclosure Problem for Homothetic Polygons
Authors:
Waseem Akram,
Sanjeev Saxena
Abstract:
In this paper, we investigate the homothetic point enclosure problem: given a set $S$ of $n$ triangles with sides parallel to three fixed directions, find a data structure for $S$ that can report all the triangles of $S$ that contain a query point efficiently. The problem is "inverse" of the homothetic range search problem. We present an $O(n\log n)$ space solution that supports the queries in…
▽ More
In this paper, we investigate the homothetic point enclosure problem: given a set $S$ of $n$ triangles with sides parallel to three fixed directions, find a data structure for $S$ that can report all the triangles of $S$ that contain a query point efficiently. The problem is "inverse" of the homothetic range search problem. We present an $O(n\log n)$ space solution that supports the queries in $O(\log n + k)$ time, where $k$ is the output size. The preprocessing time is $O(n\log n)$. The same results also hold for homothetic polygons.
△ Less
Submitted 3 December, 2021;
originally announced December 2021.
-
Pix2seq: A Language Modeling Framework for Object Detection
Authors:
Ting Chen,
Saurabh Saxena,
Lala Li,
David J. Fleet,
Geoffrey Hinton
Abstract:
We present Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we cast object detection as a language modeling task conditioned on the observed pixel inputs. Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and we train a neural network to perceiv…
▽ More
We present Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we cast object detection as a language modeling task conditioned on the observed pixel inputs. Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and we train a neural network to perceive the image and generate the desired sequence. Our approach is based mainly on the intuition that if a neural network knows about where and what the objects are, we just need to teach it how to read them out. Beyond the use of task-specific data augmentations, our approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.
△ Less
Submitted 27 March, 2022; v1 submitted 22 September, 2021;
originally announced September 2021.
-
Search-Based Task Planning with Learned Skill Effect Models for Lifelong Robotic Manipulation
Authors:
Jacky Liang,
Mohit Sharma,
Alex LaGrassa,
Shivam Vats,
Saumya Saxena,
Oliver Kroemer
Abstract:
Robots deployed in many real-world settings need to be able to acquire new skills and solve new tasks over time. Prior works on planning with skills often make assumptions on the structure of skills and tasks, such as subgoal skills, shared skill implementations, or task-specific plan skeletons, which limit adaptation to new skills and tasks. By contrast, we propose doing task planning by jointly…
▽ More
Robots deployed in many real-world settings need to be able to acquire new skills and solve new tasks over time. Prior works on planning with skills often make assumptions on the structure of skills and tasks, such as subgoal skills, shared skill implementations, or task-specific plan skeletons, which limit adaptation to new skills and tasks. By contrast, we propose doing task planning by jointly searching in the space of parameterized skills using high-level skill effect models learned in simulation. We use an iterative training procedure to efficiently generate relevant data to train such models. Our approach allows flexible skill parameterizations and task specifications to facilitate lifelong learning in general-purpose domains. Experiments demonstrate the ability of our planner to integrate new skills in a lifelong manner, finding new task strategies with lower costs in both train and test tasks. We additionally show that our method can transfer to the real world without further fine-tuning.
△ Less
Submitted 13 April, 2022; v1 submitted 17 September, 2021;
originally announced September 2021.
-
XCI-Sketch: Extraction of Color Information from Images for Generation of Colored Outlines and Sketches
Authors:
V Manushree,
Sameer Saxena,
Parna Chowdhury,
Manisimha Varma,
Harsh Rathod,
Ankita Ghosh,
Sahil Khose
Abstract:
Sketches are a medium to convey a visual scene from an individual's creative perspective. The addition of color substantially enhances the overall expressivity of a sketch. This paper proposes two methods to mimic human-drawn colored sketches by utilizing the Contour Drawing Dataset. Our first approach renders colored outline sketches by applying image processing techniques aided by k-means color…
▽ More
Sketches are a medium to convey a visual scene from an individual's creative perspective. The addition of color substantially enhances the overall expressivity of a sketch. This paper proposes two methods to mimic human-drawn colored sketches by utilizing the Contour Drawing Dataset. Our first approach renders colored outline sketches by applying image processing techniques aided by k-means color clustering. The second method uses a generative adversarial network to develop a model that can generate colored sketches from previously unobserved images. We assess the results obtained through quantitative and qualitative evaluations.
△ Less
Submitted 7 January, 2022; v1 submitted 25 August, 2021;
originally announced August 2021.
-
Instance-Level Task Parameters: A Robust Multi-task Weighting Framework
Authors:
Pavan Kumar Anasosalu Vasu,
Shreyas Saxena,
Oncel Tuzel
Abstract:
Recent works have shown that deep neural networks benefit from multi-task learning by learning a shared representation across several related tasks. However, performance of such systems depend on relative weighting between various losses involved during training. Prior works on loss weighting schemes assume that instances are equally easy or hard for all tasks. In order to break this assumption, w…
▽ More
Recent works have shown that deep neural networks benefit from multi-task learning by learning a shared representation across several related tasks. However, performance of such systems depend on relative weighting between various losses involved during training. Prior works on loss weighting schemes assume that instances are equally easy or hard for all tasks. In order to break this assumption, we let the training process dictate the optimal weighting of tasks for every instance in the dataset. More specifically, we equip every instance in the dataset with a set of learnable parameters (instance-level task parameters) where the cardinality is equal to the number of tasks learned by the model. These parameters model the weighting of each task for an instance. They are updated by gradient descent and do not require hand-crafted rules. We conduct extensive experiments on SURREAL and CityScapes datasets, for human shape and pose estimation, depth estimation and semantic segmentation tasks. In these tasks, our approach outperforms recent dynamic loss weighting approaches, e.g. reducing surface estimation errors by 8.97% on SURREAL. When applied to datasets where one or more tasks can have noisy annotations, the proposed method learns to prioritize learning from clean labels for a given task, e.g. reducing surface estimation errors by up to 60%. We also show that we can reliably detect corrupt labels for a given task as a by-product from learned instance-level task parameters.
△ Less
Submitted 10 June, 2021;
originally announced June 2021.
-
Training With Data Dependent Dynamic Learning Rates
Authors:
Shreyas Saxena,
Nidhi Vyas,
Dennis DeCoste
Abstract:
Recently many first and second order variants of SGD have been proposed to facilitate training of Deep Neural Networks (DNNs). A common limitation of these works stem from the fact that they use the same learning rate across all instances present in the dataset. This setting is widely adopted under the assumption that loss functions for each instance are similar in nature, and hence, a common lear…
▽ More
Recently many first and second order variants of SGD have been proposed to facilitate training of Deep Neural Networks (DNNs). A common limitation of these works stem from the fact that they use the same learning rate across all instances present in the dataset. This setting is widely adopted under the assumption that loss functions for each instance are similar in nature, and hence, a common learning rate can be used. In this work, we relax this assumption and propose an optimization framework which accounts for difference in loss function characteristics across instances. More specifically, our optimizer learns a dynamic learning rate for each instance present in the dataset. Learning a dynamic learning rate for each instance allows our optimization framework to focus on different modes of training data during optimization. When applied to an image classification task, across different CNN architectures, learning dynamic learning rates leads to consistent gains over standard optimizers. When applied to a dataset containing corrupt instances, our framework reduces the learning rates on noisy instances, and improves over the state-of-the-art. Finally, we show that our optimization framework can be used for personalization of a machine learning model towards a known targeted data distribution.
△ Less
Submitted 27 May, 2021;
originally announced May 2021.
-
Sorted Range Reporting and Range Minima Queries
Authors:
Waseem Akram,
Sanjeev Saxena
Abstract:
Given an array A[1: n] of n elements drawn from an ordered set, the sorted range selection problem is to build a data structure that can be used to answer the following type of queries efficiently: Given a pair of indices i, j $ (1\le i\le j \le n)$, and a positive integer k, report the k smallest elements from the sub-array A[i: j] in order. Brodal et al. (Brodal, G.S., Fagerberg, R., Greve, M.,…
▽ More
Given an array A[1: n] of n elements drawn from an ordered set, the sorted range selection problem is to build a data structure that can be used to answer the following type of queries efficiently: Given a pair of indices i, j $ (1\le i\le j \le n)$, and a positive integer k, report the k smallest elements from the sub-array A[i: j] in order. Brodal et al. (Brodal, G.S., Fagerberg, R., Greve, M., and L{ó}pez-Ortiz, A., Online sorted range reporting. Algorithms and Computation (2009) pp. 173--182) introduced the problem and gave an optimal solution. After O(n log n) time for preprocessing, the query time is O(k). The space used is O(n).
In this paper, we propose the only other possible optimal trade-off for the problem. We present a linear space solution to the problem that takes O(k log k) time to answer a range selection query. The preprocessing time is O(n). Moreover, the proposed algorithm reports the output elements one by one in non-decreasing order. Our solution is simple and practical.
We also describe an extremely simple method for range minima queries (most of whose parts are known) which takes al most (but not exactly) linear time. We believe that this method may be, in practice, faster and easier to implement in most cases.
△ Less
Submitted 19 September, 2023; v1 submitted 6 April, 2021;
originally announced April 2021.
-
Learning Reactive and Predictive Differentiable Controllers for Switching Linear Dynamical Models
Authors:
Saumya Saxena,
Alex LaGrassa,
Oliver Kroemer
Abstract:
Humans leverage the dynamics of the environment and their own bodies to accomplish challenging tasks such as grasping an object while walking past it or pushing off a wall to turn a corner. Such tasks often involve switching dynamics as the robot makes and breaks contact. Learning these dynamics is a challenging problem and prone to model inaccuracies, especially near contact regions. In this work…
▽ More
Humans leverage the dynamics of the environment and their own bodies to accomplish challenging tasks such as grasping an object while walking past it or pushing off a wall to turn a corner. Such tasks often involve switching dynamics as the robot makes and breaks contact. Learning these dynamics is a challenging problem and prone to model inaccuracies, especially near contact regions. In this work, we present a framework for learning composite dynamical behaviors from expert demonstrations. We learn a switching linear dynamical model with contacts encoded in switching conditions as a close approximation of our system dynamics. We then use discrete-time LQR as the differentiable policy class for data-efficient learning of control to develop a control strategy that operates over multiple dynamical modes and takes into account discontinuities due to contact. In addition to predicting interactions with the environment, our policy effectively reacts to inaccurate predictions such as unanticipated contacts. Through simulation and real world experiments, we demonstrate generalization of learned behaviors to different scenarios and robustness to model inaccuracies during execution.
△ Less
Submitted 26 March, 2021;
originally announced March 2021.
-
Monolingual and Parallel Corpora for Kangri Low Resource Language
Authors:
Shweta Chauhan,
Shefali Saxena,
Philemon Daniel
Abstract:
In this paper we present the dataset of Himachali low resource endangered language, Kangri (ISO 639-3xnr) listed in the United Nations Educational, Scientific and Cultural Organization (UNESCO). The compilation of kangri corpus has been a challenging task due to the non-availability of the digitalized resources. The corpus contains 1,81,552 Monolingual and 27,362 Hindi-Kangri Parallel corpora. We…
▽ More
In this paper we present the dataset of Himachali low resource endangered language, Kangri (ISO 639-3xnr) listed in the United Nations Educational, Scientific and Cultural Organization (UNESCO). The compilation of kangri corpus has been a challenging task due to the non-availability of the digitalized resources. The corpus contains 1,81,552 Monolingual and 27,362 Hindi-Kangri Parallel corpora. We shared pre-trained kangri word embeddings. We also reported the Bilingual Evaluation Understudy (BLEU) score and Metric for Evaluation of Translation with Explicit ORdering (METEOR) score of Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) results for the corpus. The corpus is freely available for non-commercial usages and research. To the best of our knowledge, this is the first Himachali low resource endangered language corpus. The resources are available at (https://github.com/chauhanshweta/Kangri_corpus)
△ Less
Submitted 22 March, 2021;
originally announced March 2021.
-
Dynamic curriculum learning via data parameters for noise robust keyword spotting
Authors:
Takuya Higuchi,
Shreyas Saxena,
Mehrez Souden,
Tien Dung Tran,
Masood Delfarah,
Chandra Dhir
Abstract:
We propose dynamic curriculum learning via data parameters for noise robust keyword spotting. Data parameter learning has recently been introduced for image processing, where weight parameters, so-called data parameters, for target classes and instances are introduced and optimized along with model parameters. The data parameters scale logits and control importance over classes and instances durin…
▽ More
We propose dynamic curriculum learning via data parameters for noise robust keyword spotting. Data parameter learning has recently been introduced for image processing, where weight parameters, so-called data parameters, for target classes and instances are introduced and optimized along with model parameters. The data parameters scale logits and control importance over classes and instances during training, which enables automatic curriculum learning without additional annotations for training data. Similarly, in this paper, we propose using this curriculum learning approach for acoustic modeling, and train an acoustic model on clean and noisy utterances with the data parameters. The proposed approach automatically learns the difficulty of the classes and instances, e.g. due to low speech to noise ratio (SNR), in the gradient descent optimization and performs curriculum learning. This curriculum learning leads to overall improvement of the accuracy of the acoustic model. We evaluate the effectiveness of the proposed approach on a keyword spotting task. Experimental results show 7.7% relative reduction in false reject ratio with the data parameters compared to a baseline model which is simply trained on the multiconditioned dataset.
△ Less
Submitted 18 February, 2021;
originally announced February 2021.
-
Machine learning pipeline for battery state of health estimation
Authors:
Darius Roman,
Saurabh Saxena,
Valentin Robu,
Michael Pecht,
David Flynn
Abstract:
Lithium-ion batteries are ubiquitous in modern day applications ranging from portable electronics to electric vehicles. Irrespective of the application, reliable real-time estimation of battery state of health (SOH) by on-board computers is crucial to the safe operation of the battery, ultimately safeguarding asset integrity. In this paper, we design and evaluate a machine learning pipeline for es…
▽ More
Lithium-ion batteries are ubiquitous in modern day applications ranging from portable electronics to electric vehicles. Irrespective of the application, reliable real-time estimation of battery state of health (SOH) by on-board computers is crucial to the safe operation of the battery, ultimately safeguarding asset integrity. In this paper, we design and evaluate a machine learning pipeline for estimation of battery capacity fade - a metric of battery health - on 179 cells cycled under various conditions. The pipeline estimates battery SOH with an associated confidence interval by using two parametric and two non-parametric algorithms. Using segments of charge voltage and current curves, the pipeline engineers 30 features, performs automatic feature selection and calibrates the algorithms. When deployed on cells operated under the fast-charging protocol, the best model achieves a root mean squared percent error of 0.45\%. This work provides insights into the design of scalable data-driven models for battery SOH estimation, emphasising the value of confidence bounds around the prediction. The pipeline methodology combines experimental data with machine learning modelling and can be generalized to other critical components that require real-time estimation of SOH.
△ Less
Submitted 1 February, 2021;
originally announced February 2021.
-
Learning Soft Labels via Meta Learning
Authors:
Nidhi Vyas,
Shreyas Saxena,
Thomas Voice
Abstract:
One-hot labels do not represent soft decision boundaries among concepts, and hence, models trained on them are prone to overfitting. Using soft labels as targets provide regularization, but different soft labels might be optimal at different stages of optimization. Also, training with fixed labels in the presence of noisy annotations leads to worse generalization. To address these limitations, we…
▽ More
One-hot labels do not represent soft decision boundaries among concepts, and hence, models trained on them are prone to overfitting. Using soft labels as targets provide regularization, but different soft labels might be optimal at different stages of optimization. Also, training with fixed labels in the presence of noisy annotations leads to worse generalization. To address these limitations, we propose a framework, where we treat the labels as learnable parameters, and optimize them along with model parameters. The learned labels continuously adapt themselves to the model's state, thereby providing dynamic regularization. When applied to the task of supervised image-classification, our method leads to consistent gains across different datasets and architectures. For instance, dynamically learned labels improve ResNet18 by 2.1% on CIFAR100. When applied to dataset containing noisy labels, the learned labels correct the annotation mistakes, and improves over state-of-the-art by a significant margin. Finally, we show that learned labels capture semantic relationship between classes, and thereby improve teacher models for the downstream task of distillation.
△ Less
Submitted 20 September, 2020;
originally announced September 2020.
-
TextDecepter: Hard Label Black Box Attack on Text Classifiers
Authors:
Sachin Saxena
Abstract:
Machine learning has been proven to be susceptible to carefully crafted samples, known as adversarial examples. The generation of these adversarial examples helps to make the models more robust and gives us an insight into the underlying decision-making of these models. Over the years, researchers have successfully attacked image classifiers in both, white and black-box settings. However, these me…
▽ More
Machine learning has been proven to be susceptible to carefully crafted samples, known as adversarial examples. The generation of these adversarial examples helps to make the models more robust and gives us an insight into the underlying decision-making of these models. Over the years, researchers have successfully attacked image classifiers in both, white and black-box settings. However, these methods are not directly applicable to texts as text data is discrete. In recent years, research on crafting adversarial examples against textual applications has been on the rise. In this paper, we present a novel approach for hard-label black-box attacks against Natural Language Processing (NLP) classifiers, where no model information is disclosed, and an attacker can only query the model to get a final decision of the classifier, without confidence scores of the classes involved. Such an attack scenario applies to real-world black-box models being used for security-sensitive applications such as sentiment analysis and toxic content detection.
△ Less
Submitted 27 December, 2020; v1 submitted 16 August, 2020;
originally announced August 2020.
-
On seat allocation problem with multiple merit lists
Authors:
Rahul Kumar Singh,
Sanjeev Saxena
Abstract:
In this note, we present a simpler algorithm for joint seat allocation problem in case there are two or more merit lists. In case of two lists (the current situation for Engineering seats in India), the running time of the algorithm is proportional to sum of running time for two separate (delinked) allocations. The algorithm is straight forward and natural and is not (at least directly) based on d…
▽ More
In this note, we present a simpler algorithm for joint seat allocation problem in case there are two or more merit lists. In case of two lists (the current situation for Engineering seats in India), the running time of the algorithm is proportional to sum of running time for two separate (delinked) allocations. The algorithm is straight forward and natural and is not (at least directly) based on deferred acceptance algorithm of Gale and Shapley. Each person can only move higher in his or her preference list. Thus, all steps of the algorithm can be made public. This will improve transparency and trust in the system.
△ Less
Submitted 13 August, 2020;
originally announced August 2020.
-
Novel Perception Algorithmic Framework For Object Identification and Tracking In Autonomous Navigation
Authors:
Suryansh Saxena,
Isaac K Isukapati
Abstract:
This paper introduces a novel perception framework that has the ability to identify and track objects in autonomous vehicle's field of view. The proposed algorithms don't require any training for achieving this goal. The framework makes use of ego-vehicle's pose estimation and a KD-Tree-based segmentation algorithm to generate object clusters. In turn, using a VFH technique, the geometry of each i…
▽ More
This paper introduces a novel perception framework that has the ability to identify and track objects in autonomous vehicle's field of view. The proposed algorithms don't require any training for achieving this goal. The framework makes use of ego-vehicle's pose estimation and a KD-Tree-based segmentation algorithm to generate object clusters. In turn, using a VFH technique, the geometry of each identified object cluster is translated into a multi-modal PDF and a motion model is initiated with every new object cluster for the purpose of robust spatio-temporal tracking. The methodology further uses statistical properties of high-dimensional probability density functions and Bayesian motion model estimates to identify and track objects from frame to frame. The effectiveness of the methodology is tested on a KITTI dataset. The results show that the median tracking accuracy is around 91% with an end-to-end computational time of 153 milliseconds
△ Less
Submitted 8 June, 2020;
originally announced June 2020.
-
Learning Active Task-Oriented Exploration Policies for Bridging the Sim-to-Real Gap
Authors:
Jacky Liang,
Saumya Saxena,
Oliver Kroemer
Abstract:
Training robotic policies in simulation suffers from the sim-to-real gap, as simulated dynamics can be different from real-world dynamics. Past works tackled this problem through domain randomization and online system-identification. The former is sensitive to the manually-specified training distribution of dynamics parameters and can result in behaviors that are overly conservative. The latter re…
▽ More
Training robotic policies in simulation suffers from the sim-to-real gap, as simulated dynamics can be different from real-world dynamics. Past works tackled this problem through domain randomization and online system-identification. The former is sensitive to the manually-specified training distribution of dynamics parameters and can result in behaviors that are overly conservative. The latter requires learning policies that concurrently perform the task and generate useful trajectories for system identification. In this work, we propose and analyze a framework for learning exploration policies that explicitly perform task-oriented exploration actions to identify task-relevant system parameters. These parameters are then used by model-based trajectory optimization algorithms to perform the task in the real world. We instantiate the framework in simulation with the Linear Quadratic Regulator as well as in the real world with pouring and object dragging tasks. Experiments show that task-oriented exploration helps model-based policies adapt to systems with initially unknown parameters, and it leads to better task performance than task-agnostic exploration.
△ Less
Submitted 5 November, 2020; v1 submitted 2 June, 2020;
originally announced June 2020.
-
Zone Theorem for Arrangements in three dimensions
Authors:
Sanjeev Saxena
Abstract:
In this note, a simple description of zone theorem in three dimensions is given.
In this note, a simple description of zone theorem in three dimensions is given.
△ Less
Submitted 2 June, 2020;
originally announced June 2020.
-
Non-Autoregressive Machine Translation with Latent Alignments
Authors:
Chitwan Saharia,
William Chan,
Saurabh Saxena,
Mohammad Norouzi
Abstract:
This paper presents two strong methods, CTC and Imputer, for non-autoregressive machine translation that model latent alignments with dynamic programming. We revisit CTC for machine translation and demonstrate that a simple CTC model can achieve state-of-the-art for single-step non-autoregressive machine translation, contrary to what prior work indicates. In addition, we adapt the Imputer model fo…
▽ More
This paper presents two strong methods, CTC and Imputer, for non-autoregressive machine translation that model latent alignments with dynamic programming. We revisit CTC for machine translation and demonstrate that a simple CTC model can achieve state-of-the-art for single-step non-autoregressive machine translation, contrary to what prior work indicates. In addition, we adapt the Imputer model for non-autoregressive machine translation and demonstrate that Imputer with just 4 generation steps can match the performance of an autoregressive Transformer baseline. Our latent alignment models are simpler than many existing non-autoregressive translation baselines; for example, we do not require target length prediction or re-scoring with an autoregressive model. On the competitive WMT'14 En$\rightarrow$De task, our CTC model achieves 25.7 BLEU with a single generation step, while Imputer achieves 27.5 BLEU with 2 generation steps, and 28.0 BLEU with 4 generation steps. This compares favourably to the autoregressive Transformer baseline at 27.8 BLEU.
△ Less
Submitted 16 November, 2020; v1 submitted 15 April, 2020;
originally announced April 2020.
-
Source Printer Identification from Document Images Acquired using Smartphone
Authors:
Sharad Joshi,
Suraj Saxena,
Nitin Khanna
Abstract:
Vast volumes of printed documents continue to be used for various important as well as trivial applications. Such applications often rely on the information provided in the form of printed text documents whose integrity verification poses a challenge due to time constraints and lack of resources. Source printer identification provides essential information about the origin and integrity of a print…
▽ More
Vast volumes of printed documents continue to be used for various important as well as trivial applications. Such applications often rely on the information provided in the form of printed text documents whose integrity verification poses a challenge due to time constraints and lack of resources. Source printer identification provides essential information about the origin and integrity of a printed document in a fast and cost-effective manner. Even when fraudulent documents are identified, information about their origin can help stop future frauds. If a smartphone camera replaces scanner for the document acquisition process, document forensics would be more economical, user-friendly, and even faster in many applications where remote and distributed analysis is beneficial. Building on existing methods, we propose to learn a single CNN model from the fusion of letter images and their printer-specific noise residuals. In the absence of any publicly available dataset, we created a new dataset consisting of 2250 document images of text documents printed by eighteen printers and acquired by a smartphone camera at five acquisition settings. The proposed method achieves 98.42% document classification accuracy using images of letter 'e' under a 5x2 cross-validation approach. Further, when tested using about half a million letters of all types, it achieves 90.33% and 98.01% letter and document classification accuracies, respectively, thus highlighting the ability to learn a discriminative model without dependence on a single letter type. Also, classification accuracies are encouraging under various acquisition settings, including low illumination and change in angle between the document and camera planes.
△ Less
Submitted 27 March, 2020;
originally announced March 2020.