subscribe to arXiv mailings

arXiv:2407.19773 [pdf, other]

Unmasking unlearnable models: a classification challenge for biomedical images without visible cues

Authors: Shivam Kumar, Samrat Chatterjee

Abstract: Predicting traits from images lacking visual cues is challenging, as algorithms are designed to capture visually correlated ground truth. This problem is critical in biomedical sciences, and their solution can improve the efficacy of non-invasive methods. For example, a recent challenge of predicting MGMT methylation status from MRI images is critical for treatment decisions of glioma patients. Us… ▽ More Predicting traits from images lacking visual cues is challenging, as algorithms are designed to capture visually correlated ground truth. This problem is critical in biomedical sciences, and their solution can improve the efficacy of non-invasive methods. For example, a recent challenge of predicting MGMT methylation status from MRI images is critical for treatment decisions of glioma patients. Using less robust models poses a significant risk in these critical scenarios and underscores the urgency of addressing this issue. Despite numerous efforts, contemporary models exhibit suboptimal performance, and underlying reasons for this limitation remain elusive. In this study, we demystify the complexity of MGMT status prediction through a comprehensive exploration by performing benchmarks of existing models adjoining transfer learning. Their architectures were further dissected by observing gradient flow across layers. Additionally, a feature selection strategy was applied to improve model interpretability. Our finding highlighted that current models are unlearnable and may require new architectures to explore applications in the real world. We believe our study will draw immediate attention and catalyse advancements in predictive modelling with non-visible cues. △ Less

Submitted 29 July, 2024; originally announced July 2024.

arXiv:2407.17508 [pdf, other]

Artificial Intelligence Based Navigation in Quasi Structured Environment

Authors: Hariram Sampath Kumar, Archana Singh, Manish Kumar Ojha

Abstract: The proper planning of different types of public transportation such as metro, highway, waterways, and so on, can increase the efficiency, reduce the congestion and improve the safety of the country. There are certain challenges associated with route planning, such as high cost of implementation, need for adequate resource & infrastructure and resistance to change. The goal of this research is to… ▽ More The proper planning of different types of public transportation such as metro, highway, waterways, and so on, can increase the efficiency, reduce the congestion and improve the safety of the country. There are certain challenges associated with route planning, such as high cost of implementation, need for adequate resource & infrastructure and resistance to change. The goal of this research is to examine the working, applications, complexity factors, advantages & disadvantages of Floyd- Warshall, Bellman-Ford, Johnson, Ant Colony Optimization (ACO), Particle Swarm Optimization (PSO), & Grey Wolf Optimizer (GWO), to find the best choice for the above application. In this paper, comparative analysis of above-mentioned algorithms is presented. The Floyd-Warshall method and ACO algorithm are chosen based on the comparisons. Also, a combination of modified Floyd-Warshall with ACO algorithm is proposed. The proposed algorithm showed better results with less time complexity, when applied on randomly structured points within a boundary called quasi-structured points. In addition, this paper also discusses the future works of integrating Floyd-Warshall with ACO to develop a real-time model for overcoming above mentioned-challenges during transportation route planning. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: 10 pages, 8 figures

arXiv:2407.16905 [pdf, other]

Assessing the role of clinical summarization and patient chart review within communications, medical management, and diagnostics

Authors: Chanseo Lee, Kimon-Aristotelis Vogt, Sonu Kumar

Abstract: Effective summarization of unstructured patient data in electronic health records (EHRs) is crucial for accurate diagnosis and efficient patient care, yet clinicians often struggle with information overload and time constraints. This review dives into recent literature and case studies on both the significant impacts and outstanding issues of patient chart review on communications, diagnostics, an… ▽ More Effective summarization of unstructured patient data in electronic health records (EHRs) is crucial for accurate diagnosis and efficient patient care, yet clinicians often struggle with information overload and time constraints. This review dives into recent literature and case studies on both the significant impacts and outstanding issues of patient chart review on communications, diagnostics, and management. It also discusses recent efforts to integrate artificial intelligence (AI) into clinical summarization tasks, and its transformative impact on the clinician's potential, including but not limited to reductions of administrative burden and improved patient-centered care. △ Less

Submitted 24 June, 2024; originally announced July 2024.

arXiv:2407.15716 [pdf, other]

CrashEventLLM: Predicting System Crashes with Large Language Models

Authors: Priyanka Mudgal, Bijan Arbab, Swaathi Sampath Kumar

Abstract: As the dependence on computer systems expands across various domains, focusing on personal, industrial, and large-scale applications, there arises a compelling need to enhance their reliability to sustain business operations seamlessly and ensure optimal user satisfaction. System logs generated by these devices serve as valuable repositories of historical trends and past failures. The use of machi… ▽ More As the dependence on computer systems expands across various domains, focusing on personal, industrial, and large-scale applications, there arises a compelling need to enhance their reliability to sustain business operations seamlessly and ensure optimal user satisfaction. System logs generated by these devices serve as valuable repositories of historical trends and past failures. The use of machine learning techniques for failure prediction has become commonplace, enabling the extraction of insights from past data to anticipate future behavior patterns. Recently, large language models have demonstrated remarkable capabilities in tasks including summarization, reasoning, and event prediction. Therefore, in this paper, we endeavor to investigate the potential of large language models in predicting system failures, leveraging insights learned from past failure behavior to inform reasoning and decision-making processes effectively. Our approach involves leveraging data from the Intel Computing Improvement Program (ICIP) system crash logs to identify significant events and develop CrashEventLLM. This model, built upon a large language model framework, serves as our foundation for crash event prediction. Specifically, our model utilizes historical data to forecast future crash events, informed by expert annotations. Additionally, it goes beyond mere prediction, offering insights into potential causes for each crash event. This work provides the preliminary insights into prompt-based large language models for the log-based event prediction task. △ Less

Submitted 28 July, 2024; v1 submitted 22 July, 2024; originally announced July 2024.

Comments: Accepted in ICITCOM'24. Copyrights will be with IEEE

arXiv:2407.15324 [pdf, other]

Cooperative Salvo Guidance over Leader-Follower Network with Free-Will Arbitrary Time Convergence

Authors: Rajib Shekhar Pal, Shashi Ranjan Kumar, Dwaipayan Mukherjee

Abstract: A cooperative salvo strategy is proposed in this paper which achieves consensus among the interceptors within a pre-defined arbitrary settling time. Considering non-linear engagement kinematics and a system lag to capture the effect of interceptor autopilot as present in realistic interception scenarios, the guidance schemes use the time-to-go estimates of the interceptors in order to achieve simu… ▽ More A cooperative salvo strategy is proposed in this paper which achieves consensus among the interceptors within a pre-defined arbitrary settling time. Considering non-linear engagement kinematics and a system lag to capture the effect of interceptor autopilot as present in realistic interception scenarios, the guidance schemes use the time-to-go estimates of the interceptors in order to achieve simultaneous interception of a stationary target at a pre-determined impact time. The guidance scheme ensures that consensus among the time-to-go estimates of the interceptors is achieved within a settling time whose upper bound can be pre-specified arbitrarily independent of the initial conditions or design parameters. The efficacy of the proposed guidance strategy is demonstrated using numerical simulations with varied conditions of initial position, velocities and heading angle errors of the interceptors as well as different desired impact times. △ Less

Submitted 21 July, 2024; originally announced July 2024.

arXiv:2407.15227 [pdf, other]

A Community-Centric Perspective for Characterizing and Detecting Anti-Asian Violence-Provoking Speech

Authors: Gaurav Verma, Rynaa Grover, Jiawei Zhou, Binny Mathew, Jordan Kraemer, Munmun De Choudhury, Srijan Kumar

Abstract: Violence-provoking speech -- speech that implicitly or explicitly promotes violence against the members of the targeted community, contributed to a massive surge in anti-Asian crimes during the pandemic. While previous works have characterized and built tools for detecting other forms of harmful speech, like fear speech and hate speech, our work takes a community-centric approach to studying anti-… ▽ More Violence-provoking speech -- speech that implicitly or explicitly promotes violence against the members of the targeted community, contributed to a massive surge in anti-Asian crimes during the pandemic. While previous works have characterized and built tools for detecting other forms of harmful speech, like fear speech and hate speech, our work takes a community-centric approach to studying anti-Asian violence-provoking speech. Using data from ~420k Twitter posts spanning a 3-year duration (January 1, 2020 to February 1, 2023), we develop a codebook to characterize anti-Asian violence-provoking speech and collect a community-crowdsourced dataset to facilitate its large-scale detection using state-of-the-art classifiers. We contrast the capabilities of natural language processing classifiers, ranging from BERT-based to LLM-based classifiers, in detecting violence-provoking speech with their capabilities to detect anti-Asian hateful speech. In contrast to prior work that has demonstrated the effectiveness of such classifiers in detecting hateful speech ($F_1 = 0.89$), our work shows that accurate and reliable detection of violence-provoking speech is a challenging task ($F_1 = 0.69$). We discuss the implications of our findings, particularly the need for proactive interventions to support Asian communities during public health crises. The resources related to the study are available at https://claws-lab.github.io/violence-provoking-speech/. △ Less

Submitted 21 July, 2024; originally announced July 2024.

Comments: Accepted to ACL 2024 Main

arXiv:2407.15022 [pdf]

Encouraging Responsible Use of Generative AI in Education: A Reward-Based Learning Approach

Authors: Aditi Singh, Abul Ehtesham, Saket Kumar, Gaurav Kumar Gupta, Tala Talaei Khoei

Abstract: This research introduces an innovative mathematical learning approach that integrates generative AI to cultivate a structured learning rather than quick solution. Our method combines chatbot capabilities and generative AI to offer interactive problem-solving exercises, enhancing learning through a stepby-step approach for varied problems, advocating for the responsible use of AI in education. Our… ▽ More This research introduces an innovative mathematical learning approach that integrates generative AI to cultivate a structured learning rather than quick solution. Our method combines chatbot capabilities and generative AI to offer interactive problem-solving exercises, enhancing learning through a stepby-step approach for varied problems, advocating for the responsible use of AI in education. Our approach emphasizes that immediate answers from ChatGPT can impede real learning. We introduce a reward-based system that requires students to solve mathematical problems effectively to receive the final answer. This encourages a progressive learning path from basic to complex problems, rewarding mastery with final solutions. The goal is to transition students from seeking quick fixes to engaging actively in a comprehensive learning experience. △ Less

Submitted 26 June, 2024; originally announced July 2024.

Comments: 9 pages, 4 figures

arXiv:2407.13833 [pdf, other]

Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle

Authors: Emman Haider, Daniel Perez-Becker, Thomas Portet, Piyush Madan, Amit Garg, David Majercak, Wen Wen, Dongwoo Kim, Ziyi Yang, Jianwen Zhang, Hiteshi Sharma, Blake Bullwinkel, Martin Pouliot, Amanda Minnich, Shiven Chawla, Solianna Herrera, Shahed Warreth, Maggie Engler, Gary Lopez, Nina Chikanov, Raja Sekhar Rao Dheekonda, Bolor-Erdene Jagdagdorj, Roman Lutz, Richard Lundeen, Tori Westerhoff , et al. (5 additional authors not shown)

Abstract: Recent innovations in language model training have demonstrated that it is possible to create highly performant models that are small enough to run on a smartphone. As these models are deployed in an increasing number of domains, it is critical to ensure that they are aligned with human preferences and safety considerations. In this report, we present our methodology for safety aligning the Phi-3… ▽ More Recent innovations in language model training have demonstrated that it is possible to create highly performant models that are small enough to run on a smartphone. As these models are deployed in an increasing number of domains, it is critical to ensure that they are aligned with human preferences and safety considerations. In this report, we present our methodology for safety aligning the Phi-3 series of language models. We utilized a "break-fix" cycle, performing multiple rounds of dataset curation, safety post-training, benchmarking, red teaming, and vulnerability identification to cover a variety of harm areas in both single and multi-turn scenarios. Our results indicate that this approach iteratively improved the performance of the Phi-3 models across a wide range of responsible AI benchmarks. △ Less

Submitted 18 July, 2024; originally announced July 2024.

arXiv:2407.13318 [pdf, other]

A new approach to delegate signing rights to proxy signers using isogeny-based cryptography

Authors: Kunal Dey, Somnath Kumar, Vikas Srivastava, Sumit Kumar Debnath

Abstract: E-governance is a two-way protocol through which one can use government services, share data and request information. It refers to the use of communication and information technologies to provide government services to public in an efficient and fast manner. In addition, any document submitted to the e-Government system must be authenticated by a government officer using a digital signature scheme… ▽ More E-governance is a two-way protocol through which one can use government services, share data and request information. It refers to the use of communication and information technologies to provide government services to public in an efficient and fast manner. In addition, any document submitted to the e-Government system must be authenticated by a government officer using a digital signature scheme. In the context of digital signatures, the proxy signature is an important cryptographic primitive that allows the original signer to delegate signing authority to another signer (proxy signer). The proxy signature has a number of important applications in the e-government system. There are now a large amount of proxy signature schemes. The security of most of them relies on the following hard problems: the discrete logarithm problem and the factorization of integers problem. However, a large-scale quantum computer can solve them in polynomial time due to Shor's algorithm. As a consequence, there is a need for a quantum computer-resistant proxy signature to secure e-governance system from quantum adversaries. In this work, we propose the first post-quantum isogeny based proxy signature scheme CSI-PS (commutative supersingular isogeny proxy signature). Our construction is proven to be uf-cma secure under the hardness of the group action inverse problem (GAIP) based on isogeny. △ Less

Submitted 18 July, 2024; originally announced July 2024.

arXiv:2407.12185 [pdf, other]

Satisficing Exploration for Deep Reinforcement Learning

Authors: Dilip Arumugam, Saurabh Kumar, Ramki Gummadi, Benjamin Van Roy

Abstract: A default assumption in the design of reinforcement-learning algorithms is that a decision-making agent always explores to learn optimal behavior. In sufficiently complex environments that approach the vastness and scale of the real world, however, attaining optimal performance may in fact be an entirely intractable endeavor and an agent may seldom find itself in a position to complete the requisi… ▽ More A default assumption in the design of reinforcement-learning algorithms is that a decision-making agent always explores to learn optimal behavior. In sufficiently complex environments that approach the vastness and scale of the real world, however, attaining optimal performance may in fact be an entirely intractable endeavor and an agent may seldom find itself in a position to complete the requisite exploration for identifying an optimal policy. Recent work has leveraged tools from information theory to design agents that deliberately forgo optimal solutions in favor of sufficiently-satisfying or satisficing solutions, obtained through lossy compression. Notably, such agents may employ fundamentally different exploratory decisions to learn satisficing behaviors more efficiently than optimal ones that are more data intensive. While supported by a rigorous corroborating theory, the underlying algorithm relies on model-based planning, drastically limiting the compatibility of these ideas with function approximation and high-dimensional observations. In this work, we remedy this issue by extending an agent that directly represents uncertainty over the optimal value function allowing it to both bypass the need for model-based planning and to learn satisficing policies. We provide simple yet illustrative experiments that demonstrate how our algorithm enables deep reinforcement-learning agents to achieve satisficing behaviors. In keeping with previous work on this setting for multi-armed bandits, we additionally find that our algorithm is capable of synthesizing optimal behaviors, when feasible, more efficiently than its non-information-theoretic counterpart. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: Accepted to the Finding the Frame Workshop at RLC 2024

arXiv:2407.12043 [pdf, other]

The Art of Saying No: Contextual Noncompliance in Language Models

Authors: Faeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin, Abhilasha Ravichander, Sarah Wiegreffe, Nouha Dziri, Khyathi Chandu, Jack Hessel, Yulia Tsvetkov, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi

Abstract: Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a… ▽ More Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests (in addition to unsafe requests). To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts. We find that most existing models show significantly high compliance rates in certain previously understudied categories with models like GPT-4 incorrectly complying with as many as 30% of requests. To address these gaps, we explore different training strategies using a synthetically-generated training set of requests and expected noncompliant responses. Our experiments demonstrate that while direct finetuning of instruction-tuned models can lead to both over-refusal and a decline in general capabilities, using parameter efficient methods like low rank adapters helps to strike a good balance between appropriate noncompliance and other capabilities. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.10837 [pdf, other]

Trajectory Tracking for Unmanned Aerial Vehicles in 3D Spaces under Motion Constraints

Authors: Saurabh Kumar, Shashi Ranjan Kumar, Abhinav Sinha

Abstract: This article presents a three-dimensional nonlinear trajectory tracking control strategy for unmanned aerial vehicles (UAVs) in the presence of spatial constraints. As opposed to many existing control strategies, which do not consider spatial constraints, the proposed strategy considers spatial constraints on each degree of freedom movement of the UAV. Such consideration makes the design appealing… ▽ More This article presents a three-dimensional nonlinear trajectory tracking control strategy for unmanned aerial vehicles (UAVs) in the presence of spatial constraints. As opposed to many existing control strategies, which do not consider spatial constraints, the proposed strategy considers spatial constraints on each degree of freedom movement of the UAV. Such consideration makes the design appealing for many practical applications, such as pipeline inspection, boundary tracking, etc. The proposed design accounts for the limited information about the inertia matrix, thereby affirming its inherent robustness against unmodeled dynamics and other imperfections. We rigorously show that the UAV will converge to its desired path by maintaining bounded position, orientation, and linear and angular speeds. Finally, we demonstrate the effectiveness of the proposed strategy through various numerical simulations. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2407.09466 [pdf, other]

TRAVERSE: Traffic-Responsive Autonomous Vehicle Experience & Rare-event Simulation for Enhanced safety

Authors: Sandeep Thalapanane, Sandip Sharan Senthil Kumar, Guru Nandhan Appiya Dilipkumar Peethambari, Sourang SriHari, Laura Zheng, Julio Poveda, Ming C. Lin

Abstract: Data for training learning-enabled self-driving cars in the physical world are typically collected in a safe, normal environment. Such data distribution often engenders a strong bias towards safe driving, making self-driving cars unprepared when encountering adversarial scenarios like unexpected accidents. Due to a dearth of such adverse data that is unrealistic for drivers to collect, autonomous… ▽ More Data for training learning-enabled self-driving cars in the physical world are typically collected in a safe, normal environment. Such data distribution often engenders a strong bias towards safe driving, making self-driving cars unprepared when encountering adversarial scenarios like unexpected accidents. Due to a dearth of such adverse data that is unrealistic for drivers to collect, autonomous vehicles can perform poorly when experiencing such rare events. This work addresses much-needed research by having participants drive a VR vehicle simulator going through simulated traffic with various types of accidental scenarios. It aims to understand human responses and behaviors in simulated accidents, contributing to our understanding of driving dynamics and safety. The simulation framework adopts a robust traffic simulation and is rendered using the Unity Game Engine. Furthermore, the simulation framework is built with portable, light-weight immersive driving simulator hardware, lowering the resource barrier for studies in autonomous driving research. Keywords: Rare Events, Traffic Simulation, Autonomous Driving, Virtual Reality, User Studies △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.08818 [pdf]

MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization

Authors: Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Valentin Hoffman, Tomasz Limisiewicz, Yulia Tsvetkov, Noah A. Smith

Abstract: In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. Specifically, previous studies have reported multiple modeling biases that the current tokenization algorithms introduce to non-Latin script languages, the main one being over-segmentation. In this work, we propose MAGNET; multilingual adaptiv… ▽ More In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. Specifically, previous studies have reported multiple modeling biases that the current tokenization algorithms introduce to non-Latin script languages, the main one being over-segmentation. In this work, we propose MAGNET; multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization. MAGNET learns to predict segment boundaries between byte tokens in a sequence via sub-modules within the model, which act as internal boundary predictors (tokenizers). Previous gradient-based tokenization methods aimed for uniform compression across sequences by integrating a single boundary predictor during training and optimizing it end-to-end through stochastic reparameterization alongside the next token prediction objective. However, this approach still results in over-segmentation for non-Latin script languages in multilingual settings. In contrast, MAGNET offers a customizable architecture where byte-level sequences are routed through language-script-specific predictors, each optimized for its respective language script. This modularity enforces equitable segmentation granularity across different language scripts compared to previous methods. Through extensive experiments, we demonstrate that in addition to reducing segmentation disparities, MAGNET also enables faster language modelling and improves downstream utility. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.08726 [pdf, other]

Map It Anywhere (MIA): Empowering Bird's Eye View Mapping using Large-scale Public Data

Authors: Cherie Ho, Jiaye Zou, Omar Alama, Sai Mitheran Jagadesh Kumar, Benjamin Chiang, Taneesh Gupta, Chen Wang, Nikhil Keetha, Katia Sycara, Sebastian Scherer

Abstract: Top-down Bird's Eye View (BEV) maps are a popular representation for ground robot navigation due to their richness and flexibility for downstream tasks. While recent methods have shown promise for predicting BEV maps from First-Person View (FPV) images, their generalizability is limited to small regions captured by current autonomous vehicle-based datasets. In this context, we show that a more sca… ▽ More Top-down Bird's Eye View (BEV) maps are a popular representation for ground robot navigation due to their richness and flexibility for downstream tasks. While recent methods have shown promise for predicting BEV maps from First-Person View (FPV) images, their generalizability is limited to small regions captured by current autonomous vehicle-based datasets. In this context, we show that a more scalable approach towards generalizable map prediction can be enabled by using two large-scale crowd-sourced mapping platforms, Mapillary for FPV images and OpenStreetMap for BEV semantic maps. We introduce Map It Anywhere (MIA), a data engine that enables seamless curation and modeling of labeled map prediction data from existing open-source map platforms. Using our MIA data engine, we display the ease of automatically collecting a dataset of 1.2 million pairs of FPV images & BEV maps encompassing diverse geographies, landscapes, environmental factors, camera models & capture scenarios. We further train a simple camera model-agnostic model on this data for BEV map prediction. Extensive evaluations using established benchmarks and our dataset show that the data curated by MIA enables effective pretraining for generalizable BEV map prediction, with zero-shot performance far exceeding baselines trained on existing datasets by 35%. Our analysis highlights the promise of using large-scale public maps for developing & testing generalizable BEV perception, paving the way for more robust autonomous navigation. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.08655 [pdf, other]

SPOCKMIP: Segmentation of Vessels in MRAs with Enhanced Continuity using Maximum Intensity Projection as Loss

Authors: Chethan Radhakrishna, Karthikesh Varma Chintalapati, Sri Chandana Hudukula Ram Kumar, Raviteja Sutrave, Hendrik Mattern, Oliver Speck, Andreas Nürnberger, Soumick Chatterjee

Abstract: Identification of vessel structures of different sizes in biomedical images is crucial in the diagnosis of many neurodegenerative diseases. However, the sparsity of good-quality annotations of such images makes the task of vessel segmentation challenging. Deep learning offers an efficient way to segment vessels of different sizes by learning their high-level feature representations and the spatial… ▽ More Identification of vessel structures of different sizes in biomedical images is crucial in the diagnosis of many neurodegenerative diseases. However, the sparsity of good-quality annotations of such images makes the task of vessel segmentation challenging. Deep learning offers an efficient way to segment vessels of different sizes by learning their high-level feature representations and the spatial continuity of such features across dimensions. Semi-supervised patch-based approaches have been effective in identifying small vessels of one to two voxels in diameter. This study focuses on improving the segmentation quality by considering the spatial correlation of the features using the Maximum Intensity Projection~(MIP) as an additional loss criterion. Two methods are proposed with the incorporation of MIPs of label segmentation on the single~(z-axis) and multiple perceivable axes of the 3D volume. The proposed MIP-based methods produce segmentations with improved vessel continuity, which is evident in visual examinations of ROIs. Patch-based training is improved by introducing an additional loss term, MIP loss, to penalise the predicted discontinuity of vessels. A training set of 14 volumes is selected from the StudyForrest dataset comprising of 18 7-Tesla 3D Time-of-Flight~(ToF) Magnetic Resonance Angiography (MRA) images. The generalisation performance of the method is evaluated using the other unseen volumes in the dataset. It is observed that the proposed method with multi-axes MIP loss produces better quality segmentations with a median Dice of $80.245 \pm 0.129$. Also, the method with single-axis MIP loss produces segmentations with a median Dice of $79.749 \pm 0.109$. Furthermore, a visual comparison of the ROIs in the predicted segmentation reveals a significant improvement in the continuity of the vessels when MIP loss is incorporated into training. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.07786 [pdf, ps, other]

The Human Factor in AI Red Teaming: Perspectives from Social and Collaborative Computing

Authors: Alice Qian Zhang, Ryland Shaw, Jacy Reese Anthis, Ashlee Milton, Emily Tseng, Jina Suh, Lama Ahmad, Ram Shankar Siva Kumar, Julian Posada, Benjamin Shestakofsky, Sarah T. Roberts, Mary L. Gray

Abstract: Rapid progress in general-purpose AI has sparked significant interest in "red teaming," a practice of adversarial testing originating in military and cybersecurity applications. AI red teaming raises many questions about the human factor, such as how red teamers are selected, biases and blindspots in how tests are conducted, and harmful content's psychological effects on red teamers. A growing bod… ▽ More Rapid progress in general-purpose AI has sparked significant interest in "red teaming," a practice of adversarial testing originating in military and cybersecurity applications. AI red teaming raises many questions about the human factor, such as how red teamers are selected, biases and blindspots in how tests are conducted, and harmful content's psychological effects on red teamers. A growing body of HCI and CSCW literature examines related practices-including data labeling, content moderation, and algorithmic auditing. However, few, if any, have investigated red teaming itself. This workshop seeks to consider the conceptual and empirical challenges associated with this practice, often rendered opaque by non-disclosure agreements. Future studies may explore topics ranging from fairness to mental health and other areas of potential harm. We aim to facilitate a community of researchers and practitioners who can begin to meet these challenges with creativity, innovation, and thoughtful reflection. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: Workshop proposal accepted to CSCW 2024

arXiv:2407.07128 [pdf, other]

Modularity aided consistent attributed graph clustering via coarsening

Authors: Samarth Bhatia, Yukti Makhija, Manoj Kumar, Sandeep Kumar

Abstract: Graph clustering is an important unsupervised learning technique for partitioning graphs with attributes and detecting communities. However, current methods struggle to accurately capture true community structures and intra-cluster relations, be computationally efficient, and identify smaller communities. We address these challenges by integrating coarsening and modularity maximization, effectivel… ▽ More Graph clustering is an important unsupervised learning technique for partitioning graphs with attributes and detecting communities. However, current methods struggle to accurately capture true community structures and intra-cluster relations, be computationally efficient, and identify smaller communities. We address these challenges by integrating coarsening and modularity maximization, effectively leveraging both adjacency and node features to enhance clustering accuracy. We propose a loss function incorporating log-determinant, smoothness, and modularity components using a block majorization-minimization technique, resulting in superior clustering outcomes. The method is theoretically consistent under the Degree-Corrected Stochastic Block Model (DC-SBM), ensuring asymptotic error-free performance and complete label recovery. Our provably convergent and time-efficient algorithm seamlessly integrates with graph neural networks (GNNs) and variational graph autoencoders (VGAEs) to learn enhanced node features and deliver exceptional clustering performance. Extensive experiments on benchmark datasets demonstrate its superiority over existing state-of-the-art methods for both attributed and non-attributed graphs. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: The first two authors contributed equally to this work

arXiv:2407.06868 [pdf, other]

Energy Efficient Fair STAR-RIS for Mobile Users

Authors: Ashok S. Kumar, Nancy Nayak, Sheetal Kalyani, Himal A. Suraweera

Abstract: In this work, we propose a method to improve the energy efficiency and fairness of simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) for mobile users, ensuring reduced power consumption while maintaining reliable communication. To achieve this, we introduce a new parameter known as the subsurface assignment variable, which determines the number of STAR-RIS e… ▽ More In this work, we propose a method to improve the energy efficiency and fairness of simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) for mobile users, ensuring reduced power consumption while maintaining reliable communication. To achieve this, we introduce a new parameter known as the subsurface assignment variable, which determines the number of STAR-RIS elements allocated to each user. We then formulate a novel optimization problem by concurrently optimizing the phase shifts of the STAR-RIS and subsurface assignment variable. We leverage the deep reinforcement learning (DRL) technique to address this optimization problem. The DRL model predicts the phase shifts of the STAR-RIS and efficiently allocates elements of STAR-RIS to the users. Additionally, we incorporate a penalty term in the DRL model to facilitate intelligent deactivation of STAR-RIS elements when not in use to enhance energy efficiency. Through extensive experiments, we show that the proposed method can achieve fairly high and nearly equal data rates for all users in both the transmission and reflection spaces in an energy-efficient manner. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.04444 [pdf, other]

TokenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR

Authors: Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Iuliia Nigmatulina, Esaú Villatoro-Tello, Sergio Burdisso, Petr Motlicek, Karthik Pandia, Aravind Ganapathiraju

Abstract: In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achie… ▽ More In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Additionally, we present task transfer learning to a new task within an existing TokenVerse. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: 5 pages, double column

arXiv:2407.04332 [pdf]

Energy Efficient Knapsack Optimization Using Probabilistic Memristor Crossbars

Authors: Jinzhan Li, Suhas Kumar, Su-in Yi

Abstract: Constrained optimization underlies crucial societal problems (for instance, stock trading and bandwidth allocation), but is often computationally hard (complexity grows exponentially with problem size). The big-data era urgently demands low-latency and low-energy optimization at the edge, which cannot be handled by digital processors due to their non-parallel von Neumann architecture. Recent effor… ▽ More Constrained optimization underlies crucial societal problems (for instance, stock trading and bandwidth allocation), but is often computationally hard (complexity grows exponentially with problem size). The big-data era urgently demands low-latency and low-energy optimization at the edge, which cannot be handled by digital processors due to their non-parallel von Neumann architecture. Recent efforts using massively parallel hardware (such as memristor crossbars and quantum processors) employing annealing algorithms, while promising, have handled relatively easy and stable problems with sparse or binary representations (such as the max-cut or traveling salesman problems).However, most real-world applications embody three features, which are encoded in the knapsack problem, and cannot be handled by annealing algorithms - dense and non-binary representations, with destabilizing self-feedback. Here we demonstrate a post-digital-hardware-friendly randomized competitive Ising-inspired (RaCI) algorithm performing knapsack optimization, experimentally implemented on a foundry-manufactured CMOS-integrated probabilistic analog memristor crossbar. Our solution outperforms digital and quantum approaches by over 4 orders of magnitude in energy efficiency. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: 16 pages, 8 figures

arXiv:2407.04087 [pdf, other]

Advanced Artificial Intelligence Strategy for Optimizing Urban Rail Network Design using Nature-Inspired Algorithms

Authors: Hariram Sampath Kumar, Archana Singh, Manish Kumar Ojha

Abstract: This study introduces an innovative methodology for the planning of metro network routes within the urban environment of Chennai, Tamil Nadu, India. A comparative analysis of the modified Ant Colony Optimization (ACO) method (previously developed) with recent breakthroughs in nature-inspired algorithms demonstrates the modified ACO's superiority over modern techniques. By utilizing the modified AC… ▽ More This study introduces an innovative methodology for the planning of metro network routes within the urban environment of Chennai, Tamil Nadu, India. A comparative analysis of the modified Ant Colony Optimization (ACO) method (previously developed) with recent breakthroughs in nature-inspired algorithms demonstrates the modified ACO's superiority over modern techniques. By utilizing the modified ACO algorithm, the most efficient routes connecting the origin and destination of the metro route are generated. Additionally, the model is applied to the existing metro network to highlight variations between the model's results and the current network. The Google Maps platform, integrated with Python, handles real-time data, including land utilization, Geographical Information Systems (GIS) data, census information, and points of interest. This processing enables the identification of stops within the city and along the chosen routes. The resulting metro network showcases substantial benefits compared to conventional route planning methods, with noteworthy enhancements in workforce productivity, decreased planning time, and cost-efficiency. This study significantly enhances the efficiency of urban transport systems, specifically in rapidly changing metropolitan settings such as chennai. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: 10 pages, 17 figures

arXiv:2407.04053 [pdf, other]

Edge AI: A Taxonomy, Systematic Review and Future Directions

Authors: Sukhpal Singh Gill, Muhammed Golec, Jianmin Hu, Minxian Xu, Junhui Du, Huaming Wu, Guneet Kaur Walia, Subramaniam Subramanian Murugesan, Babar Ali, Mohit Kumar, Kejiang Ye, Prabal Verma, Surendra Kumar, Felix Cuadrado, Steve Uhlig

Abstract: Edge Artificial Intelligence (AI) incorporates a network of interconnected systems and devices that receive, cache, process, and analyse data in close communication with the location where the data is captured with AI technology. Recent advancements in AI efficiency, the widespread use of Internet of Things (IoT) devices, and the emergence of edge computing have unlocked the enormous scope of Edge… ▽ More Edge Artificial Intelligence (AI) incorporates a network of interconnected systems and devices that receive, cache, process, and analyse data in close communication with the location where the data is captured with AI technology. Recent advancements in AI efficiency, the widespread use of Internet of Things (IoT) devices, and the emergence of edge computing have unlocked the enormous scope of Edge AI. The goal of Edge AI is to optimize data processing efficiency and velocity while ensuring data confidentiality and integrity. Despite being a relatively new field of research, spanning from 2014 to the present, it has shown significant and rapid development over the last five years. In this article, we present a systematic literature review for Edge AI to discuss the existing research, recent advancements, and future research directions. We created a collaborative edge AI learning system for cloud and edge computing analysis, including an in-depth study of the architectures that facilitate this mechanism. The taxonomy for Edge AI facilitates the classification and configuration of Edge AI systems while also examining its potential influence across many fields through compassing infrastructure, cloud computing, fog computing, services, use cases, ML and deep learning, and resource management. This study highlights the significance of Edge AI in processing real-time data at the edge of the network. Additionally, it emphasizes the research challenges encountered by Edge AI systems, including constraints on resources, vulnerabilities to security threats, and problems with scalability. Finally, this study highlights the potential future research directions that aim to address the current limitations of Edge AI by providing innovative solutions. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: Preprint Version, 18 Figures

arXiv:2407.03152 [pdf, other]

Stereo Risk: A Continuous Modeling Approach to Stereo Matching

Authors: Ce Liu, Suryansh Kumar, Shuhang Gu, Radu Timofte, Yao Yao, Luc Van Gool

Abstract: We introduce Stereo Risk, a new deep-learning approach to solve the classical stereo-matching problem in computer vision. As it is well-known that stereo matching boils down to a per-pixel disparity estimation problem, the popular state-of-the-art stereo-matching approaches widely rely on regressing the scene disparity values, yet via discretization of scene disparity values. Such discretization o… ▽ More We introduce Stereo Risk, a new deep-learning approach to solve the classical stereo-matching problem in computer vision. As it is well-known that stereo matching boils down to a per-pixel disparity estimation problem, the popular state-of-the-art stereo-matching approaches widely rely on regressing the scene disparity values, yet via discretization of scene disparity values. Such discretization often fails to capture the nuanced, continuous nature of scene depth. Stereo Risk departs from the conventional discretization approach by formulating the scene disparity as an optimal solution to a continuous risk minimization problem, hence the name "stereo risk". We demonstrate that $L^1$ minimization of the proposed continuous risk function enhances stereo-matching performance for deep networks, particularly for disparities with multi-modal probability distributions. Furthermore, to enable the end-to-end network training of the non-differentiable $L^1$ risk optimization, we exploited the implicit function theorem, ensuring a fully differentiable network. A comprehensive analysis demonstrates our method's theoretical soundness and superior performance over the state-of-the-art methods across various benchmark datasets, including KITTI 2012, KITTI 2015, ETH3D, SceneFlow, and Middlebury 2014. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: Accepted as an Oral Paper at ICML 2024. Draft info: 18 pages, 6 Figure, 16 Tables

arXiv:2407.02929 [pdf, other]

A Hybrid Reactive Routing Protocol for Decentralized UAV Networks

Authors: Shivam Garg, Alexander Ihler, Elizabeth Serena Bentley, Sunil Kumar

Abstract: Wireless networks consisting of low SWaP, FW-UAVs are used in many applications, such as monitoring, search and surveillance of inaccessible areas. A decentralized and autonomous approach ensures robustness to failures; the UAVs explore and sense within the area and forward their information, in a multihop manner, to nearby aerial gateway nodes. However, the unpredictable nature of the events, rel… ▽ More Wireless networks consisting of low SWaP, FW-UAVs are used in many applications, such as monitoring, search and surveillance of inaccessible areas. A decentralized and autonomous approach ensures robustness to failures; the UAVs explore and sense within the area and forward their information, in a multihop manner, to nearby aerial gateway nodes. However, the unpredictable nature of the events, relatively high speed of UAVs, and dynamic UAV trajectories cause the network topology to change significantly over time, resulting in frequent route breaks. A holistic routing approach is needed to support multiple traffic flows in these networks to provide mobility- and congestion-aware, high-quality routes when needed, with low control and computational overheads, using the information collected in a distributed manner. Existing routing schemes do not address all the mentioned issues. We present a hybrid reactive routing protocol for decentralized UAV networks. Our scheme searches routes on-demand, monitors a region around the selected route (the pipe), and proactively switches to an alternative route before the current route's quality degrades below a threshold. We empirically evaluate the impact of pipe width and node density on our ability to find alternate high-quality routes within the pipe and the overhead required to maintain the pipe. Compared to existing reactive routing schemes, our approach achieves higher throughput and reduces the number of route discoveries, overhead, and resulting flow interruptions at different traffic loads, node densities and speeds. Despite having limited network topology information, and low overhead and route computation complexity, our proposed scheme achieves superior throughput to proactive optimized link state routing scheme at different network and traffic settings. We also evaluate the relative performance of reactive and proactive routing schemes. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.02662 [pdf, other]

Supporters and Skeptics: LLM-based Analysis of Engagement with Mental Health (Mis)Information Content on Video-sharing Platforms

Authors: Viet Cuong Nguyen, Mini Jain, Abhijat Chauhan, Heather Jaime Soled, Santiago Alvarez Lesmes, Zihang Li, Michael L. Birnbaum, Sunny X. Tang, Srijan Kumar, Munmun De Choudhury

Abstract: Over one in five adults in the US lives with a mental illness. In the face of a shortage of mental health professionals and offline resources, online short-form video content has grown to serve as a crucial conduit for disseminating mental health help and resources. However, the ease of content creation and access also contributes to the spread of misinformation, posing risks to accurate diagnosis… ▽ More Over one in five adults in the US lives with a mental illness. In the face of a shortage of mental health professionals and offline resources, online short-form video content has grown to serve as a crucial conduit for disseminating mental health help and resources. However, the ease of content creation and access also contributes to the spread of misinformation, posing risks to accurate diagnosis and treatment. Detecting and understanding engagement with such content is crucial to mitigating their harmful effects on public health. We perform the first quantitative study of the phenomenon using YouTube Shorts and Bitchute as the sites of study. We contribute MentalMisinfo, a novel labeled mental health misinformation (MHMisinfo) dataset of 739 videos (639 from Youtube and 100 from Bitchute) and 135372 comments in total, using an expert-driven annotation schema. We first found that few-shot in-context learning with large language models (LLMs) are effective in detecting MHMisinfo videos. Next, we discover distinct and potentially alarming linguistic patterns in how audiences engage with MHMisinfo videos through commentary on both video-sharing platforms. Across the two platforms, comments could exacerbate prevailing stigma with some groups showing heightened susceptibility to and alignment with MHMisinfo. We discuss technical and public health-driven adaptive solutions to tackling the "epidemic" of mental health misinformation online. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: 12 pages, in submission to ICWSM

arXiv:2406.18848 [pdf, other]

Temporally Multi-Scale Sparse Self-Attention for Physical Activity Data Imputation

Authors: Hui Wei, Maxwell A. Xu, Colin Samplawski, James M. Rehg, Santosh Kumar, Benjamin M. Marlin

Abstract: Wearable sensors enable health researchers to continuously collect data pertaining to the physiological state of individuals in real-world settings. However, such data can be subject to extensive missingness due to a complex combination of factors. In this work, we study the problem of imputation of missing step count data, one of the most ubiquitous forms of wearable sensor data. We construct a n… ▽ More Wearable sensors enable health researchers to continuously collect data pertaining to the physiological state of individuals in real-world settings. However, such data can be subject to extensive missingness due to a complex combination of factors. In this work, we study the problem of imputation of missing step count data, one of the most ubiquitous forms of wearable sensor data. We construct a novel and large scale data set consisting of a training set with over 3 million hourly step count observations and a test set with over 2.5 million hourly step count observations. We propose a domain knowledge-informed sparse self-attention model for this task that captures the temporal multi-scale nature of step-count data. We assess the performance of the model relative to baselines and conduct ablation studies to verify our specific model designs. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: Accepted by Conference on Health, Inference, and Learning (CHIL) 2024

arXiv:2406.18510 [pdf, other]

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Authors: Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri

Abstract: We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes multiple tactics for systematic exploration of novel jailbreaks. Compared to prior work that performed red-teaming via recruited human workers, gradient-based optimization, or iterative revision with… ▽ More We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes multiple tactics for systematic exploration of novel jailbreaks. Compared to prior work that performed red-teaming via recruited human workers, gradient-based optimization, or iterative revision with LLMs, our work investigates jailbreaks from chatbot users who were not specifically instructed to break the system. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks compared to state-of-the-art jailbreak methods. While many datasets exist for jailbreak evaluation, very few open-source datasets exist for jailbreak training, as safety training data has been closed even when model weights are open. With WildTeaming we create WildJailbreak, a large-scale open-source synthetic safety dataset with 262K vanilla (direct request) and adversarial (complex jailbreak) prompt-response pairs. To mitigate exaggerated safety behaviors, WildJailbreak provides two contrastive types of queries: 1) harmful queries (vanilla & adversarial) and 2) benign queries that resemble harmful queries in form but contain no harm. As WildJailbreak considerably upgrades the quality and scale of existing safety resources, it uniquely enables us to examine the scaling effects of data and the interplay of data properties and model capabilities during safety training. Through extensive experiments, we identify the training properties that enable an ideal balance of safety behaviors: appropriate safeguarding without over-refusal, effective handling of vanilla and adversarial queries, and minimal, if any, decrease in general capabilities. All components of WildJailbeak contribute to achieving balanced safety behaviors of models. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.17968 [pdf, other]

Efficient Document Ranking with Learnable Late Interactions

Authors: Ziwei Ji, Himanshu Jain, Andreas Veit, Sashank J. Reddi, Sadeep Jayasumana, Ankit Singh Rawat, Aditya Krishna Menon, Felix Yu, Sanjiv Kumar

Abstract: Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval. To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings; usually, the former has higher quality while the latter benefits from lower latency. Recently, late-interaction models have been p… ▽ More Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval. To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings; usually, the former has higher quality while the latter benefits from lower latency. Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer based on query and document token embeddings. However, these lightweight scorers are often hand-crafted, and there is no understanding of their approximation power; further, such scorers require access to individual document token embeddings, which imposes an increased latency and storage burden. In this paper, we propose novel learnable late-interaction models (LITE) that resolve these issues. Theoretically, we prove that LITE is a universal approximator of continuous scoring functions, even for relatively small embedding dimension. Empirically, LITE outperforms previous late-interaction models such as ColBERT on both in-domain and zero-shot re-ranking tasks. For instance, experiments on MS MARCO passage re-ranking show that LITE not only yields a model with better generalization, but also lowers latency and requires 0.25x storage compared to ColBERT. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.17963 [pdf, other]

Empowering Interdisciplinary Insights with Dynamic Graph Embedding Trajectories

Authors: Yiqiao Jin, Andrew Zhao, Yeon-Chang Lee, Meng Ye, Ajay Divakaran, Srijan Kumar

Abstract: We developed DyGETViz, a novel framework for effectively visualizing dynamic graphs (DGs) that are ubiquitous across diverse real-world systems. This framework leverages recent advancements in discrete-time dynamic graph (DTDG) models to adeptly handle the temporal dynamics inherent in dynamic graphs. DyGETViz effectively captures both micro- and macro-level structural shifts within these graphs,… ▽ More We developed DyGETViz, a novel framework for effectively visualizing dynamic graphs (DGs) that are ubiquitous across diverse real-world systems. This framework leverages recent advancements in discrete-time dynamic graph (DTDG) models to adeptly handle the temporal dynamics inherent in dynamic graphs. DyGETViz effectively captures both micro- and macro-level structural shifts within these graphs, offering a robust method for representing complex and massive dynamic graphs. The application of DyGETViz extends to a diverse array of domains, including ethology, epidemiology, finance, genetics, linguistics, communication studies, social studies, and international relations. Through its implementation, DyGETViz has revealed or confirmed various critical insights. These include the diversity of content sharing patterns and the degree of specialization within online communities, the chronological evolution of lexicons across decades, and the distinct trajectories exhibited by aging-related and non-related genes. Importantly, DyGETViz enhances the accessibility of scientific findings to non-domain experts by simplifying the complexities of dynamic graphs. Our framework is released as an open-source Python package for use across diverse disciplines. Our work not only addresses the ongoing challenges in visualizing and analyzing DTDG models but also establishes a foundational framework for future investigations into dynamic graph representation and analysis across various disciplines. △ Less

Submitted 28 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

Comments: 27 pages, 11 figures

arXiv:2406.17641 [pdf]

The experience of humans' and robots' mutual (im)politeness in enacted service scenarios: An empirical study

Authors: Victor Kaptelinin, Suna Bensch, Thomas Hellström, Patrik Björnfot, Shikhar Kumar

Abstract: The paper reports an empirical study of the effect of human treatment of a robot on the social perception of the robot's behavior. The study employed an enacted interaction between an anthropomorphic "waiter" robot and two customers. The robot and one of the customers (acted out by a researcher) were following four different interaction scripts, representing all combinations of mutual politeness a… ▽ More The paper reports an empirical study of the effect of human treatment of a robot on the social perception of the robot's behavior. The study employed an enacted interaction between an anthropomorphic "waiter" robot and two customers. The robot and one of the customers (acted out by a researcher) were following four different interaction scripts, representing all combinations of mutual politeness and impoliteness of the robot and the customer. The participants (N=24, within-subject design) were assigned the role of an "included observer", that is, a fellow customer who was present in the situation without being actively involved in the interactions. The participants assessed how they experienced the interaction scenarios by providing Likert scale scores and free-text responses. The results indicate that while impolite robots' behavior was generally assessed negatively, it was commonly perceived as more justifiable and fairer if the robot was treated impolitely by the human. Politeness reciprocity expectations in the context of the social perception of robots are discussed. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: 19 pages, 5 figures, 7 tables

arXiv:2406.13715 [pdf, other]

Converging Dimensions: Information Extraction and Summarization through Multisource, Multimodal, and Multilingual Fusion

Authors: Pranav Janjani, Mayank Palan, Sarvesh Shirude, Ninad Shegokar, Sunny Kumar, Faruk Kazi

Abstract: Recent advances in large language models (LLMs) have led to new summarization strategies, offering an extensive toolkit for extracting important information. However, these approaches are frequently limited by their reliance on isolated sources of data. The amount of information that can be gathered is limited and covers a smaller range of themes, which introduces the possibility of falsified cont… ▽ More Recent advances in large language models (LLMs) have led to new summarization strategies, offering an extensive toolkit for extracting important information. However, these approaches are frequently limited by their reliance on isolated sources of data. The amount of information that can be gathered is limited and covers a smaller range of themes, which introduces the possibility of falsified content and limited support for multilingual and multimodal data. The paper proposes a novel approach to summarization that tackles such challenges by utilizing the strength of multiple sources to deliver a more exhaustive and informative understanding of intricate topics. The research progresses beyond conventional, unimodal sources such as text documents and integrates a more diverse range of data, including YouTube playlists, pre-prints, and Wikipedia pages. The aforementioned varied sources are then converted into a unified textual representation, enabling a more holistic analysis. This multifaceted approach to summary generation empowers us to extract pertinent information from a wider array of sources. The primary tenet of this approach is to maximize information gain while minimizing information overlap and maintaining a high level of informativeness, which encourages the generation of highly coherent summaries. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 11 pages, 3 figures

arXiv:2406.12405 [pdf]

On The Effective Rate and Error Rate Analysis over Fluctuating Nakagami-m Fading Channel

Authors: Manpreet Kaur, Puspraj Singh Chauhan, Sandeep Kumar, Pappu Kumar Verma

Abstract: This paper provides a detailed analysis of the important performance metrics like effective capacity and symbol error rate over fluctuating Nakagami-m fading channel. This distribution is obtained from the ratio of two random variables, following the Nakagami-m distribution and the uniform distribution. Our study derives exact analytical expressions for the EC and SER under different modulation sc… ▽ More This paper provides a detailed analysis of the important performance metrics like effective capacity and symbol error rate over fluctuating Nakagami-m fading channel. This distribution is obtained from the ratio of two random variables, following the Nakagami-m distribution and the uniform distribution. Our study derives exact analytical expressions for the EC and SER under different modulation schemes, considering the effect of channel parameters. Recognising the importance of additive Laplacian noise in today scenario, it has been considered for the error performance analysis of the system. This work may be utilised for the design and optimization of the systems operating in environments characterized by fluctuating Nakagami-m fading. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 18 pages

arXiv:2406.12274 [pdf, other]

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

Authors: Somnath Banerjee, Soham Tripathy, Sayan Layek, Shanu Kumar, Animesh Mukherjee, Rima Hazra

Abstract: Safety-aligned language models often exhibit fragile and imbalanced safety mechanisms, increasing the likelihood of generating unsafe content. In addition, incorporating new knowledge through editing techniques to language models can further compromise safety. To address these issues, we propose SafeInfer, a context-adaptive, decoding-time safety alignment strategy for generating safe responses to… ▽ More Safety-aligned language models often exhibit fragile and imbalanced safety mechanisms, increasing the likelihood of generating unsafe content. In addition, incorporating new knowledge through editing techniques to language models can further compromise safety. To address these issues, we propose SafeInfer, a context-adaptive, decoding-time safety alignment strategy for generating safe responses to user queries. SafeInfer comprises two phases: the safety amplification phase, which employs safe demonstration examples to adjust the model's hidden states and increase the likelihood of safer outputs, and the safety-guided decoding phase, which influences token selection based on safety-optimized distributions, ensuring the generated content complies with ethical guidelines. Further, we present HarmEval, a novel benchmark for extensive safety evaluations, designed to address potential misuse scenarios in accordance with the policies of leading AI tech giants. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: Under review

arXiv:2406.11768 [pdf, other]

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

Authors: Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

Abstract: Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities. We build GAMA by integrating an LLM with multiple types of audio representations, including feat… ▽ More Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities. We build GAMA by integrating an LLM with multiple types of audio representations, including features from a custom Audio Q-Former, a multi-layer aggregator that aggregates features from multiple layers of an audio encoder. We fine-tune GAMA on a large-scale audio-language dataset, which augments it with audio understanding capabilities. Next, we propose CompA-R (Instruction-Tuning for Complex Audio Reasoning), a synthetically generated instruction-tuning (IT) dataset with instructions that require the model to perform complex reasoning on the input audio. We instruction-tune GAMA with CompA-R to endow it with complex reasoning abilities, where we further add a soft prompt as input with high-level semantic evidence by leveraging event tags of the input audio. Finally, we also propose CompA-R-test, a human-labeled evaluation dataset for evaluating the capabilities of LALMs on open-ended audio question-answering that requires complex reasoning. Through automated and expert human evaluations, we show that GAMA outperforms all other LALMs in literature on diverse audio understanding tasks by margins of 1%-84%. Further, GAMA IT-ed on CompA-R proves to be superior in its complex reasoning and instruction following capabilities. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Project Website: https://sreyan88.github.io/gamaaudio/

arXiv:2406.09443 [pdf, other]

Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness

Authors: Satyam Kumar, Sai Srujana Buddi, Utkarsh Oggy Sarawgi, Vineet Garg, Shivesh Ranjan, Ognjen, Rudovic, Ahmed Hussen Abdelaziz, Saurabh Adya

Abstract: Voice activity detection (VAD) is a critical component in various applications such as speech recognition, speech enhancement, and hands-free communication systems. With the increasing demand for personalized and context-aware technologies, the need for effective personalized VAD systems has become paramount. In this paper, we present a comparative analysis of Personalized Voice Activity Detection… ▽ More Voice activity detection (VAD) is a critical component in various applications such as speech recognition, speech enhancement, and hands-free communication systems. With the increasing demand for personalized and context-aware technologies, the need for effective personalized VAD systems has become paramount. In this paper, we present a comparative analysis of Personalized Voice Activity Detection (PVAD) systems to assess their real-world effectiveness. We introduce a comprehensive approach to assess PVAD systems, incorporating various performance metrics such as frame-level and utterance-level error rates, detection latency and accuracy, alongside user-level analysis. Through extensive experimentation and evaluation, we provide a thorough understanding of the strengths and limitations of various PVAD variants. This paper advances the understanding of PVAD technology by offering insights into its efficacy and viability in practical applications using a comprehensive set of metrics. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.09167 [pdf, other]

Vision Transformer Segmentation for Visual Bird Sound Denoising

Authors: Sahil Kumar, Jialu Li, Youshan Zhang

Abstract: Audio denoising, especially in the context of bird sounds, remains a challenging task due to persistent residual noise. Traditional and deep learning methods often struggle with artificial or low-frequency noise. In this work, we propose ViTVS, a novel approach that leverages the power of the vision transformer (ViT) architecture. ViTVS adeptly combines segmentation techniques to disentangle clean… ▽ More Audio denoising, especially in the context of bird sounds, remains a challenging task due to persistent residual noise. Traditional and deep learning methods often struggle with artificial or low-frequency noise. In this work, we propose ViTVS, a novel approach that leverages the power of the vision transformer (ViT) architecture. ViTVS adeptly combines segmentation techniques to disentangle clean audio from complex signal mixtures. Our key contributions encompass the development of ViTVS, introducing comprehensive, long-range, and multi-scale representations. These contributions directly tackle the limitations inherent in conventional approaches. Extensive experiments demonstrate that ViTVS outperforms state-of-the-art methods, positioning it as a benchmark solution for real-world bird sound denoising applications. Source code is available at: https://github.com/aiai-4/ViVTS. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: INTERSPEECH 2024

arXiv:2406.07910 [pdf, other]

Demonstration of Safe Electromagnetic Radiation Emitted by 5G Active Antenna Systems

Authors: Sumit Kumar, Chandan Kumar Sheemar, Abdelrahman Astro, Jorge Querol, Symeon Chatzinotas

Abstract: The careful planning and safe deployment of 5G technologies will bring enormous benefits to society and the economy. Higher frequency, beamforming, and small-cells are key technologies that will provide unmatched throughput and seamless connectivity to 5G users. Superficial knowledge of these technologies has raised concerns among the general public about the harmful effects of radiation. Several… ▽ More The careful planning and safe deployment of 5G technologies will bring enormous benefits to society and the economy. Higher frequency, beamforming, and small-cells are key technologies that will provide unmatched throughput and seamless connectivity to 5G users. Superficial knowledge of these technologies has raised concerns among the general public about the harmful effects of radiation. Several standardization bodies are active to put limits on the emissions which are based on a defined set of radiation measurement methodologies. However, due to the peculiarity of 5G such as dynamicity of the beams, network densification, Time Division Duplexing mode of operation, etc, using existing EMF measurement methods may provide inaccurate results. In this context, we discuss our experimental studies aimed towards the measurement of radiation caused by beam-based transmissions from a 5G base station equipped with an Active Antenna System(AAS). We elaborate on the shortcomings of current measurement methodologies and address several open questions. Next, we demonstrate that using user-specific downlink beamforming, not only better performance is achieved compared to non-beamformed downlink, but also the radiation in the vicinity of the intended user is significantly decreased. Further, we show that under weak reception conditions, an uplink transmission can cause significantly high radiation in the vicinity of the user equipment. We believe that our work will help in clearing several misleading concepts about the 5G EMF radiation effects. We conclude the work by providing guidelines to improve the methodology of EMF measurement by considering the spatiotemporal dynamicity of the 5G transmission. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.06811 [pdf, other]

Learning Continually by Spectral Regularization

Authors: Alex Lewandowski, Saurabh Kumar, Dale Schuurmans, András György, Marlos C. Machado

Abstract: Loss of plasticity is a phenomenon where neural networks become more difficult to train during the course of learning. Continual learning algorithms seek to mitigate this effect by sustaining good predictive performance while maintaining network trainability. We develop new techniques for improving continual learning by first reconsidering how initialization can ensure trainability during early ph… ▽ More Loss of plasticity is a phenomenon where neural networks become more difficult to train during the course of learning. Continual learning algorithms seek to mitigate this effect by sustaining good predictive performance while maintaining network trainability. We develop new techniques for improving continual learning by first reconsidering how initialization can ensure trainability during early phases of learning. From this perspective, we derive new regularization strategies for continual learning that ensure beneficial initialization properties are better maintained throughout training. In particular, we investigate two new regularization techniques for continual learning: (i) Wasserstein regularization toward the initial weight distribution, which is less restrictive than regularizing toward initial weights; and (ii) regularizing weight matrix singular values, which directly ensures gradient diversity is maintained throughout training. We present an experimental analysis that shows these alternative regularizers can improve continual learning performance across a range of supervised learning tasks and model architectures. The alternative regularizers prove to be less sensitive to hyperparameters while demonstrating better training in individual tasks, sustaining trainability as new tasks arrive, and achieving better generalization performance. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.04432 [pdf, other]

LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition

Authors: Sreyan Ghosh, Sonal Kumar, Ashish Seth, Purva Chiniya, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha

Abstract: Visual cues, like lip motion, have been shown to improve the performance of Automatic Speech Recognition (ASR) systems in noisy environments. We propose LipGER (Lip Motion aided Generative Error Correction), a novel framework for leveraging visual cues for noise-robust ASR. Instead of learning the cross-modal correlation between the audio and visual modalities, we make an LLM learn the task of vis… ▽ More Visual cues, like lip motion, have been shown to improve the performance of Automatic Speech Recognition (ASR) systems in noisy environments. We propose LipGER (Lip Motion aided Generative Error Correction), a novel framework for leveraging visual cues for noise-robust ASR. Instead of learning the cross-modal correlation between the audio and visual modalities, we make an LLM learn the task of visually-conditioned (generative) ASR error correction. Specifically, we instruct an LLM to predict the transcription from the N-best hypotheses generated using ASR beam-search. This is further conditioned on lip motions. This approach addresses key challenges in traditional AVSR learning, such as the lack of large-scale paired datasets and difficulties in adapting to new domains. We experiment on 4 datasets in various settings and show that LipGER improves the Word Error Rate in the range of 1.1%-49.2%. We also release LipHyp, a large-scale dataset with hypothesis-transcription pairs that is additionally equipped with lip motion cues to promote further research in this space △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: InterSpeech 2024. Code and Data: https://github.com/Sreyan88/LipGER

arXiv:2406.04286 [pdf, other]

ABEX: Data Augmentation for Low-Resource NLU via Expanding Abstract Descriptions

Authors: Sreyan Ghosh, Utkarsh Tyagi, Sonal Kumar, C. K. Evuru, S Ramaneswaran, S Sakshi, Dinesh Manocha

Abstract: We present ABEX, a novel and effective generative data augmentation methodology for low-resource Natural Language Understanding (NLU) tasks. ABEX is based on ABstract-and-EXpand, a novel paradigm for generating diverse forms of an input document -- we first convert a document into its concise, abstract description and then generate new documents based on expanding the resultant abstraction. To lea… ▽ More We present ABEX, a novel and effective generative data augmentation methodology for low-resource Natural Language Understanding (NLU) tasks. ABEX is based on ABstract-and-EXpand, a novel paradigm for generating diverse forms of an input document -- we first convert a document into its concise, abstract description and then generate new documents based on expanding the resultant abstraction. To learn the task of expanding abstract descriptions, we first train BART on a large-scale synthetic dataset with abstract-document pairs. Next, to generate abstract descriptions for a document, we propose a simple, controllable, and training-free method based on editing AMR graphs. ABEX brings the best of both worlds: by expanding from abstract representations, it preserves the original semantic properties of the documents, like style and meaning, thereby maintaining alignment with the original label and data distribution. At the same time, the fundamental process of elaborating on abstract descriptions facilitates diverse generations. We demonstrate the effectiveness of ABEX on 4 NLU tasks spanning 12 datasets and 4 low-resource settings. ABEX outperforms all our baselines qualitatively with improvements of 0.04% - 38.8%. Qualitatively, ABEX outperforms all prior methods from literature in terms of context and length diversity. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: ACL 2024 Main Conference. Code and data: https://github.com/Sreyan88/ABEX

arXiv:2406.02469 [pdf, other]

Landscape-Aware Growing: The Power of a Little LAG

Authors: Stefani Karp, Nikunj Saunshi, Sobhan Miryoosefi, Sashank J. Reddi, Sanjiv Kumar

Abstract: Recently, there has been increasing interest in efficient pretraining paradigms for training Transformer-based models. Several recent approaches use smaller models to initialize larger models in order to save computation (e.g., stacking and fusion). In this work, we study the fundamental question of how to select the best growing strategy from a given pool of growing strategies. Prior works have e… ▽ More Recently, there has been increasing interest in efficient pretraining paradigms for training Transformer-based models. Several recent approaches use smaller models to initialize larger models in order to save computation (e.g., stacking and fusion). In this work, we study the fundamental question of how to select the best growing strategy from a given pool of growing strategies. Prior works have extensively focused on loss- and/or function-preserving behavior at initialization or simply performance at the end of training. Instead, we identify that behavior at initialization can be misleading as a predictor of final performance and present an alternative perspective based on early training dynamics, which we call "landscape-aware growing (LAG)". We perform extensive analysis of correlation of the final performance with performance in the initial steps of training and find early and more accurate predictions of the optimal growing strategy (i.e., with only a small "lag" after initialization). This perspective also motivates an adaptive strategy for gradual stacking. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2405.19420 [pdf, other]

Using Contrastive Learning with Generative Similarity to Learn Spaces that Capture Human Inductive Biases

Authors: Raja Marjieh, Sreejan Kumar, Declan Campbell, Liyi Zhang, Gianluca Bencomo, Jake Snell, Thomas L. Griffiths

Abstract: Humans rely on strong inductive biases to learn from few examples and abstract useful information from sensory data. Instilling such biases in machine learning models has been shown to improve their performance on various benchmarks including few-shot learning, robustness, and alignment. However, finding effective training procedures to achieve that goal can be challenging as psychologically-rich… ▽ More Humans rely on strong inductive biases to learn from few examples and abstract useful information from sensory data. Instilling such biases in machine learning models has been shown to improve their performance on various benchmarks including few-shot learning, robustness, and alignment. However, finding effective training procedures to achieve that goal can be challenging as psychologically-rich training data such as human similarity judgments are expensive to scale, and Bayesian models of human inductive biases are often intractable for complex, realistic domains. Here, we address this challenge by introducing a Bayesian notion of generative similarity whereby two datapoints are considered similar if they are likely to have been sampled from the same distribution. This measure can be applied to complex generative processes, including probabilistic programs. We show that generative similarity can be used to define a contrastive learning objective even when its exact form is intractable, enabling learning of spatial embeddings that express specific inductive biases. We demonstrate the utility of our approach by showing how it can be used to capture human inductive biases for geometric shapes, and to better distinguish different abstract drawing styles that are parameterized by probabilistic programs. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.19261 [pdf, other]

Faster Cascades via Speculative Decoding

Authors: Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, Sanjiv Kumar

Abstract: Cascades and speculative decoding are two common approaches to improving language models' inference efficiency. Both approaches involve interleaving models of different sizes, but via fundamentally distinct mechanisms: cascades employ a deferral rule that invokes the larger model only for "hard" inputs, while speculative decoding uses speculative execution to primarily invoke the larger model in p… ▽ More Cascades and speculative decoding are two common approaches to improving language models' inference efficiency. Both approaches involve interleaving models of different sizes, but via fundamentally distinct mechanisms: cascades employ a deferral rule that invokes the larger model only for "hard" inputs, while speculative decoding uses speculative execution to primarily invoke the larger model in parallel verification mode. These mechanisms offer different benefits: empirically, cascades are often capable of yielding better quality than even the larger model, while theoretically, speculative decoding offers a guarantee of quality-neutrality. In this paper, we leverage the best of both these approaches by designing new speculative cascading techniques that implement their deferral rule through speculative execution. We characterize the optimal deferral rule for our speculative cascades, and employ a plug-in approximation to the optimal rule. Through experiments with T5 models on benchmark language tasks, we show that the proposed approach yields better cost-quality trade-offs than cascading and speculative decoding baselines. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.18359 [pdf, other]

Bridging the Gap: Dynamic Learning Strategies for Improving Multilingual Performance in LLMs

Authors: Somnath Kumar, Vaibhav Balloli, Mercy Ranjit, Kabir Ahuja, Tanuja Ganu, Sunayana Sitaram, Kalika Bali, Akshay Nambi

Abstract: Large language models (LLMs) are at the forefront of transforming numerous domains globally. However, their inclusivity and effectiveness remain limited for non-Latin scripts and low-resource languages. This paper tackles the imperative challenge of enhancing the multilingual performance of LLMs without extensive training or fine-tuning. Through systematic investigation and evaluation of diverse l… ▽ More Large language models (LLMs) are at the forefront of transforming numerous domains globally. However, their inclusivity and effectiveness remain limited for non-Latin scripts and low-resource languages. This paper tackles the imperative challenge of enhancing the multilingual performance of LLMs without extensive training or fine-tuning. Through systematic investigation and evaluation of diverse languages using popular question-answering (QA) datasets, we present novel techniques that unlock the true potential of LLMs in a polyglot landscape. Our approach encompasses three key strategies that yield significant improvements in multilingual proficiency. First, by meticulously optimizing prompts tailored for polyglot LLMs, we unlock their latent capabilities, resulting in substantial performance boosts across languages. Second, we introduce a new hybrid approach that synergizes LLM Retrieval Augmented Generation (RAG) with multilingual embeddings and achieves improved multilingual task performance. Finally, we introduce a novel learning approach that dynamically selects the optimal prompt strategy, LLM model, and embedding model per query at run-time. This dynamic adaptation maximizes the efficacy of LLMs across languages, outperforming best static and random strategies. Additionally, our approach adapts configurations in both offline and online settings, and can seamlessly adapt to new languages and datasets, leading to substantial advancements in multilingual understanding and generation across diverse languages. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Report number: MSR-TR-VeLLM-01

arXiv:2405.18358 [pdf, other]

MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning

Authors: Somnath Kumar, Yash Gadhia, Tanuja Ganu, Akshay Nambi

Abstract: Recent advancements in Multi-modal Large Language Models (MLLMs) have significantly improved their performance in tasks combining vision and language. However, challenges persist in detailed multi-modal understanding, comprehension of complex tasks, and reasoning over multi-modal information. This paper introduces MMCTAgent, a novel multi-modal critical thinking agent framework designed to address… ▽ More Recent advancements in Multi-modal Large Language Models (MLLMs) have significantly improved their performance in tasks combining vision and language. However, challenges persist in detailed multi-modal understanding, comprehension of complex tasks, and reasoning over multi-modal information. This paper introduces MMCTAgent, a novel multi-modal critical thinking agent framework designed to address the inherent limitations of current MLLMs in complex visual reasoning tasks. Inspired by human cognitive processes and critical thinking, MMCTAgent iteratively analyzes multi-modal information, decomposes queries, plans strategies, and dynamically evolves its reasoning. Additionally, MMCTAgent incorporates critical thinking elements such as verification of final answers and self-reflection through a novel approach that defines a vision-based critic and identifies task-specific evaluation criteria, thereby enhancing its decision-making abilities. Through rigorous evaluations across various image and video understanding benchmarks, we demonstrate that MMCTAgent (with and without the critic) outperforms both foundational MLLMs and other tool-augmented pipelines. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Report number: MSR-TR-VeLLM-03

arXiv:2405.16401 [pdf, other]

Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning

Authors: Neha Kalibhat, Priyatham Kattakinda, Arman Zarei, Nikita Seleznev, Samuel Sharpe, Senthil Kumar, Soheil Feizi

Abstract: Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from visual data. This paper explores the notion of providing semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training fram… ▽ More Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from visual data. This paper explores the notion of providing semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. Leveraging off-the-shelf segmentation and scene-graph models, we extract representations of instance segmentation masks (referred to as tangible tokens) and relationships and actions (referred to as intangible tokens). Subsequently, we pre-train a vision-side transformer by incorporating these newly extracted tokens and aligning the resultant embeddings with caption embeddings from a text-side encoder. To capture the structural and semantic relationships among visual tokens, we introduce additive attention weights, which are used to compute self-attention scores. Our experiments on COCO demonstrate notable improvements over ViTs in learned representation quality across text-to-image (+47%) and image-to-text retrieval (+44%) tasks. Furthermore, we showcase the advantages on compositionality benchmarks such as ARO (+18%) and Winoground (+10%). △ Less

Submitted 25 May, 2024; originally announced May 2024.

arXiv:2405.15683 [pdf, other]

VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap

Authors: Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Utkarsh Tyagi, Oriol Nieto, Zeyu Jin, Dinesh Manocha

Abstract: Recent interest in Large Vision-Language Models (LVLMs) for practical applications is moderated by the significant challenge of hallucination or the inconsistency between the factual information and the generated text. In this paper, we first perform an in-depth analysis of hallucinations and discover several novel insights about how and when LVLMs hallucinate. From our analysis, we show that: (1)… ▽ More Recent interest in Large Vision-Language Models (LVLMs) for practical applications is moderated by the significant challenge of hallucination or the inconsistency between the factual information and the generated text. In this paper, we first perform an in-depth analysis of hallucinations and discover several novel insights about how and when LVLMs hallucinate. From our analysis, we show that: (1) The community's efforts have been primarily targeted towards reducing hallucinations related to visual recognition (VR) prompts (e.g., prompts that only require describing the image), thereby ignoring hallucinations for cognitive prompts (e.g., prompts that require additional skills like reasoning on contents of the image). (2) LVLMs lack visual perception, i.e., they can see but not necessarily understand or perceive the input image. We analyze responses to cognitive prompts and show that LVLMs hallucinate due to a perception gap: although LVLMs accurately recognize visual elements in the input image and possess sufficient cognitive skills, they struggle to respond accurately and hallucinate. To overcome this shortcoming, we propose Visual Description Grounded Decoding (VDGD), a simple, robust, and training-free method for alleviating hallucinations. Specifically, we first describe the image and add it as a prefix to the instruction. Next, during auto-regressive decoding, we sample from the plausible candidates according to their KL-Divergence (KLD) to the description, where lower KLD is given higher preference. Experimental results on several benchmarks and LVLMs show that VDGD improves significantly over other baselines in reducing hallucinations. We also propose VaLLu, a benchmark for the comprehensive evaluation of the cognitive capabilities of LVLMs. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: Preprint. Under review. Code will be released on paper acceptance

arXiv:2405.11596 [pdf]

Bioinspired Nested-Isotropic Lattices with Tunable Anisotropy for Additive Manufacturing

Authors: R. Boda, B. Panda, S. Kumar

Abstract: This study presents innovative nested-isotropic lattices for additive manufacturing, drawing inspiration from bio-architectures found in cortical bone osteons, golden spirals, and fractals. These lattices provide tunable anisotropy by integrating architectural elements like ``nesting orders (NOs)'' and corresponding ``nesting orientations (NORs),'' along with repetitive self-similar X-cross struts… ▽ More This study presents innovative nested-isotropic lattices for additive manufacturing, drawing inspiration from bio-architectures found in cortical bone osteons, golden spirals, and fractals. These lattices provide tunable anisotropy by integrating architectural elements like ``nesting orders (NOs)'' and corresponding ``nesting orientations (NORs),'' along with repetitive self-similar X-cross struts and three four-fold axes of symmetry, resulting in a wide spectrum of lattice designs. Nine mono-nest and twenty multi-nest lattices, along with 252 parametric variations, are realized. The relative density $ \barρ $ and surface area density $ \bar{S} $ are calculated. Employing finite element-based numerical homogenization, elastic stiffness tensors are estimated to evaluate the anisotropic measure - Zener ratio $ Z $ and elastic modulus $ \bar{E} $ for all lattice designs. The mono-nest lattices generated considering higher NOs and respective NORs exhibit a transition from shear dominant to tensile/compression dominant (TCD) anisotropic behavior and their strut size variations show a strong influence on $ \barρ $, $ \bar{S} $, and $ \bar{E} $. In contrast, multi-nest lattices exhibit isotropic and neo-isotropic characteristics, with strut size mismatch exerting more influence on $ Z $. Increasing NOs and NORs result in isotropic or TCD behavior for most multi-nest lattices, with strut size mismatch leading to many isotropic lattices. These bio-inspired nested lattices, coupled with advancements in additive manufacturing, hold potential for diverse applications. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: 50 pages, 16 figures in the main text

arXiv:2405.11346 [pdf]

Decision support system for Forest fire management using Ontology with Big Data and LLMs

Authors: Ritesh Chandra, Shashi Shekhar Kumar, Rushil Patra, Sonali Agarwal

Abstract: Forests are crucial for ecological balance, but wildfires, a major cause of forest loss, pose significant risks. Fire weather indices, which assess wildfire risk and predict resource demands, are vital. With the rise of sensor networks in fields like healthcare and environmental monitoring, semantic sensor networks are increasingly used to gather climatic data such as wind speed, temperature, and… ▽ More Forests are crucial for ecological balance, but wildfires, a major cause of forest loss, pose significant risks. Fire weather indices, which assess wildfire risk and predict resource demands, are vital. With the rise of sensor networks in fields like healthcare and environmental monitoring, semantic sensor networks are increasingly used to gather climatic data such as wind speed, temperature, and humidity. However, processing these data streams to determine fire weather indices presents challenges, underscoring the growing importance of effective forest fire detection. This paper discusses using Apache Spark for early forest fire detection, enhancing fire risk prediction with meteorological and geographical data. Building on our previous development of Semantic Sensor Network (SSN) ontologies and Semantic Web Rules Language (SWRL) for managing forest fires in Monesterial Natural Park, we expanded SWRL to improve a Decision Support System (DSS) using a Large Language Models (LLMs) and Spark framework. We implemented real-time alerts with Spark streaming, tailored to various fire scenarios, and validated our approach using ontology metrics, query-based evaluations, LLMs score precision, F1 score, and recall measures. △ Less

Submitted 18 May, 2024; originally announced May 2024.

Showing 1–50 of 1,064 results for author: Kumar, S