subscribe to arXiv mailings

Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

Authors: Xiuquan Hou, Meiqin Liu, Senlin Zhang, Ping Wei, Badong Chen, Xuguang Lan

Abstract: This paper presents a general scheme for enhancing the convergence and performance of DETR (DEtection TRansformer). We investigate the slow convergence problem in transformers from a new perspective, suggesting that it arises from the self-attention that introduces no structural bias over inputs. To address this issue, we explore incorporating position relation prior as attention bias to augment o… ▽ More This paper presents a general scheme for enhancing the convergence and performance of DETR (DEtection TRansformer). We investigate the slow convergence problem in transformers from a new perspective, suggesting that it arises from the self-attention that introduces no structural bias over inputs. To address this issue, we explore incorporating position relation prior as attention bias to augment object detection, following the verification of its statistical significance using a proposed quantitative macroscopic correlation (MC) metric. Our approach, termed Relation-DETR, introduces an encoder to construct position relation embeddings for progressive attention refinement, which further extends the traditional streaming pipeline of DETR into a contrastive relation pipeline to address the conflicts between non-duplicate predictions and positive supervision. Extensive experiments on both generic and task-specific datasets demonstrate the effectiveness of our approach. Under the same configurations, Relation-DETR achieves a significant improvement (+2.0% AP compared to DINO), state-of-the-art performance (51.7% AP for 1x and 52.1% AP for 2x settings), and a remarkably faster convergence speed (over 40% AP with only 2 training epochs) than existing DETR detectors on COCO val2017. Moreover, the proposed relation encoder serves as a universal plug-in-and-play component, bringing clear improvements for theoretically any DETR-like methods. Furthermore, we introduce a class-agnostic detection dataset, SA-Det-100k. The experimental results on the dataset illustrate that the proposed explicit position relation achieves a clear improvement of 1.3% AP, highlighting its potential towards universal object detection. The code and dataset are available at https://github.com/xiuqhou/Relation-DETR. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV 2024

arXiv:2407.11497 [pdf, other]

"I Came Across a Junk": Understanding Design Flaws of Data Visualization from the Public's Perspective

Authors: Xingyu Lan, Yu Liu

Abstract: The visualization community has a rich history of reflecting upon flaws of visualization design, and research in this direction has remained lively until now. However, three main gaps still exist. First, most existing work characterizes design flaws from the perspective of researchers rather than the perspective of general users. Second, little work has been done to infer why these design flaws oc… ▽ More The visualization community has a rich history of reflecting upon flaws of visualization design, and research in this direction has remained lively until now. However, three main gaps still exist. First, most existing work characterizes design flaws from the perspective of researchers rather than the perspective of general users. Second, little work has been done to infer why these design flaws occur. Third, due to problems such as unclear terminology and ambiguous research scope, a better framework that systematically outlines various design flaws and helps distinguish different types of flaws is desired. To address the above gaps, this work investigated visualization design flaws through the lens of the public, constructed a framework to summarize and categorize the identified flaws, and explored why these flaws occur. Specifically, we analyzed 2227 flawed data visualizations collected from an online gallery and derived a design task-associated taxonomy containing 76 specific design flaws. These flaws were further classified into three high-level categories (i.e., misinformation, uninformativeness, unsociableness) and ten subcategories (e.g., inaccuracy, unfairness, ambiguity). Next, we organized five focus groups to explore why these design flaws occur and identified seven causes of the flaws. Finally, we proposed a set of reflections and implications arising from the research. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.07844 [pdf, other]

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Authors: Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian, Lin Ma, Dongmei Jiang, Yaowei Wang, Xiangyuan Lan, Xiaodan Liang

Abstract: Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training and pseudo-labeling on diverse large-scale datasets. However, these approaches encounter two main challenges: (i) how to effectively eliminate data… ▽ More Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training and pseudo-labeling on diverse large-scale datasets. However, these approaches encounter two main challenges: (i) how to effectively eliminate data noise from pseudo-labeling, and (ii) how to efficiently leverage the language-aware capability for region-level cross-modality fusion and alignment. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data format. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enhance the cross-modality alignment through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks, achieving state-of-the-art results with an AP of 50.6% on the COCO benchmark and 40.1% on the LVIS benchmark in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4% AP, outperforming many existing methods with the same backbone. The code for OV-DINO is available at https://github.com/wanghao9610/OV-DINO. △ Less

Submitted 21 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

Comments: Technical Report

arXiv:2404.01622 [pdf, ps, other]

Gen4DS: Workshop on Data Storytelling in an Era of Generative AI

Authors: Xingyu Lan, Leni Yang, Zezhong Wang, Yun Wang, Danqing Shi, Sheelagh Carpendale

Abstract: Storytelling is an ancient and precious human ability that has been rejuvenated in the digital age. Over the last decade, there has been a notable surge in the recognition and application of data storytelling, both in academia and industry. Recently, the rapid development of generative AI has brought new opportunities and challenges to this field, sparking numerous new questions. These questions m… ▽ More Storytelling is an ancient and precious human ability that has been rejuvenated in the digital age. Over the last decade, there has been a notable surge in the recognition and application of data storytelling, both in academia and industry. Recently, the rapid development of generative AI has brought new opportunities and challenges to this field, sparking numerous new questions. These questions may not necessarily be quickly transformed into papers, but we believe it is necessary to promptly discuss them to help the community better clarify important issues and research agendas for the future. We thus invite you to join our workshop (Gen4DS) to discuss questions such as: How can generative AI facilitate the creation of data stories? How might generative AI alter the workflow of data storytellers? What are the pitfalls and risks of incorporating AI in storytelling? We have designed both paper presentations and interactive activities (including hands-on creation, group discussion pods, and debates on controversial issues) for the workshop. We hope that participants will learn about the latest advances and pioneering work in data storytelling, engage in critical conversations with each other, and have an enjoyable, unforgettable, and meaningful experience at the event. △ Less

Submitted 5 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

arXiv:2403.10750 [pdf, other]

Depression Detection on Social Media with Large Language Models

Authors: Xiaochong Lan, Yiming Cheng, Li Sheng, Chen Gao, Yong Li

Abstract: Depression harms. However, due to a lack of mental health awareness and fear of stigma, many patients do not actively seek diagnosis and treatment, leading to detrimental outcomes. Depression detection aims to determine whether an individual suffers from depression by analyzing their history of posts on social media, which can significantly aid in early detection and intervention. It mainly faces… ▽ More Depression harms. However, due to a lack of mental health awareness and fear of stigma, many patients do not actively seek diagnosis and treatment, leading to detrimental outcomes. Depression detection aims to determine whether an individual suffers from depression by analyzing their history of posts on social media, which can significantly aid in early detection and intervention. It mainly faces two key challenges: 1) it requires professional medical knowledge, and 2) it necessitates both high accuracy and explainability. To address it, we propose a novel depression detection system called DORIS, combining medical knowledge and the recent advances in large language models (LLMs). Specifically, to tackle the first challenge, we proposed an LLM-based solution to first annotate whether high-risk texts meet medical diagnostic criteria. Further, we retrieve texts with high emotional intensity and summarize critical information from the historical mood records of users, so-called mood courses. To tackle the second challenge, we combine LLM and traditional classifiers to integrate medical knowledge-guided features, for which the model can also explain its prediction results, achieving both high accuracy and explainability. Extensive experimental results on benchmarking datasets show that, compared to the current best baseline, our approach improves by 0.036 in AUPRC, which can be considered significant, demonstrating the effectiveness of our approach and its high value as an NLP application. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2402.19231 [pdf, other]

CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition

Authors: Feng Lu, Xiangyuan Lan, Lijun Zhang, Dongmei Jiang, Yaowei Wang, Chun Yuan

Abstract: Over the past decade, most methods in visual place recognition (VPR) have used neural networks to produce feature representations. These networks typically produce a global representation of a place image using only this image itself and neglect the cross-image variations (e.g. viewpoint and illumination), which limits their robustness in challenging scenes. In this paper, we propose a robust glob… ▽ More Over the past decade, most methods in visual place recognition (VPR) have used neural networks to produce feature representations. These networks typically produce a global representation of a place image using only this image itself and neglect the cross-image variations (e.g. viewpoint and illumination), which limits their robustness in challenging scenes. In this paper, we propose a robust global representation method with cross-image correlation awareness for VPR, named CricaVPR. Our method uses the attention mechanism to correlate multiple images within a batch. These images can be taken in the same place with different conditions or viewpoints, or even captured from different places. Therefore, our method can utilize the cross-image variations as a cue to guide the representation learning, which ensures more robust features are produced. To further facilitate the robustness, we propose a multi-scale convolution-enhanced adaptation method to adapt pre-trained visual foundation models to the VPR task, which introduces the multi-scale local information to further enhance the cross-image correlation-aware representation. Experimental results show that our method outperforms state-of-the-art methods by a large margin with significantly less training time. The code is released at https://github.com/Lu-Feng/CricaVPR. △ Less

Submitted 1 April, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

Comments: Accepted by CVPR2024

arXiv:2402.17978 [pdf, other]

Imagine, Initialize, and Explore: An Effective Exploration Method in Multi-Agent Reinforcement Learning

Authors: Zeyang Liu, Lipeng Wan, Xinrui Yang, Zhuoran Chen, Xingyu Chen, Xuguang Lan

Abstract: Effective exploration is crucial to discovering optimal strategies for multi-agent reinforcement learning (MARL) in complex coordination tasks. Existing methods mainly utilize intrinsic rewards to enable committed exploration or use role-based learning for decomposing joint action spaces instead of directly conducting a collective search in the entire action-observation space. However, they often… ▽ More Effective exploration is crucial to discovering optimal strategies for multi-agent reinforcement learning (MARL) in complex coordination tasks. Existing methods mainly utilize intrinsic rewards to enable committed exploration or use role-based learning for decomposing joint action spaces instead of directly conducting a collective search in the entire action-observation space. However, they often face challenges obtaining specific joint action sequences to reach successful states in long-horizon tasks. To address this limitation, we propose Imagine, Initialize, and Explore (IIE), a novel method that offers a promising solution for efficient multi-agent exploration in complex scenarios. IIE employs a transformer model to imagine how the agents reach a critical state that can influence each other's transition functions. Then, we initialize the environment at this state using a simulator before the exploration phase. We formulate the imagination as a sequence modeling problem, where the states, observations, prompts, actions, and rewards are predicted autoregressively. The prompt consists of timestep-to-go, return-to-go, influence value, and one-shot demonstration, specifying the desired state and trajectory as well as guiding the action generation. By initializing agents at the critical states, IIE significantly increases the likelihood of discovering potentially important under-explored regions. Despite its simplicity, empirical results demonstrate that our method outperforms multi-agent exploration baselines on the StarCraft Multi-Agent Challenge (SMAC) and SMACv2 environments. Particularly, IIE shows improved performance in the sparse-reward SMAC tasks and produces more effective curricula over the initialized states than other generative methods, such as CVAE-GAN and diffusion models. △ Less

Submitted 1 March, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: The 38th Annual AAAI Conference on Artificial Intelligence

arXiv:2402.16086 [pdf, other]

doi 10.1609/aaai.v38i9.28901

Deep Homography Estimation for Visual Place Recognition

Authors: Feng Lu, Shuting Dong, Lijun Zhang, Bingxi Liu, Xiangyuan Lan, Dongmei Jiang, Chun Yuan

Abstract: Visual place recognition (VPR) is a fundamental task for many applications such as robot localization and augmented reality. Recently, the hierarchical VPR methods have received considerable attention due to the trade-off between accuracy and efficiency. They usually first use global features to retrieve the candidate images, then verify the spatial consistency of matched local features for re-ran… ▽ More Visual place recognition (VPR) is a fundamental task for many applications such as robot localization and augmented reality. Recently, the hierarchical VPR methods have received considerable attention due to the trade-off between accuracy and efficiency. They usually first use global features to retrieve the candidate images, then verify the spatial consistency of matched local features for re-ranking. However, the latter typically relies on the RANSAC algorithm for fitting homography, which is time-consuming and non-differentiable. This makes existing methods compromise to train the network only in global feature extraction. Here, we propose a transformer-based deep homography estimation (DHE) network that takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification. Moreover, we design a re-projection error of inliers loss to train the DHE network without additional homography labels, which can also be jointly trained with the backbone network to help it extract the features that are more suitable for local matching. Extensive experiments on benchmark datasets show that our method can outperform several state-of-the-art methods. And it is more than one order of magnitude faster than the mainstream hierarchical VPR methods using RANSAC. The code is released at https://github.com/Lu-Feng/DHE-VPR. △ Less

Submitted 18 March, 2024; v1 submitted 25 February, 2024; originally announced February 2024.

Comments: Accepted by AAAI2024

Journal ref: AAAI 2024

arXiv:2402.14505 [pdf, other]

Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition

Authors: Feng Lu, Lijun Zhang, Xiangyuan Lan, Shuting Dong, Yaowei Wang, Chun Yuan

Abstract: Recent studies show that vision models pre-trained in generic visual learning tasks with large-scale data can provide useful feature representations for a wide range of visual perception problems. However, few attempts have been made to exploit pre-trained foundation models in visual place recognition (VPR). Due to the inherent difference in training objectives and data between the tasks of model… ▽ More Recent studies show that vision models pre-trained in generic visual learning tasks with large-scale data can provide useful feature representations for a wide range of visual perception problems. However, few attempts have been made to exploit pre-trained foundation models in visual place recognition (VPR). Due to the inherent difference in training objectives and data between the tasks of model pre-training and VPR, how to bridge the gap and fully unleash the capability of pre-trained models for VPR is still a key issue to address. To this end, we propose a novel method to realize seamless adaptation of pre-trained models for VPR. Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method to achieve both global and local adaptation efficiently, in which only lightweight adapters are tuned without adjusting the pre-trained model. Besides, to guide effective adaptation, we propose a mutual nearest neighbor local feature loss, which ensures proper dense local features are produced for local matching and avoids time-consuming spatial verification in re-ranking. Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time, and uses about only 3% retrieval runtime of the two-stage VPR methods with RANSAC-based spatial verification. It ranks 1st on the MSLS challenge leaderboard (at the time of submission). The code is released at https://github.com/Lu-Feng/SelaVPR. △ Less

Submitted 3 April, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

Comments: ICLR2024

arXiv:2402.11816 [pdf, other]

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

Authors: Jihai Zhang, Xiang Lan, Xiaoye Qu, Yu Cheng, Mengling Feng, Bryan Hooi

Abstract: Self-Supervised Contrastive Learning has proven effective in deriving high-quality representations from unlabeled data. However, a major challenge that hinders both unimodal and multimodal contrastive learning is feature suppression, a phenomenon where the trained model captures only a limited portion of the information from the input data while overlooking other potentially valuable content. This… ▽ More Self-Supervised Contrastive Learning has proven effective in deriving high-quality representations from unlabeled data. However, a major challenge that hinders both unimodal and multimodal contrastive learning is feature suppression, a phenomenon where the trained model captures only a limited portion of the information from the input data while overlooking other potentially valuable content. This issue often leads to indistinguishable representations for visually similar but semantically different inputs, adversely affecting downstream task performance, particularly those requiring rigorous semantic comprehension. To address this challenge, we propose a novel model-agnostic Multistage Contrastive Learning (MCL) framework. Unlike standard contrastive learning which inherently captures one single biased feature distribution, MCL progressively learns previously unlearned features through feature-aware negative sampling at each stage, where the negative samples of an anchor are exclusively selected from the cluster it was assigned to in preceding stages. Meanwhile, MCL preserves the previously well-learned features by cross-stage representation integration, integrating features across all stages to form final representations. Our comprehensive evaluation demonstrates MCL's effectiveness and superiority across both unimodal and multimodal contrastive learning, spanning a range of model architectures from ResNet to Vision Transformers (ViT). Remarkably, in tasks where the original CLIP model has shown limitations, MCL dramatically enhances performance, with improvements up to threefold on specific attributes in the recently proposed MMVP benchmark. △ Less

Submitted 15 July, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

Comments: ECCV 2024 Camera-Ready

arXiv:2402.11792 [pdf, other]

SInViG: A Self-Evolving Interactive Visual Agent for Human-Robot Interaction

Authors: Jie Xu, Hanbo Zhang, Xinghang Li, Huaping Liu, Xuguang Lan, Tao Kong

Abstract: Linguistic ambiguity is ubiquitous in our daily lives. Previous works adopted interaction between robots and humans for language disambiguation. Nevertheless, when interactive robots are deployed in daily environments, there are significant challenges for natural human-robot interaction, stemming from complex and unpredictable visual inputs, open-ended interaction, and diverse user demands. In thi… ▽ More Linguistic ambiguity is ubiquitous in our daily lives. Previous works adopted interaction between robots and humans for language disambiguation. Nevertheless, when interactive robots are deployed in daily environments, there are significant challenges for natural human-robot interaction, stemming from complex and unpredictable visual inputs, open-ended interaction, and diverse user demands. In this paper, we present SInViG, which is a self-evolving interactive visual agent for human-robot interaction based on natural languages, aiming to resolve language ambiguity, if any, through multi-turn visual-language dialogues. It continuously and automatically learns from unlabeled images and large language models, without human intervention, to be more robust against visual and linguistic complexity. Benefiting from self-evolving, it sets new state-of-the-art on several interactive visual grounding benchmarks. Moreover, our human-robot interaction experiments show that the evolved models consistently acquire more and more preferences from human users. Besides, we also deployed our model on a Franka robot for interactive manipulation tasks. Results demonstrate that our model can follow diverse user instructions and interact naturally with humans in natural language, despite the complexity and disturbance of the environment. △ Less

Submitted 19 February, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

arXiv:2402.03699 [pdf]

Automatic Robotic Development through Collaborative Framework by Large Language Models

Authors: Zhirong Luan, Yujun Lai, Rundong Huang, Xiaruiqi Lan, Liangjun Chen, Badong Chen

Abstract: Despite the remarkable code generation abilities of large language models LLMs, they still face challenges in complex task handling. Robot development, a highly intricate field, inherently demands human involvement in task allocation and collaborative teamwork . To enhance robot development, we propose an innovative automated collaboration framework inspired by real-world robot developers. This fr… ▽ More Despite the remarkable code generation abilities of large language models LLMs, they still face challenges in complex task handling. Robot development, a highly intricate field, inherently demands human involvement in task allocation and collaborative teamwork . To enhance robot development, we propose an innovative automated collaboration framework inspired by real-world robot developers. This framework employs multiple LLMs in distinct roles analysts, programmers, and testers. Analysts delve deep into user requirements, enabling programmers to produce precise code, while testers fine-tune the parameters based on user feedback for practical robot application. Each LLM tackles diverse, critical tasks within the development process. Clear collaboration rules emulate real world teamwork among LLMs. Analysts, programmers, and testers form a cohesive team overseeing strategy, code, and parameter adjustments . Through this framework, we achieve complex robot development without requiring specialized knowledge, relying solely on non experts participation. △ Less

Submitted 16 February, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

arXiv:2401.16699 [pdf, other]

Towards Unified Interactive Visual Grounding in The Wild

Authors: Jie Xu, Hanbo Zhang, Qingyi Si, Yifeng Li, Xuguang Lan, Tao Kong

Abstract: Interactive visual grounding in Human-Robot Interaction (HRI) is challenging yet practical due to the inevitable ambiguity in natural languages. It requires robots to disambiguate the user input by active information gathering. Previous approaches often rely on predefined templates to ask disambiguation questions, resulting in performance reduction in realistic interactive scenarios. In this paper… ▽ More Interactive visual grounding in Human-Robot Interaction (HRI) is challenging yet practical due to the inevitable ambiguity in natural languages. It requires robots to disambiguate the user input by active information gathering. Previous approaches often rely on predefined templates to ask disambiguation questions, resulting in performance reduction in realistic interactive scenarios. In this paper, we propose TiO, an end-to-end system for interactive visual grounding in human-robot interaction. Benefiting from a unified formulation of visual dialogue and grounding, our method can be trained on a joint of extensive public data, and show superior generality to diversified and challenging open-world scenarios. In the experiments, we validate TiO on GuessWhat?! and InViG benchmarks, setting new state-of-the-art performance by a clear margin. Moreover, we conduct HRI experiments on the carefully selected 150 challenging scenes as well as real-robot platforms. Results show that our method demonstrates superior generality to diversified visual and language inputs with a high success rate. Codes and demos are available at https://github.com/jxu124/TiO. △ Less

Submitted 18 February, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: Accepted to ICRA 2024

arXiv:2401.16355 [pdf, other]

PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

Authors: Yuxuan Sun, Hao Wu, Chenglu Zhu, Sunyi Zheng, Qizi Chen, Kai Zhang, Yunlong Zhang, Dan Wan, Xiaoxiao Lan, Mengyue Zheng, Jingxiong Li, Xinheng Lyu, Tao Lin, Lin Yang

Abstract: The emergence of large multimodal models has unlocked remarkable potential in AI, particularly in pathology. However, the lack of specialized, high-quality benchmark impeded their development and precise evaluation. To address this, we introduce PathMMU, the largest and highest-quality expert-validated pathology benchmark for Large Multimodal Models (LMMs). It comprises 33,428 multimodal multi-cho… ▽ More The emergence of large multimodal models has unlocked remarkable potential in AI, particularly in pathology. However, the lack of specialized, high-quality benchmark impeded their development and precise evaluation. To address this, we introduce PathMMU, the largest and highest-quality expert-validated pathology benchmark for Large Multimodal Models (LMMs). It comprises 33,428 multimodal multi-choice questions and 24,067 images from various sources, each accompanied by an explanation for the correct answer. The construction of PathMMU harnesses GPT-4V's advanced capabilities, utilizing over 30,000 image-caption pairs to enrich captions and generate corresponding Q&As in a cascading process. Significantly, to maximize PathMMU's authority, we invite seven pathologists to scrutinize each question under strict standards in PathMMU's validation and test sets, while simultaneously setting an expert-level performance benchmark for PathMMU. We conduct extensive evaluations, including zero-shot assessments of 14 open-sourced and 4 closed-sourced LMMs and their robustness to image corruption. We also fine-tune representative LMMs to assess their adaptability to PathMMU. The empirical findings indicate that advanced LMMs struggle with the challenging PathMMU benchmark, with the top-performing LMM, GPT-4V, achieving only a 49.8% zero-shot performance, significantly lower than the 71.8% demonstrated by human pathologists. After fine-tuning, significantly smaller open-sourced LMMs can outperform GPT-4V but still fall short of the expertise shown by pathologists. We hope that the PathMMU will offer valuable insights and foster the development of more specialized, next-generation LMMs for pathology. △ Less

Submitted 20 March, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: 27 pages, 12 figures

arXiv:2312.11970 [pdf, other]

Large Language Models Empowered Agent-based Modeling and Simulation: A Survey and Perspectives

Authors: Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, Yong Li

Abstract: Agent-based modeling and simulation has evolved as a powerful tool for modeling complex systems, offering insights into emergent behaviors and interactions among diverse agents. Integrating large language models into agent-based modeling and simulation presents a promising avenue for enhancing simulation capabilities. This paper surveys the landscape of utilizing large language models in agent-bas… ▽ More Agent-based modeling and simulation has evolved as a powerful tool for modeling complex systems, offering insights into emergent behaviors and interactions among diverse agents. Integrating large language models into agent-based modeling and simulation presents a promising avenue for enhancing simulation capabilities. This paper surveys the landscape of utilizing large language models in agent-based modeling and simulation, examining their challenges and promising future directions. In this survey, since this is an interdisciplinary field, we first introduce the background of agent-based modeling and simulation and large language model-empowered agents. We then discuss the motivation for applying large language models to agent-based simulation and systematically analyze the challenges in environment perception, human alignment, action generation, and evaluation. Most importantly, we provide a comprehensive overview of the recent works of large language model-empowered agent-based modeling and simulation in multiple scenarios, which can be divided into four domains: cyber, physical, social, and hybrid, covering simulation of both real-world and virtual environments. Finally, since this area is new and quickly evolving, we discuss the open problems and promising future directions. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: 37 pages

arXiv:2310.10467 [pdf, other]

Stance Detection with Collaborative Role-Infused LLM-Based Agents

Authors: Xiaochong Lan, Chen Gao, Depeng Jin, Yong Li

Abstract: Stance detection automatically detects the stance in a text towards a target, vital for content analysis in web and social media research. Despite their promising capabilities, LLMs encounter challenges when directly applied to stance detection. First, stance detection demands multi-aspect knowledge, from deciphering event-related terminologies to understanding the expression styles in social medi… ▽ More Stance detection automatically detects the stance in a text towards a target, vital for content analysis in web and social media research. Despite their promising capabilities, LLMs encounter challenges when directly applied to stance detection. First, stance detection demands multi-aspect knowledge, from deciphering event-related terminologies to understanding the expression styles in social media platforms. Second, stance detection requires advanced reasoning to infer authors' implicit viewpoints, as stance are often subtly embedded rather than overtly stated in the text. To address these challenges, we design a three-stage framework COLA (short for Collaborative rOle-infused LLM-based Agents) in which LLMs are designated distinct roles, creating a collaborative system where each role contributes uniquely. Initially, in the multidimensional text analysis stage, we configure the LLMs to act as a linguistic expert, a domain specialist, and a social media veteran to get a multifaceted analysis of texts, thus overcoming the first challenge. Next, in the reasoning-enhanced debating stage, for each potential stance, we designate a specific LLM-based agent to advocate for it, guiding the LLM to detect logical connections between text features and stance, tackling the second challenge. Finally, in the stance conclusion stage, a final decision maker agent consolidates prior insights to determine the stance. Our approach avoids extra annotated data and model training and is highly usable. We achieve state-of-the-art performance across multiple datasets. Ablation studies validate the effectiveness of each design role in handling stance detection. Further experiments have demonstrated the explainability and the versatility of our approach. Our approach excels in usability, accuracy, effectiveness, explainability and versatility, highlighting its value. △ Less

Submitted 16 April, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

arXiv:2310.05694 [pdf, other]

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

Authors: Kai He, Rui Mao, Qika Lin, Yucheng Ruan, Xiang Lan, Mengling Feng, Erik Cambria

Abstract: The utilization of large language models (LLMs) in the Healthcare domain has generated both excitement and concern due to their ability to effectively respond to freetext queries with certain professional knowledge. This survey outlines the capabilities of the currently developed LLMs for Healthcare and explicates their development process, with the aim of providing an overview of the development… ▽ More The utilization of large language models (LLMs) in the Healthcare domain has generated both excitement and concern due to their ability to effectively respond to freetext queries with certain professional knowledge. This survey outlines the capabilities of the currently developed LLMs for Healthcare and explicates their development process, with the aim of providing an overview of the development roadmap from traditional Pretrained Language Models (PLMs) to LLMs. Specifically, we first explore the potential of LLMs to enhance the efficiency and effectiveness of various Healthcare applications highlighting both the strengths and limitations. Secondly, we conduct a comparison between the previous PLMs and the latest LLMs, as well as comparing various LLMs with each other. Then we summarize related Healthcare training data, training methods, optimization strategies, and usage. Finally, the unique concerns associated with deploying LLMs in Healthcare settings are investigated, particularly regarding fairness, accountability, transparency and ethics. Our survey provide a comprehensive investigation from perspectives of both computer science and Healthcare specialty. Besides the discussion about Healthcare concerns, we supports the computer science community by compiling a collection of open source resources, such as accessible datasets, the latest methodologies, code implementations, and evaluation benchmarks in the Github. Summarily, we contend that a significant paradigm shift is underway, transitioning from PLMs to LLMs. This shift encompasses a move from discriminative AI approaches to generative AI approaches, as well as a shift from model-centered methodologies to data-centered methodologies. Also, we determine that the biggest obstacle of using LLMs in Healthcare are fairness, accountability, transparency and ethics. △ Less

Submitted 11 June, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

arXiv:2308.05938 [pdf, other]

doi 10.1109/TMM.2023.3330047

FoodSAM: Any Food Segmentation

Authors: Xing Lan, Jiayi Lyu, Hanyu Jiang, Kun Dong, Zehai Niu, Yi Zhang, Jian Xue

Abstract: In this paper, we explore the zero-shot capability of the Segment Anything Model (SAM) for food image segmentation. To address the lack of class-specific information in SAM-generated masks, we propose a novel framework, called FoodSAM. This innovative approach integrates the coarse semantic mask with SAM-generated masks to enhance semantic segmentation quality. Besides, we recognize that the ingre… ▽ More In this paper, we explore the zero-shot capability of the Segment Anything Model (SAM) for food image segmentation. To address the lack of class-specific information in SAM-generated masks, we propose a novel framework, called FoodSAM. This innovative approach integrates the coarse semantic mask with SAM-generated masks to enhance semantic segmentation quality. Besides, we recognize that the ingredients in food can be supposed as independent individuals, which motivated us to perform instance segmentation on food images. Furthermore, FoodSAM extends its zero-shot capability to encompass panoptic segmentation by incorporating an object detector, which renders FoodSAM to effectively capture non-food object information. Drawing inspiration from the recent success of promptable segmentation, we also extend FoodSAM to promptable segmentation, supporting various prompt variants. Consequently, FoodSAM emerges as an all-encompassing solution capable of segmenting food items at multiple levels of granularity. Remarkably, this pioneering framework stands as the first-ever work to achieve instance, panoptic, and promptable segmentation on food images. Extensive experiments demonstrate the feasibility and impressing performance of FoodSAM, validating SAM's potential as a prominent and influential tool within the domain of food image segmentation. We release our code at https://github.com/jamesjg/FoodSAM. △ Less

Submitted 11 August, 2023; originally announced August 2023.

Comments: Code is available at https://github.com/jamesjg/FoodSAM

arXiv:2308.02831 [pdf, other]

Affective Visualization Design: Leveraging the Emotional Impact of Data

Authors: Xingyu Lan, Yanqiu Wu, Nan Cao

Abstract: In recent years, more and more researchers have reflected on the undervaluation of emotion in data visualization and highlighted the importance of considering human emotion in visualization design. Meanwhile, an increasing number of studies have been conducted to explore emotion-related factors. However, so far, this research area is still in its early stages and faces a set of challenges, such as… ▽ More In recent years, more and more researchers have reflected on the undervaluation of emotion in data visualization and highlighted the importance of considering human emotion in visualization design. Meanwhile, an increasing number of studies have been conducted to explore emotion-related factors. However, so far, this research area is still in its early stages and faces a set of challenges, such as the unclear definition of key concepts, the insufficient justification of why emotion is important in visualization design, and the lack of characterization of the design space of affective visualization design. To address these challenges, first, we conducted a literature review and identified three research lines that examined both emotion and data visualization. We clarified the differences between these research lines and kept 109 papers that studied or discussed how data visualization communicates and influences emotion. Then, we coded the 109 papers in terms of how they justified the legitimacy of considering emotion in visualization design (i.e., why emotion is important) and identified five argumentative perspectives. Based on these papers, we also identified 61 projects that practiced affective visualization design. We coded these design projects in three dimensions, including design fields (where), design tasks (what), and design methods (how), to explore the design space of affective visualization design. △ Less

Submitted 5 August, 2023; originally announced August 2023.

Comments: to appear at IEEE VIS 2023

arXiv:2307.16644 [pdf, other]

doi 10.1145/3580305.3599874

NEON: Living Needs Prediction System in Meituan

Authors: Xiaochong Lan, Chen Gao, Shiqi Wen, Xiuqi Chen, Yingge Che, Han Zhang, Huazhou Wei, Hengliang Luo, Yong Li

Abstract: Living needs refer to the various needs in human's daily lives for survival and well-being, including food, housing, entertainment, etc. On life service platforms that connect users to service providers, such as Meituan, the problem of living needs prediction is fundamental as it helps understand users and boost various downstream applications such as personalized recommendation. However, the prob… ▽ More Living needs refer to the various needs in human's daily lives for survival and well-being, including food, housing, entertainment, etc. On life service platforms that connect users to service providers, such as Meituan, the problem of living needs prediction is fundamental as it helps understand users and boost various downstream applications such as personalized recommendation. However, the problem has not been well explored and is faced with two critical challenges. First, the needs are naturally connected to specific locations and times, suffering from complex impacts from the spatiotemporal context. Second, there is a significant gap between users' actual living needs and their historical records on the platform. To address these two challenges, we design a system of living NEeds predictiON named NEON, consisting of three phases: feature mining, feature fusion, and multi-task prediction. In the feature mining phase, we carefully extract individual-level user features for spatiotemporal modeling, and aggregated-level behavioral features for enriching data, which serve as the basis for addressing two challenges, respectively. Further, in the feature fusion phase, we propose a neural network that effectively fuses two parts of features into the user representation. Moreover, we design a multi-task prediction phase, where the auxiliary task of needs-meeting way prediction can enhance the modeling of spatiotemporal context. Extensive offline evaluations verify that our NEON system can effectively predict users' living needs. Furthermore, we deploy NEON into Meituan's algorithm engine and evaluate how it enhances the three downstream prediction applications, via large-scale online A/B testing. △ Less

Submitted 31 July, 2023; originally announced July 2023.

arXiv:2307.14984 [pdf, other]

S3: Social-network Simulation System with Large Language Model-Empowered Agents

Authors: Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, Yong Li

Abstract: Social network simulation plays a crucial role in addressing various challenges within social science. It offers extensive applications such as state prediction, phenomena explanation, and policy-making support, among others. In this work, we harness the formidable human-like capabilities exhibited by large language models (LLMs) in sensing, reasoning, and behaving, and utilize these qualities to… ▽ More Social network simulation plays a crucial role in addressing various challenges within social science. It offers extensive applications such as state prediction, phenomena explanation, and policy-making support, among others. In this work, we harness the formidable human-like capabilities exhibited by large language models (LLMs) in sensing, reasoning, and behaving, and utilize these qualities to construct the S$^3$ system (short for $\textbf{S}$ocial network $\textbf{S}$imulation $\textbf{S}$ystem). Adhering to the widely employed agent-based simulation paradigm, we employ prompt engineering and prompt tuning techniques to ensure that the agent's behavior closely emulates that of a genuine human within the social network. Specifically, we simulate three pivotal aspects: emotion, attitude, and interaction behaviors. By endowing the agent in the system with the ability to perceive the informational environment and emulate human actions, we observe the emergence of population-level phenomena, including the propagation of information, attitudes, and emotions. We conduct an evaluation encompassing two levels of simulation, employing real-world social network data. Encouragingly, the results demonstrate promising accuracy. This work represents an initial step in the realm of social network simulation empowered by LLM-based agents. We anticipate that our endeavors will serve as a source of inspiration for the development of simulation systems within, but not limited to, social science. △ Less

Submitted 19 October, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

arXiv:2307.11458 [pdf, other]

Strip-MLP: Efficient Token Interaction for Vision MLP

Authors: Guiping Cao, Shengda Luo, Wenjian Huang, Xiangyuan Lan, Dongmei Jiang, Yaowei Wang, Jianguo Zhang

Abstract: Token interaction operation is one of the core modules in MLP-based models to exchange and aggregate information between different spatial locations. However, the power of token interaction on the spatial dimension is highly dependent on the spatial resolution of the feature maps, which limits the model's expressive ability, especially in deep layers where the feature are down-sampled to a small s… ▽ More Token interaction operation is one of the core modules in MLP-based models to exchange and aggregate information between different spatial locations. However, the power of token interaction on the spatial dimension is highly dependent on the spatial resolution of the feature maps, which limits the model's expressive ability, especially in deep layers where the feature are down-sampled to a small spatial size. To address this issue, we present a novel method called \textbf{Strip-MLP} to enrich the token interaction power in three ways. Firstly, we introduce a new MLP paradigm called Strip MLP layer that allows the token to interact with other tokens in a cross-strip manner, enabling the tokens in a row (or column) to contribute to the information aggregations in adjacent but different strips of rows (or columns). Secondly, a \textbf{C}ascade \textbf{G}roup \textbf{S}trip \textbf{M}ixing \textbf{M}odule (CGSMM) is proposed to overcome the performance degradation caused by small spatial feature size. The module allows tokens to interact more effectively in the manners of within-patch and cross-patch, which is independent to the feature spatial size. Finally, based on the Strip MLP layer, we propose a novel \textbf{L}ocal \textbf{S}trip \textbf{M}ixing \textbf{M}odule (LSMM) to boost the token interaction power in the local region. Extensive experiments demonstrate that Strip-MLP significantly improves the performance of MLP-based models on small datasets and obtains comparable or even better results on ImageNet. In particular, Strip-MLP models achieve higher average Top-1 accuracy than existing MLP-based models by +2.44\% on Caltech-101 and +2.16\% on CIFAR-100. The source codes will be available at~\href{https://github.com/Med-Process/Strip_MLP{https://github.com/Med-Process/Strip\_MLP}. △ Less

Submitted 21 July, 2023; originally announced July 2023.

arXiv:2307.09193 [pdf, other]

ESMC: Entire Space Multi-Task Model for Post-Click Conversion Rate via Parameter Constraint

Authors: Zhenhao Jiang, Biao Zeng, Hao Feng, Jin Liu, Jicong Fan, Jie Zhang, Jia Jia, Ning Hu, Xingyu Chen, Xuguang Lan

Abstract: Large-scale online recommender system spreads all over the Internet being in charge of two basic tasks: Click-Through Rate (CTR) and Post-Click Conversion Rate (CVR) estimations. However, traditional CVR estimators suffer from well-known Sample Selection Bias and Data Sparsity issues. Entire space models were proposed to address the two issues via tracing the decision-making path of "exposure_clic… ▽ More Large-scale online recommender system spreads all over the Internet being in charge of two basic tasks: Click-Through Rate (CTR) and Post-Click Conversion Rate (CVR) estimations. However, traditional CVR estimators suffer from well-known Sample Selection Bias and Data Sparsity issues. Entire space models were proposed to address the two issues via tracing the decision-making path of "exposure_click_purchase". Further, some researchers observed that there are purchase-related behaviors between click and purchase, which can better draw the user's decision-making intention and improve the recommendation performance. Thus, the decision-making path has been extended to "exposure_click_in-shop action_purchase" and can be modeled with conditional probability approach. Nevertheless, we observe that the chain rule of conditional probability does not always hold. We report Probability Space Confusion (PSC) issue and give a derivation of difference between ground-truth and estimation mathematically. We propose a novel Entire Space Multi-Task Model for Post-Click Conversion Rate via Parameter Constraint (ESMC) and two alternatives: Entire Space Multi-Task Model with Siamese Network (ESMS) and Entire Space Multi-Task Model in Global Domain (ESMG) to address the PSC issue. Specifically, we handle "exposure_click_in-shop action" and "in-shop action_purchase" separately in the light of characteristics of in-shop action. The first path is still treated with conditional probability while the second one is treated with parameter constraint strategy. Experiments on both offline and online environments in a large-scale recommendation system illustrate the superiority of our proposed methods over state-of-the-art models. The real-world datasets will be released. △ Less

Submitted 29 July, 2023; v1 submitted 18 July, 2023; originally announced July 2023.

arXiv:2304.12592 [pdf, other]

MMRDN: Consistent Representation for Multi-View Manipulation Relationship Detection in Object-Stacked Scenes

Authors: Han Wang, Jiayuan Zhang, Lipeng Wan, Xingyu Chen, Xuguang Lan, Nanning Zheng

Abstract: Manipulation relationship detection (MRD) aims to guide the robot to grasp objects in the right order, which is important to ensure the safety and reliability of grasping in object stacked scenes. Previous works infer manipulation relationship by deep neural network trained with data collected from a predefined view, which has limitation in visual dislocation in unstructured environments. Multi-vi… ▽ More Manipulation relationship detection (MRD) aims to guide the robot to grasp objects in the right order, which is important to ensure the safety and reliability of grasping in object stacked scenes. Previous works infer manipulation relationship by deep neural network trained with data collected from a predefined view, which has limitation in visual dislocation in unstructured environments. Multi-view data provide more comprehensive information in space, while a challenge of multi-view MRD is domain shift. In this paper, we propose a novel multi-view fusion framework, namely multi-view MRD network (MMRDN), which is trained by 2D and 3D multi-view data. We project the 2D data from different views into a common hidden space and fit the embeddings with a set of Von-Mises-Fisher distributions to learn the consistent representations. Besides, taking advantage of position information within the 3D data, we select a set of $K$ Maximum Vertical Neighbors (KMVN) points from the point cloud of each object pair, which encodes the relative position of these two objects. Finally, the features of multi-view 2D and 3D data are concatenated to predict the pairwise relationship of objects. Experimental results on the challenging REGRAD dataset show that MMRDN outperforms the state-of-the-art methods in multi-view MRD tasks. The results also demonstrate that our model trained by synthetic data is capable to transfer to real-world scenarios. △ Less

Submitted 25 April, 2023; originally announced April 2023.

arXiv:2304.01171 [pdf, other]

Revisiting Context Aggregation for Image Matting

Authors: Qinglin Liu, Xiaoqian Lv, Quanling Meng, Zonglin Li, Xiangyuan Lan, Shuo Yang, Shengping Zhang, Liqiang Nie

Abstract: Traditional studies emphasize the significance of context information in improving matting performance. Consequently, deep learning-based matting methods delve into designing pooling or affinity-based context aggregation modules to achieve superior results. However, these modules cannot well handle the context scale shift caused by the difference in image size during training and inference, result… ▽ More Traditional studies emphasize the significance of context information in improving matting performance. Consequently, deep learning-based matting methods delve into designing pooling or affinity-based context aggregation modules to achieve superior results. However, these modules cannot well handle the context scale shift caused by the difference in image size during training and inference, resulting in matting performance degradation. In this paper, we revisit the context aggregation mechanisms of matting networks and find that a basic encoder-decoder network without any context aggregation modules can actually learn more universal context aggregation, thereby achieving higher matting performance compared to existing methods. Building on this insight, we present AEMatter, a matting network that is straightforward yet very effective. AEMatter adopts a Hybrid-Transformer backbone with appearance-enhanced axis-wise learning (AEAL) blocks to build a basic network with strong context aggregation learning capability. Furthermore, AEMatter leverages a large image training strategy to assist the network in learning context aggregation from data. Extensive experiments on five popular matting datasets demonstrate that the proposed AEMatter outperforms state-of-the-art matting methods by a large margin. △ Less

Submitted 14 May, 2024; v1 submitted 3 April, 2023; originally announced April 2023.

arXiv:2303.17408 [pdf, other]

P-Transformer: A Prompt-based Multimodal Transformer Architecture For Medical Tabular Data

Authors: Yucheng Ruan, Xiang Lan, Daniel J. Tan, Hairil Rizal Abdullah, Mengling Feng

Abstract: Medical tabular data, abundant in Electronic Health Records (EHRs), is a valuable resource for diverse medical tasks such as risk prediction. While deep learning approaches, particularly transformer-based models, have shown remarkable performance in tabular data prediction, there are still problems remained for existing work to be effectively adapted into medical domain, such as under-utilization… ▽ More Medical tabular data, abundant in Electronic Health Records (EHRs), is a valuable resource for diverse medical tasks such as risk prediction. While deep learning approaches, particularly transformer-based models, have shown remarkable performance in tabular data prediction, there are still problems remained for existing work to be effectively adapted into medical domain, such as under-utilization of unstructured free-texts, limited exploration of textual information in structured data, and data corruption. To address these issues, we propose P-Transformer, a Prompt-based multimodal Transformer architecture designed specifically for medical tabular data. This framework consists two critical components: a tabular cell embedding generator and a tabular transformer. The former efficiently encodes diverse modalities from both structured and unstructured tabular data into a harmonized language semantic space with the help of pre-trained sentence encoder and medical prompts. The latter integrates cell representations to generate patient embeddings for various medical tasks. In comprehensive experiments on two real-world datasets for three medical tasks, P-Transformer demonstrated the improvements with 10.9%/11.0% on RMSE/MAE, 0.5%/2.2% on RMSE/MAE, and 1.6%/0.8% on BACC/AUROC compared to state-of-the-art (SOTA) baselines in predictability. Notably, the model exhibited strong resilience to data corruption in the structured data, particularly when the corruption rates are high. △ Less

Submitted 9 January, 2024; v1 submitted 30 March, 2023; originally announced March 2023.

arXiv:2303.07828 [pdf, other]

Prioritized Planning for Target-Oriented Manipulation via Hierarchical Stacking Relationship Prediction

Authors: Zewen Wu, Jian Tang, Xingyu Chen, Chengzhong Ma, Xuguang Lan, Nanning Zheng

Abstract: In scenarios involving the grasping of multiple targets, the learning of stacking relationships between objects is fundamental for robots to execute safely and efficiently. However, current methods lack subdivision for the hierarchy of stacking relationship types. In scenes where objects are mostly stacked in an orderly manner, they are incapable of performing human-like and high-efficient graspin… ▽ More In scenarios involving the grasping of multiple targets, the learning of stacking relationships between objects is fundamental for robots to execute safely and efficiently. However, current methods lack subdivision for the hierarchy of stacking relationship types. In scenes where objects are mostly stacked in an orderly manner, they are incapable of performing human-like and high-efficient grasping decisions. This paper proposes a perception-planning method to distinguish different stacking types between objects and generate prioritized manipulation order decisions based on given target designations. We utilize a Hierarchical Stacking Relationship Network (HSRN) to discriminate the hierarchy of stacking and generate a refined Stacking Relationship Tree (SRT) for relationship description. Considering that objects with high stacking stability can be grasped together if necessary, we introduce an elaborate decision-making planner based on the Partially Observable Markov Decision Process (POMDP), which leverages observations and generates the least grasp-consuming decision chain with robustness and is suitable for simultaneously specifying multiple targets. To verify our work, we set the scene to the dining table and augment the REGRAD dataset with a set of common tableware models for network training. Experiments show that our method effectively generates grasping decisions that conform to human requirements, and improves the implementation efficiency compared with existing methods on the basis of guaranteeing the success rate. △ Less

Submitted 25 June, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

Comments: 8 pages, 8 figures. Accepted by 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023)

arXiv:2302.03357 [pdf, other]

Towards Enhancing Time Series Contrastive Learning: A Dynamic Bad Pair Mining Approach

Authors: Xiang Lan, Hanshu Yan, Shenda Hong, Mengling Feng

Abstract: Not all positive pairs are beneficial to time series contrastive learning. In this paper, we study two types of bad positive pairs that can impair the quality of time series representation learned through contrastive learning: the noisy positive pair and the faulty positive pair. We observe that, with the presence of noisy positive pairs, the model tends to simply learn the pattern of noise (Noisy… ▽ More Not all positive pairs are beneficial to time series contrastive learning. In this paper, we study two types of bad positive pairs that can impair the quality of time series representation learned through contrastive learning: the noisy positive pair and the faulty positive pair. We observe that, with the presence of noisy positive pairs, the model tends to simply learn the pattern of noise (Noisy Alignment). Meanwhile, when faulty positive pairs arise, the model wastes considerable amount of effort aligning non-representative patterns (Faulty Alignment). To address this problem, we propose a Dynamic Bad Pair Mining (DBPM) algorithm, which reliably identifies and suppresses bad positive pairs in time series contrastive learning. Specifically, DBPM utilizes a memory module to dynamically track the training behavior of each positive pair along training process. This allows us to identify potential bad positive pairs at each epoch based on their historical training behaviors. The identified bad pairs are subsequently down-weighted through a transformation module, thereby mitigating their negative impact on the representation learning process. DBPM is a simple algorithm designed as a lightweight plug-in without learnable parameters to enhance the performance of existing state-of-the-art methods. Through extensive experiments conducted on four large-scale, real-world time series datasets, we demonstrate DBPM's efficacy in mitigating the adverse effects of bad positive pairs. △ Less

Submitted 28 March, 2024; v1 submitted 7 February, 2023; originally announced February 2023.

Comments: ICLR 2024 Camera Ready (https://openreview.net/pdf?id=K2c04ulKXn)

arXiv:2211.12075 [pdf, other]

Greedy based Value Representation for Optimal Coordination in Multi-agent Reinforcement Learning

Authors: Lipeng Wan, Zeyang Liu, Xingyu Chen, Xuguang Lan, Nanning Zheng

Abstract: Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning methods with linear value decomposition (LVD) or monotonic value decomposition (MVD) suffer from relative overgeneralization. As a result, they can not ensure optimal consistency (i.e., the correspondence between individual greedy actions and the maximal true Q value). In this paper, we derive th… ▽ More Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning methods with linear value decomposition (LVD) or monotonic value decomposition (MVD) suffer from relative overgeneralization. As a result, they can not ensure optimal consistency (i.e., the correspondence between individual greedy actions and the maximal true Q value). In this paper, we derive the expression of the joint Q value function of LVD and MVD. According to the expression, we draw a transition diagram, where each self-transition node (STN) is a possible convergence. To ensure optimal consistency, the optimal node is required to be the unique STN. Therefore, we propose the greedy-based value representation (GVR), which turns the optimal node into an STN via inferior target shaping and further eliminates the non-optimal STNs via superior experience replay. In addition, GVR achieves an adaptive trade-off between optimality and stability. Our method outperforms state-of-the-art baselines in experiments on various benchmarks. Theoretical proofs and empirical results on matrix games demonstrate that GVR ensures optimal consistency under sufficient exploration. △ Less

Submitted 22 November, 2022; originally announced November 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2112.04454

arXiv:2211.03296 [pdf, other]

The Chart Excites Me! Exploring How Data Visualization Design Influences Affective Arousal

Authors: Xingyu Lan, Yanqiu Wu, Qing Chen, Nan Cao

Abstract: As data visualizations have been increasingly applied in mass communication, designers often seek to grasp viewers immediately and motivate them to read more. Such goals, as suggested by previous research, are closely associated with the activation of emotion, namely affective arousal. Given this motivation, this work takes initial steps toward understanding the arousal-related factors in data vis… ▽ More As data visualizations have been increasingly applied in mass communication, designers often seek to grasp viewers immediately and motivate them to read more. Such goals, as suggested by previous research, are closely associated with the activation of emotion, namely affective arousal. Given this motivation, this work takes initial steps toward understanding the arousal-related factors in data visualization design. We collected a corpus of 265 data visualizations and conducted a crowdsourcing study with 184 participants during which the participants were asked to rate the affective arousal elicited by data visualization design (all texts were blurred to exclude the influence of semantics) and provide their reasons. Based on the collected data, first, we identified a set of arousal-related design features by analyzing user comments qualitatively. Then, we mapped these features to computable variables and constructed regression models to infer which features are significant contributors to affective arousal quantitatively. Through this exploratory study, we finally identified four design features (e.g., colorfulness, the number of different visual channels) cross-validated as important features correlated with affective arousal. △ Less

Submitted 6 November, 2022; originally announced November 2022.

arXiv:2209.03642 [pdf, other]

VizBelle: A Design Space of Embellishments for Data Visualization

Authors: Qing Chen, Ziyan Liu, Chengwei Wang, Xingyu Lan, Ying Chen, Siming Chen, Nan Cao

Abstract: Visual embellishments, as a form of non-linguistic rhetorical figures, are used to help convey abstract concepts or attract readers' attention. Creating data visualizations with appropriate and visually pleasing embellishments is challenging since this process largely depends on the experience and the aesthetic taste of designers. To help facilitate designers in the ideation and creation process,… ▽ More Visual embellishments, as a form of non-linguistic rhetorical figures, are used to help convey abstract concepts or attract readers' attention. Creating data visualizations with appropriate and visually pleasing embellishments is challenging since this process largely depends on the experience and the aesthetic taste of designers. To help facilitate designers in the ideation and creation process, we propose a design space, VizBelle, based on the analysis of 361 classified visualizations from online sources. VizBelle consists of four dimensions, namely, communication goal to fit user intention, object to select the target area, strategy and technique to offer potential approaches. We further provide a website to present detailed explanations and examples of various techniques. We conducted a within-subject study with 20 professional and amateur design enthusiasts to evaluate the effectiveness of our design space. Results show that our design space is illuminating and useful for designers to create data visualizations with embellishments. △ Less

Submitted 8 September, 2022; originally announced September 2022.

arXiv:2208.07156 [pdf, other]

Cooperative guidance of multiple missiles: a hybrid co-evolutionary approach

Authors: Xuejing Lan, Junda Chen, Zhijia Zhao, Tao Zou

Abstract: Cooperative guidance of multiple missiles is a challenging task with rigorous constraints of time and space consensus, especially when attacking dynamic targets. In this paper, the cooperative guidance task is described as a distributed multi-objective cooperative optimization problem. To address the issues of non-stationarity and continuous control faced by cooperative guidance, the natural evolu… ▽ More Cooperative guidance of multiple missiles is a challenging task with rigorous constraints of time and space consensus, especially when attacking dynamic targets. In this paper, the cooperative guidance task is described as a distributed multi-objective cooperative optimization problem. To address the issues of non-stationarity and continuous control faced by cooperative guidance, the natural evolutionary strategy (NES) is improved along with an elitist adaptive learning technique to develop a novel natural co-evolutionary strategy (NCES). The gradients of original evolutionary strategy are rescaled to reduce the estimation bias caused by the interaction between the multiple missiles. Then, a hybrid co-evolutionary cooperative guidance law (HCCGL) is proposed by integrating the highly scalable co-evolutionary mechanism and the traditional guidance strategy. Finally, three simulations under different conditions demonstrate the effectiveness and superiority of this guidance law in solving cooperative guidance tasks with high accuracy. The proposed co-evolutionary approach has great prospects not only in cooperative guidance, but also in other application scenarios of multi-objective optimization, dynamic optimization and distributed control. △ Less

Submitted 14 April, 2023; v1 submitted 15 August, 2022; originally announced August 2022.

ACM Class: F.2.2; J.2

arXiv:2208.04518 [pdf]

doi 10.1145/3555585

A Mixed-Methods Analysis of the Algorithm-Mediated Labor of Online Food Deliverers in China

Authors: Zhilong Chen, Xiaochong Lan, Jinghua Piao, Yunke Zhang, Yong Li

Abstract: In recent years, China has witnessed the proliferation and success of the online food delivery industry, an emerging type of the gig economy. Online food deliverers who deliver the food from restaurants to customers play a critical role in enabling this industry. Mediated by algorithms and coupled with interactions with multiple stakeholders, this emerging kind of labor has been taken by millions… ▽ More In recent years, China has witnessed the proliferation and success of the online food delivery industry, an emerging type of the gig economy. Online food deliverers who deliver the food from restaurants to customers play a critical role in enabling this industry. Mediated by algorithms and coupled with interactions with multiple stakeholders, this emerging kind of labor has been taken by millions of people. In this paper, we present a mixed-methods analysis to investigate this labor of online food deliverers and uncover how the mediation of algorithms shapes it. Combining large-scale quantitative data-driven investigations of 100,000 deliverers' behavioral data with in-depth qualitative interviews with 15 online food deliverers, we demonstrate their working activities, identify how algorithms mediate their delivery procedures, and reveal how they perceive their relationships with different stakeholders as a result of their algorithm-mediated labor. Our findings provide important implications for enabling better experiences and more humanized labor of deliverers as well as workers in gig economies of similar kinds. △ Less

Submitted 26 August, 2022; v1 submitted 8 August, 2022; originally announced August 2022.

Comments: Accepted to CSCW 2022

arXiv:2208.04122 [pdf]

doi 10.1145/3555646

Practitioners Versus Users: A Value-Sensitive Evaluation of Current Industrial Recommender System Design

Authors: Zhilong Chen, Jinghua Piao, Xiaochong Lan, Hancheng Cao, Chen Gao, Zhicong Lu, Yong Li

Abstract: Recommender systems are playing an increasingly important role in alleviating information overload and supporting users' various needs, e.g., consumption, socialization, and entertainment. However, limited research focuses on how values should be extensively considered in industrial deployments of recommender systems, the ignorance of which can be problematic. To fill this gap, in this paper, we a… ▽ More Recommender systems are playing an increasingly important role in alleviating information overload and supporting users' various needs, e.g., consumption, socialization, and entertainment. However, limited research focuses on how values should be extensively considered in industrial deployments of recommender systems, the ignorance of which can be problematic. To fill this gap, in this paper, we adopt Value Sensitive Design to comprehensively explore how practitioners and users recognize different values of current industrial recommender systems. Based on conceptual and empirical investigations, we focus on five values: recommendation quality, privacy, transparency, fairness, and trustworthiness. We further conduct in-depth qualitative interviews with 20 users and 10 practitioners to delve into their opinions about these values. Our results reveal the existence and sources of tensions between practitioners and users in terms of value interpretation, evaluation, and practice, which provide novel implications for designing more human-centric and value-sensitive recommender systems. △ Less

Submitted 26 August, 2022; v1 submitted 8 August, 2022; originally announced August 2022.

Comments: Zhilong Chen and Jinghua Piao contribute equally to this work; Accepted to CSCW 2022

arXiv:2207.08794 [pdf, other]

DeFlowSLAM: Self-Supervised Scene Motion Decomposition for Dynamic Dense SLAM

Authors: Weicai Ye, Xingyuan Yu, Xinyue Lan, Yuhang Ming, Jinyu Li, Hujun Bao, Zhaopeng Cui, Guofeng Zhang

Abstract: We present a novel dual-flow representation of scene motion that decomposes the optical flow into a static flow field caused by the camera motion and another dynamic flow field caused by the objects' movements in the scene. Based on this representation, we present a dynamic SLAM, dubbed DeFlowSLAM, that exploits both static and dynamic pixels in the images to solve the camera poses, rather than si… ▽ More We present a novel dual-flow representation of scene motion that decomposes the optical flow into a static flow field caused by the camera motion and another dynamic flow field caused by the objects' movements in the scene. Based on this representation, we present a dynamic SLAM, dubbed DeFlowSLAM, that exploits both static and dynamic pixels in the images to solve the camera poses, rather than simply using static background pixels as other dynamic SLAM systems do. We propose a dynamic update module to train our DeFlowSLAM in a self-supervised manner, where a dense bundle adjustment layer takes in estimated static flow fields and the weights controlled by the dynamic mask and outputs the residual of the optimized static flow fields, camera poses, and inverse depths. The static and dynamic flow fields are estimated by warping the current image to the neighboring images, and the optical flow can be obtained by summing the two fields. Extensive experiments demonstrate that DeFlowSLAM generalizes well to both static and dynamic scenes as it exhibits comparable performance to the state-of-the-art DROID-SLAM in static and less dynamic scenes while significantly outperforming DROID-SLAM in highly dynamic environments. The code and pre-trained model will be available on the project webpage: \urlstyle{tt} \textcolor{url_color}{\url{https://zju3dv.github.io/deflowslam/}}. △ Less

Submitted 13 January, 2023; v1 submitted 18 July, 2022; originally announced July 2022.

Comments: Homepage: https://zju3dv.github.io/deflowslam

arXiv:2207.02705 [pdf, other]

Incentivizing Proof-of-Stake Blockchain for Secured Data Collection in UAV-Assisted IoT: A Multi-Agent Reinforcement Learning Approach

Authors: Xiao Tang, Xunqiang Lan, Lixin Li, Yan Zhang, Zhu Han

Abstract: The Internet of Things (IoT) can be conveniently deployed while empowering various applications, where the IoT nodes can form clusters to finish certain missions collectively. In this paper, we propose to employ unmanned aerial vehicles (UAVs) to assist the clustered IoT data collection with blockchain-based security provisioning. In particular, the UAVs generate candidate blocks based on the coll… ▽ More The Internet of Things (IoT) can be conveniently deployed while empowering various applications, where the IoT nodes can form clusters to finish certain missions collectively. In this paper, we propose to employ unmanned aerial vehicles (UAVs) to assist the clustered IoT data collection with blockchain-based security provisioning. In particular, the UAVs generate candidate blocks based on the collected data, which are then audited through a lightweight proof-of-stake consensus mechanism within the UAV-based blockchain network. To motivate efficient blockchain while reducing the operational cost, a stake pool is constructed at the active UAV while encouraging stake investment from other UAVs with profit sharing. The problem is formulated to maximize the overall profit through the blockchain system in unit time by jointly investigating the IoT transmission, incentives through investment and profit sharing, and UAV deployment strategies. Then, the problem is solved in a distributed manner while being decoupled into two layers. The inner layer incorporates IoT transmission and incentive design, which are tackled with large-system approximation and one-leader-multi-follower Stackelberg game analysis, respectively. The outer layer for UAV deployment is undertaken with a multi-agent deep deterministic policy gradient approach. Results show the convergence of the proposed learning process and the UAV deployment, and also demonstrated is the performance superiority of our proposal as compared with the baselines. △ Less

Submitted 6 July, 2022; originally announced July 2022.

Comments: 14 pages, 10 figures, submitted to IEEE Journal

arXiv:2207.01610 [pdf, other]

PVO: Panoptic Visual Odometry

Authors: Weicai Ye, Xinyue Lan, Shuo Chen, Yuhang Ming, Xingyuan Yu, Hujun Bao, Zhaopeng Cui, Guofeng Zhang

Abstract: We present PVO, a novel panoptic visual odometry framework to achieve more comprehensive modeling of the scene motion, geometry, and panoptic segmentation information. Our PVO models visual odometry (VO) and video panoptic segmentation (VPS) in a unified view, which makes the two tasks mutually beneficial. Specifically, we introduce a panoptic update module into the VO Module with the guidance of… ▽ More We present PVO, a novel panoptic visual odometry framework to achieve more comprehensive modeling of the scene motion, geometry, and panoptic segmentation information. Our PVO models visual odometry (VO) and video panoptic segmentation (VPS) in a unified view, which makes the two tasks mutually beneficial. Specifically, we introduce a panoptic update module into the VO Module with the guidance of image panoptic segmentation. This Panoptic-Enhanced VO Module can alleviate the impact of dynamic objects in the camera pose estimation with a panoptic-aware dynamic mask. On the other hand, the VO-Enhanced VPS Module also improves the segmentation accuracy by fusing the panoptic segmentation result of the current frame on the fly to the adjacent frames, using geometric information such as camera pose, depth, and optical flow obtained from the VO Module. These two modules contribute to each other through recurrent iterative optimization. Extensive experiments demonstrate that PVO outperforms state-of-the-art methods in both visual odometry and video panoptic segmentation tasks. △ Less

Submitted 26 March, 2023; v1 submitted 4 July, 2022; originally announced July 2022.

Comments: CVPR2023 Project page: https://zju3dv.github.io/pvo/ code: https://github.com/zju3dv/PVO

arXiv:2203.05243 [pdf, other]

A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach

Authors: Xiaohan Lan, Yitian Yuan, Xin Wang, Long Chen, Zhi Wang, Lin Ma, Wenwu Zhu

Abstract: Temporal Sentence Grounding in Videos (TSGV), which aims to ground a natural language sentence in an untrimmed video, has drawn widespread attention over the past few years. However, recent studies have found that current benchmark datasets may have obvious moment annotation biases, enabling several simple baselines even without training to achieve SOTA performance. In this paper, we take a closer… ▽ More Temporal Sentence Grounding in Videos (TSGV), which aims to ground a natural language sentence in an untrimmed video, has drawn widespread attention over the past few years. However, recent studies have found that current benchmark datasets may have obvious moment annotation biases, enabling several simple baselines even without training to achieve SOTA performance. In this paper, we take a closer look at existing evaluation protocols, and find both the prevailing dataset and evaluation metrics are the devils that lead to untrustworthy benchmarking. Therefore, we propose to re-organize the two widely-used datasets, making the ground-truth moment distributions different in the training and test splits, i.e., out-of-distribution (OOD) test. Meanwhile, we introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets. New benchmarking results indicate that our proposed evaluation protocols can better monitor the research progress. Furthermore, we propose a novel causality-based Multi-branch Deconfounding Debiasing (MDD) framework for unbiased moment prediction. Specifically, we design a multi-branch deconfounder to eliminate the effects caused by multiple confounders with causal intervention. In order to help the model better align the semantics between sentence queries and video moments, we enhance the representations during feature encoding. Specifically, for textual information, the query is parsed into several verb-centered phrases to obtain a more fine-grained textual feature. For visual information, the positional information has been decomposed from moment features to enhance representations of moments with diverse locations. Extensive experiments demonstrate that our proposed approach can achieve competitive results among existing SOTA approaches and outperform the base model with great gains. △ Less

Submitted 10 March, 2022; originally announced March 2022.

arXiv:2203.01217 [pdf, other]

Hybrid Tracker with Pixel and Instance for Video Panoptic Segmentation

Authors: Weicai Ye, Xinyue Lan, Ge Su, Hujun Bao, Zhaopeng Cui, Guofeng Zhang

Abstract: Video Panoptic Segmentation (VPS) aims to generate coherent panoptic segmentation and track the identities of all pixels across video frames. Existing methods predominantly utilize the trained instance embedding to keep the consistency of panoptic segmentation. However, they inevitably struggle to cope with the challenges of small objects, similar appearance but inconsistent identities, occlusion,… ▽ More Video Panoptic Segmentation (VPS) aims to generate coherent panoptic segmentation and track the identities of all pixels across video frames. Existing methods predominantly utilize the trained instance embedding to keep the consistency of panoptic segmentation. However, they inevitably struggle to cope with the challenges of small objects, similar appearance but inconsistent identities, occlusion, and strong instance contour deformations. To address these problems, we present HybridTracker, a lightweight and joint tracking model attempting to eliminate the limitations of the single tracker. HybridTracker performs pixel tracker and instance tracker in parallel to obtain the association matrices, which are fused into a matching matrix. In the instance tracker, we design a differentiable matching layer, ensuring the stability of inter-frame matching. In the pixel tracker, we compute the dice coefficient of the same instance of different frames given the estimated optical flow, forming the Intersection Over Union (IoU) matrix. We additionally propose mutual check and temporal consistency constraints during inference to settle the occlusion and contour deformation challenges. Comprehensive experiments show that HybridTracker achieves superior performance than state-of-the-art methods on Cityscapes-VPS and VIPER datasets. △ Less

Submitted 11 December, 2023; v1 submitted 2 March, 2022; originally announced March 2022.

arXiv:2203.00865 [pdf]

Beyond Virtual Bazaar: How Social Commerce Promotes Inclusivity for the Traditionally Underserved Community in Chinese Developing Regions

Authors: Zhilong Chen, Hancheng Cao, Xiaochong Lan, Zhicong Lu, Yong Li

Abstract: The disadvantaged population is often underserved and marginalized in technology engagement: prior works show they are generally more reluctant and experience more barriers in adopting and engaging with mainstream technology. Here, we contribute to the HCI4D and ICTD literature through a novel "counter" case study on Chinese social commerce (e.g., Pinduoduo), which 1) first prospers among the trad… ▽ More The disadvantaged population is often underserved and marginalized in technology engagement: prior works show they are generally more reluctant and experience more barriers in adopting and engaging with mainstream technology. Here, we contribute to the HCI4D and ICTD literature through a novel "counter" case study on Chinese social commerce (e.g., Pinduoduo), which 1) first prospers among the traditionally underserved community from developing regions ahead of the more technologically advantaged communities, and 2) has been heavily engaged by this community. Through 12 in-depth interviews with social commerce users from the traditionally underserved community in Chinese developing regions, we demonstrate how social commerce, acting as a "counter", brings online the traditional offline socioeconomic lives the community has lived for ages, fits into the community's social, cultural, and economic context, and thus effectively promotes technology inclusivity. Our work provides novel insights and implications for building inclusive technology for the "next billion" population. △ Less

Submitted 1 March, 2022; originally announced March 2022.

Comments: Zhilong Chen and Hancheng Cao contribute equally to this work; Accepted to CHI 2022

arXiv:2202.03631 [pdf, ps, other]

Robotic Grasping from Classical to Modern: A Survey

Authors: Hanbo Zhang, Jian Tang, Shiguang Sun, Xuguang Lan

Abstract: Robotic Grasping has always been an active topic in robotics since grasping is one of the fundamental but most challenging skills of robots. It demands the coordination of robotic perception, planning, and control for robustness and intelligence. However, current solutions are still far behind humans, especially when confronting unstructured scenarios. In this paper, we survey the advances of robo… ▽ More Robotic Grasping has always been an active topic in robotics since grasping is one of the fundamental but most challenging skills of robots. It demands the coordination of robotic perception, planning, and control for robustness and intelligence. However, current solutions are still far behind humans, especially when confronting unstructured scenarios. In this paper, we survey the advances of robotic grasping, starting from the classical formulations and solutions to the modern ones. By reviewing the history of robotic grasping, we want to provide a complete view of this community, and perhaps inspire the combination and fusion of different ideas, which we think would be helpful to touch and explore the essence of robotic grasping problems. In detail, we firstly give an overview of the analytic methods for robotic grasping. After that, we provide a discussion on the recent state-of-the-art data-driven grasping approaches rising in recent years. With the development of computer vision, semantic grasping is being widely investigated and can be the basis of intelligent manipulation and skill learning for autonomous robotic systems in the future. Therefore, in our survey, we also briefly review the recent progress in this topic. Finally, we discuss the open problems and the future research directions that may be important for the human-level robustness, autonomy, and intelligence of robots. △ Less

Submitted 7 February, 2022; originally announced February 2022.

arXiv:2112.04454 [pdf, other]

Greedy-based Value Representation for Optimal Coordination in Multi-agent Reinforcement Learning

Authors: Lipeng Wan, Zeyang Liu, Xingyu Chen, Han Wang, Xuguang Lan

Abstract: Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning methods with linear value decomposition (LVD) or monotonic value decomposition (MVD) suffer from relative overgeneralization. As a result, they can not ensure optimal consistency (i.e., the correspondence between individual greedy actions and the maximal true Q value). In this paper, we derive th… ▽ More Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning methods with linear value decomposition (LVD) or monotonic value decomposition (MVD) suffer from relative overgeneralization. As a result, they can not ensure optimal consistency (i.e., the correspondence between individual greedy actions and the maximal true Q value). In this paper, we derive the expression of the joint Q value function of LVD and MVD. According to the expression, we draw a transition diagram, where each self-transition node (STN) is a possible convergence. To ensure optimal consistency, the optimal node is required to be the unique STN. Therefore, we propose the greedy-based value representation (GVR), which turns the optimal node into an STN via inferior target shaping and further eliminates the non-optimal STNs via superior experience replay. In addition, GVR achieves an adaptive trade-off between optimality and stability. Our method outperforms state-of-the-art baselines in experiments on various benchmarks. Theoretical proofs and empirical results on matrix games demonstrate that GVR ensures optimal consistency under sufficient exploration. △ Less

Submitted 3 July, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

arXiv:2112.01078 [pdf, other]

Multi-Agent Intention Sharing via Leader-Follower Forest

Authors: Zeyang Liu, Lipeng Wan, Xue sui, Kewu Sun, Xuguang Lan

Abstract: Intention sharing is crucial for efficient cooperation under partially observable environments in multi-agent reinforcement learning (MARL). However, message deceiving, i.e., a mismatch between the propagated intentions and the final decisions, may happen when agents change strategies simultaneously according to received intentions. Message deceiving leads to potential miscoordination and difficul… ▽ More Intention sharing is crucial for efficient cooperation under partially observable environments in multi-agent reinforcement learning (MARL). However, message deceiving, i.e., a mismatch between the propagated intentions and the final decisions, may happen when agents change strategies simultaneously according to received intentions. Message deceiving leads to potential miscoordination and difficulty for policy learning. This paper proposes the leader-follower forest (LFF) to learn the hierarchical relationship between agents based on interdependencies, achieving one-sided intention sharing in multi-agent communication. By limiting the flowings of intentions through directed edges, intention sharing via LFF (IS-LFF) can eliminate message deceiving effectively and achieve better coordination. In addition, a twostage learning algorithm is proposed to train the forest and the agent network. We evaluate IS-LFF on multiple partially observable MARL benchmarks, and the experimental results show that our method outperforms state-of-the-art communication algorithms. △ Less

Submitted 2 December, 2021; originally announced December 2021.

arXiv:2109.08908 [pdf, other]

Intra-Inter Subject Self-supervised Learning for Multivariate Cardiac Signals

Authors: Xiang Lan, Dianwen Ng, Shenda Hong, Mengling Feng

Abstract: Learning information-rich and generalizable representations effectively from unlabeled multivariate cardiac signals to identify abnormal heart rhythms (cardiac arrhythmias) is valuable in real-world clinical settings but often challenging due to its complex temporal dynamics. Cardiac arrhythmias can vary significantly in temporal patterns even for the same patient ($i.e.$, intra subject difference… ▽ More Learning information-rich and generalizable representations effectively from unlabeled multivariate cardiac signals to identify abnormal heart rhythms (cardiac arrhythmias) is valuable in real-world clinical settings but often challenging due to its complex temporal dynamics. Cardiac arrhythmias can vary significantly in temporal patterns even for the same patient ($i.e.$, intra subject difference). Meanwhile, the same type of cardiac arrhythmia can show different temporal patterns among different patients due to different cardiac structures ($i.e.$, inter subject difference). In this paper, we address the challenges by proposing an Intra-inter Subject self-supervised Learning (ISL) model that is customized for multivariate cardiac signals. Our proposed ISL model integrates medical knowledge into self-supervision to effectively learn from intra-inter subject differences. In intra subject self-supervision, ISL model first extracts heartbeat-level features from each subject using a channel-wise attentional CNN-RNN encoder. Then a stationarity test module is employed to capture the temporal dependencies between heartbeats. In inter subject self-supervision, we design a set of data augmentations according to the clinical characteristics of cardiac signals and perform contrastive learning among subjects to learn distinctive representations for various types of patients. Extensive experiments on three real-world datasets were conducted. In a semi-supervised transfer learning scenario, our pre-trained ISL model leads about 10% improvement over supervised training when only 1% labeled data is available, suggesting strong generalizability and robustness of the model. △ Less

Submitted 18 September, 2021; originally announced September 2021.

Comments: preliminary version

arXiv:2109.08903 [pdf, other]

Density-based Curriculum for Multi-goal Reinforcement Learning with Sparse Rewards

Authors: Deyu Yang, Hanbo Zhang, Xuguang Lan, Jishiyu Ding

Abstract: Multi-goal reinforcement learning (RL) aims to qualify the agent to accomplish multi-goal tasks, which is of great importance in learning scalable robotic manipulation skills. However, reward engineering always requires strenuous efforts in multi-goal RL. Moreover, it will introduce inevitable bias causing the suboptimality of the final policy. The sparse reward provides a simple yet efficient way… ▽ More Multi-goal reinforcement learning (RL) aims to qualify the agent to accomplish multi-goal tasks, which is of great importance in learning scalable robotic manipulation skills. However, reward engineering always requires strenuous efforts in multi-goal RL. Moreover, it will introduce inevitable bias causing the suboptimality of the final policy. The sparse reward provides a simple yet efficient way to overcome such limits. Nevertheless, it harms the exploration efficiency and even hinders the policy from convergence. In this paper, we propose a density-based curriculum learning method for efficient exploration with sparse rewards and better generalization to desired goal distribution. Intuitively, our method encourages the robot to gradually broaden the frontier of its ability along the directions to cover the entire desired goal space as much and quickly as possible. To further improve data efficiency and generality, we augment the goals and transitions within the allowed region during training. Finally, We evaluate our method on diversified variants of benchmark manipulation tasks that are challenging for existing methods. Empirical results show that our method outperforms the state-of-the-art baselines in terms of both data efficiency and success rate. △ Less

Submitted 24 September, 2021; v1 submitted 18 September, 2021; originally announced September 2021.

Comments: 8 pages, 7 figures

arXiv:2109.08039 [pdf, other]

A Survey on Temporal Sentence Grounding in Videos

Authors: Xiaohan Lan, Yitian Yuan, Xin Wang, Zhi Wang, Wenwu Zhu

Abstract: Temporal sentence grounding in videos(TSGV), which aims to localize one target segment from an untrimmed video with respect to a given sentence query, has drawn increasing attentions in the research community over the past few years. Different from the task of temporal action localization, TSGV is more flexible since it can locate complicated activities via natural languages, without restrictions… ▽ More Temporal sentence grounding in videos(TSGV), which aims to localize one target segment from an untrimmed video with respect to a given sentence query, has drawn increasing attentions in the research community over the past few years. Different from the task of temporal action localization, TSGV is more flexible since it can locate complicated activities via natural languages, without restrictions from predefined action categories. Meanwhile, TSGV is more challenging since it requires both textual and visual understanding for semantic alignment between two modalities(i.e., text and video). In this survey, we give a comprehensive overview for TSGV, which i) summarizes the taxonomy of existing methods, ii) provides a detailed description of the evaluation protocols(i.e., datasets and metrics) to be used in TSGV, and iii) in-depth discusses potential problems of current benchmarking designs and research directions for further investigations. To the best of our knowledge, this is the first systematic survey on temporal sentence grounding. More specifically, we first discuss existing TSGV approaches by grouping them into four categories, i.e., two-stage methods, end-to-end methods, reinforcement learning-based methods, and weakly supervised methods. Then we present the benchmark datasets and evaluation metrics to assess current research progress. Finally, we discuss some limitations in TSGV through pointing out potential problems improperly resolved in the current evaluation protocols, which may push forwards more cutting edge research in TSGV. Besides, we also share our insights on several promising directions, including three typical tasks with new and practical settings based on TSGV. △ Less

Submitted 16 September, 2021; v1 submitted 16 September, 2021; originally announced September 2021.

Comments: 32 pages with 19 figures

arXiv:2108.12863 [pdf, other]

MBDF-Net: Multi-Branch Deep Fusion Network for 3D Object Detection

Authors: Xun Tan, Xingyu Chen, Guowei Zhang, Jishiyu Ding, Xuguang Lan

Abstract: Point clouds and images could provide complementary information when representing 3D objects. Fusing the two kinds of data usually helps to improve the detection results. However, it is challenging to fuse the two data modalities, due to their different characteristics and the interference from the non-interest areas. To solve this problem, we propose a Multi-Branch Deep Fusion Network (MBDF-Net)… ▽ More Point clouds and images could provide complementary information when representing 3D objects. Fusing the two kinds of data usually helps to improve the detection results. However, it is challenging to fuse the two data modalities, due to their different characteristics and the interference from the non-interest areas. To solve this problem, we propose a Multi-Branch Deep Fusion Network (MBDF-Net) for 3D object detection. The proposed detector has two stages. In the first stage, our multi-branch feature extraction network utilizes Adaptive Attention Fusion (AAF) modules to produce cross-modal fusion features from single-modal semantic features. In the second stage, we use a region of interest (RoI) -pooled fusion module to generate enhanced local features for refinement. A novel attention-based hybrid sampling strategy is also proposed for selecting key points in the downsampling process. We evaluate our approach on two widely used benchmark datasets including KITTI and SUN-RGBD. The experimental results demonstrate the advantages of our method over state-of-the-art approaches. △ Less

Submitted 29 August, 2021; originally announced August 2021.

arXiv:2108.11092 [pdf, other]

INVIGORATE: Interactive Visual Grounding and Grasping in Clutter

Authors: Hanbo Zhang, Yunfan Lu, Cunjun Yu, David Hsu, Xuguang Lan, Nanning Zheng

Abstract: This paper presents INVIGORATE, a robot system that interacts with human through natural language and grasps a specified object in clutter. The objects may occlude, obstruct, or even stack on top of one another. INVIGORATE embodies several challenges: (i) infer the target object among other occluding objects, from input language expressions and RGB images, (ii) infer object blocking relationships… ▽ More This paper presents INVIGORATE, a robot system that interacts with human through natural language and grasps a specified object in clutter. The objects may occlude, obstruct, or even stack on top of one another. INVIGORATE embodies several challenges: (i) infer the target object among other occluding objects, from input language expressions and RGB images, (ii) infer object blocking relationships (OBRs) from the images, and (iii) synthesize a multi-step plan to ask questions that disambiguate the target object and to grasp it successfully. We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping. They allow for unrestricted object categories and language expressions, subject to the training datasets. However, errors in visual perception and ambiguity in human languages are inevitable and negatively impact the robot's performance. To overcome these uncertainties, we build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules. Through approximate POMDP planning, the robot tracks the history of observations and asks disambiguation questions in order to achieve a near-optimal sequence of actions that identify and grasp the target object. INVIGORATE combines the benefits of model-based POMDP planning and data-driven deep learning. Preliminary experiments with INVIGORATE on a Fetch robot show significant benefits of this integrated approach to object grasping in clutter with natural language interactions. A demonstration video is available at https://youtu.be/zYakh80SGcU. △ Less

Submitted 7 January, 2024; v1 submitted 25 August, 2021; originally announced August 2021.

Comments: 12 pages, full version

arXiv:2107.06564 [pdf, other]

Probabilistic Human Motion Prediction via A Bayesian Neural Network

Authors: Jie Xu, Xingyu Chen, Xuguang Lan, Nanning Zheng

Abstract: Human motion prediction is an important and challenging topic that has promising prospects in efficient and safe human-robot-interaction systems. Currently, the majority of the human motion prediction algorithms are based on deterministic models, which may lead to risky decisions for robots. To solve this problem, we propose a probabilistic model for human motion prediction in this paper. The key… ▽ More Human motion prediction is an important and challenging topic that has promising prospects in efficient and safe human-robot-interaction systems. Currently, the majority of the human motion prediction algorithms are based on deterministic models, which may lead to risky decisions for robots. To solve this problem, we propose a probabilistic model for human motion prediction in this paper. The key idea of our approach is to extend the conventional deterministic motion prediction neural network to a Bayesian one. On one hand, our model could generate several future motions when given an observed motion sequence. On the other hand, by calculating the Epistemic Uncertainty and the Heteroscedastic Aleatoric Uncertainty, our model could tell the robot if the observation has been seen before and also give the optimal result among all possible predictions. We extensively validate our approach on a large scale benchmark dataset Human3.6m. The experiments show that our approach performs better than deterministic methods. We further evaluate our approach in a Human-Robot-Interaction (HRI) scenario. The experimental results show that our approach makes the interaction more efficient and safer. △ Less

Submitted 14 July, 2021; originally announced July 2021.

Comments: Accepted at ICRA 2021

arXiv:2106.10262 [pdf, other]

A Probabilistic Representation of DNNs: Bridging Mutual Information and Generalization

Authors: Xinjie Lan, Kenneth Barner

Abstract: Recently, Mutual Information (MI) has attracted attention in bounding the generalization error of Deep Neural Networks (DNNs). However, it is intractable to accurately estimate the MI in DNNs, thus most previous works have to relax the MI bound, which in turn weakens the information theoretic explanation for generalization. To address the limitation, this paper introduces a probabilistic represent… ▽ More Recently, Mutual Information (MI) has attracted attention in bounding the generalization error of Deep Neural Networks (DNNs). However, it is intractable to accurately estimate the MI in DNNs, thus most previous works have to relax the MI bound, which in turn weakens the information theoretic explanation for generalization. To address the limitation, this paper introduces a probabilistic representation of DNNs for accurately estimating the MI. Leveraging the proposed MI estimator, we validate the information theoretic explanation for generalization, and derive a tighter generalization bound than the state-of-the-art relaxations. △ Less

Submitted 18 June, 2021; originally announced June 2021.

Comments: To appear in the ICML 2021 Workshop on Theoretic Foundation, Criticism, and Application Trend of Explainable AI

Showing 1–50 of 85 results for author: Lan, X