-
Towards Localized Fine-Grained Control for Facial Expression Generation
Authors:
Tuomas Varanka,
Huai-Qian Khor,
Yante Li,
Mengting Wei,
Hanwei Kung,
Nicu Sebe,
Guoying Zhao
Abstract:
Generative models have surged in popularity recently due to their ability to produce high-quality images and video. However, steering these models to produce images with specific attributes and precise control remains challenging. Humans, particularly their faces, are central to content generation due to their ability to convey rich expressions and intent. Current generative models mostly generate…
▽ More
Generative models have surged in popularity recently due to their ability to produce high-quality images and video. However, steering these models to produce images with specific attributes and precise control remains challenging. Humans, particularly their faces, are central to content generation due to their ability to convey rich expressions and intent. Current generative models mostly generate flat neutral expressions and characterless smiles without authenticity. Other basic expressions like anger are possible, but are limited to the stereotypical expression, while other unconventional facial expressions like doubtful are difficult to reliably generate.
In this work, we propose the use of AUs (action units) for facial expression control in face generation. AUs describe individual facial muscle movements based on facial anatomy, allowing precise and localized control over the intensity of facial movements. By combining different action units, we unlock the ability to create unconventional facial expressions that go beyond typical emotional models, enabling nuanced and authentic reactions reflective of real-world expressions. The proposed method can be seamlessly integrated with both text and image prompts using adapters, offering precise and intuitive control of the generated results. Code and dataset are available in {https://github.com/tvaranka/fineface}.
△ Less
Submitted 25 July, 2024;
originally announced July 2024.
-
Before and After Blockchain: Development and Principle of Distributed Fault Tolerance Consensus
Authors:
Huanyu Wu,
Chentao Yue,
Yixuan Fan,
Yonghui Li,
Lei Zhang
Abstract:
The concept of distributed consensus gained widespread attention following the publication of ``Byzantine Generals Problem'' by Leslie Lamport in the 1980s. This research topic has been active and extensively studied over the last four decades, particularly since the advent of blockchain technology in 2009. Blockchain technology employs Proof-of-X (PoX) or Byzantine-fault-tolerant (BFT) systems, w…
▽ More
The concept of distributed consensus gained widespread attention following the publication of ``Byzantine Generals Problem'' by Leslie Lamport in the 1980s. This research topic has been active and extensively studied over the last four decades, particularly since the advent of blockchain technology in 2009. Blockchain technology employs Proof-of-X (PoX) or Byzantine-fault-tolerant (BFT) systems, where all participants follow a protocol to achieve a common state (i.e., consistency) eventually. However, because PoX consensus such as Proof-of-Work is is resource-intensive with high power consumption, most permissioned blockchains employ BFT to achieve consistency. In this article, we provide an introduction to the fundamental principles and history of distributed consensus. We then explore the well-known fault-tolerant state machine replication (SMR) in partially synchronous networks, as well as consensus protocols in asynchronous models and recently proposed DAG-based consensus. Additionally, we examine the relationship between BFT consensus and blockchain technology and discuss the following questions: What is the history and evolution of BFT? Why are BFT protocols designed in the way they are and what core components do they use? What is the connection between BFT and blockchain technology, and what are the driving needs for future BFT research?
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
The Second Joint Workshop on Cross Reality
Authors:
Nanjia Wang,
Yue Li,
Francesco Chiossi,
Fabian Pointecker,
Lixiang Zhao,
Daniel Zielasko
Abstract:
The 2nd Joint Workshop on Cross Reality (JWCR'24), organized as part of ISMAR 2024, seeks to explore the burgeoning field of Cross Reality (CR), which encompasses the seamless integration and transition between various points on the reality-virtuality continuum (RVC) such as Virtual Reality (VR), Augmented Virtuality (AV), and Augmented Reality (AR). This hybrid workshop aims to build upon the fou…
▽ More
The 2nd Joint Workshop on Cross Reality (JWCR'24), organized as part of ISMAR 2024, seeks to explore the burgeoning field of Cross Reality (CR), which encompasses the seamless integration and transition between various points on the reality-virtuality continuum (RVC) such as Virtual Reality (VR), Augmented Virtuality (AV), and Augmented Reality (AR). This hybrid workshop aims to build upon the foundation laid by the inaugural JWCR at ISMAR 2023, which successfully unified diverse CR research communities. The workshop will address key themes including CR visualization, interaction, user behavior, design, development, engineering, and collaboration. CR Visualization focuses on creating and displaying spatial data across the RVC, enabling users to navigate and interpret information fluidly. CR Interaction delves into natural user engagements using gestures, voice commands, and other advanced techniques to enhance immersion. The study of CR User Behavior and Experience investigates how users perceive and interact within these hybrid environments. Furthermore, CR Design and Development emphasizes creating effective CR applications using innovative processes and tools, while CR Collaboration examines methods for fostering teamwork in mixed reality settings.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
Survey and Taxonomy: The Role of Data-Centric AI in Transformer-Based Time Series Forecasting
Authors:
Jingjing Xu,
Caesar Wu,
Yuan-Fang Li,
Gregoire Danoy,
Pascal Bouvry
Abstract:
Alongside the continuous process of improving AI performance through the development of more sophisticated models, researchers have also focused their attention to the emerging concept of data-centric AI, which emphasizes the important role of data in a systematic machine learning training process. Nonetheless, the development of models has also continued apace. One result of this progress is the…
▽ More
Alongside the continuous process of improving AI performance through the development of more sophisticated models, researchers have also focused their attention to the emerging concept of data-centric AI, which emphasizes the important role of data in a systematic machine learning training process. Nonetheless, the development of models has also continued apace. One result of this progress is the development of the Transformer Architecture, which possesses a high level of capability in multiple domains such as Natural Language Processing (NLP), Computer Vision (CV) and Time Series Forecasting (TSF). Its performance is, however, heavily dependent on input data preprocessing and output data evaluation, justifying a data-centric approach to future research. We argue that data-centric AI is essential for training AI models, particularly for transformer-based TSF models efficiently. However, there is a gap regarding the integration of transformer-based TSF and data-centric AI. This survey aims to pin down this gap via the extensive literature review based on the proposed taxonomy. We review the previous research works from a data-centric AI perspective and we intend to lay the foundation work for the future development of transformer-based architecture and data-centric AI.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
Causal Interventional Prediction System for Robust and Explainable Effect Forecasting
Authors:
Zhixuan Chu,
Hui Ding,
Guang Zeng,
Shiyu Wang,
Yiming Li
Abstract:
Although the widespread use of AI systems in today's world is growing, many current AI systems are found vulnerable due to hidden bias and missing information, especially in the most commonly used forecasting system. In this work, we explore the robustness and explainability of AI-based forecasting systems. We provide an in-depth analysis of the underlying causality involved in the effect predicti…
▽ More
Although the widespread use of AI systems in today's world is growing, many current AI systems are found vulnerable due to hidden bias and missing information, especially in the most commonly used forecasting system. In this work, we explore the robustness and explainability of AI-based forecasting systems. We provide an in-depth analysis of the underlying causality involved in the effect prediction task and further establish a causal graph based on treatment, adjustment variable, confounder, and outcome. Correspondingly, we design a causal interventional prediction system (CIPS) based on a variational autoencoder and fully conditional specification of multiple imputations. Extensive results demonstrate the superiority of our system over state-of-the-art methods and show remarkable versatility and extensibility in practice.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
Take A Step Back: Rethinking the Two Stages in Visual Reasoning
Authors:
Mingyu Zhang,
Jiting Cai,
Mingyu Liu,
Yue Xu,
Cewu Lu,
Yong-Lu Li
Abstract:
Visual reasoning, as a prominent research area, plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning…
▽ More
Visual reasoning, as a prominent research area, plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning and their tendency to data bias fitting. In this paper, we revisit visual reasoning with a two-stage perspective: (1) symbolization and (2) logical reasoning given symbols or their representations. We find that the reasoning stage is better at generalization than symbolization. Thus, it is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner. Given our findings, we establish design principles for visual reasoning frameworks following the separated symbolization and shared reasoning. The proposed two-stage framework achieves impressive generalization ability on various visual reasoning tasks, including puzzles, physical prediction, and visual question answering (VQA), encompassing both 2D and 3D modalities. We believe our insights will pave the way for generalizable visual reasoning.
△ Less
Submitted 28 July, 2024;
originally announced July 2024.
-
Temporal Feature Matters: A Framework for Diffusion Model Quantization
Authors:
Yushi Huang,
Ruihao Gong,
Xianglong Liu,
Jing Liu,
Yuhang Li,
Jiwen Lu,
Dacheng Tao
Abstract:
The Diffusion models, widely used for image generation, face significant challenges related to their broad applicability due to prolonged inference times and high memory demands. Efficient Post-Training Quantization (PTQ) is crucial to address these issues in traditional models. Unlike those models, diffusion models critically rely on the time-step $t$ for effective multi-round denoising. Typicall…
▽ More
The Diffusion models, widely used for image generation, face significant challenges related to their broad applicability due to prolonged inference times and high memory demands. Efficient Post-Training Quantization (PTQ) is crucial to address these issues in traditional models. Unlike those models, diffusion models critically rely on the time-step $t$ for effective multi-round denoising. Typically, $t$ from the finite set $\{1, \ldots, T\}$ is encoded into a hypersensitive temporal feature by several modules, entirely independent of the sampling data. However, existing PTQ methods do not optimize these modules individually. Instead, they employ unsuitable reconstruction objectives and complex calibration methods, leading to significant disturbances in the temporal feature and denoising trajectory. To address these challenges, we introduce a novel quantization framework: 1)~TIB-based Maintenance: Based on our innovative Temporal Information Block~(TIB) definition, Temporal Information-aware Reconstruction~(TIAR) and Finite Set Calibration~(FSC) are developed to efficiently align full precision temporal features. 2)~Cache-based Maintenance: Instead of indirect and complex optimization for the related modules, pre-computing and caching quantized counterparts of temporal features are developed to minimize errors. 3)~Disturbance-aware Selection: Employ temporal feature errors to guide a fine-grained selection for superior maintenance. This framework preserves most of the temporal information and ensures high-quality end-to-end generation. Extensive testing on various datasets and diffusion models confirms our superior results. Notably, our approach closely matches the performance of the full-precision model under 4-bit quantization. Furthermore, the quantized SD-XL model achieves hardware acceleration of 2.20$\times$ on CPU and 5.76$\times$ on GPU demonstrating its efficiency.
△ Less
Submitted 28 July, 2024;
originally announced July 2024.
-
Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph
Authors:
Zhengcen Li,
Xinle Chang,
Yueran Li,
Jingyong Su
Abstract:
Group Activity Recognition aims to understand collective activities from videos. Existing solutions primarily rely on the RGB modality, which encounters challenges such as background variations, occlusions, motion blurs, and significant computational overhead. Meanwhile, current keypoint-based methods offer a lightweight and informative representation of human motions but necessitate accurate indi…
▽ More
Group Activity Recognition aims to understand collective activities from videos. Existing solutions primarily rely on the RGB modality, which encounters challenges such as background variations, occlusions, motion blurs, and significant computational overhead. Meanwhile, current keypoint-based methods offer a lightweight and informative representation of human motions but necessitate accurate individual annotations and specialized interaction reasoning modules. To address these limitations, we design a panoramic graph that incorporates multi-person skeletons and objects to encapsulate group activity, offering an effective alternative to RGB video. This panoramic graph enables Graph Convolutional Network (GCN) to unify intra-person, inter-person, and person-object interactive modeling through spatial-temporal graph convolutions. In practice, we develop a novel pipeline that extracts skeleton coordinates using pose estimation and tracking algorithms and employ Multi-person Panoramic GCN (MP-GCN) to predict group activities. Extensive experiments on Volleyball and NBA datasets demonstrate that the MP-GCN achieves state-of-the-art performance in both accuracy and efficiency. Notably, our method outperforms RGB-based approaches by using only estimated 2D keypoints as input. Code is available at https://github.com/mgiant/MP-GCN
△ Less
Submitted 28 July, 2024;
originally announced July 2024.
-
FIND: Fine-tuning Initial Noise Distribution with Policy Optimization for Diffusion Models
Authors:
Changgu Chen,
Libing Yang,
Xiaoyan Yang,
Lianggangxu Chen,
Gaoqi He,
CHangbo Wang,
Yang Li
Abstract:
In recent years, large-scale pre-trained diffusion models have demonstrated their outstanding capabilities in image and video generation tasks. However, existing models tend to produce visual objects commonly found in the training dataset, which diverges from user input prompts. The underlying reason behind the inaccurate generated results lies in the model's difficulty in sampling from specific i…
▽ More
In recent years, large-scale pre-trained diffusion models have demonstrated their outstanding capabilities in image and video generation tasks. However, existing models tend to produce visual objects commonly found in the training dataset, which diverges from user input prompts. The underlying reason behind the inaccurate generated results lies in the model's difficulty in sampling from specific intervals of the initial noise distribution corresponding to the prompt. Moreover, it is challenging to directly optimize the initial distribution, given that the diffusion process involves multiple denoising steps. In this paper, we introduce a Fine-tuning Initial Noise Distribution (FIND) framework with policy optimization, which unleashes the powerful potential of pre-trained diffusion networks by directly optimizing the initial distribution to align the generated contents with user-input prompts. To this end, we first reformulate the diffusion denoising procedure as a one-step Markov decision process and employ policy optimization to directly optimize the initial distribution. In addition, a dynamic reward calibration module is proposed to ensure training stability during optimization. Furthermore, we introduce a ratio clipping algorithm to utilize historical data for network training and prevent the optimized distribution from deviating too far from the original policy to restrain excessive optimization magnitudes. Extensive experiments demonstrate the effectiveness of our method in both text-to-image and text-to-video tasks, surpassing SOTA methods in achieving consistency between prompts and the generated content. Our method achieves 10 times faster than the SOTA approach. Our homepage is available at \url{https://github.com/vpx-ecnu/FIND-website}.
△ Less
Submitted 28 July, 2024;
originally announced July 2024.
-
Parameter-Efficient Fine-Tuning via Circular Convolution
Authors:
Aochuan Chen,
Ziqi Gao,
Zijing Liu,
Yu Li,
Jia Li
Abstract:
Low-Rank Adaptation (LoRA) has gained popularity for fine-tuning large foundation models, leveraging low-rank matrices $\mathbf{A}$ and $\mathbf{B}$ to represent weight changes (\textit{i.e.,} $Δ\mathbf{W} = \mathbf{B} \mathbf{A}$). This method reduces trainable parameters and mitigates heavy memory consumption associated with full delta matrices by sequentially multiplying $\mathbf{A}$ and…
▽ More
Low-Rank Adaptation (LoRA) has gained popularity for fine-tuning large foundation models, leveraging low-rank matrices $\mathbf{A}$ and $\mathbf{B}$ to represent weight changes (\textit{i.e.,} $Δ\mathbf{W} = \mathbf{B} \mathbf{A}$). This method reduces trainable parameters and mitigates heavy memory consumption associated with full delta matrices by sequentially multiplying $\mathbf{A}$ and $\mathbf{B}$ with the activation. Despite its success, the intrinsic low-rank characteristic may limit its performance. Although several variants have been proposed to address this issue, they often overlook the crucial computational and memory efficiency brought by LoRA. In this paper, we propose \underline{C}ir\underline{c}ular \underline{C}onvolution \underline{A}daptation (C$^3$A), which not only achieves high-rank adaptation with enhanced performance but also excels in both computational power and memory utilization. Extensive experiments demonstrate that C$^3$A consistently outperforms LoRA and its variants across various fine-tuning tasks.
△ Less
Submitted 27 July, 2024;
originally announced July 2024.
-
Large Language Models for Human-like Autonomous Driving: A Survey
Authors:
Yun Li,
Kai Katsumata,
Ehsan Javanmardi,
Manabu Tsukada
Abstract:
Large Language Models (LLMs), AI models trained on massive text corpora with remarkable language understanding and generation capabilities, are transforming the field of Autonomous Driving (AD). As AD systems evolve from rule-based and optimization-based methods to learning-based techniques like deep reinforcement learning, they are now poised to embrace a third and more advanced category: knowled…
▽ More
Large Language Models (LLMs), AI models trained on massive text corpora with remarkable language understanding and generation capabilities, are transforming the field of Autonomous Driving (AD). As AD systems evolve from rule-based and optimization-based methods to learning-based techniques like deep reinforcement learning, they are now poised to embrace a third and more advanced category: knowledge-based AD empowered by LLMs. This shift promises to bring AD closer to human-like AD. However, integrating LLMs into AD systems poses challenges in real-time inference, safety assurance, and deployment costs. This survey provides a comprehensive and critical review of recent progress in leveraging LLMs for AD, focusing on their applications in modular AD pipelines and end-to-end AD systems. We highlight key advancements, identify pressing challenges, and propose promising research directions to bridge the gap between LLMs and AD, thereby facilitating the development of more human-like AD systems. The survey first introduces LLMs' key features and common training schemes, then delves into their applications in modular AD pipelines and end-to-end AD, respectively, followed by discussions on open challenges and future directions. Through this in-depth analysis, we aim to provide insights and inspiration for researchers and practitioners working at the intersection of AI and autonomous vehicles, ultimately contributing to safer, smarter, and more human-centric AD technologies.
△ Less
Submitted 27 July, 2024;
originally announced July 2024.
-
Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction
Authors:
Yansheng Li,
Tingzhu Wang,
Kang Wu,
Linlin Wang,
Xin Guo,
Wenbin Wang
Abstract:
Scene Graph Generation (SGG) aims to explore the relationships between objects in images and obtain scene summary graphs, thereby better serving downstream tasks. However, the long-tailed problem has adversely affected the scene graph's quality. The predictions are dominated by coarse-grained relationships, lacking more informative fine-grained ones. The union region of one object pair (i.e., one…
▽ More
Scene Graph Generation (SGG) aims to explore the relationships between objects in images and obtain scene summary graphs, thereby better serving downstream tasks. However, the long-tailed problem has adversely affected the scene graph's quality. The predictions are dominated by coarse-grained relationships, lacking more informative fine-grained ones. The union region of one object pair (i.e., one sample) contains rich and dedicated contextual information, enabling the prediction of the sample-specific bias for refining the original relationship prediction. Therefore, we propose a novel Sample-Level Bias Prediction (SBP) method for fine-grained SGG (SBG). Firstly, we train a classic SGG model and construct a correction bias set by calculating the margin between the ground truth label and the predicted label with one classic SGG model. Then, we devise a Bias-Oriented Generative Adversarial Network (BGAN) that learns to predict the constructed correction biases, which can be utilized to correct the original predictions from coarse-grained relationships to fine-grained ones. The extensive experimental results on VG, GQA, and VG-1800 datasets demonstrate that our SBG outperforms the state-of-the-art methods in terms of Average@K across three mainstream SGG models: Motif, VCtree, and Transformer. Compared to dataset-level correction methods on VG, SBG shows a significant average improvement of 5.6%, 3.9%, and 3.2% on Average@K for tasks PredCls, SGCls, and SGDet, respectively. The code will be available at https://github.com/Zhuzi24/SBG.
△ Less
Submitted 27 July, 2024;
originally announced July 2024.
-
Stochastic Parrots or ICU Experts? Large Language Models in Critical Care Medicine: A Scoping Review
Authors:
Tongyue Shi,
Jun Ma,
Zihan Yu,
Haowei Xu,
Minqi Xiong,
Meirong Xiao,
Yilin Li,
Huiying Zhao,
Guilan Kong
Abstract:
With the rapid development of artificial intelligence (AI), large language models (LLMs) have shown strong capabilities in natural language understanding, reasoning, and generation, attracting amounts of research interest in applying LLMs to health and medicine. Critical care medicine (CCM) provides diagnosis and treatment for critically ill patients who often require intensive monitoring and inte…
▽ More
With the rapid development of artificial intelligence (AI), large language models (LLMs) have shown strong capabilities in natural language understanding, reasoning, and generation, attracting amounts of research interest in applying LLMs to health and medicine. Critical care medicine (CCM) provides diagnosis and treatment for critically ill patients who often require intensive monitoring and interventions in intensive care units (ICUs). Can LLMs be applied to CCM? Are LLMs just like stochastic parrots or ICU experts in assisting clinical decision-making? This scoping review aims to provide a panoramic portrait of the application of LLMs in CCM. Literature in seven databases, including PubMed, Embase, Scopus, Web of Science, CINAHL, IEEE Xplore, and ACM Digital Library, were searched from January 1, 2019, to June 10, 2024. Peer-reviewed journal and conference articles that discussed the application of LLMs in critical care settings were included. From an initial 619 articles, 24 were selected for final review. This review grouped applications of LLMs in CCM into three categories: clinical decision support, medical documentation and reporting, and medical education and doctor-patient communication. LLMs have advantages in handling unstructured data and do not require manual feature engineering. Meanwhile, applying LLMs to CCM faces challenges, including hallucinations, poor interpretability, bias and alignment challenges, and privacy and ethics issues. Future research should enhance model reliability and interpretability, integrate up-to-date medical knowledge, and strengthen privacy and ethical guidelines. As LLMs evolve, they could become key tools in CCM to help improve patient outcomes and optimize healthcare delivery. This study is the first review of LLMs in CCM, aiding researchers, clinicians, and policymakers to understand the current status and future potentials of LLMs in CCM.
△ Less
Submitted 27 July, 2024;
originally announced July 2024.
-
Data Processing Techniques for Modern Multimodal Models
Authors:
Yinheng Li,
Han Ding,
Hang Chen
Abstract:
Data processing plays an significant role in current multimodal model training. In this paper. we provide an comprehensive review of common data processing techniques used in modern multimodal model training with a focus on diffusion models and multimodal large language models (MLLMs). We summarized all techniques into four categories: data quality, data quantity, data distribution and data safety…
▽ More
Data processing plays an significant role in current multimodal model training. In this paper. we provide an comprehensive review of common data processing techniques used in modern multimodal model training with a focus on diffusion models and multimodal large language models (MLLMs). We summarized all techniques into four categories: data quality, data quantity, data distribution and data safety. We further present our findings in the choice of data process methods in different type of models. This study aims to provide guidance to multimodal models developers with effective data processing techniques.
△ Less
Submitted 27 July, 2024;
originally announced July 2024.
-
Reducing Spurious Correlation for Federated Domain Generalization
Authors:
Shuran Ma,
Weiying Xie,
Daixun Li,
Haowei Li,
Yunsong Li
Abstract:
The rapid development of multimedia has provided a large amount of data with different distributions for visual tasks, forming different domains. Federated Learning (FL) can efficiently use this diverse data distributed on different client media in a decentralized manner through model sharing. However, in open-world scenarios, there is a challenge: global models may struggle to predict well on ent…
▽ More
The rapid development of multimedia has provided a large amount of data with different distributions for visual tasks, forming different domains. Federated Learning (FL) can efficiently use this diverse data distributed on different client media in a decentralized manner through model sharing. However, in open-world scenarios, there is a challenge: global models may struggle to predict well on entirely new domain data captured by certain media, which were not encountered during training. Existing methods still rely on strong statistical correlations between samples and labels to address this issue, which can be misleading, as some features may establish spurious short-cut correlations with the predictions. To comprehensively address this challenge, we introduce FedCD (Cross-Domain Invariant Federated Learning), an overall optimization framework at both the local and global levels. We introduce the Spurious Correlation Intervener (SCI), which employs invariance theory to locally generate interventers for features in a self-supervised manner to reduce the model's susceptibility to spurious correlated features. Our approach requires no sharing of data or features, only the gradients related to the model. Additionally, we develop the simple yet effective Risk Extrapolation Aggregation strategy (REA), determining aggregation coefficients through mathematical optimization to facilitate global causal invariant predictions. Extensive experiments and ablation studies highlight the effectiveness of our approach. In both classification and object detection generalization tasks, our method outperforms the baselines by an average of at least 1.45% in Acc, 4.8% and 1.27% in mAP50.
△ Less
Submitted 27 July, 2024;
originally announced July 2024.
-
Repairing Networks of $\mathcal{EL_\perp}$ Ontologies using Weakening and Completing -- Extended version
Authors:
Ying Li,
Patrick Lambrix
Abstract:
The quality of ontologies and their alignments is crucial for developing high-quality semantics-based applications. Traditional debugging techniques repair ontology networks by removing unwanted axioms and mappings, but may thereby remove consequences that are correct in the domain of the ontology network. In this paper we propose a framework for repairing ontology networks that deals with this is…
▽ More
The quality of ontologies and their alignments is crucial for developing high-quality semantics-based applications. Traditional debugging techniques repair ontology networks by removing unwanted axioms and mappings, but may thereby remove consequences that are correct in the domain of the ontology network. In this paper we propose a framework for repairing ontology networks that deals with this issue. It defines basic operations such as debugging, weakening and completing. Further, it defines combination operators that reflect choices in how and when to use the basic operators, as well as choices regarding the autonomy level of the ontologies and alignments in the ontology network. We show the influence of the combination operators on the quality of the repaired network and present an implemented tool. By using our framework together with existing algorithms for debugging, weakening and completing, we essentially provide a blueprint for extending previous work and systems.
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
Topology Optimization of Random Memristors for Input-Aware Dynamic SNN
Authors:
Bo Wang,
Shaocong Wang,
Ning Lin,
Yi Li,
Yifei Yu,
Yue Zhang,
Jichang Yang,
Xiaoshan Wu,
Yangu He,
Songqi Wang,
Rui Chen,
Guoqi Li,
Xiaojuan Qi,
Zhongrui Wang,
Dashan Shang
Abstract:
There is unprecedented development in machine learning, exemplified by recent large language models and world simulators, which are artificial neural networks running on digital computers. However, they still cannot parallel human brains in terms of energy efficiency and the streamlined adaptability to inputs of different difficulties, due to differences in signal representation, optimization, run…
▽ More
There is unprecedented development in machine learning, exemplified by recent large language models and world simulators, which are artificial neural networks running on digital computers. However, they still cannot parallel human brains in terms of energy efficiency and the streamlined adaptability to inputs of different difficulties, due to differences in signal representation, optimization, run-time reconfigurability, and hardware architecture. To address these fundamental challenges, we introduce pruning optimization for input-aware dynamic memristive spiking neural network (PRIME). Signal representation-wise, PRIME employs leaky integrate-and-fire neurons to emulate the brain's inherent spiking mechanism. Drawing inspiration from the brain's structural plasticity, PRIME optimizes the topology of a random memristive spiking neural network without expensive memristor conductance fine-tuning. For runtime reconfigurability, inspired by the brain's dynamic adjustment of computational depth, PRIME employs an input-aware dynamic early stop policy to minimize latency during inference, thereby boosting energy efficiency without compromising performance. Architecture-wise, PRIME leverages memristive in-memory computing, mirroring the brain and mitigating the von Neumann bottleneck. We validated our system using a 40 nm 256 Kb memristor-based in-memory computing macro on neuromorphic image classification and image inpainting. Our results demonstrate the classification accuracy and Inception Score are comparable to the software baseline, while achieving maximal 62.50-fold improvements in energy efficiency, and maximal 77.0% computational load savings. The system also exhibits robustness against stochastic synaptic noise of analogue memristors. Our software-hardware co-designed model paves the way to future brain-inspired neuromorphic computing with brain-like energy efficiency and adaptivity.
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
Learning Spectral-Decomposed Tokens for Domain Generalized Semantic Segmentation
Authors:
Jingjun Yi,
Qi Bi,
Hao Zheng,
Haolan Zhan,
Wei Ji,
Yawen Huang,
Yuexiang Li,
Yefeng Zheng
Abstract:
The rapid development of Vision Foundation Model (VFM) brings inherent out-domain generalization for a variety of down-stream tasks. Among them, domain generalized semantic segmentation (DGSS) holds unique challenges as the cross-domain images share common pixel-wise content information but vary greatly in terms of the style. In this paper, we present a novel Spectral-dEcomposed Token (SET) learni…
▽ More
The rapid development of Vision Foundation Model (VFM) brings inherent out-domain generalization for a variety of down-stream tasks. Among them, domain generalized semantic segmentation (DGSS) holds unique challenges as the cross-domain images share common pixel-wise content information but vary greatly in terms of the style. In this paper, we present a novel Spectral-dEcomposed Token (SET) learning framework to advance the frontier. Delving into further than existing fine-tuning token & frozen backbone paradigm, the proposed SET especially focuses on the way learning style-invariant features from these learnable tokens. Particularly, the frozen VFM features are first decomposed into the phase and amplitude components in the frequency space, which mainly contain the information of content and style, respectively, and then separately processed by learnable tokens for task-specific information extraction. After the decomposition, style variation primarily impacts the token-based feature enhancement within the amplitude branch. To address this issue, we further develop an attention optimization method to bridge the gap between style-affected representation and static tokens during inference. Extensive cross-domain experiments show its state-of-the-art performance.
△ Less
Submitted 28 July, 2024; v1 submitted 26 July, 2024;
originally announced July 2024.
-
AI Safety in Generative AI Large Language Models: A Survey
Authors:
Jaymari Chua,
Yun Li,
Shiyi Yang,
Chen Wang,
Lina Yao
Abstract:
Large Language Model (LLMs) such as ChatGPT that exhibit generative AI capabilities are facing accelerated adoption and innovation. The increased presence of Generative AI (GAI) inevitably raises concerns about the risks and safety associated with these models. This article provides an up-to-date survey of recent trends in AI safety research of GAI-LLMs from a computer scientist's perspective: spe…
▽ More
Large Language Model (LLMs) such as ChatGPT that exhibit generative AI capabilities are facing accelerated adoption and innovation. The increased presence of Generative AI (GAI) inevitably raises concerns about the risks and safety associated with these models. This article provides an up-to-date survey of recent trends in AI safety research of GAI-LLMs from a computer scientist's perspective: specific and technical. In this survey, we explore the background and motivation for the identified harms and risks in the context of LLMs being generative language models; our survey differentiates by emphasising the need for unified theories of the distinct safety challenges in the research development and applications of LLMs. We start our discussion with a concise introduction to the workings of LLMs, supported by relevant literature. Then we discuss earlier research that has pointed out the fundamental constraints of generative models, or lack of understanding thereof (e.g., performance and safety trade-offs as LLMs scale in number of parameters). We provide a sufficient coverage of LLM alignment -- delving into various approaches, contending methods and present challenges associated with aligning LLMs with human preferences. By highlighting the gaps in the literature and possible implementation oversights, our aim is to create a comprehensive analysis that provides insights for addressing AI safety in LLMs and encourages the development of aligned and secure models. We conclude our survey by discussing future directions of LLMs for AI safety, offering insights into ongoing research in this critical area.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
MapTune: Advancing ASIC Technology Mapping via Reinforcement Learning Guided Library Tuning
Authors:
Mingju Liu,
Daniel Robinson,
Yingjie Li,
Cunxi Yu
Abstract:
Technology mapping involves mapping logical circuits to a library of cells. Traditionally, the full technology library is used, leading to a large search space and potential overhead. Motivated by randomly sampled technology mapping case studies, we propose MapTune framework that addresses this challenge by utilizing reinforcement learning to make design-specific choices during cell selection. By…
▽ More
Technology mapping involves mapping logical circuits to a library of cells. Traditionally, the full technology library is used, leading to a large search space and potential overhead. Motivated by randomly sampled technology mapping case studies, we propose MapTune framework that addresses this challenge by utilizing reinforcement learning to make design-specific choices during cell selection. By learning from the environment, MapTune refines the cell selection process, resulting in a reduced search space and potentially improved mapping quality.
The effectiveness of MapTune is evaluated on a wide range of benchmarks, different technology libraries and technology mappers. The experimental results demonstrate that MapTune achieves higher mapping accuracy and reducing delay/area across diverse circuit designs, technology libraries and mappers. The paper also discusses the Pareto-Optimal exploration and confirms the perpetual delay-area trade-off. Conducted on benchmark suites ISCAS 85/89, ITC/ISCAS 99, VTR8.0 and EPFL benchmarks, the post-technology mapping and post-sizing quality-of-results (QoR) have been significantly improved, with average Area-Delay Product (ADP) improvement of 22.54\% among all different exploration settings in MapTune. The improvements are consistently remained for four different technologies (7nm, 45nm, 130nm, and 180 nm) and two different mappers.
△ Less
Submitted 25 July, 2024;
originally announced July 2024.
-
Exploring the Effect of Dataset Diversity in Self-Supervised Learning for Surgical Computer Vision
Authors:
Tim J. M. Jaspers,
Ronald L. P. D. de Jong,
Yasmina Al Khalil,
Tijn Zeelenberg,
Carolus H. J. Kusters,
Yiping Li,
Romy C. van Jaarsveld,
Franciscus H. A. Bakker,
Jelle P. Ruurda,
Willem M. Brinkman,
Peter H. N. De With,
Fons van der Sommen
Abstract:
Over the past decade, computer vision applications in minimally invasive surgery have rapidly increased. Despite this growth, the impact of surgical computer vision remains limited compared to other medical fields like pathology and radiology, primarily due to the scarcity of representative annotated data. Whereas transfer learning from large annotated datasets such as ImageNet has been convention…
▽ More
Over the past decade, computer vision applications in minimally invasive surgery have rapidly increased. Despite this growth, the impact of surgical computer vision remains limited compared to other medical fields like pathology and radiology, primarily due to the scarcity of representative annotated data. Whereas transfer learning from large annotated datasets such as ImageNet has been conventionally the norm to achieve high-performing models, recent advancements in self-supervised learning (SSL) have demonstrated superior performance. In medical image analysis, in-domain SSL pretraining has already been shown to outperform ImageNet-based initialization. Although unlabeled data in the field of surgical computer vision is abundant, the diversity within this data is limited. This study investigates the role of dataset diversity in SSL for surgical computer vision, comparing procedure-specific datasets against a more heterogeneous general surgical dataset across three different downstream surgical applications. The obtained results show that using solely procedure-specific data can lead to substantial improvements of 13.8%, 9.5%, and 36.8% compared to ImageNet pretraining. However, extending this data with more heterogeneous surgical data further increases performance by an additional 5.0%, 5.2%, and 2.5%, suggesting that increasing diversity within SSL data is beneficial for model performance. The code and pretrained model weights are made publicly available at https://github.com/TimJaspers0801/SurgeNet.
△ Less
Submitted 26 July, 2024; v1 submitted 25 July, 2024;
originally announced July 2024.
-
Unified Lexical Representation for Interpretable Visual-Language Alignment
Authors:
Yifan Li,
Yikai Wang,
Yanwei Fu,
Dongyu Ru,
Zheng Zhang,
Tong He
Abstract:
Visual-Language Alignment (VLA) has gained a lot of attention since CLIP's groundbreaking work. Although CLIP performs well, the typical direct latent feature alignment lacks clarity in its representation and similarity scores. On the other hand, lexical representation, a vector whose element represents the similarity between the sample and a word from the vocabulary, is a natural sparse represent…
▽ More
Visual-Language Alignment (VLA) has gained a lot of attention since CLIP's groundbreaking work. Although CLIP performs well, the typical direct latent feature alignment lacks clarity in its representation and similarity scores. On the other hand, lexical representation, a vector whose element represents the similarity between the sample and a word from the vocabulary, is a natural sparse representation and interpretable, providing exact matches for individual words. However, lexical representations is difficult to learn due to no ground-truth supervision and false-discovery issues, and thus requires complex design to train effectively. In this paper, we introduce LexVLA, a more interpretable VLA framework by learning a unified lexical representation for both modalities without complex design. We use DINOv2 as our visual model for its local-inclined features and Llama 2, a generative language model, to leverage its in-context lexical prediction ability. To avoid the false discovery, we propose an overuse penalty to refrain the lexical representation from falsely frequently activating meaningless words. We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on modest multi-modal dataset and avoid intricate training configurations. On cross-modal retrieval benchmarks, LexVLA, trained on the CC-12M multi-modal dataset, outperforms baselines fine-tuned on larger datasets (e.g., YFCC15M) and those trained from scratch on even bigger datasets (e.g., 1.1B data, including CC-12M). We conduct extensive experiments to analyze LexVLA.
△ Less
Submitted 25 July, 2024;
originally announced July 2024.
-
Very Large-Scale Multi-Agent Simulation in AgentScope
Authors:
Xuchen Pan,
Dawei Gao,
Yuexiang Xie,
Zhewei Wei,
Yaliang Li,
Bolin Ding,
Ji-Rong Wen,
Jingren Zhou
Abstract:
Recent advances in large language models (LLMs) have opened new avenues for applying multi-agent systems in very large-scale simulations. However, there remain several challenges when conducting multi-agent simulations with existing platforms, such as limited scalability and low efficiency, unsatisfied agent diversity, and effort-intensive management processes. To address these challenges, we deve…
▽ More
Recent advances in large language models (LLMs) have opened new avenues for applying multi-agent systems in very large-scale simulations. However, there remain several challenges when conducting multi-agent simulations with existing platforms, such as limited scalability and low efficiency, unsatisfied agent diversity, and effort-intensive management processes. To address these challenges, we develop several new features and components for AgentScope, a user-friendly multi-agent platform, enhancing its convenience and flexibility for supporting very large-scale multi-agent simulations. Specifically, we propose an actor-based distributed mechanism as the underlying technological infrastructure towards great scalability and high efficiency, and provide flexible environment support for simulating various real-world scenarios, which enables parallel execution of multiple agents, centralized workflow orchestration, and both inter-agent and agent-environment interactions among agents. Moreover, we integrate an easy-to-use configurable tool and an automatic background generation pipeline in AgentScope, simplifying the process of creating agents with diverse yet detailed background settings. Last but not least, we provide a web-based interface for conveniently monitoring and managing a large number of agents that might deploy across multiple devices. We conduct a comprehensive simulation to demonstrate the effectiveness of the proposed enhancements in AgentScope, and provide detailed observations and discussions to highlight the great potential of applying multi-agent systems in large-scale simulations. The source code is released on GitHub at https://github.com/modelscope/agentscope to inspire further research and development in large-scale multi-agent simulations.
△ Less
Submitted 25 July, 2024;
originally announced July 2024.
-
DAC: 2D-3D Retrieval with Noisy Labels via Divide-and-Conquer Alignment and Correction
Authors:
Chaofan Gan,
Yuanpeng Tu,
Yuxi Li,
Weiyao Lin
Abstract:
With the recent burst of 2D and 3D data, cross-modal retrieval has attracted increasing attention recently. However, manual labeling by non-experts will inevitably introduce corrupted annotations given ambiguous 2D/3D content. Though previous works have addressed this issue by designing a naive division strategy with hand-crafted thresholds, their performance generally exhibits great sensitivity t…
▽ More
With the recent burst of 2D and 3D data, cross-modal retrieval has attracted increasing attention recently. However, manual labeling by non-experts will inevitably introduce corrupted annotations given ambiguous 2D/3D content. Though previous works have addressed this issue by designing a naive division strategy with hand-crafted thresholds, their performance generally exhibits great sensitivity to the threshold value. Besides, they fail to fully utilize the valuable supervisory signals within each divided subset. To tackle this problem, we propose a Divide-and-conquer 2D-3D cross-modal Alignment and Correction framework (DAC), which comprises Multimodal Dynamic Division (MDD) and Adaptive Alignment and Correction (AAC). Specifically, the former performs accurate sample division by adaptive credibility modeling for each sample based on the compensation information within multimodal loss distribution. Then in AAC, samples in distinct subsets are exploited with different alignment strategies to fully enhance the semantic compactness and meanwhile alleviate over-fitting to noisy labels, where a self-correction strategy is introduced to improve the quality of representation. Moreover. To evaluate the effectiveness in real-world scenarios, we introduce a challenging noisy benchmark, namely Objaverse-N200, which comprises 200k-level samples annotated with 1156 realistic noisy labels. Extensive experiments on both traditional and the newly proposed benchmarks demonstrate the generality and superiority of our DAC, where DAC outperforms state-of-the-art models by a large margin. (i.e., with +5.9% gain on ModelNet40 and +5.8% on Objaverse-N200).
△ Less
Submitted 25 July, 2024;
originally announced July 2024.
-
CityX: Controllable Procedural Content Generation for Unbounded 3D Cities
Authors:
Shougao Zhang,
Mengqi Zhou,
Yuxi Wang,
Chuanchen Luo,
Rongyu Wang,
Yiwei Li,
Xucheng Yin,
Zhaoxiang Zhang,
Junran Peng
Abstract:
Generating a realistic, large-scale 3D virtual city remains a complex challenge due to the involvement of numerous 3D assets, various city styles, and strict layout constraints. Existing approaches provide promising attempts at procedural content generation to create large-scale scenes using Blender agents. However, they face crucial issues such as difficulties in scaling up generation capability…
▽ More
Generating a realistic, large-scale 3D virtual city remains a complex challenge due to the involvement of numerous 3D assets, various city styles, and strict layout constraints. Existing approaches provide promising attempts at procedural content generation to create large-scale scenes using Blender agents. However, they face crucial issues such as difficulties in scaling up generation capability and achieving fine-grained control at the semantic layout level. To address these problems, we propose a novel multi-modal controllable procedural content generation method, named CityX, which enhances realistic, unbounded 3D city generation guided by multiple layout conditions, including OSM, semantic maps, and satellite images. Specifically, the proposed method contains a general protocol for integrating various PCG plugins and a multi-agent framework for transforming instructions into executable Blender actions. Through this effective framework, CityX shows the potential to build an innovative ecosystem for 3D scene generation by bridging the gap between the quality of generated assets and industrial requirements. Extensive experiments have demonstrated the effectiveness of our method in creating high-quality, diverse, and unbounded cities guided by multi-modal conditions. Our project page: https://cityx-lab.github.io.
△ Less
Submitted 29 July, 2024; v1 submitted 24 July, 2024;
originally announced July 2024.
-
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation
Authors:
Zhenzhi Wang,
Yixuan Li,
Yanhong Zeng,
Youqing Fang,
Yuwei Guo,
Wenran Liu,
Jing Tan,
Kai Chen,
Tianfan Xue,
Bo Dai,
Dahua Lin
Abstract:
Human image animation involves generating videos from a character photo, allowing user control and unlocking potential for video and movie production. While recent approaches yield impressive results using high-quality training data, the inaccessibility of these datasets hampers fair and transparent benchmarking. Moreover, these approaches prioritize 2D human motion and overlook the significance o…
▽ More
Human image animation involves generating videos from a character photo, allowing user control and unlocking potential for video and movie production. While recent approaches yield impressive results using high-quality training data, the inaccessibility of these datasets hampers fair and transparent benchmarking. Moreover, these approaches prioritize 2D human motion and overlook the significance of camera motions in videos, leading to limited control and unstable video generation. To demystify the training data, we present HumanVid, the first large-scale high-quality dataset tailored for human image animation, which combines crafted real-world and synthetic data. For the real-world data, we compile a vast collection of copyright-free real-world videos from the internet. Through a carefully designed rule-based filtering strategy, we ensure the inclusion of high-quality videos, resulting in a collection of 20K human-centric videos in 1080P resolution. Human and camera motion annotation is accomplished using a 2D pose estimator and a SLAM-based method. For the synthetic data, we gather 2,300 copyright-free 3D avatar assets to augment existing available 3D assets. Notably, we introduce a rule-based camera trajectory generation method, enabling the synthetic pipeline to incorporate diverse and precise camera motion annotation, which can rarely be found in real-world data. To verify the effectiveness of HumanVid, we establish a baseline model named CamAnimate, short for Camera-controllable Human Animation, that considers both human and camera motions as conditions. Through extensive experimentation, we demonstrate that such simple baseline training on our HumanVid achieves state-of-the-art performance in controlling both human pose and camera motions, setting a new benchmark. Code and data will be publicly available at https://github.com/zhenzhiwang/HumanVid/.
△ Less
Submitted 28 July, 2024; v1 submitted 24 July, 2024;
originally announced July 2024.
-
3D Gaussian Splatting: Survey, Technologies, Challenges, and Opportunities
Authors:
Yanqi Bao,
Tianyu Ding,
Jing Huo,
Yaoli Liu,
Yuxin Li,
Wenbin Li,
Yang Gao,
Jiebo Luo
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a prominent technique with the potential to become a mainstream method for 3D representations. It can effectively transform multi-view images into explicit 3D Gaussian representations through efficient training, and achieve real-time rendering of novel views. This survey aims to analyze existing 3DGS-related works from multiple intersecting perspectives,…
▽ More
3D Gaussian Splatting (3DGS) has emerged as a prominent technique with the potential to become a mainstream method for 3D representations. It can effectively transform multi-view images into explicit 3D Gaussian representations through efficient training, and achieve real-time rendering of novel views. This survey aims to analyze existing 3DGS-related works from multiple intersecting perspectives, including related tasks, technologies, challenges, and opportunities. The primary objective is to provide newcomers with a rapid understanding of the field and to assist researchers in methodically organizing existing technologies and challenges. Specifically, we delve into the optimization, application, and extension of 3DGS, categorizing them based on their focuses or motivations. Additionally, we summarize and classify nine types of technical modules and corresponding improvements identified in existing works. Based on these analyses, we further examine the common challenges and technologies across various tasks, proposing potential research opportunities.
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
MMRA: A Benchmark for Multi-granularity Multi-image Relational Association
Authors:
Siwei Wu,
Kang Zhu,
Yu Bai,
Yiming Liang,
Yizhi Li,
Haoning Wu,
Jiaheng Liu,
Ruibo Liu,
Xingwei Qu,
Xuxin Cheng,
Ge Zhang,
Wenhao Huang,
Chenghua Lin
Abstract:
Given the remarkable success that large visual language models (LVLMs) have achieved in image perception tasks, the endeavor to make LVMLs perceive the world like humans is drawing increasing attention. Current multi-modal benchmarks mainly focus on the objective fact or certain topic related potential knowledge within a image, but overlook the associative relations between multiple images. Theref…
▽ More
Given the remarkable success that large visual language models (LVLMs) have achieved in image perception tasks, the endeavor to make LVMLs perceive the world like humans is drawing increasing attention. Current multi-modal benchmarks mainly focus on the objective fact or certain topic related potential knowledge within a image, but overlook the associative relations between multiple images. Therefore, we define a multi-image relation association task, and meticulously curate \textbf{MMRA} benchmark, a \textbf{M}ulti-granularity \textbf{M}ulti-image \textbf{R}elational \textbf{A}ssociation benchmark, consisted of \textbf{1026} samples. In order to systematically and comprehensively evaluate mainstream LVLMs, we establish an associational relation system among images that contain \textbf{11 subtasks} (e.g, UsageSimilarity, SubEvent, etc.) at two granularity levels (i.e., "\textbf{image}" and "\textbf{entity}") according to the relations in ConceptNet. Our experiments demonstrate that, on our MMRA benchmark, current mainstream LVLMs all have their own advantages and disadvantages across different subtasks. It is worth noting that, at the entity level, the performance of all models is worse than that of them at the image level, indicating that the fine-grained multi-image perception task is still challenging for LVLMs. The tasks related to spatial perception are relatively difficult for LVLMs to handle. Furthermore, we find that LVMLs exhibit a good ability to perceive image details, and the key to enhancing their multi-image association capability is to strengthen the reasoning ability of their language model component. All our codes and data are released at htt\url{https://github.com/Wusiwei0410/MMRA}.
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
Physical Adversarial Attack on Monocular Depth Estimation via Shape-Varying Patches
Authors:
Chenxing Zhao,
Yang Li,
Shihao Wu,
Wenyi Tan,
Shuangju Zhou,
Quan Pan
Abstract:
Adversarial attacks against monocular depth estimation (MDE) systems pose significant challenges, particularly in safety-critical applications such as autonomous driving. Existing patch-based adversarial attacks for MDE are confined to the vicinity of the patch, making it difficult to affect the entire target. To address this limitation, we propose a physics-based adversarial attack on monocular d…
▽ More
Adversarial attacks against monocular depth estimation (MDE) systems pose significant challenges, particularly in safety-critical applications such as autonomous driving. Existing patch-based adversarial attacks for MDE are confined to the vicinity of the patch, making it difficult to affect the entire target. To address this limitation, we propose a physics-based adversarial attack on monocular depth estimation, employing a framework called Attack with Shape-Varying Patches (ASP), aiming to optimize patch content, shape, and position to maximize effectiveness. We introduce various mask shapes, including quadrilateral, rectangular, and circular masks, to enhance the flexibility and efficiency of the attack. Furthermore, we propose a new loss function to extend the influence of the patch beyond the overlapping regions. Experimental results demonstrate that our attack method generates an average depth error of 18 meters on the target car with a patch area of 1/9, affecting over 98\% of the target area.
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation
Authors:
Yongqi Li,
Hongru Cai,
Wenjie Wang,
Leigang Qu,
Yinwei Wei,
Wenjie Li,
Liqiang Nie,
Tat-Seng Chua
Abstract:
Text-to-image retrieval is a fundamental task in multimedia processing, aiming to retrieve semantically relevant cross-modal content. Traditional studies have typically approached this task as a discriminative problem, matching the text and image via the cross-attention mechanism (one-tower framework) or in a common embedding space (two-tower framework). Recently, generative cross-modal retrieval…
▽ More
Text-to-image retrieval is a fundamental task in multimedia processing, aiming to retrieve semantically relevant cross-modal content. Traditional studies have typically approached this task as a discriminative problem, matching the text and image via the cross-attention mechanism (one-tower framework) or in a common embedding space (two-tower framework). Recently, generative cross-modal retrieval has emerged as a new research line, which assigns images with unique string identifiers and generates the target identifier as the retrieval target. Despite its great potential, existing generative approaches are limited due to the following issues: insufficient visual information in identifiers, misalignment with high-level semantics, and learning gap towards the retrieval target. To address the above issues, we propose an autoregressive voken generation method, named AVG. AVG tokenizes images into vokens, i.e., visual tokens, and innovatively formulates the text-to-image retrieval task as a token-to-voken generation problem. AVG discretizes an image into a sequence of vokens as the identifier of the image, while maintaining the alignment with both the visual information and high-level semantics of the image. Additionally, to bridge the learning gap between generative training and the retrieval target, we incorporate discriminative training to modify the learning direction during token-to-voken training. Extensive experiments demonstrate that AVG achieves superior results in both effectiveness and efficiency.
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
Region-based Content Enhancement for Efficient Video Analytics at the Edge
Authors:
Weijun Wang,
Liang Mi,
Shaowei Cen,
Haipeng Dai,
Yuanchun Li,
Xiaoming Fu,
Yunxin Liu
Abstract:
Video analytics is widespread in various applications serving our society. Recent advances of content enhancement in video analytics offer significant benefits for the bandwidth saving and accuracy improvement. However, existing content-enhanced video analytics systems are excessively computationally expensive and provide extremely low throughput. In this paper, we present region-based content enh…
▽ More
Video analytics is widespread in various applications serving our society. Recent advances of content enhancement in video analytics offer significant benefits for the bandwidth saving and accuracy improvement. However, existing content-enhanced video analytics systems are excessively computationally expensive and provide extremely low throughput. In this paper, we present region-based content enhancement, that enhances only the important regions in videos, to improve analytical accuracy. Our system, RegenHance, enables high-accuracy and high-throughput video analytics at the edge by 1) a macroblock-based region importance predictor that identifies the important regions fast and precisely, 2) a region-aware enhancer that stitches sparsely distributed regions into dense tensors and enhances them efficiently, and 3) a profile-based execution planer that allocates appropriate resources for enhancement and analytics components. We prototype RegenHance on five heterogeneous edge devices. Experiments on two analytical tasks reveal that region-based enhancement improves the overall accuracy of 10-19% and achieves 2-3x throughput compared to the state-of-the-art frame-based enhancement methods.
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
PateGail: A Privacy-Preserving Mobility Trajectory Generator with Imitation Learning
Authors:
Huandong Wang,
Changzheng Gao,
Yuchen Wu,
Depeng Jin,
Lina Yao,
Yong Li
Abstract:
Generating human mobility trajectories is of great importance to solve the lack of large-scale trajectory data in numerous applications, which is caused by privacy concerns. However, existing mobility trajectory generation methods still require real-world human trajectories centrally collected as the training data, where there exists an inescapable risk of privacy leakage. To overcome this limitat…
▽ More
Generating human mobility trajectories is of great importance to solve the lack of large-scale trajectory data in numerous applications, which is caused by privacy concerns. However, existing mobility trajectory generation methods still require real-world human trajectories centrally collected as the training data, where there exists an inescapable risk of privacy leakage. To overcome this limitation, in this paper, we propose PateGail, a privacy-preserving imitation learning model to generate mobility trajectories, which utilizes the powerful generative adversary imitation learning model to simulate the decision-making process of humans. Further, in order to protect user privacy, we train this model collectively based on decentralized mobility data stored in user devices, where personal discriminators are trained locally to distinguish and reward the real and generated human trajectories. In the training process, only the generated trajectories and their rewards obtained based on personal discriminators are shared between the server and devices, whose privacy is further preserved by our proposed perturbation mechanisms with theoretical proof to satisfy differential privacy. Further, to better model the human decision-making process, we propose a novel aggregation mechanism of the rewards obtained from personal discriminators. We theoretically prove that under the reward obtained based on the aggregation mechanism, our proposed model maximizes the lower bound of the discounted total rewards of users. Extensive experiments show that the trajectories generated by our model are able to resemble real-world trajectories in terms of five key statistical metrics, outperforming state-of-the-art algorithms by over 48.03%. Furthermore, we demonstrate that the synthetic trajectories are able to efficiently support practical applications, including mobility prediction and location recommendation.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
A Brief Discussion on the Philosophical Principles and Development Directions of Data Circulation
Authors:
Zhi Li,
Lei Zhang,
Junyi Xin,
Jianfei He,
Yan Li,
Zhenjun Ma,
Qi Sun
Abstract:
The data circulation is a complex scenario involving a large number of participants and different types of requirements, which not only has to comply with the laws and regulations, but also faces multiple challenges in technical and business areas. In order to systematically and comprehensively address these issues, it is essential to have a comprehensive and profound understanding of 'data circul…
▽ More
The data circulation is a complex scenario involving a large number of participants and different types of requirements, which not only has to comply with the laws and regulations, but also faces multiple challenges in technical and business areas. In order to systematically and comprehensively address these issues, it is essential to have a comprehensive and profound understanding of 'data circulation'.
The traditional analysis method tends to proceed based on the traditional circulation model of commodities, that is, tangible objects, which has some defects and shortcomings, and tends to be a formalized approach, which is faced numerous challenges in practice. This paper analyzes the circulation of data with a philosophical approach, obtains the new explication of data and executing entity, and provides a new definition of the concepts of data utilization and data key stakeholders (objects). At the same time, it puts forward the idea of ``data alienation'', and constructs a new interpretive framework of ``data circulation''.
Based on the framework of this interpretation, it is clearly proposed that ``data alienation'' is the core of ``data circulation'', benefit distribution is the driving force, and legal compliance is the foundation, and further discussed the three modes of ``data circulation''. It further discusses the three modes of ``data circulation''. It is pointed out that ``data circulation'' is different from traditional ``commodity circulation''. To achieve ``data circulation'',a comprehensive information infrastructure needs to be established. from a theoretical point of view, it lays a solid foundation for the development of ``data circulation''.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
Research on Adverse Drug Reaction Prediction Model Combining Knowledge Graph Embedding and Deep Learning
Authors:
Yufeng Li,
Wenchao Zhao,
Bo Dang,
Xu Yan,
Weimin Wang,
Min Gao,
Mingxuan Xiao
Abstract:
In clinical treatment, identifying potential adverse reactions of drugs can help assist doctors in making medication decisions. In response to the problems in previous studies that features are high-dimensional and sparse, independent prediction models need to be constructed for each adverse reaction of drugs, and the prediction accuracy is low, this paper develops an adverse drug reaction predict…
▽ More
In clinical treatment, identifying potential adverse reactions of drugs can help assist doctors in making medication decisions. In response to the problems in previous studies that features are high-dimensional and sparse, independent prediction models need to be constructed for each adverse reaction of drugs, and the prediction accuracy is low, this paper develops an adverse drug reaction prediction model based on knowledge graph embedding and deep learning, which can predict experimental results. Unified prediction of adverse drug reactions covered. Knowledge graph embedding technology can fuse the associated information between drugs and alleviate the shortcomings of high-dimensional sparsity in feature matrices, and the efficient training capabilities of deep learning can improve the prediction accuracy of the model. This article builds an adverse drug reaction knowledge graph based on drug feature data; by analyzing the embedding effect of the knowledge graph under different embedding strategies, the best embedding strategy is selected to obtain sample vectors; and then a convolutional neural network model is constructed to predict adverse reactions. The results show that under the DistMult embedding model and 400-dimensional embedding strategy, the convolutional neural network model has the best prediction effect; the average accuracy, F_1 score, recall rate and area under the curve of repeated experiments are better than the methods reported in the literature. The obtained prediction model has good prediction accuracy and stability, and can provide an effective reference for later safe medication guidance.
△ Less
Submitted 27 July, 2024; v1 submitted 22 July, 2024;
originally announced July 2024.
-
AutoRG-Brain: Grounded Report Generation for Brain MRI
Authors:
Jiayu Lei,
Xiaoman Zhang,
Chaoyi Wu,
Lisong Dai,
Ya Zhang,
Yanyong Zhang,
Yanfeng Wang,
Weidi Xie,
Yuehua Li
Abstract:
Radiologists are tasked with interpreting a large number of images in a daily base, with the responsibility of generating corresponding reports. This demanding workload elevates the risk of human error, potentially leading to treatment delays, increased healthcare costs, revenue loss, and operational inefficiencies. To address these challenges, we initiate a series of work on grounded Automatic Re…
▽ More
Radiologists are tasked with interpreting a large number of images in a daily base, with the responsibility of generating corresponding reports. This demanding workload elevates the risk of human error, potentially leading to treatment delays, increased healthcare costs, revenue loss, and operational inefficiencies. To address these challenges, we initiate a series of work on grounded Automatic Report Generation (AutoRG), starting from the brain MRI interpretation system, which supports the delineation of brain structures, the localization of anomalies, and the generation of well-organized findings. We make contributions from the following aspects, first, on dataset construction, we release a comprehensive dataset encompassing segmentation masks of anomaly regions and manually authored reports, termed as RadGenome-Brain MRI. This data resource is intended to catalyze ongoing research and development in the field of AI-assisted report generation systems. Second, on system design, we propose AutoRG-Brain, the first brain MRI report generation system with pixel-level grounded visual clues. Third, for evaluation, we conduct quantitative assessments and human evaluations of brain structure segmentation, anomaly localization, and report generation tasks to provide evidence of its reliability and accuracy. This system has been integrated into real clinical scenarios, where radiologists were instructed to write reports based on our generated findings and anomaly segmentation masks. The results demonstrate that our system enhances the report-writing skills of junior doctors, aligning their performance more closely with senior doctors, thereby boosting overall productivity.
△ Less
Submitted 26 July, 2024; v1 submitted 23 July, 2024;
originally announced July 2024.
-
Knowledge-driven AI-generated data for accurate and interpretable breast ultrasound diagnoses
Authors:
Haojun Yu,
Youcheng Li,
Nan Zhang,
Zihan Niu,
Xuantong Gong,
Yanwen Luo,
Quanlin Wu,
Wangyan Qin,
Mengyuan Zhou,
Jie Han,
Jia Tao,
Ziwei Zhao,
Di Dai,
Di He,
Dong Wang,
Binghui Tang,
Ling Huo,
Qingli Zhu,
Yong Wang,
Liwei Wang
Abstract:
Data-driven deep learning models have shown great capabilities to assist radiologists in breast ultrasound (US) diagnoses. However, their effectiveness is limited by the long-tail distribution of training data, which leads to inaccuracies in rare cases. In this study, we address a long-standing challenge of improving the diagnostic model performance on rare cases using long-tailed data. Specifical…
▽ More
Data-driven deep learning models have shown great capabilities to assist radiologists in breast ultrasound (US) diagnoses. However, their effectiveness is limited by the long-tail distribution of training data, which leads to inaccuracies in rare cases. In this study, we address a long-standing challenge of improving the diagnostic model performance on rare cases using long-tailed data. Specifically, we introduce a pipeline, TAILOR, that builds a knowledge-driven generative model to produce tailored synthetic data. The generative model, using 3,749 lesions as source data, can generate millions of breast-US images, especially for error-prone rare cases. The generated data can be further used to build a diagnostic model for accurate and interpretable diagnoses. In the prospective external evaluation, our diagnostic model outperforms the average performance of nine radiologists by 33.5% in specificity with the same sensitivity, improving their performance by providing predictions with an interpretable decision-making process. Moreover, on ductal carcinoma in situ (DCIS), our diagnostic model outperforms all radiologists by a large margin, with only 34 DCIS lesions in the source data. We believe that TAILOR can potentially be extended to various diseases and imaging modalities.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
When, Where, and What? A Novel Benchmark for Accident Anticipation and Localization with Large Language Models
Authors:
Haicheng Liao,
Yongkang Li,
Chengyue Wang,
Yanchen Guan,
KaHou Tam,
Chunlin Tian,
Li Li,
Chengzhong Xu,
Zhenning Li
Abstract:
As autonomous driving systems increasingly become part of daily transportation, the ability to accurately anticipate and mitigate potential traffic accidents is paramount. Traditional accident anticipation models primarily utilizing dashcam videos are adept at predicting when an accident may occur but fall short in localizing the incident and identifying involved entities. Addressing this gap, thi…
▽ More
As autonomous driving systems increasingly become part of daily transportation, the ability to accurately anticipate and mitigate potential traffic accidents is paramount. Traditional accident anticipation models primarily utilizing dashcam videos are adept at predicting when an accident may occur but fall short in localizing the incident and identifying involved entities. Addressing this gap, this study introduces a novel framework that integrates Large Language Models (LLMs) to enhance predictive capabilities across multiple dimensions--what, when, and where accidents might occur. We develop an innovative chain-based attention mechanism that dynamically adjusts to prioritize high-risk elements within complex driving scenes. This mechanism is complemented by a three-stage model that processes outputs from smaller models into detailed multimodal inputs for LLMs, thus enabling a more nuanced understanding of traffic dynamics. Empirical validation on the DAD, CCD, and A3D datasets demonstrates superior performance in Average Precision (AP) and Mean Time-To-Accident (mTTA), establishing new benchmarks for accident prediction technology. Our approach not only advances the technological framework for autonomous driving safety but also enhances human-AI interaction, making predictive insights generated by autonomous systems more intuitive and actionable.
△ Less
Submitted 26 July, 2024; v1 submitted 23 July, 2024;
originally announced July 2024.
-
LawLuo: A Chinese Law Firm Co-run by LLM Agents
Authors:
Jingyun Sun,
Chengxiao Dai,
Zhongze Luo,
Yangbo Chang,
Yang Li
Abstract:
Large Language Models (LLMs) demonstrate substantial potential in delivering legal consultation services to users without a legal background, attributed to their superior text comprehension and generation capabilities. Nonetheless, existing Chinese legal LLMs limit interaction to a single model-user dialogue, unlike the collaborative consultations typical of law firms, where multiple staff members…
▽ More
Large Language Models (LLMs) demonstrate substantial potential in delivering legal consultation services to users without a legal background, attributed to their superior text comprehension and generation capabilities. Nonetheless, existing Chinese legal LLMs limit interaction to a single model-user dialogue, unlike the collaborative consultations typical of law firms, where multiple staff members contribute to a single consultation. This limitation prevents an authentic consultation experience. Additionally, extant Chinese legal LLMs suffer from critical limitations: (1) insufficient control over the quality of instruction fine-tuning data; (2) increased model hallucination resulting from users' ambiguous queries; and (3) a reduction in the model's ability to follow instructions over multiple dialogue turns. In response to these challenges, we propose a novel legal dialogue framework that leverages the collaborative capabilities of multiple LLM agents, termed LawLuo. This framework encompasses four agents: a receptionist, a lawyer, a secretary, and a boss, each responsible for different functionalities, collaboratively providing a comprehensive legal consultation to users. Additionally, we constructed two high-quality legal dialogue datasets, KINLED and MURLED, and fine-tuned ChatGLM-3-6b using these datasets. We propose a legal query clarification algorithm called ToLC. Experimental results demonstrate that LawLuo outperforms baseline LLMs, including GPT-4, across three dimensions: lawyer-like language style, the usefulness of legal advice, and the accuracy of legal knowledge. Our code and datasets are available at https://github.com/NEFUJing/LawLuo.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval
Authors:
Xiaowan Hu,
Yiyi Chen,
Yan Li,
Minquan Wang,
Haoqian Wang,
Quan Chen,
Han Li,
Peng Jiang
Abstract:
With the rapid expansion of e-commerce, more consumers have become accustomed to making purchases via livestreaming. Accurately identifying the products being sold by salespeople, i.e., livestreaming product retrieval (LPR), poses a fundamental and daunting challenge. The LPR task encompasses three primary dilemmas in real-world scenarios: 1) the recognition of intended products from distractor pr…
▽ More
With the rapid expansion of e-commerce, more consumers have become accustomed to making purchases via livestreaming. Accurately identifying the products being sold by salespeople, i.e., livestreaming product retrieval (LPR), poses a fundamental and daunting challenge. The LPR task encompasses three primary dilemmas in real-world scenarios: 1) the recognition of intended products from distractor products present in the background; 2) the video-image heterogeneity that the appearance of products showcased in live streams often deviates substantially from standardized product images in stores; 3) there are numerous confusing products with subtle visual nuances in the shop. To tackle these challenges, we propose the Spatiotemporal Graphing Multi-modal Network (SGMN). First, we employ a text-guided attention mechanism that leverages the spoken content of salespeople to guide the model to focus toward intended products, emphasizing their salience over cluttered background products. Second, a long-range spatiotemporal graph network is further designed to achieve both instance-level interaction and frame-level matching, solving the misalignment caused by video-image heterogeneity. Third, we propose a multi-modal hard example mining, assisting the model in distinguishing highly similar products with fine-grained features across the video-image-text domain. Through extensive quantitative and qualitative experiments, we demonstrate the superior performance of our proposed SGMN model, surpassing the state-of-the-art methods by a substantial margin. The code is available at https://github.com/Huxiaowan/SGMN.
△ Less
Submitted 24 July, 2024; v1 submitted 23 July, 2024;
originally announced July 2024.
-
FoRA: Low-Rank Adaptation Model beyond Multimodal Siamese Network
Authors:
Weiying Xie,
Yusi Zhang,
Tianlin Hui,
Jiaqing Zhang,
Jie Lei,
Yunsong Li
Abstract:
Multimodal object detection offers a promising prospect to facilitate robust detection in various visual conditions. However, existing two-stream backbone networks are challenged by complex fusion and substantial parameter increments. This is primarily due to large data distribution biases of multimodal homogeneous information. In this paper, we propose a novel multimodal object detector, named Lo…
▽ More
Multimodal object detection offers a promising prospect to facilitate robust detection in various visual conditions. However, existing two-stream backbone networks are challenged by complex fusion and substantial parameter increments. This is primarily due to large data distribution biases of multimodal homogeneous information. In this paper, we propose a novel multimodal object detector, named Low-rank Modal Adaptors (LMA) with a shared backbone. The shared parameters enhance the consistency of homogeneous information, while lightweight modal adaptors focus on modality unique features. Furthermore, we design an adaptive rank allocation strategy to adapt to the varying heterogeneity at different feature levels. When applied to two multimodal object detection datasets, experiments validate the effectiveness of our method. Notably, on DroneVehicle, LMA attains a 10.4% accuracy improvement over the state-of-the-art method with a 149M-parameters reduction. The code is available at https://github.com/zyszxhy/FoRA.
Our work was submitted to ACM MM in April 2024, but was rejected. We will continue to refine our work and paper writing next, mainly including proof of theory and multi-task applications of FoRA.
△ Less
Submitted 22 July, 2024;
originally announced July 2024.
-
Shapley Pruning for Neural Network Compression
Authors:
Kamil Adamczewski,
Yawei Li,
Luc van Gool
Abstract:
Neural network pruning is a rich field with a variety of approaches. In this work, we propose to connect the existing pruning concepts such as leave-one-out pruning and oracle pruning and develop them into a more general Shapley value-based framework that targets the compression of convolutional neural networks. To allow for practical applications in utilizing the Shapley value, this work presents…
▽ More
Neural network pruning is a rich field with a variety of approaches. In this work, we propose to connect the existing pruning concepts such as leave-one-out pruning and oracle pruning and develop them into a more general Shapley value-based framework that targets the compression of convolutional neural networks. To allow for practical applications in utilizing the Shapley value, this work presents the Shapley value approximations, and performs the comparative analysis in terms of cost-benefit utility for the neural network compression. The proposed ranks are evaluated against a new benchmark, Oracle rank, constructed based on oracle sets. The broad experiments show that the proposed normative ranking and its approximations show practical results, obtaining state-of-the-art network compression.
△ Less
Submitted 19 July, 2024;
originally announced July 2024.
-
A Large-scale Benchmark Dataset for Commuting Origin-destination Matrix Generation
Authors:
Can Rong,
Jingtao Ding,
Yan Liu,
Yong Li
Abstract:
The commuting origin-destination~(OD) matrix is a critical input for urban planning and transportation, providing crucial information about the population residing in one region and working in another within an interested area. Despite its importance, obtaining and updating the matrix is challenging due to high costs and privacy concerns. This has spurred research into generating commuting OD matr…
▽ More
The commuting origin-destination~(OD) matrix is a critical input for urban planning and transportation, providing crucial information about the population residing in one region and working in another within an interested area. Despite its importance, obtaining and updating the matrix is challenging due to high costs and privacy concerns. This has spurred research into generating commuting OD matrices for areas lacking historical data, utilizing readily available information via computational models. In this regard, existing research is primarily restricted to only a single or few large cities, preventing these models from being applied effectively in other areas with distinct characteristics, particularly in towns and rural areas where such data is urgently needed. To address this, we propose a large-scale dataset comprising commuting OD matrices for 3,233 diverse areas around the U.S. For each area, we provide the commuting OD matrix, combined with regional attributes including demographics and point-of-interests of each region in that area. We believe this comprehensive dataset will facilitate the development of more generalizable commuting OD matrix generation models, which can capture various patterns of distinct areas. Additionally, we use this dataset to benchmark a set of commuting OD generation models, including physical models, element-wise predictive models, and matrix-wise generative models. Surprisingly, we find a new paradigm, which considers the whole area combined with its commuting OD matrix as an attributed directed weighted graph and generates the weighted edges based on the node attributes, can achieve the optimal. This may inspire a new research direction from graph learning in this field.
△ Less
Submitted 23 July, 2024; v1 submitted 22 July, 2024;
originally announced July 2024.
-
Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning
Authors:
Kaiwen Wang,
Rahul Kidambi,
Ryan Sullivan,
Alekh Agarwal,
Christoph Dann,
Andrea Michi,
Marco Gelmi,
Yunxuan Li,
Raghav Gupta,
Avinava Dubey,
Alexandre Ramé,
Johan Ferret,
Geoffrey Cideron,
Le Hou,
Hongkun Yu,
Amr Ahmed,
Aranyak Mehta,
Léonard Hussenot,
Olivier Bachem,
Edouard Leurent
Abstract:
Reward-based finetuning is crucial for aligning language policies with intended behaviors (e.g., creativity and safety). A key challenge here is to develop steerable language models that trade-off multiple (conflicting) objectives in a flexible and efficient manner. This paper presents Conditioned Language Policy (CLP), a general framework for finetuning language models on multiple objectives. Bui…
▽ More
Reward-based finetuning is crucial for aligning language policies with intended behaviors (e.g., creativity and safety). A key challenge here is to develop steerable language models that trade-off multiple (conflicting) objectives in a flexible and efficient manner. This paper presents Conditioned Language Policy (CLP), a general framework for finetuning language models on multiple objectives. Building on techniques from multi-task training and parameter-efficient finetuning, CLP can learn steerable models that effectively trade-off conflicting objectives at inference time. Notably, this does not require training or maintaining multiple models to achieve different trade-offs between the objectives. Through an extensive set of experiments and ablations, we show that the CLP framework learns steerable models that outperform and Pareto-dominate the current state-of-the-art approaches for multi-objective finetuning.
△ Less
Submitted 22 July, 2024;
originally announced July 2024.
-
Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models
Authors:
Xin Ma,
Yaohui Wang,
Gengyun Jia,
Xinyuan Chen,
Yuan-Fang Li,
Cunjian Chen,
Yu Qiao
Abstract:
Diffusion models have achieved great progress in image animation due to powerful generative capabilities. However, maintaining spatio-temporal consistency with detailed information from the input static image over time (e.g., style, background, and object of the input static image) and ensuring smoothness in animated video narratives guided by textual prompts still remains challenging. In this pap…
▽ More
Diffusion models have achieved great progress in image animation due to powerful generative capabilities. However, maintaining spatio-temporal consistency with detailed information from the input static image over time (e.g., style, background, and object of the input static image) and ensuring smoothness in animated video narratives guided by textual prompts still remains challenging. In this paper, we introduce Cinemo, a novel image animation approach towards achieving better motion controllability, as well as stronger temporal consistency and smoothness. In general, we propose three effective strategies at the training and inference stages of Cinemo to accomplish our goal. At the training stage, Cinemo focuses on learning the distribution of motion residuals, rather than directly predicting subsequent via a motion diffusion model. Additionally, a structural similarity index-based strategy is proposed to enable Cinemo to have better controllability of motion intensity. At the inference stage, a noise refinement technique based on discrete cosine transformation is introduced to mitigate sudden motion changes. Such three strategies enable Cinemo to produce highly consistent, smooth, and motion-controllable results. Compared to previous methods, Cinemo offers simpler and more precise user controllability. Extensive experiments against several state-of-the-art methods, including both commercial tools and research approaches, across multiple metrics, demonstrate the effectiveness and superiority of our proposed approach.
△ Less
Submitted 22 July, 2024; v1 submitted 22 July, 2024;
originally announced July 2024.
-
Odyssey: Empowering Agents with Open-World Skills
Authors:
Shunyu Liu,
Yaoru Li,
Kongcheng Zhang,
Zhenyu Cui,
Wenkai Fang,
Yuxuan Zheng,
Tongya Zheng,
Mingli Song
Abstract:
Recent studies have delved into constructing generalist agents for open-world embodied environments like Minecraft. Despite the encouraging results, existing efforts mainly focus on solving basic programmatic tasks, e.g., material collection and tool-crafting following the Minecraft tech-tree, treating the ObtainDiamond task as the ultimate goal. This limitation stems from the narrowly defined set…
▽ More
Recent studies have delved into constructing generalist agents for open-world embodied environments like Minecraft. Despite the encouraging results, existing efforts mainly focus on solving basic programmatic tasks, e.g., material collection and tool-crafting following the Minecraft tech-tree, treating the ObtainDiamond task as the ultimate goal. This limitation stems from the narrowly defined set of actions available to agents, requiring them to learn effective long-horizon strategies from scratch. Consequently, discovering diverse gameplay opportunities in the open world becomes challenging. In this work, we introduce ODYSSEY, a new framework that empowers Large Language Model (LLM)-based agents with open-world skills to explore the vast Minecraft world. ODYSSEY comprises three key parts: (1) An interactive agent with an open-world skill library that consists of 40 primitive skills and 183 compositional skills. (2) A fine-tuned LLaMA-3 model trained on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki. (3) A new open-world benchmark includes thousands of long-term planning tasks, tens of dynamic-immediate planning tasks, and one autonomous exploration task. Extensive experiments demonstrate that the proposed ODYSSEY framework can effectively evaluate the planning and exploration capabilities of agents. All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions.
△ Less
Submitted 21 July, 2024;
originally announced July 2024.
-
U-learning for Prediction Inference via Combinatory Multi-Subsampling: With Applications to LASSO and Neural Networks
Authors:
Zhe Fei,
Yi Li
Abstract:
Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns at numerous CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making valid inferences on predicted epigenetic ages, or more broadly, on predictions derived from high-dimensional inputs, presents challenges. We introduce a novel U-learning a…
▽ More
Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns at numerous CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making valid inferences on predicted epigenetic ages, or more broadly, on predictions derived from high-dimensional inputs, presents challenges. We introduce a novel U-learning approach via combinatory multi-subsampling for making ensemble predictions and constructing confidence intervals for predictions of continuous outcomes when traditional asymptotic methods are not applicable. More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics and invokes the Hájek projection for deriving the variances of predictions and constructing confidence intervals with valid conditional coverage probabilities. We apply our approach to two commonly used predictive algorithms, Lasso and deep neural networks (DNNs), and illustrate the validity of inferences with extensive numerical studies. We have applied these methods to predict the DNA methylation age (DNAmAge) of patients with various health conditions, aiming to accurately characterize the aging process and potentially guide anti-aging interventions.
△ Less
Submitted 21 July, 2024;
originally announced July 2024.
-
CGB-DM: Content and Graphic Balance Layout Generation with Transformer-based Diffusion Model
Authors:
Yu Li,
Yifan Chen,
Gongye Liu,
Jie Wu,
Yujiu Yang
Abstract:
Layout generation is the foundation task of intelligent design, which requires the integration of visual aesthetics and harmonious expression of content delivery. However, existing methods still face challenges in generating precise and visually appealing layouts, including blocking, overlap, or spatial misalignment between layouts, which are closely related to the spatial structure of graphic lay…
▽ More
Layout generation is the foundation task of intelligent design, which requires the integration of visual aesthetics and harmonious expression of content delivery. However, existing methods still face challenges in generating precise and visually appealing layouts, including blocking, overlap, or spatial misalignment between layouts, which are closely related to the spatial structure of graphic layouts. We find that these methods overly focus on content information and lack constraints on layout spatial structure, resulting in an imbalance of learning content-aware and graphic-aware features. To tackle this issue, we propose Content and Graphic Balance Layout Generation with Transformer-based Diffusion Model (CGB-DM). Specifically, we first design a regulator that balances the predicted content and graphic weight, overcoming the tendency of paying more attention to the content on canvas. Secondly, we introduce a graphic constraint of saliency bounding box to further enhance the alignment of geometric features between layout representations and images. In addition, we adapt a transformer-based diffusion model as the backbone, whose powerful generation capability ensures the quality in layout generation. Extensive experimental results indicate that our method has achieved state-of-the-art performance in both quantitative and qualitative evaluations. Our model framework can also be expanded to other graphic design fields.
△ Less
Submitted 22 July, 2024; v1 submitted 21 July, 2024;
originally announced July 2024.
-
Schedulability Analysis in Time-Sensitive Networking: A Systematic Literature Review
Authors:
Zitong Wang,
Feng Luo,
Yunpeng Li,
Haotian Gan,
Lei Zhu
Abstract:
Time-Sensitive Networking (TSN) is a set of standards that provide low-latency, high-reliability guarantees for the transmission of traffic in networks, and it is becoming an accepted solution for complex time-critical systems such as those in industrial automation and the automotive. In time-critical systems, it is essential to verify the timing predictability of the system, and the application o…
▽ More
Time-Sensitive Networking (TSN) is a set of standards that provide low-latency, high-reliability guarantees for the transmission of traffic in networks, and it is becoming an accepted solution for complex time-critical systems such as those in industrial automation and the automotive. In time-critical systems, it is essential to verify the timing predictability of the system, and the application of scheduling mechanisms in TSN can also bring about changes in system timing. Therefore, schedulability analysis techniques can be used to verify that the system is scheduled according to the scheduling mechanism and meets the timing requirements. In this paper, we provide a clear overview of the state-of-the-art works on the topic of schedulability analysis in TSN in an attempt to clarify the purpose of schedulability analysis, categorize the methods of schedulability analysis and compare their respective strengths and weaknesses, point out the scheduling mechanisms under analyzing and the corresponding traffic classes, clarify the network scenarios constructed during the evaluation and list the challenges and directions still needing to be worked on in schedulability analysis in TSN. To this end, we conducted a systematic literature review and finally identified 123 relevant research papers published in major conferences and journals in the past 15 years. Based on a comprehensive review of the relevant literature, we have identified several key findings and emphasized the future challenges in schedulability analysis for TSN.
△ Less
Submitted 20 July, 2024;
originally announced July 2024.
-
Consent in Crisis: The Rapid Decline of the AI Data Commons
Authors:
Shayne Longpre,
Robert Mahari,
Ariel Lee,
Campbell Lund,
Hamidah Oderinwale,
William Brannon,
Nayan Saxena,
Naana Obeng-Marnu,
Tobin South,
Cole Hunter,
Kevin Klyman,
Christopher Klamm,
Hailey Schoelkopf,
Nikhil Singh,
Manuel Cherep,
Ahmad Anis,
An Dinh,
Caroline Chitongo,
Da Yin,
Damien Sileo,
Deividas Mataciunas,
Diganta Misra,
Emad Alghamdi,
Enrico Shippole,
Jianguo Zhang
, et al. (24 additional authors not shown)
Abstract:
General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how co…
▽ More
General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how codified data use preferences are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crises in data consent, for both developers and creators. The foreclosure of much of the open web will impact not only commercial AI, but also non-commercial AI and academic research.
△ Less
Submitted 24 July, 2024; v1 submitted 20 July, 2024;
originally announced July 2024.
-
RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies
Authors:
Xiaomeng Chu,
Jiajun Deng,
Guoliang You,
Yifan Duan,
Yao Li,
Yanyong Zhang
Abstract:
The recent advances in query-based multi-camera 3D object detection are featured by initializing object queries in the 3D space, and then sampling features from perspective-view images to perform multi-round query refinement. In such a framework, query points near the same camera ray are likely to sample similar features from very close pixels, resulting in ambiguous query features and degraded de…
▽ More
The recent advances in query-based multi-camera 3D object detection are featured by initializing object queries in the 3D space, and then sampling features from perspective-view images to perform multi-round query refinement. In such a framework, query points near the same camera ray are likely to sample similar features from very close pixels, resulting in ambiguous query features and degraded detection accuracy. To this end, we introduce RayFormer, a camera-ray-inspired query-based 3D object detector that aligns the initialization and feature extraction of object queries with the optical characteristics of cameras. Specifically, RayFormer transforms perspective-view image features into bird's eye view (BEV) via the lift-splat-shoot method and segments the BEV map to sectors based on the camera rays. Object queries are uniformly and sparsely initialized along each camera ray, facilitating the projection of different queries onto different areas in the image to extract distinct features. Besides, we leverage the instance information of images to supplement the uniformly initialized object queries by further involving additional queries along the ray from 2D object detection boxes. To extract unique object-level features that cater to distinct queries, we design a ray sampling method that suitably organizes the distribution of feature sampling points on both images and bird's eye view. Extensive experiments are conducted on the nuScenes dataset to validate our proposed ray-inspired model design. The proposed RayFormer achieves 55.5% mAP and 63.3% NDS, respectively. Our codes will be made available.
△ Less
Submitted 27 July, 2024; v1 submitted 20 July, 2024;
originally announced July 2024.