subscribe to arXiv mailings

arXiv:2407.20183 [pdf, other]

MindSearch: Mimicking Human Minds Elicits Deep AI Searcher

Authors: Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, Feng Zhao

Abstract: Information seeking and integration is a complex cognitive task that consumes enormous time and effort. Inspired by the remarkable progress of Large Language Models, recent works attempt to solve this task by combining LLMs and search engines. However, these methods still obtain unsatisfying performance due to three challenges: (1) complex requests often cannot be accurately and completely retriev… ▽ More Information seeking and integration is a complex cognitive task that consumes enormous time and effort. Inspired by the remarkable progress of Large Language Models, recent works attempt to solve this task by combining LLMs and search engines. However, these methods still obtain unsatisfying performance due to three challenges: (1) complex requests often cannot be accurately and completely retrieved by the search engine once (2) corresponding information to be integrated is spread over multiple web pages along with massive noise, and (3) a large number of web pages with long contents may quickly exceed the maximum context length of LLMs. Inspired by the cognitive process when humans solve these problems, we introduce MindSearch to mimic the human minds in web information seeking and integration, which can be instantiated by a simple yet effective LLM-based multi-agent framework. The WebPlanner models the human mind of multi-step information seeking as a dynamic graph construction process: it decomposes the user query into atomic sub-questions as nodes in the graph and progressively extends the graph based on the search result from WebSearcher. Tasked with each sub-question, WebSearcher performs hierarchical information retrieval with search engines and collects valuable information for WebPlanner. The multi-agent design of MindSearch enables the whole framework to seek and integrate information parallelly from larger-scale (e.g., more than 300) web pages in 3 minutes, which is worth 3 hours of human effort. MindSearch demonstrates significant improvement in the response quality in terms of depth and breadth, on both close-set and open-set QA problems. Besides, responses from MindSearch based on InternLM2.5-7B are preferable by humans to ChatGPT-Web and Perplexity.ai applications, which implies that MindSearch can already deliver a competitive solution to the proprietary AI search engine. △ Less

Submitted 29 July, 2024; originally announced July 2024.

Comments: Technical Report. Project Page: https://mindsearch.netlify.app Code: https://github.com/InternLM/MindSearch

arXiv:2407.20171 [pdf, other]

Diffusion Feedback Helps CLIP See Better

Authors: Wenxuan Wang, Quan Sun, Fan Zhang, Yepeng Tang, Jing Liu, Xinlong Wang

Abstract: Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks. However, recent studies reveal that CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure, etc. These visual shortcomings also limit the… ▽ More Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks. However, recent studies reveal that CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure, etc. These visual shortcomings also limit the perception capabilities of multimodal large language models (MLLMs) built on CLIP. The main reason could be that the image-text pairs used to train CLIP are inherently biased, due to the lack of the distinctiveness of the text and the diversity of images. In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7%), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP's strong zero-shot capabilities. The code will be available at https://github.com/baaivision/DIVA. △ Less

Submitted 29 July, 2024; originally announced July 2024.

arXiv:2407.19875 [pdf, other]

Exploring Robust Face-Voice Matching in Multilingual Environments

Authors: Jiehui Tang, Xiaofei Wang, Zhen Xiao, Jiayi Liu, Xueliang Liu, Richang Hong

Abstract: This paper presents Team Xaiofei's innovative approach to exploring Face-Voice Association in Multilingual Environments (FAME) at ACM Multimedia 2024. We focus on the impact of different languages in face-voice matching by building upon Fusion and Orthogonal Projection (FOP), introducing four key components: a dual-branch structure, dynamic sample pair weighting, robust data augmentation, and scor… ▽ More This paper presents Team Xaiofei's innovative approach to exploring Face-Voice Association in Multilingual Environments (FAME) at ACM Multimedia 2024. We focus on the impact of different languages in face-voice matching by building upon Fusion and Orthogonal Projection (FOP), introducing four key components: a dual-branch structure, dynamic sample pair weighting, robust data augmentation, and score polarization strategy. Our dual-branch structure serves as an auxiliary mechanism to better integrate and provide more comprehensive information. We also introduce a dynamic weighting mechanism for various sample pairs to optimize learning. Data augmentation techniques are employed to enhance the model's generalization across diverse conditions. Additionally, score polarization strategy based on age and gender matching confidence clarifies and accentuates the final results. Our methods demonstrate significant effectiveness, achieving an equal error rate (EER) of 20.07 on the V2-EH dataset and 21.76 on the V1-EU dataset. △ Less

Submitted 29 July, 2024; originally announced July 2024.

arXiv:2407.19547 [pdf, other]

Temporal Feature Matters: A Framework for Diffusion Model Quantization

Authors: Yushi Huang, Ruihao Gong, Xianglong Liu, Jing Liu, Yuhang Li, Jiwen Lu, Dacheng Tao

Abstract: The Diffusion models, widely used for image generation, face significant challenges related to their broad applicability due to prolonged inference times and high memory demands. Efficient Post-Training Quantization (PTQ) is crucial to address these issues in traditional models. Unlike those models, diffusion models critically rely on the time-step $t$ for effective multi-round denoising. Typicall… ▽ More The Diffusion models, widely used for image generation, face significant challenges related to their broad applicability due to prolonged inference times and high memory demands. Efficient Post-Training Quantization (PTQ) is crucial to address these issues in traditional models. Unlike those models, diffusion models critically rely on the time-step $t$ for effective multi-round denoising. Typically, $t$ from the finite set $\{1, \ldots, T\}$ is encoded into a hypersensitive temporal feature by several modules, entirely independent of the sampling data. However, existing PTQ methods do not optimize these modules individually. Instead, they employ unsuitable reconstruction objectives and complex calibration methods, leading to significant disturbances in the temporal feature and denoising trajectory. To address these challenges, we introduce a novel quantization framework: 1)~TIB-based Maintenance: Based on our innovative Temporal Information Block~(TIB) definition, Temporal Information-aware Reconstruction~(TIAR) and Finite Set Calibration~(FSC) are developed to efficiently align full precision temporal features. 2)~Cache-based Maintenance: Instead of indirect and complex optimization for the related modules, pre-computing and caching quantized counterparts of temporal features are developed to minimize errors. 3)~Disturbance-aware Selection: Employ temporal feature errors to guide a fine-grained selection for superior maintenance. This framework preserves most of the temporal information and ensures high-quality end-to-end generation. Extensive testing on various datasets and diffusion models confirms our superior results. Notably, our approach closely matches the performance of the full-precision model under 4-bit quantization. Furthermore, the quantized SD-XL model achieves hardware acceleration of 2.20$\times$ on CPU and 5.76$\times$ on GPU demonstrating its efficiency. △ Less

Submitted 28 July, 2024; originally announced July 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2311.16503

arXiv:2407.19542 [pdf, other]

UniVoxel: Fast Inverse Rendering by Unified Voxelization of Scene Representation

Authors: Shuang Wu, Songlin Tang, Guangming Lu, Jianzhuang Liu, Wenjie Pei

Abstract: Typical inverse rendering methods focus on learning implicit neural scene representations by modeling the geometry, materials and illumination separately, which entails significant computations for optimization. In this work we design a Unified Voxelization framework for explicit learning of scene representations, dubbed UniVoxel, which allows for efficient modeling of the geometry, materials and… ▽ More Typical inverse rendering methods focus on learning implicit neural scene representations by modeling the geometry, materials and illumination separately, which entails significant computations for optimization. In this work we design a Unified Voxelization framework for explicit learning of scene representations, dubbed UniVoxel, which allows for efficient modeling of the geometry, materials and illumination jointly, thereby accelerating the inverse rendering significantly. To be specific, we propose to encode a scene into a latent volumetric representation, based on which the geometry, materials and illumination can be readily learned via lightweight neural networks in a unified manner. Particularly, an essential design of UniVoxel is that we leverage local Spherical Gaussians to represent the incident light radiance, which enables the seamless integration of modeling illumination into the unified voxelization framework. Such novel design enables our UniVoxel to model the joint effects of direct lighting, indirect lighting and light visibility efficiently without expensive multi-bounce ray tracing. Extensive experiments on multiple benchmarks covering diverse scenes demonstrate that UniVoxel boosts the optimization efficiency significantly compared to other methods, reducing the per-scene training time from hours to 18 minutes, while achieving favorable reconstruction quality. Code is available at https://github.com/freemantom/UniVoxel. △ Less

Submitted 28 July, 2024; originally announced July 2024.

Comments: ECCV2024

arXiv:2407.19514 [pdf, other]

Detached and Interactive Multimodal Learning

Authors: Yunfeng Fan, Wenchao Xu, Haozhao Wang, Junhong Liu, Song Guo

Abstract: Recently, Multimodal Learning (MML) has gained significant interest as it compensates for single-modality limitations through comprehensive complementary information within multimodal data. However, traditional MML methods generally use the joint learning framework with a uniform learning objective that can lead to the modality competition issue, where feedback predominantly comes from certain mod… ▽ More Recently, Multimodal Learning (MML) has gained significant interest as it compensates for single-modality limitations through comprehensive complementary information within multimodal data. However, traditional MML methods generally use the joint learning framework with a uniform learning objective that can lead to the modality competition issue, where feedback predominantly comes from certain modalities, limiting the full potential of others. In response to this challenge, this paper introduces DI-MML, a novel detached MML framework designed to learn complementary information across modalities under the premise of avoiding modality competition. Specifically, DI-MML addresses competition by separately training each modality encoder with isolated learning objectives. It further encourages cross-modal interaction via a shared classifier that defines a common feature space and employing a dimension-decoupled unidirectional contrastive (DUC) loss to facilitate modality-level knowledge transfer. Additionally, to account for varying reliability in sample pairs, we devise a certainty-aware logit weighting strategy to effectively leverage complementary information at the instance level during inference. Extensive experiments conducted on audio-visual, flow-image, and front-rear view datasets show the superior performance of our proposed method. The code is released at https://github.com/fanyunfeng-bit/DI-MML. △ Less

Submitted 28 July, 2024; originally announced July 2024.

Comments: Accepted by ACM MM 24

arXiv:2407.19224 [pdf, other]

doi 10.1145/3664647.3681261

RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues

Authors: Tianrui Pan, Jie Liu, Bohan Wang, Jie Tang, Gangshan Wu

Abstract: While existing Audio-Visual Speech Separation (AVSS) methods primarily concentrate on the audio-visual fusion strategy for two-speaker separation, they demonstrate a severe performance drop in the multi-speaker separation scenarios. Typically, AVSS methods employ guiding videos to sequentially isolate individual speakers from the given audio mixture, resulting in notable missing and noisy parts ac… ▽ More While existing Audio-Visual Speech Separation (AVSS) methods primarily concentrate on the audio-visual fusion strategy for two-speaker separation, they demonstrate a severe performance drop in the multi-speaker separation scenarios. Typically, AVSS methods employ guiding videos to sequentially isolate individual speakers from the given audio mixture, resulting in notable missing and noisy parts across various segments of the separated speech. In this study, we propose a simultaneous multi-speaker separation framework that can facilitate the concurrent separation of multiple speakers within a singular process. We introduce speaker-wise interactions to establish distinctions and correlations among speakers. Experimental results on the VoxCeleb2 and LRS3 datasets demonstrate that our method achieves state-of-the-art performance in separating mixtures with 2, 3, 4, and 5 speakers, respectively. Additionally, our model can utilize speakers with complete audio-visual information to mitigate other visual-deficient speakers, thereby enhancing its resilience to missing visual cues. We also conduct experiments where visual information for specific speakers is entirely absent or visual frames are partially missing. The results demonstrate that our model consistently outperforms others, exhibiting the smallest performance drop across all settings involving 2, 3, 4, and 5 speakers. △ Less

Submitted 27 July, 2024; originally announced July 2024.

arXiv:2407.18962 [pdf]

Autonomous Navigation of Unmanned Vehicle Through Deep Reinforcement Learning

Authors: Letian Xu, Jiabei Liu, Haopeng Zhao, Tianyao Zheng, Tongzhou Jiang, Lipeng Liu

Abstract: This paper explores the method of achieving autonomous navigation of unmanned vehicles through Deep Reinforcement Learning (DRL). The focus is on using the Deep Deterministic Policy Gradient (DDPG) algorithm to address issues in high-dimensional continuous action spaces. The paper details the model of a Ackermann robot and the structure and application of the DDPG algorithm. Experiments were condu… ▽ More This paper explores the method of achieving autonomous navigation of unmanned vehicles through Deep Reinforcement Learning (DRL). The focus is on using the Deep Deterministic Policy Gradient (DDPG) algorithm to address issues in high-dimensional continuous action spaces. The paper details the model of a Ackermann robot and the structure and application of the DDPG algorithm. Experiments were conducted in a simulation environment to verify the feasibility of the improved algorithm. The results demonstrate that the DDPG algorithm outperforms traditional Deep Q-Network (DQN) and Double Deep Q-Network (DDQN) algorithms in path planning tasks. △ Less

Submitted 18 July, 2024; originally announced July 2024.

arXiv:2407.18854 [pdf, other]

Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment

Authors: Yuze Zheng, Zixuan Li, Xiangxian Li, Jinxing Liu, Yuqing Wang, Xiangxu Meng, Lei Meng

Abstract: Image classification models often demonstrate unstable performance in real-world applications due to variations in image information, driven by differing visual perspectives of subject objects and lighting discrepancies. To mitigate these challenges, existing studies commonly incorporate additional modal information matching the visual data to regularize the model's learning process, enabling the… ▽ More Image classification models often demonstrate unstable performance in real-world applications due to variations in image information, driven by differing visual perspectives of subject objects and lighting discrepancies. To mitigate these challenges, existing studies commonly incorporate additional modal information matching the visual data to regularize the model's learning process, enabling the extraction of high-quality visual features from complex image regions. Specifically, in the realm of multimodal learning, cross-modal alignment is recognized as an effective strategy, harmonizing different modal information by learning a domain-consistent latent feature space for visual and semantic features. However, this approach may face limitations due to the heterogeneity between multimodal information, such as differences in feature distribution and structure. To address this issue, we introduce a Multimodal Alignment and Reconstruction Network (MARNet), designed to enhance the model's resistance to visual noise. Importantly, MARNet includes a cross-modal diffusion reconstruction module for smoothly and stably blending information across different domains. Experiments conducted on two benchmark datasets, Vireo-Food172 and Ingredient-101, demonstrate that MARNet effectively improves the quality of image information extracted by the model. It is a plug-and-play framework that can be rapidly integrated into various image classification frameworks, boosting model performance. △ Less

Submitted 26 July, 2024; originally announced July 2024.

arXiv:2407.18626 [pdf, other]

Every Part Matters: Integrity Verification of Scientific Figures Based on Multimodal Large Language Models

Authors: Xiang Shi, Jiawei Liu, Yinpeng Liu, Qikai Cheng, Wei Lu

Abstract: This paper tackles a key issue in the interpretation of scientific figures: the fine-grained alignment of text and figures. It advances beyond prior research that primarily dealt with straightforward, data-driven visualizations such as bar and pie charts and only offered a basic understanding of diagrams through captioning and classification. We introduce a novel task, Figure Integrity Verificatio… ▽ More This paper tackles a key issue in the interpretation of scientific figures: the fine-grained alignment of text and figures. It advances beyond prior research that primarily dealt with straightforward, data-driven visualizations such as bar and pie charts and only offered a basic understanding of diagrams through captioning and classification. We introduce a novel task, Figure Integrity Verification, designed to evaluate the precision of technologies in aligning textual knowledge with visual elements in scientific figures. To support this, we develop a semi-automated method for constructing a large-scale dataset, Figure-seg, specifically designed for this task. Additionally, we propose an innovative framework, Every Part Matters (EPM), which leverages Multimodal Large Language Models (MLLMs) to not only incrementally improve the alignment and verification of text-figure integrity but also enhance integrity through analogical reasoning. Our comprehensive experiments show that these innovations substantially improve upon existing methods, allowing for more precise and thorough analysis of complex scientific figures. This progress not only enhances our understanding of multimodal technologies but also stimulates further research and practical applications across fields requiring the accurate interpretation of complex visual data. △ Less

Submitted 26 July, 2024; originally announced July 2024.

Comments: 28 pages, 11 figures, under review

arXiv:2407.17825 [pdf, other]

Blockchain Takeovers in Web 3.0: An Empirical Study on the TRON-Steem Incident

Authors: Chao Li, Runhua Xu, Balaji Palanisamy, Li Duan, Meng Shen, Jiqiang Liu, Wei Wang

Abstract: A fundamental goal of Web 3.0 is to establish a decentralized network and application ecosystem, thereby enabling users to retain control over their data while promoting value exchange. However, the recent Tron-Steem takeover incident poses a significant threat to this vision. In this paper, we present a thorough empirical analysis of the Tron-Steem takeover incident. By conducting a fine-grained… ▽ More A fundamental goal of Web 3.0 is to establish a decentralized network and application ecosystem, thereby enabling users to retain control over their data while promoting value exchange. However, the recent Tron-Steem takeover incident poses a significant threat to this vision. In this paper, we present a thorough empirical analysis of the Tron-Steem takeover incident. By conducting a fine-grained reconstruction of the stake and election snapshots within the Steem blockchain, one of the most prominent social-oriented blockchains, we quantify the marked shifts in decentralization pre and post the takeover incident, highlighting the severe threat that blockchain network takeovers pose to the decentralization principle of Web 3.0. Moreover, by employing heuristic methods to identify anomalous voters and conducting clustering analyses on voter behaviors, we unveil the underlying mechanics of takeover strategies employed in the Tron-Steem incident and suggest potential mitigation strategies, which contribute to the enhanced resistance of Web 3.0 networks against similar threats in the future. We believe the insights gleaned from this research help illuminate the challenges imposed by blockchain network takeovers in the Web 3.0 era, suggest ways to foster the development of decentralized technologies and governance, as well as to enhance the protection of Web 3.0 user rights. △ Less

Submitted 25 July, 2024; originally announced July 2024.

arXiv:2407.17816 [pdf, other]

NC-NCD: Novel Class Discovery for Node Classification

Authors: Yue Hou, Xueyuan Chen, He Zhu, Romei Liu, Bowen Shi, Jiaheng Liu, Junran Wu, Ke Xu

Abstract: Novel Class Discovery (NCD) involves identifying new categories within unlabeled data by utilizing knowledge acquired from previously established categories. However, existing NCD methods often struggle to maintain a balance between the performance of old and new categories. Discovering unlabeled new categories in a class-incremental way is more practical but also more challenging, as it is freque… ▽ More Novel Class Discovery (NCD) involves identifying new categories within unlabeled data by utilizing knowledge acquired from previously established categories. However, existing NCD methods often struggle to maintain a balance between the performance of old and new categories. Discovering unlabeled new categories in a class-incremental way is more practical but also more challenging, as it is frequently hindered by either catastrophic forgetting of old categories or an inability to learn new ones. Furthermore, the implementation of NCD on continuously scalable graph-structured data remains an under-explored area. In response to these challenges, we introduce for the first time a more practical NCD scenario for node classification (i.e., NC-NCD), and propose a novel self-training framework with prototype replay and distillation called SWORD, adopted to our NC-NCD setting. Our approach enables the model to cluster unlabeled new category nodes after learning labeled nodes while preserving performance on old categories without reliance on old category nodes. SWORD achieves this by employing a self-training strategy to learn new categories and preventing the forgetting of old categories through the joint use of feature prototypes and knowledge distillation. Extensive experiments on four common benchmarks demonstrate the superiority of SWORD over other state-of-the-art methods. △ Less

Submitted 25 July, 2024; originally announced July 2024.

Comments: Accepted by CIKM'24

arXiv:2407.17416 [pdf, other]

Explaining Spectrograms in Machine Learning: A Study on Neural Networks for Speech Classification

Authors: Jesin James, Balamurali B. T., Binu Abeysinghe, Junchen Liu

Abstract: This study investigates discriminative patterns learned by neural networks for accurate speech classification, with a specific focus on vowel classification tasks. By examining the activations and features of neural networks for vowel classification, we gain insights into what the networks "see" in spectrograms. Through the use of class activation mapping, we identify the frequencies that contribu… ▽ More This study investigates discriminative patterns learned by neural networks for accurate speech classification, with a specific focus on vowel classification tasks. By examining the activations and features of neural networks for vowel classification, we gain insights into what the networks "see" in spectrograms. Through the use of class activation mapping, we identify the frequencies that contribute to vowel classification and compare these findings with linguistic knowledge. Experiments on a American English dataset of vowels showcases the explainability of neural networks and provides valuable insights into the causes of misclassifications and their characteristics when differentiating them from unvoiced speech. This study not only enhances our understanding of the underlying acoustic cues in vowel classification but also offers opportunities for improving speech recognition by bridging the gap between abstract representations in neural networks and established linguistic knowledge △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: 5th International Conference on Artificial Intelligence and Speech Technology (AIST-2023), New Delhi, India

arXiv:2407.17379 [pdf, other]

MMRA: A Benchmark for Multi-granularity Multi-image Relational Association

Authors: Siwei Wu, Kang Zhu, Yu Bai, Yiming Liang, Yizhi Li, Haoning Wu, Jiaheng Liu, Ruibo Liu, Xingwei Qu, Xuxin Cheng, Ge Zhang, Wenhao Huang, Chenghua Lin

Abstract: Given the remarkable success that large visual language models (LVLMs) have achieved in image perception tasks, the endeavor to make LVMLs perceive the world like humans is drawing increasing attention. Current multi-modal benchmarks mainly focus on the objective fact or certain topic related potential knowledge within a image, but overlook the associative relations between multiple images. Theref… ▽ More Given the remarkable success that large visual language models (LVLMs) have achieved in image perception tasks, the endeavor to make LVMLs perceive the world like humans is drawing increasing attention. Current multi-modal benchmarks mainly focus on the objective fact or certain topic related potential knowledge within a image, but overlook the associative relations between multiple images. Therefore, we define a multi-image relation association task, and meticulously curate \textbf{MMRA} benchmark, a \textbf{M}ulti-granularity \textbf{M}ulti-image \textbf{R}elational \textbf{A}ssociation benchmark, consisted of \textbf{1026} samples. In order to systematically and comprehensively evaluate mainstream LVLMs, we establish an associational relation system among images that contain \textbf{11 subtasks} (e.g, UsageSimilarity, SubEvent, etc.) at two granularity levels (i.e., "\textbf{image}" and "\textbf{entity}") according to the relations in ConceptNet. Our experiments demonstrate that, on our MMRA benchmark, current mainstream LVLMs all have their own advantages and disadvantages across different subtasks. It is worth noting that, at the entity level, the performance of all models is worse than that of them at the image level, indicating that the fine-grained multi-image perception task is still challenging for LVLMs. The tasks related to spatial perception are relatively difficult for LVLMs to handle. Furthermore, we find that LVMLs exhibit a good ability to perceive image details, and the key to enhancing their multi-image association capability is to strengthen the reasoning ability of their language model component. All our codes and data are released at htt\url{https://github.com/Wusiwei0410/MMRA}. △ Less

Submitted 24 July, 2024; originally announced July 2024.

Comments: VLMS, Multi-Image Association

arXiv:2407.17333 [pdf, other]

Global and Local Confidence Based Fraud Detection Graph Neural Network

Authors: Jiaxun Liu, Yue Tian, Guanjun Liu

Abstract: This paper presents the Global and Local Confidence Graph Neural Network (GLC-GNN), an innovative approach to graph-based anomaly detection that addresses the challenges of heterophily and camouflage in fraudulent activities. By introducing a prototype to encapsulate the global features of a graph and calculating a Global Confidence (GC) value for each node, GLC-GNN effectively distinguishes betwe… ▽ More This paper presents the Global and Local Confidence Graph Neural Network (GLC-GNN), an innovative approach to graph-based anomaly detection that addresses the challenges of heterophily and camouflage in fraudulent activities. By introducing a prototype to encapsulate the global features of a graph and calculating a Global Confidence (GC) value for each node, GLC-GNN effectively distinguishes between benign and fraudulent nodes. The model leverages GC to generate attention values for message aggregation, enhancing its ability to capture both homophily and heterophily. Through extensive experiments on four open datasets, GLC-GNN demonstrates superior performance over state-of-the-art models in accuracy and convergence speed, while maintaining a compact model size and expedited training process. The integration of global and local confidence measures in GLC-GNN offers a robust solution for detecting anomalies in graphs, with significant implications for fraud detection across diverse domains. △ Less

Submitted 24 July, 2024; originally announced July 2024.

arXiv:2407.17120 [pdf, other]

Parameter-Efficient Fine-Tuning for Continual Learning: A Neural Tangent Kernel Perspective

Authors: Jingren Liu, Zhong Ji, YunLong Yu, Jiale Cao, Yanwei Pang, Jungong Han, Xuelong Li

Abstract: Parameter-efficient fine-tuning for continual learning (PEFT-CL) has shown promise in adapting pre-trained models to sequential tasks while mitigating catastrophic forgetting problem. However, understanding the mechanisms that dictate continual performance in this paradigm remains elusive. To tackle this complexity, we undertake a rigorous analysis of PEFT-CL dynamics to derive relevant metrics fo… ▽ More Parameter-efficient fine-tuning for continual learning (PEFT-CL) has shown promise in adapting pre-trained models to sequential tasks while mitigating catastrophic forgetting problem. However, understanding the mechanisms that dictate continual performance in this paradigm remains elusive. To tackle this complexity, we undertake a rigorous analysis of PEFT-CL dynamics to derive relevant metrics for continual scenarios using Neural Tangent Kernel (NTK) theory. With the aid of NTK as a mathematical analysis tool, we recast the challenge of test-time forgetting into the quantifiable generalization gaps during training, identifying three key factors that influence these gaps and the performance of PEFT-CL: training sample size, task-level feature orthogonality, and regularization. To address these challenges, we introduce NTK-CL, a novel framework that eliminates task-specific parameter storage while adaptively generating task-relevant features. Aligning with theoretical guidance, NTK-CL triples the feature representation of each sample, theoretically and empirically reducing the magnitude of both task-interplay and task-specific generalization gaps. Grounded in NTK analysis, our approach imposes an adaptive exponential moving average mechanism and constraints on task-level feature orthogonality, maintaining intra-task NTK forms while attenuating inter-task NTK forms. Ultimately, by fine-tuning optimizable parameters with appropriate regularization, NTK-CL achieves state-of-the-art performance on established PEFT-CL benchmarks. This work provides a theoretical foundation for understanding and improving PEFT-CL models, offering insights into the interplay between feature representation, task orthogonality, and generalization, contributing to the development of more efficient continual learning systems. △ Less

Submitted 24 July, 2024; originally announced July 2024.

arXiv:2407.17007 [pdf, other]

Pensieve Discuss: Scalable Small-Group CS Tutoring System with AI

Authors: Yoonseok Yang, Jack Liu, J. D. Zamfirescu-Pereira, John DeNero

Abstract: Small-group tutoring in Computer Science (CS) is effective, but presents the challenge of providing a dedicated tutor for each group and encouraging collaboration among group members at scale. We present Pensieve Discuss, a software platform that integrates synchronous editing for scaffolded programming problems with online human and AI tutors, designed to improve student collaboration and experie… ▽ More Small-group tutoring in Computer Science (CS) is effective, but presents the challenge of providing a dedicated tutor for each group and encouraging collaboration among group members at scale. We present Pensieve Discuss, a software platform that integrates synchronous editing for scaffolded programming problems with online human and AI tutors, designed to improve student collaboration and experience during group tutoring sessions. Our semester-long deployment to 800 students in a CS1 course demonstrated consistently high collaboration rates, positive feedback about the AI tutor's helpfulness and correctness, increased satisfaction with the group tutoring experience, and a substantial increase in question volume. The use of our system was preferred over an interface lacking AI tutors and synchronous editing capabilities. Our experiences suggest that small-group tutoring sessions are an important avenue for future research in educational AI. △ Less

Submitted 24 July, 2024; originally announced July 2024.

Comments: 6 pages, 7 figures, 4 tables, 1 page of references

arXiv:2407.16949 [pdf, ps, other]

Profitable Manipulations of Cryptographic Self-Selection are Statistically Detectable

Authors: Linda Cai, Jingyi Liu, S. Matthew Weinberg, Chenghan Zhou

Abstract: Cryptographic Self-Selection is a common primitive underlying leader-selection for Proof-of-Stake blockchain protocols. The concept was first popularized in Algorand [CM19], who also observed that the protocol might be manipulable. [FHWY22] provide a concrete manipulation that is strictly profitable for a staker of any size (and also prove upper bounds on the gains from manipulation). Separately… ▽ More Cryptographic Self-Selection is a common primitive underlying leader-selection for Proof-of-Stake blockchain protocols. The concept was first popularized in Algorand [CM19], who also observed that the protocol might be manipulable. [FHWY22] provide a concrete manipulation that is strictly profitable for a staker of any size (and also prove upper bounds on the gains from manipulation). Separately, [YSZ23, BM24] initiate the study of undetectable profitable manipulations of consensus protocols with a focus on the seminal Selfish Mining strategy [ES14] for Bitcoin's Proof-of-Work longest-chain protocol. They design a Selfish Mining variant that, for sufficiently large miners, is strictly profitable yet also indistinguishable to an onlooker from routine latency (that is, a sufficiently large profit-maximizing miner could use their strategy to strictly profit over being honest in a way that still appears to the rest of the network as though everyone is honest but experiencing mildly higher latency. This avoids any risk of negatively impacting the value of the underlying cryptocurrency due to attack detection). We investigate the detectability of profitable manipulations of the canonical cryptographic self-selection leader selection protocol introduced in [CM19] and studied in [FHWY22], and establish that for any player with $α< \frac{3-\sqrt{5}}{2} \approx 0.38$ fraction of the total stake, every strictly profitable manipulation is statistically detectable. Specifically, we consider an onlooker who sees only the random seed of each round (and does not need to see any other broadcasts by any other players). We show that the distribution of the sequence of random seeds when any player is profitably manipulating the protocol is inconsistent with any distribution that could arise by honest stakers being offline or timing out (for a natural stylized model of honest timeouts). △ Less

Submitted 23 July, 2024; originally announced July 2024.

arXiv:2407.16697 [pdf, other]

AbdomenAtlas: A Large-Scale, Detailed-Annotated, & Multi-Center Dataset for Efficient Transfer Learning and Open Algorithmic Benchmarking

Authors: Wenxuan Li, Chongyu Qu, Xiaoxi Chen, Pedro R. A. S. Bassi, Yijia Shi, Yuxiang Lai, Qian Yu, Huimin Xue, Yixiong Chen, Xiaorui Lin, Yutong Tang, Yining Cao, Haoqi Han, Zheyuan Zhang, Jiawei Liu, Tiezheng Zhang, Yujiu Ma, Jincheng Wang, Guang Zhang, Alan Yuille, Zongwei Zhou

Abstract: We introduce the largest abdominal CT dataset (termed AbdomenAtlas) of 20,460 three-dimensional CT volumes sourced from 112 hospitals across diverse populations, geographies, and facilities. AbdomenAtlas provides 673K high-quality masks of anatomical structures in the abdominal region annotated by a team of 10 radiologists with the help of AI algorithms. We start by having expert radiologists manu… ▽ More We introduce the largest abdominal CT dataset (termed AbdomenAtlas) of 20,460 three-dimensional CT volumes sourced from 112 hospitals across diverse populations, geographies, and facilities. AbdomenAtlas provides 673K high-quality masks of anatomical structures in the abdominal region annotated by a team of 10 radiologists with the help of AI algorithms. We start by having expert radiologists manually annotate 22 anatomical structures in 5,246 CT volumes. Following this, a semi-automatic annotation procedure is performed for the remaining CT volumes, where radiologists revise the annotations predicted by AI, and in turn, AI improves its predictions by learning from revised annotations. Such a large-scale, detailed-annotated, and multi-center dataset is needed for two reasons. Firstly, AbdomenAtlas provides important resources for AI development at scale, branded as large pre-trained models, which can alleviate the annotation workload of expert radiologists to transfer to broader clinical applications. Secondly, AbdomenAtlas establishes a large-scale benchmark for evaluating AI algorithms -- the more data we use to test the algorithms, the better we can guarantee reliable performance in complex clinical scenarios. An ISBI & MICCAI challenge named BodyMaps: Towards 3D Atlas of Human Body was launched using a subset of our AbdomenAtlas, aiming to stimulate AI innovation and to benchmark segmentation accuracy, inference efficiency, and domain generalizability. We hope our AbdomenAtlas can set the stage for larger-scale clinical trials and offer exceptional opportunities to practitioners in the medical imaging community. Codes, models, and datasets are available at https://www.zongweiz.com/dataset △ Less

Submitted 23 July, 2024; originally announced July 2024.

Comments: Published in Medical Image Analysis

arXiv:2407.16207 [pdf, other]

Graph-Structured Speculative Decoding

Authors: Zhuocheng Gong, Jiahao Liu, Ziyue Wang, Pengfei Wu, Jingang Wang, Xunliang Cai, Dongyan Zhao, Rui Yan

Abstract: Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models (LLMs) by employing a small language model to draft a hypothesis sequence, which is then validated by the LLM. The effectiveness of this approach heavily relies on the balance between performance and efficiency of the draft model. In our research, we focus on enhancing the proportion of d… ▽ More Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models (LLMs) by employing a small language model to draft a hypothesis sequence, which is then validated by the LLM. The effectiveness of this approach heavily relies on the balance between performance and efficiency of the draft model. In our research, we focus on enhancing the proportion of draft tokens that are accepted to the final output by generating multiple hypotheses instead of just one. This allows the LLM more options to choose from and select the longest sequence that meets its standards. Our analysis reveals that hypotheses produced by the draft model share many common token sequences, suggesting a potential for optimizing computation. Leveraging this observation, we introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. This structure enables us to efficiently predict and merge recurring token sequences, vastly reducing the computational demands of the draft model. We term this approach Graph-structured Speculative Decoding (GSD). We apply GSD across a range of LLMs, including a 70-billion parameter LLaMA-2 model, and observe a remarkable speedup of 1.73$\times$ to 1.96$\times$, significantly surpassing standard speculative decoding. △ Less

Submitted 23 July, 2024; originally announced July 2024.

arXiv:2407.16154 [pdf, other]

DDK: Distilling Domain Knowledge for Efficient Large Language Models

Authors: Jiaheng Liu, Chenchen Zhang, Jinyang Guo, Yuanxing Zhang, Haoran Que, Ken Deng, Zhiqi Bai, Jie Liu, Ge Zhang, Jiakai Wang, Yanan Wu, Congnan Liu, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng

Abstract: Despite the advanced intelligence abilities of large language models (LLMs) in various applications, they still face significant computational and storage demands. Knowledge Distillation (KD) has emerged as an effective strategy to improve the performance of a smaller LLM (i.e., the student model) by transferring knowledge from a high-performing LLM (i.e., the teacher model). Prevailing techniques… ▽ More Despite the advanced intelligence abilities of large language models (LLMs) in various applications, they still face significant computational and storage demands. Knowledge Distillation (KD) has emerged as an effective strategy to improve the performance of a smaller LLM (i.e., the student model) by transferring knowledge from a high-performing LLM (i.e., the teacher model). Prevailing techniques in LLM distillation typically use a black-box model API to generate high-quality pretrained and aligned datasets, or utilize white-box distillation by altering the loss function to better transfer knowledge from the teacher LLM. However, these methods ignore the knowledge differences between the student and teacher LLMs across domains. This results in excessive focus on domains with minimal performance gaps and insufficient attention to domains with large gaps, reducing overall performance. In this paper, we introduce a new LLM distillation framework called DDK, which dynamically adjusts the composition of the distillation dataset in a smooth manner according to the domain performance differences between the teacher and student models, making the distillation process more stable and effective. Extensive evaluations show that DDK significantly improves the performance of student models, outperforming both continuously pretrained baselines and existing knowledge distillation methods by a large margin. △ Less

Submitted 22 July, 2024; originally announced July 2024.

arXiv:2407.16124 [pdf, other]

Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos

Authors: Jiahe Liu, Youran Qu, Qi Yan, Xiaohui Zeng, Lele Wang, Renjie Liao

Abstract: Significant advancements have been made in video generative models recently. Unlike image generation, video generation presents greater challenges, requiring not only generating high-quality frames but also ensuring temporal consistency across these frames. Despite the impressive progress, research on metrics for evaluating the quality of generated videos, especially concerning temporal and motion… ▽ More Significant advancements have been made in video generative models recently. Unlike image generation, video generation presents greater challenges, requiring not only generating high-quality frames but also ensuring temporal consistency across these frames. Despite the impressive progress, research on metrics for evaluating the quality of generated videos, especially concerning temporal and motion consistency, remains underexplored. To bridge this research gap, we propose Fréchet Video Motion Distance (FVMD) metric, which focuses on evaluating motion consistency in video generation. Specifically, we design explicit motion features based on key point tracking, and then measure the similarity between these features via the Fréchet distance. We conduct sensitivity analysis by injecting noise into real videos to verify the effectiveness of FVMD. Further, we carry out a large-scale human study, demonstrating that our metric effectively detects temporal noise and aligns better with human perceptions of generated video quality than existing metrics. Additionally, our motion features can consistently improve the performance of Video Quality Assessment (VQA) models, indicating that our approach is also applicable to unary video quality evaluation. Code is available at https://github.com/ljh0v0/FMD-frechet-motion-distance. △ Less

Submitted 22 July, 2024; originally announced July 2024.

arXiv:2407.15975 [pdf, other]

Multilingual Fine-Grained News Headline Hallucination Detection

Authors: Jiaming Shen, Tianqi Liu, Jialu Liu, Zhen Qin, Jay Pavagadhi, Simon Baumgartner, Michael Bendersky

Abstract: The popularity of automated news headline generation has surged with advancements in pre-trained language models. However, these models often suffer from the ``hallucination'' problem, where the generated headline is not fully supported by its source article. Efforts to address this issue have predominantly focused on English, using over-simplistic classification schemes that overlook nuanced hall… ▽ More The popularity of automated news headline generation has surged with advancements in pre-trained language models. However, these models often suffer from the ``hallucination'' problem, where the generated headline is not fully supported by its source article. Efforts to address this issue have predominantly focused on English, using over-simplistic classification schemes that overlook nuanced hallucination types. In this study, we introduce the first multilingual, fine-grained news headline hallucination detection dataset that contains over 11 thousand pairs in 5 languages, each annotated with detailed hallucination types by experts. We conduct extensive experiments on this dataset under two settings. First, we implement several supervised fine-tuning approaches as preparatory solutions and demonstrate this dataset's challenges and utilities. Second, we test various large language models' in-context learning abilities and propose two novel techniques, language-dependent demonstration selection and coarse-to-fine prompting, to boost the few-shot hallucination detection performance in terms of the example-F1 metric. We release this dataset to foster further research in multilingual, fine-grained headline hallucination detection. △ Less

Submitted 22 July, 2024; originally announced July 2024.

arXiv:2407.15706 [pdf, other]

Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition

Authors: Jinfu Liu, Chen Chen, Mengyuan Liu

Abstract: Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and… ▽ More Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and fully harness the complementary multimodal features, we propose a novel multi-modality co-learning (MMCL) framework by leveraging the multimodal large language models (LLMs) as auxiliary networks for efficient skeleton-based action recognition, which engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference. Our MMCL framework primarily consists of two modules. First, the Feature Alignment Module (FAM) extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. Second, the Feature Refinement Module (FRM) uses RGB images with temporal information and text instruction to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features will further refine the classification scores and the refined scores will enhance the model's robustness and generalization in a manner similar to soft labels. Extensive experiments on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA benchmarks consistently verify the effectiveness of our MMCL, which outperforms the existing skeleton-based action recognition methods. Meanwhile, experiments on UTD-MHAD and SYSU-Action datasets demonstrate the commendable generalization of our MMCL in zero-shot and domain-adaptive action recognition. Our code is publicly available at: https://github.com/liujf69/MMCL-Action. △ Less

Submitted 24 July, 2024; v1 submitted 22 July, 2024; originally announced July 2024.

arXiv:2407.15351 [pdf, other]

LLMExplainer: Large Language Model based Bayesian Inference for Graph Explanation Generation

Authors: Jiaxing Zhang, Jiayi Liu, Dongsheng Luo, Jennifer Neville, Hua Wei

Abstract: Recent studies seek to provide Graph Neural Network (GNN) interpretability via multiple unsupervised learning models. Due to the scarcity of datasets, current methods easily suffer from learning bias. To solve this problem, we embed a Large Language Model (LLM) as knowledge into the GNN explanation network to avoid the learning bias problem. We inject LLM as a Bayesian Inference (BI) module to mit… ▽ More Recent studies seek to provide Graph Neural Network (GNN) interpretability via multiple unsupervised learning models. Due to the scarcity of datasets, current methods easily suffer from learning bias. To solve this problem, we embed a Large Language Model (LLM) as knowledge into the GNN explanation network to avoid the learning bias problem. We inject LLM as a Bayesian Inference (BI) module to mitigate learning bias. The efficacy of the BI module has been proven both theoretically and experimentally. We conduct experiments on both synthetic and real-world datasets. The innovation of our work lies in two parts: 1. We provide a novel view of the possibility of an LLM functioning as a Bayesian inference to improve the performance of existing algorithms; 2. We are the first to discuss the learning bias issues in the GNN explanation problem. △ Less

Submitted 23 July, 2024; v1 submitted 21 July, 2024; originally announced July 2024.

Comments: Preprint Paper with 13 pages

arXiv:2407.14899 [pdf, other]

Hyperspectral Unmixing Under Endmember Variability: A Variational Inference Framework

Authors: Yuening Li, Xiao Fu, Junbin Liu, Wing-Kin Ma

Abstract: This work proposes a variational inference (VI) framework for hyperspectral unmixing in the presence of endmember variability (HU-EV). An EV-accounted noisy linear mixture model (LMM) is considered, and the presence of outliers is also incorporated into the model. Following the marginalized maximum likelihood (MML) principle, a VI algorithmic structure is designed for probabilistic inference for H… ▽ More This work proposes a variational inference (VI) framework for hyperspectral unmixing in the presence of endmember variability (HU-EV). An EV-accounted noisy linear mixture model (LMM) is considered, and the presence of outliers is also incorporated into the model. Following the marginalized maximum likelihood (MML) principle, a VI algorithmic structure is designed for probabilistic inference for HU-EV. Specifically, a patch-wise static endmember assumption is employed to exploit spatial smoothness and to try to overcome the ill-posed nature of the HU-EV problem. The design facilitates lightweight, continuous optimization-based updates under a variety of endmember priors. Some of the priors, such as the Beta prior, were previously used under computationally heavy, sampling-based probabilistic HU-EV methods. The effectiveness of the proposed framework is demonstrated through synthetic, semi-real, and real-data experiments. △ Less

Submitted 20 July, 2024; originally announced July 2024.

arXiv:2407.14774 [pdf, other]

Intelligent Artistic Typography: A Comprehensive Review of Artistic Text Design and Generation

Authors: Yuhang Bai, Zichuan Huang, Wenshuo Gao, Shuai Yang, Jiaying Liu

Abstract: Artistic text generation aims to amplify the aesthetic qualities of text while maintaining readability. It can make the text more attractive and better convey its expression, thus enjoying a wide range of application scenarios such as social media display, consumer electronics, fashion, and graphic design. Artistic text generation includes artistic text stylization and semantic typography. Artisti… ▽ More Artistic text generation aims to amplify the aesthetic qualities of text while maintaining readability. It can make the text more attractive and better convey its expression, thus enjoying a wide range of application scenarios such as social media display, consumer electronics, fashion, and graphic design. Artistic text generation includes artistic text stylization and semantic typography. Artistic text stylization concentrates on the text effect overlaid upon the text, such as shadows, outlines, colors, glows, and textures. By comparison, semantic typography focuses on the deformation of the characters to strengthen their visual representation by mimicking the semantic understanding within the text. This overview paper provides an introduction to both artistic text stylization and semantic typography, including the taxonomy, the key ideas of representative methods, and the applications in static and dynamic artistic text generation. Furthermore, the dataset and evaluation metrics are introduced, and the future directions of artistic text generation are discussed. A comprehensive list of artistic text generation models studied in this review is available at https://github.com/williamyang1991/Awesome-Artistic-Typography/. △ Less

Submitted 20 July, 2024; originally announced July 2024.

Comments: GitHub Page: https://github.com/williamyang1991/Awesome-Artistic-Typography/

arXiv:2407.14138 [pdf, other]

Visual Text Generation in the Wild

Authors: Yuanzhi Zhu, Jiawei Liu, Feiyu Gao, Wenyu Liu, Xinggang Wang, Peng Wang, Fei Huang, Cong Yao, Zhibo Yang

Abstract: Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in… ▽ More Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in the given conditions; (2) Reasonability: the regions and contents of the generated text should cohere with the scene; (3) Utility: the generated text images can facilitate related tasks (e.g., text detection and recognition). Upon investigation, we find that existing methods, either rendering-based or diffusion-based, can hardly meet all these aspects simultaneously, limiting their application range. Therefore, we propose in this paper a visual text generator (termed SceneVTG), which can produce high-quality text images in the wild. Following a two-stage paradigm, SceneVTG leverages a Multimodal Large Language Model to recommend reasonable text regions and contents across multiple scales and levels, which are used by a conditional diffusion model as conditions to generate text images. Extensive experiments demonstrate that the proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. Besides, the generated images provide superior utility for tasks involving text detection and text recognition. Code and datasets are available at AdvancedLiterateMachinery. △ Less

Submitted 19 July, 2024; originally announced July 2024.

arXiv:2407.14126 [pdf, other]

Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation

Authors: Jinfeng Liu, Lingtong Kong, Bo Li, Zerong Wang, Hong Gu, Jinwei Chen

Abstract: Self-supervised monocular depth estimation has gathered notable interest since it can liberate training from dependency on depth annotations. In monocular video training case, recent methods only conduct view synthesis between existing camera views, leading to insufficient guidance. To tackle this, we try to synthesize more virtual camera views by flow-based video frame interpolation (VFI), termed… ▽ More Self-supervised monocular depth estimation has gathered notable interest since it can liberate training from dependency on depth annotations. In monocular video training case, recent methods only conduct view synthesis between existing camera views, leading to insufficient guidance. To tackle this, we try to synthesize more virtual camera views by flow-based video frame interpolation (VFI), termed as temporal augmentation. For multi-frame inference, to sidestep the problem of dynamic objects encountered by explicit geometry-based methods like ManyDepth, we return to the feature fusion paradigm and design a VFI-assisted multi-frame fusion module to align and aggregate multi-frame features, using motion and occlusion information obtained by the flow-based VFI model. Finally, we construct a unified self-supervised learning framework, named Mono-ViFI, to bilaterally connect single- and multi-frame depth. In this framework, spatial data augmentation through image affine transformation is incorporated for data diversity, along with a triplet depth consistency loss for regularization. The single- and multi-frame models can share weights, making our framework compact and memory-efficient. Extensive experiments demonstrate that our method can bring significant improvements to current advanced architectures. Source code is available at https://github.com/LiuJF1226/Mono-ViFI. △ Less

Submitted 19 July, 2024; originally announced July 2024.

Comments: 27 pages, accepted by ECCV 2024

arXiv:2407.14078 [pdf, other]

Stable-Hair: Real-World Hair Transfer via Diffusion Model

Authors: Yuxuan Zhang, Qing Zhang, Yiren Song, Jiaming Liu

Abstract: Current hair transfer methods struggle to handle diverse and intricate hairstyles, thus limiting their applicability in real-world scenarios. In this paper, we propose a novel diffusion-based hair transfer framework, named \textit{Stable-Hair}, which robustly transfers a wide range of real-world hairstyles onto user-provided faces for virtual hair try-on. To achieve this goal, our Stable-Hair fram… ▽ More Current hair transfer methods struggle to handle diverse and intricate hairstyles, thus limiting their applicability in real-world scenarios. In this paper, we propose a novel diffusion-based hair transfer framework, named \textit{Stable-Hair}, which robustly transfers a wide range of real-world hairstyles onto user-provided faces for virtual hair try-on. To achieve this goal, our Stable-Hair framework is designed as a two-stage pipeline. In the first stage, we train a Bald Converter alongside stable diffusion to remove hair from the user-provided face images, resulting in bald images. In the second stage, we specifically designed three modules: a Hair Extractor, a Latent IdentityNet, and Hair Cross-Attention Layers to transfer the target hairstyle with highly detailed and high-fidelity to the bald image. Specifically, the Hair Extractor is trained to encode reference images with the desired hairstyles. To preserve the consistency of identity content and background between the source images and the transfer results, we employ a Latent IdentityNet to encode the source images. With the assistance of our Hair Cross-Attention Layers in the U-Net, we can accurately and precisely transfer the highly detailed and high-fidelity hairstyle to the bald image. Extensive experiments have demonstrated that our approach delivers state-of-the-art (SOTA) results among existing hair transfer methods. Project page: \textcolor{red}{\url{https://xiaojiu-z.github.io/Stable-Hair.github.io/}} △ Less

Submitted 19 July, 2024; originally announced July 2024.

arXiv:2407.13757 [pdf, other]

Black-Box Opinion Manipulation Attacks to Retrieval-Augmented Generation of Large Language Models

Authors: Zhuo Chen, Jiawei Liu, Haotan Liu, Qikai Cheng, Fan Zhang, Wei Lu, Xiaozhong Liu

Abstract: Retrieval-Augmented Generation (RAG) is applied to solve hallucination problems and real-time constraints of large language models, but it also induces vulnerabilities against retrieval corruption attacks. Existing research mainly explores the unreliability of RAG in white-box and closed-domain QA tasks. In this paper, we aim to reveal the vulnerabilities of Retrieval-Enhanced Generative (RAG) mod… ▽ More Retrieval-Augmented Generation (RAG) is applied to solve hallucination problems and real-time constraints of large language models, but it also induces vulnerabilities against retrieval corruption attacks. Existing research mainly explores the unreliability of RAG in white-box and closed-domain QA tasks. In this paper, we aim to reveal the vulnerabilities of Retrieval-Enhanced Generative (RAG) models when faced with black-box attacks for opinion manipulation. We explore the impact of such attacks on user cognition and decision-making, providing new insight to enhance the reliability and security of RAG models. We manipulate the ranking results of the retrieval model in RAG with instruction and use these results as data to train a surrogate model. By employing adversarial retrieval attack methods to the surrogate model, black-box transfer attacks on RAG are further realized. Experiments conducted on opinion datasets across multiple topics show that the proposed attack strategy can significantly alter the opinion polarity of the content generated by RAG. This demonstrates the model's vulnerability and, more importantly, reveals the potential negative impact on user cognition and decision-making, making it easier to mislead users into accepting incorrect or biased information. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: 10 pages, 3 figures, under review

arXiv:2407.13609 [pdf, other]

Training-free Composite Scene Generation for Layout-to-Image Synthesis

Authors: Jiaqi Liu, Tao Huang, Chang Xu

Abstract: Recent breakthroughs in text-to-image diffusion models have significantly advanced the generation of high-fidelity, photo-realistic images from textual descriptions. Yet, these models often struggle with interpreting spatial arrangements from text, hindering their ability to produce images with precise spatial configurations. To bridge this gap, layout-to-image generation has emerged as a promisin… ▽ More Recent breakthroughs in text-to-image diffusion models have significantly advanced the generation of high-fidelity, photo-realistic images from textual descriptions. Yet, these models often struggle with interpreting spatial arrangements from text, hindering their ability to produce images with precise spatial configurations. To bridge this gap, layout-to-image generation has emerged as a promising direction. However, training-based approaches are limited by the need for extensively annotated datasets, leading to high data acquisition costs and a constrained conceptual scope. Conversely, training-free methods face challenges in accurately locating and generating semantically similar objects within complex compositions. This paper introduces a novel training-free approach designed to overcome adversarial semantic intersections during the diffusion conditioning phase. By refining intra-token loss with selective sampling and enhancing the diffusion process with attention redistribution, we propose two innovative constraints: 1) an inter-token constraint that resolves token conflicts to ensure accurate concept synthesis; and 2) a self-attention constraint that improves pixel-to-pixel relationships. Our evaluations confirm the effectiveness of leveraging layout information for guiding the diffusion process, generating content-rich images with enhanced fidelity and complexity. Code is available at https://github.com/Papple-F/csg.git. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: ECCV 2024

arXiv:2407.13252 [pdf, other]

Unveiling Structural Memorization: Structural Membership Inference Attack for Text-to-Image Diffusion Models

Authors: Qiao Li, Xiaomeng Fu, Xi Wang, Jin Liu, Xingyu Gao, Jiao Dai, Jizhong Han

Abstract: With the rapid advancements of large-scale text-to-image diffusion models, various practical applications have emerged, bringing significant convenience to society. However, model developers may misuse the unauthorized data to train diffusion models. These data are at risk of being memorized by the models, thus potentially violating citizens' privacy rights. Therefore, in order to judge whether a… ▽ More With the rapid advancements of large-scale text-to-image diffusion models, various practical applications have emerged, bringing significant convenience to society. However, model developers may misuse the unauthorized data to train diffusion models. These data are at risk of being memorized by the models, thus potentially violating citizens' privacy rights. Therefore, in order to judge whether a specific image is utilized as a member of a model's training set, Membership Inference Attack (MIA) is proposed to serve as a tool for privacy protection. Current MIA methods predominantly utilize pixel-wise comparisons as distinguishing clues, considering the pixel-level memorization characteristic of diffusion models. However, it is practically impossible for text-to-image models to memorize all the pixel-level information in massive training sets. Therefore, we move to the more advanced structure-level memorization. Observations on the diffusion process show that the structures of members are better preserved compared to those of nonmembers, indicating that diffusion models possess the capability to remember the structures of member images from training sets. Drawing on these insights, we propose a simple yet effective MIA method tailored for text-to-image diffusion models. Extensive experimental results validate the efficacy of our approach. Compared to current pixel-level baselines, our approach not only achieves state-of-the-art performance but also demonstrates remarkable robustness against various distortions. △ Less

Submitted 18 July, 2024; originally announced July 2024.

arXiv:2407.13163 [pdf, other]

ROLeR: Effective Reward Shaping in Offline Reinforcement Learning for Recommender Systems

Authors: Yi Zhang, Ruihong Qiu, Jiajun Liu, Sen Wang

Abstract: Offline reinforcement learning (RL) is an effective tool for real-world recommender systems with its capacity to model the dynamic interest of users and its interactive nature. Most existing offline RL recommender systems focus on model-based RL through learning a world model from offline data and building the recommendation policy by interacting with this model. Although these methods have made p… ▽ More Offline reinforcement learning (RL) is an effective tool for real-world recommender systems with its capacity to model the dynamic interest of users and its interactive nature. Most existing offline RL recommender systems focus on model-based RL through learning a world model from offline data and building the recommendation policy by interacting with this model. Although these methods have made progress in the recommendation performance, the effectiveness of model-based offline RL methods is often constrained by the accuracy of the estimation of the reward model and the model uncertainties, primarily due to the extreme discrepancy between offline logged data and real-world data in user interactions with online platforms. To fill this gap, a more accurate reward model and uncertainty estimation are needed for the model-based RL methods. In this paper, a novel model-based Reward Shaping in Offline Reinforcement Learning for Recommender Systems, ROLeR, is proposed for reward and uncertainty estimation in recommendation systems. Specifically, a non-parametric reward shaping method is designed to refine the reward model. In addition, a flexible and more representative uncertainty penalty is designed to fit the needs of recommendation systems. Extensive experiments conducted on four benchmark datasets showcase that ROLeR achieves state-of-the-art performance compared with existing baselines. The source code can be downloaded at https://github.com/ArronDZhang/ROLeR. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: CIKM 2024

arXiv:2407.12823 [pdf, other]

WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models

Authors: Kangyun Ning, Yisong Su, Xueqiang Lv, Yuanzhe Zhang, Jian Liu, Kang Liu, Jinan Xu

Abstract: Although Large Language Models (LLMs) excel in NLP tasks, they still need external tools to extend their ability. Current research on tool learning with LLMs often assumes mandatory tool use, which does not always align with real-world situations, where the necessity for tools is uncertain, and incorrect or unnecessary use of tools can damage the general abilities of LLMs. Therefore, we propose to… ▽ More Although Large Language Models (LLMs) excel in NLP tasks, they still need external tools to extend their ability. Current research on tool learning with LLMs often assumes mandatory tool use, which does not always align with real-world situations, where the necessity for tools is uncertain, and incorrect or unnecessary use of tools can damage the general abilities of LLMs. Therefore, we propose to explore whether LLMs can discern their ability boundaries and use tools flexibly. We then introduce the Whether-or-not tool usage Evaluation benchmark (WTU-Eval) to assess LLMs with eleven datasets, where six of them are tool-usage datasets, and five are general datasets. LLMs are prompted to use tools according to their needs. The results of eight LLMs on WTU-Eval reveal that LLMs frequently struggle to determine tool use in general datasets, and LLMs' performance in tool-usage datasets improves when their ability is similar to ChatGPT. In both datasets, incorrect tool usage significantly impairs LLMs' performance. To mitigate this, we also develop the finetuning dataset to enhance tool decision-making. Fine-tuning Llama2-7B results in a 14\% average performance improvement and a 16.8\% decrease in incorrect tool usage. We will release the WTU-Eval benchmark. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.12684 [pdf, other]

4Dynamic: Text-to-4D Generation with Hybrid Priors

Authors: Yu-Jie Yuan, Leif Kobbelt, Jiwen Liu, Yuan Zhang, Pengfei Wan, Yu-Kun Lai, Lin Gao

Abstract: Due to the fascinating generative performance of text-to-image diffusion models, growing text-to-3D generation works explore distilling the 2D generative priors into 3D, using the score distillation sampling (SDS) loss, to bypass the data scarcity problem. The existing text-to-3D methods have achieved promising results in realism and 3D consistency, but text-to-4D generation still faces challenges… ▽ More Due to the fascinating generative performance of text-to-image diffusion models, growing text-to-3D generation works explore distilling the 2D generative priors into 3D, using the score distillation sampling (SDS) loss, to bypass the data scarcity problem. The existing text-to-3D methods have achieved promising results in realism and 3D consistency, but text-to-4D generation still faces challenges, including lack of realism and insufficient dynamic motions. In this paper, we propose a novel method for text-to-4D generation, which ensures the dynamic amplitude and authenticity through direct supervision provided by a video prior. Specifically, we adopt a text-to-video diffusion model to generate a reference video and divide 4D generation into two stages: static generation and dynamic generation. The static 3D generation is achieved under the guidance of the input text and the first frame of the reference video, while in the dynamic generation stage, we introduce a customized SDS loss to ensure multi-view consistency, a video-based SDS loss to improve temporal consistency, and most importantly, direct priors from the reference video to ensure the quality of geometry and texture. Moreover, we design a prior-switching training strategy to avoid conflicts between different priors and fully leverage the benefits of each prior. In addition, to enrich the generated motion, we further introduce a dynamic modeling representation composed of a deformation network and a topology network, which ensures dynamic continuity while modeling topological changes. Our method not only supports text-to-4D generation but also enables 4D generation from monocular videos. The comparison experiments demonstrate the superiority of our method compared to existing methods. △ Less

Submitted 17 July, 2024; originally announced July 2024.

arXiv:2407.12568 [pdf, other]

LTRL: Boosting Long-tail Recognition via Reflective Learning

Authors: Qihao Zhao, Yalun Dai, Shen Lin, Wei Hu, Fan Zhang, Jun Liu

Abstract: In real-world scenarios, where knowledge distributions exhibit long-tail. Humans manage to master knowledge uniformly across imbalanced distributions, a feat attributed to their diligent practices of reviewing, summarizing, and correcting errors. Motivated by this learning process, we propose a novel learning paradigm, called reflecting learning, in handling long-tail recognition. Our method integ… ▽ More In real-world scenarios, where knowledge distributions exhibit long-tail. Humans manage to master knowledge uniformly across imbalanced distributions, a feat attributed to their diligent practices of reviewing, summarizing, and correcting errors. Motivated by this learning process, we propose a novel learning paradigm, called reflecting learning, in handling long-tail recognition. Our method integrates three processes for reviewing past predictions during training, summarizing and leveraging the feature relation across classes, and correcting gradient conflict for loss functions. These designs are lightweight enough to plug and play with existing long-tail learning methods, achieving state-of-the-art performance in popular long-tail visual benchmarks. The experimental results highlight the great potential of reflecting learning in dealing with long-tail recognition. △ Less

Submitted 17 July, 2024; originally announced July 2024.

Comments: ECCV2024

arXiv:2407.12312 [pdf, other]

Shap-Mix: Shapley Value Guided Mixing for Long-Tailed Skeleton Based Action Recognition

Authors: Jiahang Zhang, Lilang Lin, Jiaying Liu

Abstract: In real-world scenarios, human actions often fall into a long-tailed distribution. It makes the existing skeleton-based action recognition works, which are mostly designed based on balanced datasets, suffer from a sharp performance degradation. Recently, many efforts have been madeto image/video long-tailed learning. However, directly applying them to skeleton data can be sub-optimal due to the la… ▽ More In real-world scenarios, human actions often fall into a long-tailed distribution. It makes the existing skeleton-based action recognition works, which are mostly designed based on balanced datasets, suffer from a sharp performance degradation. Recently, many efforts have been madeto image/video long-tailed learning. However, directly applying them to skeleton data can be sub-optimal due to the lack of consideration of the crucial spatial-temporal motion patterns, especially for some modality-specific methodologies such as data augmentation. To this end, considering the crucial role of the body parts in the spatially concentrated human actions, we attend to the mixing augmentations and propose a novel method, Shap-Mix, which improves long-tailed learning by mining representative motion patterns for tail categories. Specifically, we first develop an effective spatial-temporal mixing strategy for the skeleton to boost representation quality. Then, the employed saliency guidance method is presented, consisting of the saliency estimation based on Shapley value and a tail-aware mixing policy. It preserves the salient motion parts of minority classes in mixed data, explicitly establishing the relationships between crucial body structure cues and high-level semantics. Extensive experiments on three large-scale skeleton datasets show our remarkable performance improvement under both long-tailed and balanced settings. Our project is publicly available at: https://jhang2020.github.io/Projects/Shap-Mix/Shap-Mix.html. △ Less

Submitted 17 July, 2024; originally announced July 2024.

Comments: Accepted by IJCAI 2024. Project Website: https://jhang2020.github.io/Projects/Shap-Mix/Shap-Mix.html

arXiv:2407.11853 [pdf, other]

A Case for Application-Aware Space Radiation Tolerance in Orbital Computing

Authors: Meiqi Wang, Han Qiu, Longnv Xu, Di Wang, Yuanjie Li, Tianwei Zhang, Jun Liu, Hewu Li

Abstract: We are witnessing a surge in the use of commercial off-the-shelf (COTS) hardware for cost-effective in-orbit computing, such as deep neural network (DNN) based on-satellite sensor data processing, Earth object detection, and task decision.However, once exposed to harsh space environments, COTS hardware is vulnerable to cosmic radiation and suffers from exhaustive single-event upsets (SEUs) and mul… ▽ More We are witnessing a surge in the use of commercial off-the-shelf (COTS) hardware for cost-effective in-orbit computing, such as deep neural network (DNN) based on-satellite sensor data processing, Earth object detection, and task decision.However, once exposed to harsh space environments, COTS hardware is vulnerable to cosmic radiation and suffers from exhaustive single-event upsets (SEUs) and multi-unit upsets (MCUs), both threatening the functionality and correctness of in-orbit computing.Existing hardware and system software protections against radiation are expensive for resource-constrained COTS nanosatellites and overwhelming for upper-layer applications due to their requirement for heavy resource redundancy and frequent reboots. Instead, we make a case for cost-effective space radiation tolerance using application domain knowledge. Our solution for the on-satellite DNN tasks, \name, exploits the uneven SEU/MCU sensitivity across DNN layers and MCUs' spatial correlation for lightweight radiation-tolerant in-orbit AI computing. Our extensive experiments using Chaohu-1 SAR satellite payloads and a hardware-in-the-loop, real data-driven space radiation emulator validate that RedNet can suppress the influence of radiation errors to $\approx$ 0 and accelerate the on-satellite DNN inference speed by 8.4%-33.0% at negligible extra costs. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.11844 [pdf, other]

Variational Randomized Smoothing for Sample-Wise Adversarial Robustness

Authors: Ryo Hase, Ye Wang, Toshiaki Koike-Akino, Jing Liu, Kieran Parsons

Abstract: Randomized smoothing is a defensive technique to achieve enhanced robustness against adversarial examples which are small input perturbations that degrade the performance of neural network models. Conventional randomized smoothing adds random noise with a fixed noise level for every input sample to smooth out adversarial perturbations. This paper proposes a new variational framework that uses a pe… ▽ More Randomized smoothing is a defensive technique to achieve enhanced robustness against adversarial examples which are small input perturbations that degrade the performance of neural network models. Conventional randomized smoothing adds random noise with a fixed noise level for every input sample to smooth out adversarial perturbations. This paper proposes a new variational framework that uses a per-sample noise level suitable for each input by introducing a noise level selector. Our experimental results demonstrate enhancement of empirical robustness against adversarial attacks. We also provide and analyze the certified robustness for our sample-wise smoothing method. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: 20 pages, under preparation

arXiv:2407.11700 [pdf, other]

Rate-Distortion-Cognition Controllable Versatile Neural Image Compression

Authors: Jinming Liu, Ruoyu Feng, Yunpeng Qi, Qiuyu Chen, Zhibo Chen, Wenjun Zeng, Xin Jin

Abstract: Recently, the field of Image Coding for Machines (ICM) has garnered heightened interest and significant advances thanks to the rapid progress of learning-based techniques for image compression and analysis. Previous studies often require training separate codecs to support various bitrate levels, machine tasks, and networks, thus lacking both flexibility and practicality. To address these challeng… ▽ More Recently, the field of Image Coding for Machines (ICM) has garnered heightened interest and significant advances thanks to the rapid progress of learning-based techniques for image compression and analysis. Previous studies often require training separate codecs to support various bitrate levels, machine tasks, and networks, thus lacking both flexibility and practicality. To address these challenges, we propose a rate-distortion-cognition controllable versatile image compression, which method allows the users to adjust the bitrate (i.e., Rate), image reconstruction quality (i.e., Distortion), and machine task accuracy (i.e., Cognition) with a single neural model, achieving ultra-controllability. Specifically, we first introduce a cognition-oriented loss in the primary compression branch to train a codec for diverse machine tasks. This branch attains variable bitrate by regulating quantization degree through the latent code channels. To further enhance the quality of the reconstructed images, we employ an auxiliary branch to supplement residual information with a scalable bitstream. Ultimately, two branches use a `$βx + (1 - β) y$' interpolation strategy to achieve a balanced cognition-distortion trade-off. Extensive experiments demonstrate that our method yields satisfactory ICM performance and flexible Rate-Distortion-Cognition controlling. △ Less

Submitted 17 July, 2024; v1 submitted 16 July, 2024; originally announced July 2024.

Comments: ECCV2024

arXiv:2407.11637 [pdf, other]

REMM:Rotation-Equivariant Framework for End-to-End Multimodal Image Matching

Authors: Han Nie, Bin Luo, Jun Liu, Zhitao Fu, Weixing Liu, Xin Su

Abstract: We present REMM, a rotation-equivariant framework for end-to-end multimodal image matching, which fully encodes rotational differences of descriptors in the whole matching pipeline. Previous learning-based methods mainly focus on extracting modal-invariant descriptors, while consistently ignoring the rotational invariance. In this paper, we demonstrate that our REMM is very useful for multimodal i… ▽ More We present REMM, a rotation-equivariant framework for end-to-end multimodal image matching, which fully encodes rotational differences of descriptors in the whole matching pipeline. Previous learning-based methods mainly focus on extracting modal-invariant descriptors, while consistently ignoring the rotational invariance. In this paper, we demonstrate that our REMM is very useful for multimodal image matching, including multimodal feature learning module and cyclic shift module. We first learn modal-invariant features through the multimodal feature learning module. Then, we design the cyclic shift module to rotationally encode the descriptors, greatly improving the performance of rotation-equivariant matching, which makes them robust to any angle. To validate our method, we establish a comprehensive rotation and scale-matching benchmark for evaluating the anti-rotation performance of multimodal images, which contains a combination of multi-angle and multi-scale transformations from four publicly available datasets. Extensive experiments show that our method outperforms existing methods in benchmarking and generalizes well to independent datasets. Additionally, we conducted an in-depth analysis of the key components of the REMM to validate the improvements brought about by the cyclic shift module. Code and dataset at https://github.com/HanNieWHU/REMM. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: 13 pages, 13 figures

arXiv:2407.11486 [pdf, other]

An efficient framework based on large foundation model for cervical cytopathology whole slide image screening

Authors: Jialong Huang, Gaojie Li, Shichao Kan, Jianfeng Liu, Yixiong Liang

Abstract: Current cervical cytopathology whole slide image (WSI) screening primarily relies on detection-based approaches, which are limited in performance due to the expense and time-consuming annotation process. Multiple Instance Learning (MIL), a weakly supervised approach that relies solely on bag-level labels, can effectively alleviate these challenges. Nonetheless, MIL commonly employs frozen pretrain… ▽ More Current cervical cytopathology whole slide image (WSI) screening primarily relies on detection-based approaches, which are limited in performance due to the expense and time-consuming annotation process. Multiple Instance Learning (MIL), a weakly supervised approach that relies solely on bag-level labels, can effectively alleviate these challenges. Nonetheless, MIL commonly employs frozen pretrained models or self-supervised learning for feature extraction, which suffers from low efficacy or inefficiency. In this paper, we propose an efficient framework for cervical cytopathology WSI classification using only WSI-level labels through unsupervised and weakly supervised learning. Given the sparse and dispersed nature of abnormal cells within cytopathological WSIs, we propose a strategy that leverages the pretrained foundation model to filter the top$k$ high-risk patches. Subsequently, we suggest parameter-efficient fine-tuning (PEFT) of a large foundation model using contrastive learning on the filtered patches to enhance its representation ability for task-specific signals. By training only the added linear adapters, we enhance the learning of patch-level features with substantially reduced time and memory consumption. Experiments conducted on the CSD and FNAC 2019 datasets demonstrate that the proposed method enhances the performance of various MIL methods and achieves state-of-the-art (SOTA) performance. The code and trained models are publicly available at https://github.com/CVIU-CSU/TCT-InfoNCE. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.11382 [pdf, other]

Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts

Authors: Jianhao Li, Tianyu Sun, Zhongdao Wang, Enze Xie, Bailan Feng, Hongbo Zhang, Ze Yuan, Ke Xu, Jiaheng Liu, Ping Luo

Abstract: This paper proposes an algorithm for automatically labeling 3D objects from 2D point or box prompts, especially focusing on applications in autonomous driving. Unlike previous arts, our auto-labeler predicts 3D shapes instead of bounding boxes and does not require training on a specific dataset. We propose a Segment, Lift, and Fit (SLF) paradigm to achieve this goal. Firstly, we segment high-quali… ▽ More This paper proposes an algorithm for automatically labeling 3D objects from 2D point or box prompts, especially focusing on applications in autonomous driving. Unlike previous arts, our auto-labeler predicts 3D shapes instead of bounding boxes and does not require training on a specific dataset. We propose a Segment, Lift, and Fit (SLF) paradigm to achieve this goal. Firstly, we segment high-quality instance masks from the prompts using the Segment Anything Model (SAM) and transform the remaining problem into predicting 3D shapes from given 2D masks. Due to the ill-posed nature of this problem, it presents a significant challenge as multiple 3D shapes can project into an identical mask. To tackle this issue, we then lift 2D masks to 3D forms and employ gradient descent to adjust their poses and shapes until the projections fit the masks and the surfaces conform to surrounding LiDAR points. Notably, since we do not train on a specific dataset, the SLF auto-labeler does not overfit to biased annotation patterns in the training set as other methods do. Thus, the generalization ability across different datasets improves. Experimental results on the KITTI dataset demonstrate that the SLF auto-labeler produces high-quality bounding box annotations, achieving an AP@0.5 IoU of nearly 90\%. Detectors trained with the generated pseudo-labels perform nearly as well as those trained with actual ground-truth annotations. Furthermore, the SLF auto-labeler shows promising results in detailed shape predictions, providing a potential alternative for the occupancy annotation of dynamic objects. △ Less

Submitted 17 July, 2024; v1 submitted 16 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV 2024

arXiv:2407.11299 [pdf, other]

FR-SLAM: A SLAM Improvement Method Based on Floor Plan Registration

Authors: Jiantao Feng, Xinde Li, HyunCheol Park, Juan Liu, Zhentong Zhang

Abstract: Simultaneous Localization and Mapping (SLAM) technology enables the construction of environmental maps and localization, serving as a key technique for indoor autonomous navigation of mobile robots. Traditional SLAM methods typically require exhaustive traversal of all rooms during indoor navigation to obtain a complete map, resulting in lengthy path planning times and prolonged time to reach targ… ▽ More Simultaneous Localization and Mapping (SLAM) technology enables the construction of environmental maps and localization, serving as a key technique for indoor autonomous navigation of mobile robots. Traditional SLAM methods typically require exhaustive traversal of all rooms during indoor navigation to obtain a complete map, resulting in lengthy path planning times and prolonged time to reach target points. Moreover, cumulative errors during motion lead to inaccurate robot localization, impacting navigation efficiency.This paper proposes an improved SLAM method, FR-SLAM, based on floor plan registration, utilizing a morphology-based floor plan registration algorithm to align and transform original floor plans. This approach facilitates the rapid acquisition of comprehensive motion maps and efficient path planning, enabling swift navigation to target positions within a shorter timeframe. To enhance registration and robot motion localization accuracy, a real-time update strategy is employed, comparing the current position's building structure with the map and dynamically updating floor plan registration results for precise localization. Comparative tests conducted on real and simulated datasets demonstrate that, compared to other benchmark algorithms, this method achieves higher floor plan registration accuracy and shorter time consumption to reach target positions. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2407.11053 [pdf, other]

doi 10.1016/j.ress.2024.110309

Sampling and active learning methods for network reliability estimation using K-terminal spanning tree

Authors: Chen Ding, Pengfei Wei, Yan Shi, Jinxing Liu, Matteo Broggi, Michael Beer

Abstract: Network reliability analysis remains a challenge due to the increasing size and complexity of networks. This paper presents a novel sampling method and an active learning method for efficient and accurate network reliability estimation under node failure and edge failure scenarios. The proposed sampling method adopts Monte Carlo technique to sample component lifetimes and the K-terminal spanning t… ▽ More Network reliability analysis remains a challenge due to the increasing size and complexity of networks. This paper presents a novel sampling method and an active learning method for efficient and accurate network reliability estimation under node failure and edge failure scenarios. The proposed sampling method adopts Monte Carlo technique to sample component lifetimes and the K-terminal spanning tree algorithm to accelerate structure function computation. Unlike existing methods that compute only one structure function value per sample, our method generates multiple component state vectors and corresponding structure function values from each sample. Network reliability is estimated based on survival signatures derived from these values. A transformation technique extends this method to handle both node failure and edge failure. To enhance efficiency of proposed sampling method and achieve adaptability to network topology changes, we introduce an active learning method utilizing a random forest (RF) classifier. This classifier directly predicts structure function values, integrates network behaviors across diverse topologies, and undergoes iterative refinement to enhance predictive accuracy. Importantly, the trained RF classifier can directly predict reliability for variant networks, a capability beyond the sampling method alone. Through investigating several network examples and two practical applications, the effectiveness of both proposed methods is demonstrated. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Journal ref: Reliability Engineering & System Safety (2024) 110309

arXiv:2407.10874 [pdf, other]

Random Channel Ablation for Robust Hand Gesture Classification with Multimodal Biosignals

Authors: Keshav Bimbraw, Jing Liu, Ye Wang, Toshiaki Koike-Akino

Abstract: Biosignal-based hand gesture classification is an important component of effective human-machine interaction. For multimodal biosignal sensing, the modalities often face data loss due to missing channels in the data which can adversely affect the gesture classification performance. To make the classifiers robust to missing channels in the data, this paper proposes using Random Channel Ablation (RC… ▽ More Biosignal-based hand gesture classification is an important component of effective human-machine interaction. For multimodal biosignal sensing, the modalities often face data loss due to missing channels in the data which can adversely affect the gesture classification performance. To make the classifiers robust to missing channels in the data, this paper proposes using Random Channel Ablation (RChA) during the training process. Ultrasound and force myography (FMG) data were acquired from the forearm for 12 hand gestures over 2 subjects. The resulting multimodal data had 16 total channels, 8 for each modality. The proposed method was applied to convolutional neural network architecture, and compared with baseline, imputation, and oracle methods. Using 5-fold cross-validation for the two subjects, on average, 12.2% and 24.5% improvement was observed for gesture classification with up to 4 and 8 missing channels respectively compared to the baseline. Notably, the proposed method is also robust to an increase in the number of missing channels compared to other methods. These results show the efficacy of using random channel ablation to improve classifier robustness for multimodal and multi-channel biosignal-based hand gesture classification. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: 5 pages, 4 figures

arXiv:2407.10870 [pdf, other]

GPT Sonograpy: Hand Gesture Decoding from Forearm Ultrasound Images via VLM

Authors: Keshav Bimbraw, Ye Wang, Jing Liu, Toshiaki Koike-Akino

Abstract: Large vision-language models (LVLMs), such as the Generative Pre-trained Transformer 4-omni (GPT-4o), are emerging multi-modal foundation models which have great potential as powerful artificial-intelligence (AI) assistance tools for a myriad of applications, including healthcare, industrial, and academic sectors. Although such foundation models perform well in a wide range of general tasks, their… ▽ More Large vision-language models (LVLMs), such as the Generative Pre-trained Transformer 4-omni (GPT-4o), are emerging multi-modal foundation models which have great potential as powerful artificial-intelligence (AI) assistance tools for a myriad of applications, including healthcare, industrial, and academic sectors. Although such foundation models perform well in a wide range of general tasks, their capability without fine-tuning is often limited in specialized tasks. However, full fine-tuning of large foundation models is challenging due to enormous computation/memory/dataset requirements. We show that GPT-4o can decode hand gestures from forearm ultrasound data even with no fine-tuning, and improves with few-shot, in-context learning. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: 8 pages, 9 figures

arXiv:2407.10756 [pdf, other]

GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation

Authors: Haonan Wang, Jie Liu, Jie Tang, Gangshan Wu, Bo Xu, Yanbing Chou, Yong Wang

Abstract: In recent years, 2D human pose estimation has made significant progress on public benchmarks. However, many of these approaches face challenges of less applicability in the industrial community due to the large number of parametric quantities and computational overhead. Efficient human pose estimation remains a hurdle, especially for whole-body pose estimation with numerous keypoints. While most c… ▽ More In recent years, 2D human pose estimation has made significant progress on public benchmarks. However, many of these approaches face challenges of less applicability in the industrial community due to the large number of parametric quantities and computational overhead. Efficient human pose estimation remains a hurdle, especially for whole-body pose estimation with numerous keypoints. While most current methods for efficient human pose estimation primarily rely on CNNs, we propose the Group-based Token Pruning Transformer (GTPT) that fully harnesses the advantages of the Transformer. GTPT alleviates the computational burden by gradually introducing keypoints in a coarse-to-fine manner. It minimizes the computation overhead while ensuring high performance. Besides, GTPT groups keypoint tokens and prunes visual tokens to improve model performance while reducing redundancy. We propose the Multi-Head Group Attention (MHGA) between different groups to achieve global interaction with little computational overhead. We conducted experiments on COCO and COCO-WholeBody. Compared to other methods, the experimental results show that GTPT can achieve higher performance with less computation, especially in whole-body with numerous keypoints. △ Less

Submitted 16 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

Comments: ECCV 2024 accepted

arXiv:2407.10737 [pdf, other]

Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models

Authors: Rining Wu, Feixiang Zhou, Ziwei Yin, Jian K. Liu

Abstract: Our brains represent the ever-changing environment with neurons in a highly dynamic fashion. The temporal features of visual pixels in dynamic natural scenes are entrapped in the neuronal responses of the retina. It is crucial to establish the intrinsic temporal relationship between visual pixels and neuronal responses. Recent foundation vision models have paved an advanced way of understanding im… ▽ More Our brains represent the ever-changing environment with neurons in a highly dynamic fashion. The temporal features of visual pixels in dynamic natural scenes are entrapped in the neuronal responses of the retina. It is crucial to establish the intrinsic temporal relationship between visual pixels and neuronal responses. Recent foundation vision models have paved an advanced way of understanding image pixels. Yet, neuronal coding in the brain largely lacks a deep understanding of its alignment with pixels. Most previous studies employ static images or artificial videos derived from static images for emulating more real and complicated stimuli. Despite these simple scenarios effectively help to separate key factors influencing visual coding, complex temporal relationships receive no consideration. To decompose the temporal features of visual coding in natural scenes, here we propose Vi-ST, a spatiotemporal convolutional neural network fed with a self-supervised Vision Transformer (ViT) prior, aimed at unraveling the temporal-based encoding patterns of retinal neuronal populations. The model demonstrates robust predictive performance in generalization tests. Furthermore, through detailed ablation experiments, we demonstrate the significance of each temporal module. Furthermore, we introduce a visual coding evaluation metric designed to integrate temporal considerations and compare the impact of different numbers of neuronal populations on complementary coding. In conclusion, our proposed Vi-ST demonstrates a novel modeling framework for neuronal coding of dynamic visual scenes in the brain, effectively aligning our brain representation of video with neuronal activity. The code is available at https://github.com/wurining/Vi-ST. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: This article is accepted by ECCV 2024, which ID is 12149. Accepted papers' id can be found in: https://eccv2024.ecva.net/Conferences/2024/AcceptedPapers

Showing 1–50 of 4,411 results for author: Liu, J