Skip to main content

Showing 1–50 of 2,134 results for author: Li, W

  1. arXiv:2407.20157  [pdf, other

    cs.AI

    rLLM: Relational Table Learning with LLMs

    Authors: Weichen Li, Xiaotong Huang, Jianwu Zheng, Zheng Wang, Chaokun Wang, Li Pan, Jianhua Li

    Abstract: We introduce rLLM (relationLLM), a PyTorch library designed for Relational Table Learning (RTL) with Large Language Models (LLMs). The core idea is to decompose state-of-the-art Graph Neural Networks, LLMs, and Table Neural Networks into standardized modules, to enable the fast construction of novel RTL-type models in a simple "combine, align, and co-train" manner. To illustrate the usage of rLLM,… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  2. arXiv:2407.20121  [pdf, other

    cs.IR cs.AI

    EXIT: An EXplicit Interest Transfer Framework for Cross-Domain Recommendation

    Authors: Lei Huang, Weitao Li, Chenrui Zhang, Jinpeng Wang, Xianchun Yi, Sheng Chen

    Abstract: Cross-domain recommendation has attracted substantial interest in industrial apps such as Meituan, which serves multiple business domains via knowledge transfer and meets the diverse interests of users. However, existing methods typically follow an implicit modeling paradigm that blends the knowledge from both the source and target domains, and design intricate network structures to share learned… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: Accepted at CIKM 2024

  3. arXiv:2407.19790  [pdf, other

    cs.AI

    Hashing based Contrastive Learning for Virtual Screening

    Authors: Jin Han, Yun Hong, Wu-Jun Li

    Abstract: Virtual screening (VS) is a critical step in computer-aided drug discovery, aiming to identify molecules that bind to a specific target receptor like protein. Traditional VS methods, such as docking, are often too time-consuming for screening large-scale molecular databases. Recent advances in deep learning have demonstrated that learning vector representations for both proteins and molecules usin… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  4. arXiv:2407.19768  [pdf, other

    cs.CV

    Efficient Face Super-Resolution via Wavelet-based Feature Enhancement Network

    Authors: Wenjie Li, Heng Guo, Xuannan Liu, Kongming Liang, Jiani Hu, Zhanyu Ma, Jun Guo

    Abstract: Face super-resolution aims to reconstruct a high-resolution face image from a low-resolution face image. Previous methods typically employ an encoder-decoder structure to extract facial structural features, where the direct downsampling inevitably introduces distortions, especially to high-frequency features such as edges. To address this issue, we propose a wavelet-based feature enhancement netwo… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  5. arXiv:2407.19669  [pdf, other

    cs.CL cs.IR

    mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval

    Authors: Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, Min Zhang

    Abstract: We present systematic efforts in building long-context multilingual text representation model (TRM) and reranker from scratch for text retrieval. We first introduce a text encoder (base size) enhanced with RoPE and unpadding, pre-trained in a native 8192-token context (longer than 512 of previous multilingual encoders). Then we construct a hybrid TRM and a cross-encoder reranker by contrastive lea… ▽ More

    Submitted 28 July, 2024; originally announced July 2024.

    Comments: 20 pages, 5 figures

  6. arXiv:2407.19234  [pdf, other

    cs.LG cs.DC

    Ordered Momentum for Asynchronous SGD

    Authors: Chang-Wei Shi, Yi-Rui Yang, Wu-Jun Li

    Abstract: Distributed learning is indispensable for training large-scale deep models. Asynchronous SGD~(ASGD) and its variants are commonly used distributed learning methods in many scenarios where the computing capabilities of workers in the cluster are heterogeneous. Momentum has been acknowledged for its benefits in both optimization and generalization in deep model training. However, existing works have… ▽ More

    Submitted 27 July, 2024; originally announced July 2024.

  7. arXiv:2407.18035  [pdf, other

    cs.CV cs.AI cs.CL

    RestoreAgent: Autonomous Image Restoration Agent via Multimodal Large Language Models

    Authors: Haoyu Chen, Wenbo Li, Jinjin Gu, Jingjing Ren, Sixiang Chen, Tian Ye, Renjing Pei, Kaiwen Zhou, Fenglong Song, Lei Zhu

    Abstract: Natural images captured by mobile devices often suffer from multiple types of degradation, such as noise, blur, and low light. Traditional image restoration methods require manual selection of specific tasks, algorithms, and execution sequences, which is time-consuming and may yield suboptimal results. All-in-one models, though capable of handling multiple tasks, typically support only a limited r… ▽ More

    Submitted 25 July, 2024; originally announced July 2024.

  8. arXiv:2407.17956  [pdf, other

    cs.CV

    SaccadeDet: A Novel Dual-Stage Architecture for Rapid and Accurate Detection in Gigapixel Images

    Authors: Wenxi Li, Ruxin Zhang, Haozhe Lin, Yuchen Guo, Chao Ma, Xiaokang Yang

    Abstract: The advancement of deep learning in object detection has predominantly focused on megapixel images, leaving a critical gap in the efficient processing of gigapixel images. These super high-resolution images present unique challenges due to their immense size and computational demands. To address this, we introduce 'SaccadeDet', an innovative architecture for gigapixel-level object detection, inspi… ▽ More

    Submitted 25 July, 2024; originally announced July 2024.

    Comments: This paper is accepted to ECML-PKDD 2024

    Journal ref: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2024

  9. arXiv:2407.17797  [pdf, other

    cs.CV cs.AI

    A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models

    Authors: Haonan Zheng, Xinyang Deng, Wen Jiang, Wenrui Li

    Abstract: With Vision-Language Pre-training (VLP) models demonstrating powerful multimodal interaction capabilities, the application scenarios of neural networks are no longer confined to unimodal domains but have expanded to more complex multimodal V+L downstream tasks. The security vulnerabilities of unimodal models have been extensively examined, whereas those of VLP models remain challenging. We note th… ▽ More

    Submitted 25 July, 2024; originally announced July 2024.

    Comments: 14 pages, 9 figures, published in ACMMM2024(oral)

  10. arXiv:2407.17418  [pdf, other

    cs.CV

    3D Gaussian Splatting: Survey, Technologies, Challenges, and Opportunities

    Authors: Yanqi Bao, Tianyu Ding, Jing Huo, Yaoli Liu, Yuxin Li, Wenbin Li, Yang Gao, Jiebo Luo

    Abstract: 3D Gaussian Splatting (3DGS) has emerged as a prominent technique with the potential to become a mainstream method for 3D representations. It can effectively transform multi-view images into explicit 3D Gaussian representations through efficient training, and achieve real-time rendering of novel views. This survey aims to analyze existing 3DGS-related works from multiple intersecting perspectives,… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

  11. arXiv:2407.17274  [pdf, other

    cs.MM cs.AI cs.CV

    Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

    Authors: Yongqi Li, Hongru Cai, Wenjie Wang, Leigang Qu, Yinwei Wei, Wenjie Li, Liqiang Nie, Tat-Seng Chua

    Abstract: Text-to-image retrieval is a fundamental task in multimedia processing, aiming to retrieve semantically relevant cross-modal content. Traditional studies have typically approached this task as a discriminative problem, matching the text and image via the cross-attention mechanism (one-tower framework) or in a common embedding space (two-tower framework). Recently, generative cross-modal retrieval… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

    Comments: Work in progress

  12. arXiv:2407.17183  [pdf

    cs.RO

    Robust Point Cloud Registration in Robotic Inspection with Locally Consistent Gaussian Mixture Model

    Authors: Lingjie Su, Wei Xu, Wenlong Li

    Abstract: In robotic inspection of aviation parts, achieving accurate pairwise point cloud registration between scanned and model data is essential. However, noise and outliers generated in robotic scanned data can compromise registration accuracy. To mitigate this challenge, this article proposes a probability-based registration method utilizing Gaussian Mixture Model (GMM) with local consistency constrain… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

    Comments: 12 pages, 14 figures

  13. arXiv:2407.16697  [pdf, other

    cs.CV

    AbdomenAtlas: A Large-Scale, Detailed-Annotated, & Multi-Center Dataset for Efficient Transfer Learning and Open Algorithmic Benchmarking

    Authors: Wenxuan Li, Chongyu Qu, Xiaoxi Chen, Pedro R. A. S. Bassi, Yijia Shi, Yuxiang Lai, Qian Yu, Huimin Xue, Yixiong Chen, Xiaorui Lin, Yutong Tang, Yining Cao, Haoqi Han, Zheyuan Zhang, Jiawei Liu, Tiezheng Zhang, Yujiu Ma, Jincheng Wang, Guang Zhang, Alan Yuille, Zongwei Zhou

    Abstract: We introduce the largest abdominal CT dataset (termed AbdomenAtlas) of 20,460 three-dimensional CT volumes sourced from 112 hospitals across diverse populations, geographies, and facilities. AbdomenAtlas provides 673K high-quality masks of anatomical structures in the abdominal region annotated by a team of 10 radiologists with the help of AI algorithms. We start by having expert radiologists manu… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

    Comments: Published in Medical Image Analysis

  14. arXiv:2407.16082  [pdf

    cs.RO

    Gel-OPTOFORT Sensor: Multi-axis Force/Torque Measurement and Geometry Observation Using GelSight and Optoelectronic Sensor Technology

    Authors: Yohan Noh, Harshal Upare, Dalia Osman, Wanlin Li

    Abstract: Although conventional GelSight-based tactile and force/torque sensors excel in detecting objects' geometry and texture information while simultaneously sensing multi-axis forces, their performance is limited by the camera's lower frame rates and the inherent properties of the elastomer. These limitations restrict their ability to measure higher force ranges at high sampling frequencies. Besides, d… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: 4 pages, 8 figures, conference

  15. arXiv:2407.15861  [pdf, other

    cs.CR cs.AI cs.CV

    Adversarial Attacks and Defenses on Text-to-Image Diffusion Models: A Survey

    Authors: Chenyu Zhang, Mingwang Hu, Wenhui Li, Lanjun Wang

    Abstract: Recently, the text-to-image diffusion model has gained considerable attention from the community due to its exceptional image generation capability. A representative model, Stable Diffusion, amassed more than 10 million users within just two months of its release. This surge in popularity has facilitated studies on the robustness and safety of the model, leading to the proposal of various adversar… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

  16. arXiv:2407.15399  [pdf, other

    cs.CL cs.AI cs.CR

    Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models

    Authors: Xiao Liu, Liangzhi Li, Tong Xiang, Fuying Ye, Lu Wei, Wangyue Li, Noa Garcia

    Abstract: With the development of large language models (LLMs) like ChatGPT, both their vast applications and potential vulnerabilities have come to the forefront. While developers have integrated multiple safety mechanisms to mitigate their misuse, a risk remains, particularly when models encounter adversarial inputs. This study unveils an attack mechanism that capitalizes on human conversation strategies… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

  17. arXiv:2407.14419  [pdf, other

    cs.CV

    HOTS3D: Hyper-Spherical Optimal Transport for Semantic Alignment of Text-to-3D Generation

    Authors: Zezeng Li, Weimin Wang, WenHai Li, Na Lei, Xianfeng Gu

    Abstract: Recent CLIP-guided 3D generation methods have achieved promising results but struggle with generating faithful 3D shapes that conform with input text due to the gap between text and image embeddings. To this end, this paper proposes HOTS3D which makes the first attempt to effectively bridge this gap by aligning text features to the image features with spherical optimal transport (SOT). However, in… ▽ More

    Submitted 19 July, 2024; originally announced July 2024.

  18. arXiv:2407.14242  [pdf, other

    cs.CV cs.MM

    Continual Panoptic Perception: Towards Multi-modal Incremental Interpretation of Remote Sensing Images

    Authors: Bo Yuan, Danpei Zhao, Zhuoran Liu, Wentao Li, Tian Li

    Abstract: Continual learning (CL) breaks off the one-way training manner and enables a model to adapt to new data, semantics and tasks continuously. However, current CL methods mainly focus on single tasks. Besides, CL models are plagued by catastrophic forgetting and semantic drift since the lack of old data, which often occurs in remote-sensing interpretation due to the intricate fine-grained semantics. I… ▽ More

    Submitted 25 July, 2024; v1 submitted 19 July, 2024; originally announced July 2024.

    Comments: Accepted in ACMMM 2024

  19. arXiv:2407.13773  [pdf, other

    cs.DL cs.AI

    OpenDataLab: Empowering General Artificial Intelligence with Open Datasets

    Authors: Conghui He, Wei Li, Zhenjiang Jin, Chao Xu, Bin Wang, Dahua Lin

    Abstract: The advancement of artificial intelligence (AI) hinges on the quality and accessibility of data, yet the current fragmentation and variability of data sources hinder efficient data utilization. The dispersion of data sources and diversity of data formats often lead to inefficiencies in data retrieval and processing, significantly impeding the progress of AI research and applications. To address th… ▽ More

    Submitted 4 June, 2024; originally announced July 2024.

  20. arXiv:2407.13771  [pdf, other

    cs.CV

    Training-Free Model Merging for Multi-target Domain Adaptation

    Authors: Wenyi Li, Huan-ang Gao, Mingju Gao, Beiwen Tian, Rong Zhi, Hao Zhao

    Abstract: In this paper, we study multi-target domain adaptation of scene understanding models. While previous methods achieved commendable results through inter-domain consistency losses, they often assumed unrealistic simultaneous access to images from all target domains, overlooking constraints such as data transfer bandwidth limitations and data privacy concerns. Given these challenges, we pose the ques… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024

  21. arXiv:2407.13719  [pdf, other

    cs.CV

    HazeCLIP: Towards Language Guided Real-World Image Dehazing

    Authors: Ruiyi Wang, Wenhao Li, Xiaohong Liu, Chunyi Li, Zicheng Zhang, Xiongkuo Min, Guangtao Zhai

    Abstract: Existing methods have achieved remarkable performance in single image dehazing, particularly on synthetic datasets. However, they often struggle with real-world hazy images due to domain shift, limiting their practical applicability. This paper introduces HazeCLIP, a language-guided adaptation framework designed to enhance the real-world performance of pre-trained dehazing networks. Inspired by th… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: 5 pages, 4 figures

  22. arXiv:2407.13519  [pdf, other

    cs.CV

    GPSFormer: A Global Perception and Local Structure Fitting-based Transformer for Point Cloud Understanding

    Authors: Changshuo Wang, Meiqing Wu, Siew-Kei Lam, Xin Ning, Shangshu Yu, Ruiping Wang, Weijun Li, Thambipillai Srikanthan

    Abstract: Despite the significant advancements in pre-training methods for point cloud understanding, directly capturing intricate shape information from irregular point clouds without reliance on external data remains a formidable challenge. To address this problem, we propose GPSFormer, an innovative Global Perception and Local Structure Fitting-based Transformer, which learns detailed shape information f… ▽ More

    Submitted 24 July, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  23. arXiv:2407.13509  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models

    Authors: Weiqin Li, Peiji Yang, Yicheng Zhong, Yixuan Zhou, Zhisheng Wang, Zhiyong Wu, Xixin Wu, Helen Meng

    Abstract: Spontaneous style speech synthesis, which aims to generate human-like speech, often encounters challenges due to the scarcity of high-quality data and limitations in model capabilities. Recent language model-based TTS systems can be trained on large, diverse, and low-quality speech datasets, resulting in highly natural synthesized speech. However, they are limited by the difficulty of simulating v… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Accepted by INTERSPEECH 2024

  24. arXiv:2407.12798  [pdf, other

    cs.CV

    Multi-Granularity and Multi-modal Feature Interaction Approach for Text Video Retrieval

    Authors: Wenjun Li, Shudong Wang, Dong Zhao, Shenghui Xu, Zhaoming Pan, Zhimin Zhang

    Abstract: The key of the text-to-video retrieval (TVR) task lies in learning the unique similarity between each pair of text (consisting of words) and video (consisting of audio and image frames) representations. However, some problems exist in the representation alignment of video and text, such as a text, and further each word, are of different importance for video frames. Besides, audio usually carries a… ▽ More

    Submitted 20 June, 2024; originally announced July 2024.

  25. arXiv:2407.12022   

    cs.CL cs.AI

    ITERTL: An Iterative Framework for Fine-tuning LLMs for RTL Code Generation

    Authors: Peiyang Wu, Nan Guo, Xiao Xiao, Wenming Li, Xiaochun Ye, Dongrui Fan

    Abstract: Recently, large language models (LLMs) have demonstrated excellent performance in understanding human instructions and generating code, which has inspired researchers to explore the feasibility of generating RTL code with LLMs. However, the existing approaches to fine-tune LLMs on RTL codes typically are conducted on fixed datasets, which do not fully stimulate the capability of LLMs and require l… ▽ More

    Submitted 23 July, 2024; v1 submitted 27 June, 2024; originally announced July 2024.

    Comments: There is some mistakes about the Experimental Setup in Section4.1

  26. arXiv:2407.11505  [pdf, other

    cs.CV

    Haze-Aware Attention Network for Single-Image Dehazing

    Authors: Lihan Tong, Yun Liu, Weijia Li, Liyuan Chen, Erkang Chen

    Abstract: Single-image dehazing is a pivotal challenge in computer vision that seeks to remove haze from images and restore clean background details. Recognizing the limitations of traditional physical model-based methods and the inefficiencies of current attention-based solutions, we propose a new dehazing network combining an innovative Haze-Aware Attention Module (HAAM) with a Multiscale Frequency Enhanc… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: 13 pages, 6 figures

    Report number: applsci-3022856 MSC Class: 68I1C; 68I8P ACM Class: I.4.3; I.4.9

  27. arXiv:2407.11494  [pdf, other

    cs.CV

    Learning Semantic Latent Directions for Accurate and Controllable Human Motion Prediction

    Authors: Guowei Xu, Jiale Tao, Wen Li, Lixin Duan

    Abstract: In the realm of stochastic human motion prediction (SHMP), researchers have often turned to generative models like GANS, VAEs and diffusion models. However, most previous approaches have struggled to accurately predict motions that are both realistic and coherent with past motion due to a lack of guidance on the latent distribution. In this paper, we introduce Semantic Latent Directions (SLD) as a… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024

  28. arXiv:2407.10680  [pdf, ps, other

    cs.SI cs.NI

    Friedkin-Johnsen Model for Opinion Dynamics on Signed Graphs

    Authors: Xiaotian Zhou, Haoxin Sun, Wanyue Xu, Wei Li, Zhongzhi Zhang

    Abstract: A signed graph offers richer information than an unsigned graph, since it describes both collaborative and competitive relationships in social networks. In this paper, we study opinion dynamics on a signed graph, based on the Friedkin-Johnsen model. We first interpret the equilibrium opinion in terms of a defined random walk on an augmented signed graph, by representing the equilibrium opinion of… ▽ More

    Submitted 17 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

  29. arXiv:2407.10068  [pdf, other

    cs.CL

    Multi-Granularity Semantic Revision for Large Language Model Distillation

    Authors: Xiaoyu Liu, Yun Zhang, Wei Li, Simiao Li, Xudong Huang, Hanting Chen, Yehui Tang, Jie Hu, Zhiwei Xiong, Yunhe Wang

    Abstract: Knowledge distillation plays a key role in compressing the Large Language Models (LLMs), which boosts a small-size student model under large teacher models' guidance. However, existing LLM distillation methods overly rely on student-generated outputs, which may introduce generation errors and misguide the distillation process. Moreover, the distillation loss functions introduced in previous art st… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

  30. arXiv:2407.09918  [pdf, other

    eess.IV cs.CV

    DiffRect: Latent Diffusion Label Rectification for Semi-supervised Medical Image Segmentation

    Authors: Xinyu Liu, Wuyang Li, Yixuan Yuan

    Abstract: Semi-supervised medical image segmentation aims to leverage limited annotated data and rich unlabeled data to perform accurate segmentation. However, existing semi-supervised methods are highly dependent on the quality of self-generated pseudo labels, which are prone to incorrect supervision and confirmation bias. Meanwhile, they are insufficient in capturing the label distributions in latent spac… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

    Comments: MICCAI 2024

  31. arXiv:2407.09904  [pdf, other

    cs.LG

    Learning a Mini-batch Graph Transformer via Two-stage Interaction Augmentation

    Authors: Wenda Li, Kaixuan Chen, Shunyu Liu, Tongya Zheng, Wenjie Huang, Mingli Song

    Abstract: Mini-batch Graph Transformer (MGT), as an emerging graph learning model, has demonstrated significant advantages in semi-supervised node prediction tasks with improved computational efficiency and enhanced model robustness. However, existing methods for processing local information either rely on sampling or simple aggregation, which respectively result in the loss and squashing of critical neighb… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

    Comments: 8 pages, 4 figures, Accept by ECAI2024

  32. arXiv:2407.08971  [pdf, other

    cs.CV

    Full-Stage Pseudo Label Quality Enhancement for Weakly-supervised Temporal Action Localization

    Authors: Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen

    Abstract: Weakly-supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos using only video-level supervision. Latest WSTAL methods introduce pseudo label learning framework to bridge the gap between classification-based training and inferencing targets at localization, and achieve cutting-edge results. In these frameworks, a classification-based model is used to generate… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  33. arXiv:2407.08680  [pdf, other

    cs.CV

    Generalizable Implicit Motion Modeling for Video Frame Interpolation

    Authors: Zujin Guo, Wei Li, Chen Change Loy

    Abstract: Motion modeling is critical in flow-based Video Frame Interpolation (VFI). Existing paradigms either consider linear combinations of bidirectional flows or directly predict bilateral flows for given timestamps without exploring favorable motion priors, thus lacking the capability of effectively modeling spatiotemporal dynamics in real-world videos. To address this limitation, in this study, we int… ▽ More

    Submitted 29 July, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

    Comments: Project Page: https://gseancdat.github.io/projects/GIMMVFI

  34. arXiv:2407.08475  [pdf, other

    cs.CL

    Investigating Public Fine-Tuning Datasets: A Complex Review of Current Practices from a Construction Perspective

    Authors: Runyuan Ma, Wei Li, Fukai Shang

    Abstract: With the rapid development of the large model domain, research related to fine-tuning has concurrently seen significant advancement, given that fine-tuning is a constituent part of the training process for large-scale models. Data engineering plays a fundamental role in the training process of models, which includes data infrastructure, data processing, etc. Data during fine-tuning likewise forms… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  35. arXiv:2407.08130  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning

    Authors: Wenrui Li, Penghong Wang, Ruiqin Xiong, Xiaopeng Fan

    Abstract: The spiking neural networks (SNNs) that efficiently encode temporal sequences have shown great potential in extracting audio-visual joint feature representations. However, coupling SNNs (binary spike sequences) with transformers (float-point sequences) to jointly explore the temporal-semantic information still facing challenges. In this paper, we introduce a novel Spiking Tucker Fusion Transformer… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: Accepted by TIP

  36. arXiv:2407.07895  [pdf, other

    cs.CV cs.CL cs.LG

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Authors: Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li

    Abstract: Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capa… ▽ More

    Submitted 28 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

    Comments: Project Page: https://llava-vl.github.io/blog/2024-06-16-llava-next-interleave/

  37. arXiv:2407.07046  [pdf, other

    cs.AI cs.CV

    CorMulT: A Semi-supervised Modality Correlation-aware Multimodal Transformer for Sentiment Analysis

    Authors: Yangmin Li, Ruiqi Zhu, Wengen Li

    Abstract: Multimodal sentiment analysis is an active research area that combines multiple data modalities, e.g., text, image and audio, to analyze human emotions and benefits a variety of applications. Existing multimodal sentiment analysis methods can be classified as modality interaction-based methods, modality transformation-based methods and modality similarity-based methods. However, most of these meth… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  38. arXiv:2407.06642  [pdf, other

    cs.CV

    Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

    Authors: Fanyue Wei, Wei Zeng, Zhenyang Li, Dawei Yin, Lixin Duan, Wen Li

    Abstract: Personalized text-to-image models allow users to generate varied styles of images (specified with a sentence) for an object (specified with a set of reference images). While remarkable results have been achieved using diffusion-based generation models, the visual structure and details of the object are often unexpectedly changed during the diffusion process. One major reason is that these diffusio… ▽ More

    Submitted 18 July, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  39. arXiv:2407.06250  [pdf, other

    cs.CV

    FairDiff: Fair Segmentation with Point-Image Diffusion

    Authors: Wenyi Li, Haoran Xu, Guiyu Zhang, Huan-ang Gao, Mingju Gao, Mengyu Wang, Hao Zhao

    Abstract: Fairness is an important topic for medical image analysis, driven by the challenge of unbalanced training data among diverse target groups and the societal demand for equitable medical quality. In response to this issue, our research adopts a data-driven strategy-enhancing data balance by integrating synthetic images. However, in terms of generating synthetic images, previous works either lack pai… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: Accepted to MICCAI 2024

  40. arXiv:2407.06109  [pdf, other

    cs.CV

    PerlDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Models

    Authors: Jinhua Zhang, Hualian Sheng, Sijia Cai, Bing Deng, Qiao Liang, Wen Li, Ying Fu, Jieping Ye, Shuhang Gu

    Abstract: Controllable generation is considered a potentially vital approach to address the challenge of annotating 3D data, and the precision of such controllable generation becomes particularly imperative in the context of data production for autonomous driving. Existing methods focus on the integration of diverse generative information into controlling inputs, utilizing frameworks such as GLIGEN or Contr… ▽ More

    Submitted 16 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

  41. arXiv:2407.05869  [pdf, other

    cs.AI

    PORCA: Root Cause Analysis with Partially Observed Data

    Authors: Chang Gong, Di Yao, Jin Wang, Wenbin Li, Lanting Fang, Yongtao Xie, Kaiyu Feng, Peng Han, Jingping Bi

    Abstract: Root Cause Analysis (RCA) aims at identifying the underlying causes of system faults by uncovering and analyzing the causal structure from complex systems. It has been widely used in many application domains. Reliable diagnostic conclusions are of great importance in mitigating system failures and financial losses. However, previous studies implicitly assume a full observation of the system, which… ▽ More

    Submitted 11 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

  42. arXiv:2407.05750  [pdf, other

    cs.CL

    Large Language Models Understand Layout

    Authors: Weiming Li, Manni Duan, Dong An, Yan Shao

    Abstract: Large language models (LLMs) demonstrate extraordinary abilities in a wide range of natural language processing (NLP) tasks. In this paper, we show that, beyond text understanding capability, LLMs are capable of processing text layouts that are denoted by spatial markers. They are able to answer questions that require explicit spatial perceiving and reasoning, while a drastic performance drop is o… ▽ More

    Submitted 25 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

  43. arXiv:2407.05680  [pdf, other

    cs.CV cs.AI

    Fine-Grained Multi-View Hand Reconstruction Using Inverse Rendering

    Authors: Qijun Gan, Wentong Li, Jinwei Ren, Jianke Zhu

    Abstract: Reconstructing high-fidelity hand models with intricate textures plays a crucial role in enhancing human-object interaction and advancing real-world applications. Despite the state-of-the-art methods excelling in texture generation and image rendering, they often face challenges in accurately capturing geometric details. Learning-based approaches usually offer better robustness and faster inferenc… ▽ More

    Submitted 8 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

    Comments: Accepted by AAAI 2024

  44. arXiv:2407.05368  [pdf, other

    cs.SD cs.AI cs.IR eess.AS

    Music Era Recognition Using Supervised Contrastive Learning and Artist Information

    Authors: Qiqi He, Xuchen Song, Weituo Hao, Ju-Chiang Wang, Wei-Tsung Lu, Wei Li

    Abstract: Does popular music from the 60s sound different than that of the 90s? Prior study has shown that there would exist some variations of patterns and regularities related to instrumentation changes and growing loudness across multi-decadal trends. This indicates that perceiving the era of a song from musical features such as audio and artist information is possible. Music era information can be an im… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

  45. arXiv:2407.04336  [pdf, ps, other

    eess.SP cs.AI

    AI-Based Beam-Level and Cell-Level Mobility Management for High Speed Railway Communications

    Authors: Wen Li, Wei Chen, Shiyue Wang, Yuanyuan Zhang, Michail Matthaiou, Bo Ai

    Abstract: High-speed railway (HSR) communications are pivotal for ensuring rail safety, operations, maintenance, and delivering passenger information services. The high speed of trains creates rapidly time-varying wireless channels, increases the signaling overhead, and reduces the system throughput, making it difficult to meet the growing and stringent needs of HSR applications. In this article, we explore… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

  46. arXiv:2407.04064  [pdf, other

    cs.RO

    Collision Avoidance for Multiple UAVs in Unknown Scenarios with Causal Representation Disentanglement

    Authors: Jiafan Zhuang, Zihao Xia, Gaofei Han, Boxi Wang, Wenji Li, Dongliang Wang, Zhifeng Hao, Ruichu Cai, Zhun Fan

    Abstract: Deep reinforcement learning (DRL) has achieved remarkable progress in online path planning tasks for multi-UAV systems. However, existing DRL-based methods often suffer from performance degradation when tackling unseen scenarios, since the non-causal factors in visual representations adversely affect policy learning. To address this issue, we propose a novel representation learning approach, \ie,… ▽ More

    Submitted 15 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

  47. arXiv:2407.04056  [pdf, other

    cs.RO

    Robust Policy Learning for Multi-UAV Collision Avoidance with Causal Feature Selection

    Authors: Jiafan Zhuang, Gaofei Han, Zihao Xia, Boxi Wang, Wenji Li, Dongliang Wang, Zhifeng Hao, Ruichu Cai, Zhun Fan

    Abstract: In unseen and complex outdoor environments, collision avoidance navigation for unmanned aerial vehicle (UAV) swarms presents a challenging problem. It requires UAVs to navigate through various obstacles and complex backgrounds. Existing collision avoidance navigation methods based on deep reinforcement learning show promising performance but suffer from poor generalization abilities, resulting in… ▽ More

    Submitted 15 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

  48. arXiv:2407.03978  [pdf, other

    cs.CL cs.AI

    Benchmarking Complex Instruction-Following with Multiple Constraints Composition

    Authors: Bosi Wen, Pei Ke, Xiaotao Gu, Lindong Wu, Hao Huang, Jinfeng Zhou, Wenchuang Li, Binxin Hu, Wendy Gao, Jiaxin Xu, Yiming Liu, Jie Tang, Hongning Wang, Minlie Huang

    Abstract: Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world scenarios. Therefore, how to evaluate the ability of complex instruction-following of LLMs has become a critical research problem. Existing benchmarks mainly focus on m… ▽ More

    Submitted 11 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: 20 pages, 7 figures

  49. arXiv:2407.03842  [pdf, other

    cs.CV

    Beyond Viewpoint: Robust 3D Object Recognition under Arbitrary Views through Joint Multi-Part Representation

    Authors: Linlong Fan, Ye Huang, Yanqi Ge, Wen Li, Lixin Duan

    Abstract: Existing view-based methods excel at recognizing 3D objects from predefined viewpoints, but their exploration of recognition under arbitrary views is limited. This is a challenging and realistic setting because each object has different viewpoint positions and quantities, and their poses are not aligned. However, most view-based methods, which aggregate multiple view features to obtain a global fe… ▽ More

    Submitted 17 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: ECCV 2024 camera ready

  50. arXiv:2407.03637  [pdf, other

    cs.LG cs.CL

    HERA: High-efficiency Matrix Compression via Element Replacement

    Authors: Yanshu Wang, Wang Li, Tong Yang

    Abstract: Large Language Models (LLMs) have significantly advanced natural language processing tasks such as machine translation, text generation, and sentiment analysis. However, their large size, often consisting of billions of parameters, poses challenges for storage, computation, and deployment, particularly in resource-constrained environments like mobile devices and edge computing platforms. Additiona… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.