-
MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality Fusion
Authors:
Chencan Fu,
Yabiao Wang,
Jiangning Zhang,
Zhengkai Jiang,
Xiaofeng Mao,
Jiafu Wu,
Weijian Cao,
Chengjie Wang,
Yanhao Ge,
Yong Liu
Abstract:
Co-speech gesture generation is crucial for producing synchronized and realistic human gestures that accompany speech, enhancing the animation of lifelike avatars in virtual environments. While diffusion models have shown impressive capabilities, current approaches often overlook a wide range of modalities and their interactions, resulting in less dynamic and contextually varied gestures. To addre…
▽ More
Co-speech gesture generation is crucial for producing synchronized and realistic human gestures that accompany speech, enhancing the animation of lifelike avatars in virtual environments. While diffusion models have shown impressive capabilities, current approaches often overlook a wide range of modalities and their interactions, resulting in less dynamic and contextually varied gestures. To address these challenges, we present MambaGesture, a novel framework integrating a Mamba-based attention block, MambaAttn, with a multi-modality feature fusion module, SEAD. The MambaAttn block combines the sequential data processing strengths of the Mamba model with the contextual richness of attention mechanisms, enhancing the temporal coherence of generated gestures. SEAD adeptly fuses audio, text, style, and emotion modalities, employing disentanglement to deepen the fusion process and yield gestures with greater realism and diversity. Our approach, rigorously evaluated on the multi-modal BEAT dataset, demonstrates significant improvements in Fréchet Gesture Distance (FGD), diversity scores, and beat alignment, achieving state-of-the-art performance in co-speech gesture generation.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
FreeShell: A Context-Free 4D Printing Technique for Fabricating Complex 3D Triangle Mesh Shells
Authors:
Chao Yuan,
Nan Cao,
Xuejiao Ma,
Shengqi Dang
Abstract:
Freeform thin-shell surfaces are critical in various fields, but their fabrication is complex and costly. Traditional methods are wasteful and require custom molds, while 3D printing needs extensive support structures and post-processing. Thermoshrinkage actuated 4D printing is an effective method through flat structures fabricating 3D shell. However, existing research faces issues related to prec…
▽ More
Freeform thin-shell surfaces are critical in various fields, but their fabrication is complex and costly. Traditional methods are wasteful and require custom molds, while 3D printing needs extensive support structures and post-processing. Thermoshrinkage actuated 4D printing is an effective method through flat structures fabricating 3D shell. However, existing research faces issues related to precise deformation and limited robustness. Addressing these issues is challenging due to three key factors: (1) Difficulty in finding a universal method to control deformation across different materials; (2) Variability in deformation influenced by factors such as printing speed, layer thickness, and heating temperature; (3) Environmental factors affecting the deformation process. To overcome these challenges, we introduce FreeShell, a robust 4D printing technique that uses thermoshrinkage to create precise 3D shells. This method prints triangular tiles connected by shrinkable connectors from a single material. Upon heating, the connectors shrink, moving the tiles to form the desired 3D shape, simplifying fabrication and reducing material and environment dependency. An optimized algorithm for flattening 3D meshes ensures precision in printing. FreeShell demonstrates its effectiveness through various examples and experiments, showcasing accuracy, robustness, and strength, representing advancement in fabricating complex freeform surfaces.
△ Less
Submitted 28 July, 2024;
originally announced July 2024.
-
Advancements in Recommender Systems: A Comprehensive Analysis Based on Data, Algorithms, and Evaluation
Authors:
Xin Ma,
Mingyue Li,
Xuguang Liu
Abstract:
Using 286 research papers collected from Web of Science, ScienceDirect, SpringerLink, arXiv, and Google Scholar databases, a systematic review methodology was adopted to review and summarize the current challenges and potential future developments in data, algorithms, and evaluation aspects of RSs. It was found that RSs involve five major research topics, namely algorithmic improvement, domain app…
▽ More
Using 286 research papers collected from Web of Science, ScienceDirect, SpringerLink, arXiv, and Google Scholar databases, a systematic review methodology was adopted to review and summarize the current challenges and potential future developments in data, algorithms, and evaluation aspects of RSs. It was found that RSs involve five major research topics, namely algorithmic improvement, domain applications, user behavior & cognition, data processing & modeling, and social impact & ethics. Collaborative filtering and hybrid recommendation techniques are mainstream. The performance of RSs is jointly limited by four types of eight data issues, two types of twelve algorithmic issues, and two evaluation issues. Notably, data-related issues such as cold start, data sparsity, and data poisoning, algorithmic issues like interest drift, device-cloud collaboration, non-causal driven, and multitask conflicts, along with evaluation issues such as offline data leakage and multi-objective balancing, have prominent impacts. Fusing physiological signals for multimodal modeling, defending against data poisoning through user information behavior, evaluating generative recommendations via social experiments, fine-tuning pre-trained large models to schedule device-cloud resource, enhancing causal inference with deep reinforcement learning, training multi-task models based on probability distributions, using cross-temporal dataset partitioning, and evaluating recommendation objectives across the full lifecycle are feasible solutions to address the aforementioned prominent challenges and unlock the power and value of RSs.The collected literature is mainly based on major international databases, and future research will further expand upon it.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
LoFormer: Local Frequency Transformer for Image Deblurring
Authors:
Xintian Mao,
Jiansheng Wang,
Xingran Xie,
Qingli Li,
Yan Wang
Abstract:
Due to the computational complexity of self-attention (SA), prevalent techniques for image deblurring often resort to either adopting localized SA or employing coarse-grained global SA methods, both of which exhibit drawbacks such as compromising global modeling or lacking fine-grained correlation. In order to address this issue by effectively modeling long-range dependencies without sacrificing f…
▽ More
Due to the computational complexity of self-attention (SA), prevalent techniques for image deblurring often resort to either adopting localized SA or employing coarse-grained global SA methods, both of which exhibit drawbacks such as compromising global modeling or lacking fine-grained correlation. In order to address this issue by effectively modeling long-range dependencies without sacrificing fine-grained details, we introduce a novel approach termed Local Frequency Transformer (LoFormer). Within each unit of LoFormer, we incorporate a Local Channel-wise SA in the frequency domain (Freq-LC) to simultaneously capture cross-covariance within low- and high-frequency local windows. These operations offer the advantage of (1) ensuring equitable learning opportunities for both coarse-grained structures and fine-grained details, and (2) exploring a broader range of representational properties compared to coarse-grained global SA methods. Additionally, we introduce an MLP Gating mechanism complementary to Freq-LC, which serves to filter out irrelevant features while enhancing global learning capabilities. Our experiments demonstrate that LoFormer significantly improves performance in the image deblurring task, achieving a PSNR of 34.09 dB on the GoPro dataset with 126G FLOPs. https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More
Authors:
Zhichao Wang,
Bin Bi,
Shiva Kumar Pentyala,
Kiran Ramnath,
Sougata Chaudhuri,
Shubham Mehrotra,
Zixu,
Zhu,
Xiang-Bo Mao,
Sitaram Asur,
Na,
Cheng
Abstract:
With advancements in self-supervised learning, the availability of trillions tokens in a pre-training corpus, instruction fine-tuning, and the development of large Transformers with billions of parameters, large language models (LLMs) are now capable of generating factual and coherent responses to human queries. However, the mixed quality of training data can lead to the generation of undesired re…
▽ More
With advancements in self-supervised learning, the availability of trillions tokens in a pre-training corpus, instruction fine-tuning, and the development of large Transformers with billions of parameters, large language models (LLMs) are now capable of generating factual and coherent responses to human queries. However, the mixed quality of training data can lead to the generation of undesired responses, presenting a significant challenge. Over the past two years, various methods have been proposed from different perspectives to enhance LLMs, particularly in aligning them with human expectation. Despite these efforts, there has not been a comprehensive survey paper that categorizes and details these approaches. In this work, we aim to address this gap by categorizing these papers into distinct topics and providing detailed explanations of each alignment method, thereby helping readers gain a thorough understanding of the current state of the field.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models
Authors:
Xin Ma,
Yaohui Wang,
Gengyun Jia,
Xinyuan Chen,
Yuan-Fang Li,
Cunjian Chen,
Yu Qiao
Abstract:
Diffusion models have achieved great progress in image animation due to powerful generative capabilities. However, maintaining spatio-temporal consistency with detailed information from the input static image over time (e.g., style, background, and object of the input static image) and ensuring smoothness in animated video narratives guided by textual prompts still remains challenging. In this pap…
▽ More
Diffusion models have achieved great progress in image animation due to powerful generative capabilities. However, maintaining spatio-temporal consistency with detailed information from the input static image over time (e.g., style, background, and object of the input static image) and ensuring smoothness in animated video narratives guided by textual prompts still remains challenging. In this paper, we introduce Cinemo, a novel image animation approach towards achieving better motion controllability, as well as stronger temporal consistency and smoothness. In general, we propose three effective strategies at the training and inference stages of Cinemo to accomplish our goal. At the training stage, Cinemo focuses on learning the distribution of motion residuals, rather than directly predicting subsequent via a motion diffusion model. Additionally, a structural similarity index-based strategy is proposed to enable Cinemo to have better controllability of motion intensity. At the inference stage, a noise refinement technique based on discrete cosine transformation is introduced to mitigate sudden motion changes. Such three strategies enable Cinemo to produce highly consistent, smooth, and motion-controllable results. Compared to previous methods, Cinemo offers simpler and more precise user controllability. Extensive experiments against several state-of-the-art methods, including both commercial tools and research approaches, across multiple metrics, demonstrate the effectiveness and superiority of our proposed approach.
△ Less
Submitted 22 July, 2024; v1 submitted 22 July, 2024;
originally announced July 2024.
-
Exploring the Design of Collaborative Applications via the Lens of NDN Workspace
Authors:
Tianyuan Yu,
Xinyu Ma,
Varun Patil,
Yekta Kocaogullar,
Lixia Zhang
Abstract:
Metaverse applications desire to communicate with semantically identified objects among a diverse set of cyberspace entities, such as cameras for collecting images from, sensors for sensing environment, and users collaborating with each other, all could be nearby or far away, in a timely and secure way. However, supporting the above function faces networking challenges. Today's metaverse implement…
▽ More
Metaverse applications desire to communicate with semantically identified objects among a diverse set of cyberspace entities, such as cameras for collecting images from, sensors for sensing environment, and users collaborating with each other, all could be nearby or far away, in a timely and secure way. However, supporting the above function faces networking challenges. Today's metaverse implementations are, by and large, use secure transport connections to communicate with cloud servers instead of letting participating entities communicate directly. In this paper, we use the design and implementation of NDN Workspace, a web-based, multi-user collaborative app to showcase a new way to networking that supports many-to-many secure data exchanges among communicating entities directly. NDN Workspace users establish trust relations among each other, exchange URI-identified objects directly, and can collaborate through intermittent connectivity, all in the absence of cloud servers. Its data-centric design offers an exciting new approach to metaverse app development.
△ Less
Submitted 21 July, 2024;
originally announced July 2024.
-
Secure Web Objects: Building Blocks for Metaverse Interoperability and Decentralization
Authors:
Tianyuan Yu,
Xinyu Ma,
Varun Patil,
Yekta Kocaogullar,
Yulong Zhang,
Jeff Burke,
Dirk Kutscher,
Lixia Zhang
Abstract:
This position paper explores how to support the Web's evolution through an underlying data-centric approach that better matches the data-orientedness of modern and emerging applications. We revisit the original vision of the Web as a hypermedia system that supports document composability and application interoperability via name-based data access. We propose the use of secure web objects (SWO), a…
▽ More
This position paper explores how to support the Web's evolution through an underlying data-centric approach that better matches the data-orientedness of modern and emerging applications. We revisit the original vision of the Web as a hypermedia system that supports document composability and application interoperability via name-based data access. We propose the use of secure web objects (SWO), a data-oriented communication approach that can reduce complexity, centrality, and inefficiency, particularly for collaborative and local-first applications, such as the Metaverse and other collaborative applications. SWO are named, signed, application-defined objects that are secured independently of their containers or communications channels, an approach that leverages the results from over a decade-long data-centric networking research. This approach does not require intermediation by aggregators of identity, storage, and other services that are common today. We present a brief design overview, illustrated through prototypes for two editors of shared hypermedia documents: one for 3D and one for LaTeX. We also discuss our findings and suggest a roadmap for future research.
△ Less
Submitted 21 July, 2024;
originally announced July 2024.
-
A Two-Phase Visualization System for Continuous Human-AI Collaboration in Sequelae Analysis and Modeling
Authors:
Yang Ouyang,
Chenyang Zhang,
He Wang,
Tianle Ma,
Chang Jiang,
Yuheng Yan,
Zuoqin Yan,
Xiaojuan Ma,
Chuhan Shi,
Quan Li
Abstract:
In healthcare, AI techniques are widely used for tasks like risk assessment and anomaly detection. Despite AI's potential as a valuable assistant, its role in complex medical data analysis often oversimplifies human-AI collaboration dynamics. To address this, we collaborated with a local hospital, engaging six physicians and one data scientist in a formative study. From this collaboration, we prop…
▽ More
In healthcare, AI techniques are widely used for tasks like risk assessment and anomaly detection. Despite AI's potential as a valuable assistant, its role in complex medical data analysis often oversimplifies human-AI collaboration dynamics. To address this, we collaborated with a local hospital, engaging six physicians and one data scientist in a formative study. From this collaboration, we propose a framework integrating two-phase interactive visualization systems: one for Human-Led, AI-Assisted Retrospective Analysis and another for AI-Mediated, Human-Reviewed Iterative Modeling. This framework aims to enhance understanding and discussion around effective human-AI collaboration in healthcare.
△ Less
Submitted 20 July, 2024;
originally announced July 2024.
-
Enhanced Mortality Prediction in ICU Stroke Patients via Deep Learning
Authors:
Armin Abdollahi,
Xinghong Ma,
Jiahao Zhang,
Daijia Wu,
Tongshou Wu,
Zizheng Ye,
Maryam Pishgar
Abstract:
Background: Stroke is second-leading cause of disability and death among adults. Approximately 17 million people suffer from a stroke annually, with about 85% being ischemic strokes. Predicting mortality of ischemic stroke patients in intensive care unit (ICU) is crucial for optimizing treatment strategies, allocating resources, and improving survival rates. Methods: We acquired data on ICU ischem…
▽ More
Background: Stroke is second-leading cause of disability and death among adults. Approximately 17 million people suffer from a stroke annually, with about 85% being ischemic strokes. Predicting mortality of ischemic stroke patients in intensive care unit (ICU) is crucial for optimizing treatment strategies, allocating resources, and improving survival rates. Methods: We acquired data on ICU ischemic stroke patients from MIMIC-IV database, including diagnoses, vital signs, laboratory tests, medications, procedures, treatments, and clinical notes. Stroke patients were randomly divided into training (70%, n=2441), test (15%, n=523), and validation (15%, n=523) sets. To address data imbalances, we applied Synthetic Minority Over-sampling Technique (SMOTE). We selected 30 features for model development, significantly reducing feature number from 1095 used in the best study. We developed a deep learning model to assess mortality risk and implemented several baseline machine learning models for comparison. Results: XGB-DL model, combining XGBoost for feature selection and deep learning, effectively minimized false positives. Model AUROC improved from 0.865 (95% CI: 0.821 - 0.905) on first day to 0.903 (95% CI: 0.868 - 0.936) by fourth day using data from 3,646 ICU mortality patients in the MIMIC-IV database with 0.945 AUROC (95% CI: 0.944 - 0.947) during training. Although other ML models also performed well in terms of AUROC, we chose Deep Learning for its higher specificity. Conclusions: Through enhanced feature selection and data cleaning, proposed model demonstrates a 13% AUROC improvement compared to existing models while reducing feature number from 1095 in previous studies to 30.
△ Less
Submitted 19 July, 2024;
originally announced July 2024.
-
EarthMarker: Visual Prompt Learning for Region-level and Point-level Remote Sensing Imagery Comprehension
Authors:
Wei Zhang,
Miaoxin Cai,
Tong Zhang,
Jun Li,
Yin Zhuang,
Xuerui Mao
Abstract:
Recent advances in visual prompting in the natural image area have allowed users to interact with artificial intelligence (AI) tools through various visual marks such as box, point, and free-form shapes. However, due to the significant difference between the natural and remote sensing (RS) images, existing visual prompting models face challenges in RS scenarios. Moreover, RS MLLMs mainly focus on…
▽ More
Recent advances in visual prompting in the natural image area have allowed users to interact with artificial intelligence (AI) tools through various visual marks such as box, point, and free-form shapes. However, due to the significant difference between the natural and remote sensing (RS) images, existing visual prompting models face challenges in RS scenarios. Moreover, RS MLLMs mainly focus on interpreting image-level RS data and only support interaction with language instruction, restricting flexibility applications in the real world. To address those limitations, the first visual prompting model named EarthMarker is proposed, which excels in image-level, region-level, and point-level RS imagery interpretation. Specifically, the visual prompts alongside images and text instruction input into the large language model (LLM), adapt models toward specific predictions and tasks. Subsequently, a sharing visual encoding method is introduced to refine multi-scale image features and visual prompt information uniformly. Furthermore, to endow the EarthMarker with versatile multi-granularity visual perception abilities, the cross-domain phased learning strategy is developed, and the disjoint parameters are optimized in a lightweight manner by leveraging both the natural and RS domain-specific knowledge. In addition, to tackle the lack of RS visual prompting data, a dataset named RSVP featuring multi-modal fine-grained visual prompting instruction is constructed. Extensive experiments are conducted to demonstrate the proposed EarthMarker's competitive performance, representing a significant advance in multi-granularity RS imagery interpretation under the visual prompting learning framework.
△ Less
Submitted 20 July, 2024; v1 submitted 18 July, 2024;
originally announced July 2024.
-
Mixture of Experts based Multi-task Supervise Learning from Crowds
Authors:
Tao Han,
Huaixuan Shi,
Xinyi Ding,
Xiao Ma,
Huamao Gu,
Yili Fang
Abstract:
Existing truth inference methods in crowdsourcing aim to map redundant labels and items to the ground truth. They treat the ground truth as hidden variables and use statistical or deep learning-based worker behavior models to infer the ground truth. However, worker behavior models that rely on ground truth hidden variables overlook workers' behavior at the item feature level, leading to imprecise…
▽ More
Existing truth inference methods in crowdsourcing aim to map redundant labels and items to the ground truth. They treat the ground truth as hidden variables and use statistical or deep learning-based worker behavior models to infer the ground truth. However, worker behavior models that rely on ground truth hidden variables overlook workers' behavior at the item feature level, leading to imprecise characterizations and negatively impacting the quality of truth inference. This paper proposes a new paradigm of multi-task supervised learning from crowds, which eliminates the need for modeling of items's ground truth in worker behavior models. Within this paradigm, we propose a worker behavior model at the item feature level called Mixture of Experts based Multi-task Supervised Learning from Crowds (MMLC). Two truth inference strategies are proposed within MMLC. The first strategy, named MMLC-owf, utilizes clustering methods in the worker spectral space to identify the projection vector of the oracle worker. Subsequently, the labels generated based on this vector are considered as the inferred truth. The second strategy, called MMLC-df, employs the MMLC model to fill the crowdsourced data, which can enhance the effectiveness of existing truth inference methods. Experimental results demonstrate that MMLC-owf outperforms state-of-the-art methods and MMLC-df enhances the quality of existing truth inference methods.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Generating 3D House Wireframes with Semantics
Authors:
Xueqi Ma,
Yilin Liu,
Wenjun Zhou,
Ruowei Wang,
Hui Huang
Abstract:
We present a new approach for generating 3D house wireframes with semantic enrichment using an autoregressive model. Unlike conventional generative models that independently process vertices, edges, and faces, our approach employs a unified wire-based representation for improved coherence in learning 3D wireframe structures. By re-ordering wire sequences based on semantic meanings, we facilitate s…
▽ More
We present a new approach for generating 3D house wireframes with semantic enrichment using an autoregressive model. Unlike conventional generative models that independently process vertices, edges, and faces, our approach employs a unified wire-based representation for improved coherence in learning 3D wireframe structures. By re-ordering wire sequences based on semantic meanings, we facilitate seamless semantic integration during sequence generation. Our two-phase technique merges a graph-based autoencoder with a transformer-based decoder to learn latent geometric tokens and generate semantic-aware wireframes. Through iterative prediction and decoding during inference, our model produces detailed wireframes that can be easily segmented into distinct components, such as walls, roofs, and rooms, reflecting the semantic essence of the shape. Empirical results on a comprehensive house dataset validate the superior accuracy, novelty, and semantic fidelity of our model compared to existing generative models. More results and details can be found on https://vcc.tech/research/2024/3DWire.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
DIM: Dynamic Integration of Multimodal Entity Linking with Large Language Model
Authors:
Shezheng Song,
Shasha Li,
Jie Yu,
Shan Zhao,
Xiaopeng Li,
Jun Ma,
Xiaodong Liu,
Zhuo Li,
Xiaoguang Mao
Abstract:
Our study delves into Multimodal Entity Linking, aligning the mention in multimodal information with entities in knowledge base. Existing methods are still facing challenges like ambiguous entity representations and limited image information utilization. Thus, we propose dynamic entity extraction using ChatGPT, which dynamically extracts entities and enhances datasets. We also propose a method: Dy…
▽ More
Our study delves into Multimodal Entity Linking, aligning the mention in multimodal information with entities in knowledge base. Existing methods are still facing challenges like ambiguous entity representations and limited image information utilization. Thus, we propose dynamic entity extraction using ChatGPT, which dynamically extracts entities and enhances datasets. We also propose a method: Dynamically Integrate Multimodal information with knowledge base (DIM), employing the capability of the Large Language Model (LLM) for visual understanding. The LLM, such as BLIP-2, extracts information relevant to entities in the image, which can facilitate improved extraction of entity features and linking them with the dynamic entity representations provided by ChatGPT. The experiments demonstrate that our proposed DIM method outperforms the majority of existing methods on the three original datasets, and achieves state-of-the-art (SOTA) on the dynamically enhanced datasets (Wiki+, Rich+, Diverse+). For reproducibility, our code and collected datasets are released on \url{https://github.com/season1blue/DIM}.
△ Less
Submitted 27 June, 2024;
originally announced July 2024.
-
Dynamic Dimension Wrapping (DDW) Algorithm: A Novel Approach for Efficient Cross-Dimensional Search in Dynamic Multidimensional Spaces
Authors:
Dongnan Jin,
Yali Liu,
Qiuzhi Song,
Xunju Ma,
Yue Liu,
Dehao Wu
Abstract:
In the real world, as the complexity of optimization problems continues to increase, there is an urgent need to research more efficient optimization methods. Current optimization algorithms excel in solving problems with a fixed number of dimensions. However, their efficiency in searching dynamic multi-dimensional spaces is unsatisfactory. In response to the challenge of cross-dimensional search i…
▽ More
In the real world, as the complexity of optimization problems continues to increase, there is an urgent need to research more efficient optimization methods. Current optimization algorithms excel in solving problems with a fixed number of dimensions. However, their efficiency in searching dynamic multi-dimensional spaces is unsatisfactory. In response to the challenge of cross-dimensional search in multi-dimensional spaces with varying numbers of dimensions, this study proposes a new optimization algorithm-Dynamic Dimension Wrapping (DDW) algorithm. Firstly, by utilizing the Dynamic Time Warping (DTW) algorithm and Euclidean distance, a mapping relationship between different time series across dimensions is established, thus creating a fitness function suitable for dimensionally dynamic multi-dimensional space. Additionally, DDW introduces a novel, more efficient cross-dimensional search mechanism for dynamic multidimensional spaces. Finally, through comparative tests with 31 optimization algorithms in dynamic multidimensional space search, the results demonstrate that DDW exhibits outstanding search efficiency and provides search results closest to the actual optimal solution.
△ Less
Submitted 18 July, 2024; v1 submitted 16 July, 2024;
originally announced July 2024.
-
LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis
Authors:
Zhenxiong Tan,
Xinyin Ma,
Gongfan Fang,
Xinchao Wang
Abstract:
Latent diffusion models have shown promising results in audio generation, making notable advancements over traditional methods. However, their performance, while impressive with short audio clips, faces challenges when extended to longer audio sequences. These challenges are due to model's self-attention mechanism and training predominantly on 10-second clips, which complicates the extension to lo…
▽ More
Latent diffusion models have shown promising results in audio generation, making notable advancements over traditional methods. However, their performance, while impressive with short audio clips, faces challenges when extended to longer audio sequences. These challenges are due to model's self-attention mechanism and training predominantly on 10-second clips, which complicates the extension to longer audio without adaptation. In response to these issues, we introduce a novel approach, LiteFocus that enhances the inference of existing audio latent diffusion models in long audio synthesis. Observed the attention pattern in self-attention, we employ a dual sparse form for attention calculation, designated as same-frequency focus and cross-frequency compensation, which curtails the attention computation under same-frequency constraints, while enhancing audio quality through cross-frequency refillment. LiteFocus demonstrates substantial reduction on inference time with diffusion-based TTA model by 1.99x in synthesizing 80-second audio clips while also obtaining improved audio quality.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
PolyRoom: Room-aware Transformer for Floorplan Reconstruction
Authors:
Yuzhou Liu,
Lingjie Zhu,
Xiaodong Ma,
Hanqiao Ye,
Xiang Gao,
Xianwei Zheng,
Shuhan Shen
Abstract:
Reconstructing geometry and topology structures from raw unstructured data has always been an important research topic in indoor mapping research. In this paper, we aim to reconstruct the floorplan with a vectorized representation from point clouds. Despite significant advancements achieved in recent years, current methods still encounter several challenges, such as missing corners or edges, inacc…
▽ More
Reconstructing geometry and topology structures from raw unstructured data has always been an important research topic in indoor mapping research. In this paper, we aim to reconstruct the floorplan with a vectorized representation from point clouds. Despite significant advancements achieved in recent years, current methods still encounter several challenges, such as missing corners or edges, inaccuracies in corner positions or angles, self-intersecting or overlapping polygons, and potentially implausible topology. To tackle these challenges, we present PolyRoom, a room-aware Transformer that leverages uniform sampling representation, room-aware query initialization, and room-aware self-attention for floorplan reconstruction. Specifically, we adopt a uniform sampling floorplan representation to enable dense supervision during training and effective utilization of angle information. Additionally, we propose a room-aware query initialization scheme to prevent non-polygonal sequences and introduce room-aware self-attention to enhance memory efficiency and model performance. Experimental results on two widely used datasets demonstrate that PolyRoom surpasses current state-of-the-art methods both quantitatively and qualitatively. Our code is available at: https://github.com/3dv-casia/PolyRoom/.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Accessing Vision Foundation Models at ImageNet-level Costs
Authors:
Yitian Zhang,
Xu Ma,
Yue Bai,
Huan Wang,
Yun Fu
Abstract:
Vision foundation models are renowned for their generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could advance research in this field. In this work, we offer a very simple and general solution, named Proteus, to distill…
▽ More
Vision foundation models are renowned for their generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could advance research in this field. In this work, we offer a very simple and general solution, named Proteus, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data. Specifically, we remove the designs from conventional knowledge distillation settings that result in dataset bias and present three levels of training objectives, i.e., token, patch, and feature, to maximize the efficacy of knowledge transfer. In this manner, Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for the broader research community. Leveraging DINOv2-g/14 as the teacher, Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 (142M training data) across 15 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B) and SynCLR-L/14 (600M).
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
Constrained Intrinsic Motivation for Reinforcement Learning
Authors:
Xiang Zheng,
Xingjun Ma,
Chao Shen,
Cong Wang
Abstract:
This paper investigates two fundamental problems that arise when utilizing Intrinsic Motivation (IM) for reinforcement learning in Reward-Free Pre-Training (RFPT) tasks and Exploration with Intrinsic Motivation (EIM) tasks: 1) how to design an effective intrinsic objective in RFPT tasks, and 2) how to reduce the bias introduced by the intrinsic objective in EIM tasks. Existing IM methods suffer fr…
▽ More
This paper investigates two fundamental problems that arise when utilizing Intrinsic Motivation (IM) for reinforcement learning in Reward-Free Pre-Training (RFPT) tasks and Exploration with Intrinsic Motivation (EIM) tasks: 1) how to design an effective intrinsic objective in RFPT tasks, and 2) how to reduce the bias introduced by the intrinsic objective in EIM tasks. Existing IM methods suffer from static skills, limited state coverage, sample inefficiency in RFPT tasks, and suboptimality in EIM tasks. To tackle these problems, we propose \emph{Constrained Intrinsic Motivation (CIM)} for RFPT and EIM tasks, respectively: 1) CIM for RFPT maximizes the lower bound of the conditional state entropy subject to an alignment constraint on the state encoder network for efficient dynamic and diverse skill discovery and state coverage maximization; 2) CIM for EIM leverages constrained policy optimization to adaptively adjust the coefficient of the intrinsic objective to mitigate the distraction from the intrinsic objective. In various MuJoCo robotics environments, we empirically show that CIM for RFPT greatly surpasses fifteen IM methods for unsupervised skill discovery in terms of skill diversity, state coverage, and fine-tuning performance. Additionally, we showcase the effectiveness of CIM for EIM in redeeming intrinsic rewards when task rewards are exposed from the beginning. Our code is available at https://github.com/x-zheng16/CIM.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
STD-LLM: Understanding Both Spatial and Temporal Properties of Spatial-Temporal Data with LLMs
Authors:
Yiheng Huang,
Xiaowei Mao,
Shengnan Guo,
Yubin Chen,
Youfang Lin,
Huaiyu Wan
Abstract:
Spatial-temporal forecasting and imputation are important for real-world dynamic systems such as intelligent transportation, urban planning, and public health. Most existing methods are tailored for individual forecasting or imputation tasks but are not designed for both. Additionally, they are less effective for zero-shot and few-shot learning. While large language models (LLMs) have exhibited st…
▽ More
Spatial-temporal forecasting and imputation are important for real-world dynamic systems such as intelligent transportation, urban planning, and public health. Most existing methods are tailored for individual forecasting or imputation tasks but are not designed for both. Additionally, they are less effective for zero-shot and few-shot learning. While large language models (LLMs) have exhibited strong pattern recognition and reasoning abilities across various tasks, including few-shot and zero-shot learning, their development in understanding spatial-temporal data has been constrained by insufficient modeling of complex correlations such as the temporal correlations, spatial connectivity, non-pairwise and high-order spatial-temporal correlations within data. In this paper, we propose STD-LLM for understanding both spatial and temporal properties of \underline{S}patial-\underline{T}emporal \underline{D}ata with \underline{LLM}s, which is capable of implementing both spatial-temporal forecasting and imputation tasks. STD-LLM understands spatial-temporal correlations via explicitly designed spatial and temporal tokenizers as well as virtual nodes. Topology-aware node embeddings are designed for LLMs to comprehend and exploit the topology structure of data. Additionally, to capture the non-pairwise and higher-order correlations, we design a hypergraph learning module for LLMs, which can enhance the overall performance and improve efficiency. Extensive experiments demonstrate that STD-LLM exhibits strong performance and generalization capabilities across the forecasting and imputation tasks on various datasets. Moreover, STD-LLM achieves promising results on both few-shot and zero-shot learning tasks.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Procedural Content Generation via Generative Artificial Intelligence
Authors:
Xinyu Mao,
Wanli Yu,
Kazunori D Yamada,
Michael R. Zielewski
Abstract:
The attempt to utilize machine learning in PCG has been made in the past. In this survey paper, we investigate how generative artificial intelligence (AI), which saw a significant increase in interest in the mid-2010s, is being used for PCG. We review applications of generative AI for the creation of various types of content, including terrains, items, and even storylines. While generative AI is e…
▽ More
The attempt to utilize machine learning in PCG has been made in the past. In this survey paper, we investigate how generative artificial intelligence (AI), which saw a significant increase in interest in the mid-2010s, is being used for PCG. We review applications of generative AI for the creation of various types of content, including terrains, items, and even storylines. While generative AI is effective for PCG, one significant issues it faces is that building high-performance generative AI requires vast amounts of training data. Because content generally highly customized, domain-specific training data is scarce, and straightforward approaches to generative AI models may not work well. For PCG research to advance further, issues related to limited training data must be overcome. Thus, we also give special consideration to research that addresses the challenges posed by limited training data.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Towards Chapter-to-Chapter Context-Aware Literary Translation via Large Language Models
Authors:
Linghao Jin,
Li An,
Xuezhe Ma
Abstract:
Discourse phenomena in existing document-level translation datasets are sparse, which has been a fundamental obstacle in the development of context-aware machine translation models. Moreover, most existing document-level corpora and context-aware machine translation methods rely on an unrealistic assumption on sentence-level alignments. To mitigate these issues, we first curate a novel dataset of…
▽ More
Discourse phenomena in existing document-level translation datasets are sparse, which has been a fundamental obstacle in the development of context-aware machine translation models. Moreover, most existing document-level corpora and context-aware machine translation methods rely on an unrealistic assumption on sentence-level alignments. To mitigate these issues, we first curate a novel dataset of Chinese-English literature, which consists of 160 books with intricate discourse structures. Then, we propose a more pragmatic and challenging setting for context-aware translation, termed chapter-to-chapter (Ch2Ch) translation, and investigate the performance of commonly-used machine translation models under this setting. Furthermore, we introduce a potential approach of finetuning large language models (LLMs) within the domain of Ch2Ch literary translation, yielding impressive improvements over baselines. Through our comprehensive analysis, we unveil that literary translation under the Ch2Ch setting is challenging in nature, with respect to both model learning methods and translation decoding algorithms.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Enhancing octree-based context models for point cloud geometry compression with attention-based child node number prediction
Authors:
Chang Sun,
Hui Yuan,
Xiaolong Mao,
Xin Lu,
Raouf Hamzaoui
Abstract:
In point cloud geometry compression, most octreebased context models use the cross-entropy between the onehot encoding of node occupancy and the probability distribution predicted by the context model as the loss. This approach converts the problem of predicting the number (a regression problem) and the position (a classification problem) of occupied child nodes into a 255-dimensional classificati…
▽ More
In point cloud geometry compression, most octreebased context models use the cross-entropy between the onehot encoding of node occupancy and the probability distribution predicted by the context model as the loss. This approach converts the problem of predicting the number (a regression problem) and the position (a classification problem) of occupied child nodes into a 255-dimensional classification problem. As a result, it fails to accurately measure the difference between the one-hot encoding and the predicted probability distribution. We first analyze why the cross-entropy loss function fails to accurately measure the difference between the one-hot encoding and the predicted probability distribution. Then, we propose an attention-based child node number prediction (ACNP) module to enhance the context models. The proposed module can predict the number of occupied child nodes and map it into an 8- dimensional vector to assist the context model in predicting the probability distribution of the occupancy of the current node for efficient entropy coding. Experimental results demonstrate that the proposed module enhances the coding efficiency of octree-based context models.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Green Screen Augmentation Enables Scene Generalisation in Robotic Manipulation
Authors:
Eugene Teoh,
Sumit Patidar,
Xiao Ma,
Stephen James
Abstract:
Generalising vision-based manipulation policies to novel environments remains a challenging area with limited exploration. Current practices involve collecting data in one location, training imitation learning or reinforcement learning policies with this data, and deploying the policy in the same location. However, this approach lacks scalability as it necessitates data collection in multiple loca…
▽ More
Generalising vision-based manipulation policies to novel environments remains a challenging area with limited exploration. Current practices involve collecting data in one location, training imitation learning or reinforcement learning policies with this data, and deploying the policy in the same location. However, this approach lacks scalability as it necessitates data collection in multiple locations for each task. This paper proposes a novel approach where data is collected in a location predominantly featuring green screens. We introduce Green-screen Augmentation (GreenAug), employing a chroma key algorithm to overlay background textures onto a green screen. Through extensive real-world empirical studies with over 850 training demonstrations and 8.2k evaluation episodes, we demonstrate that GreenAug surpasses no augmentation, standard computer vision augmentation, and prior generative augmentation methods in performance. While no algorithmic novelties are claimed, our paper advocates for a fundamental shift in data collection practices. We propose that real-world demonstrations in future research should utilise green screens, followed by the application of GreenAug. We believe GreenAug unlocks policy generalisation to visually distinct novel locations, addressing the current scene generalisation limitations in robot learning.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities
Authors:
Tianjie Ju,
Yiting Wang,
Xinbei Ma,
Pengzhou Cheng,
Haodong Zhao,
Yulong Wang,
Lifeng Liu,
Jian Xie,
Zhuosheng Zhang,
Gongshen Liu
Abstract:
The rapid adoption of large language models (LLMs) in multi-agent systems has highlighted their impressive capabilities in various applications, such as collaborative problem-solving and autonomous negotiation. However, the security implications of these LLM-based multi-agent systems have not been thoroughly investigated, particularly concerning the spread of manipulated knowledge. In this paper,…
▽ More
The rapid adoption of large language models (LLMs) in multi-agent systems has highlighted their impressive capabilities in various applications, such as collaborative problem-solving and autonomous negotiation. However, the security implications of these LLM-based multi-agent systems have not been thoroughly investigated, particularly concerning the spread of manipulated knowledge. In this paper, we investigate this critical issue by constructing a detailed threat model and a comprehensive simulation environment that mirrors real-world multi-agent deployments in a trusted platform. Subsequently, we propose a novel two-stage attack method involving Persuasiveness Injection and Manipulated Knowledge Injection to systematically explore the potential for manipulated knowledge (i.e., counterfactual and toxic knowledge) spread without explicit prompt manipulation.
Our method leverages the inherent vulnerabilities of LLMs in handling world knowledge, which can be exploited by attackers to unconsciously spread fabricated information. Through extensive experiments, we demonstrate that our attack method can successfully induce LLM-based agents to spread both counterfactual and toxic knowledge without degrading their foundational capabilities during agent communication. Furthermore, we show that these manipulations can persist through popular retrieval-augmented generation frameworks, where several benign agents store and retrieve manipulated chat histories for future interactions. This persistence indicates that even after the interaction has ended, the benign agents may continue to be influenced by manipulated knowledge. Our findings reveal significant security risks in LLM-based multi-agent systems, emphasizing the imperative need for robust defenses against manipulated knowledge spread, such as introducing ``guardian'' agents and advanced fact-checking tools.
△ Less
Submitted 22 July, 2024; v1 submitted 10 July, 2024;
originally announced July 2024.
-
BiGym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark
Authors:
Nikita Chernyadev,
Nicholas Backshall,
Xiao Ma,
Yunfan Lu,
Younggyo Seo,
Stephen James
Abstract:
We introduce BiGym, a new benchmark and learning environment for mobile bi-manual demo-driven robotic manipulation. BiGym features 40 diverse tasks set in home environments, ranging from simple target reaching to complex kitchen cleaning. To capture the real-world performance accurately, we provide human-collected demonstrations for each task, reflecting the diverse modalities found in real-world…
▽ More
We introduce BiGym, a new benchmark and learning environment for mobile bi-manual demo-driven robotic manipulation. BiGym features 40 diverse tasks set in home environments, ranging from simple target reaching to complex kitchen cleaning. To capture the real-world performance accurately, we provide human-collected demonstrations for each task, reflecting the diverse modalities found in real-world robot trajectories. BiGym supports a variety of observations, including proprioceptive data and visual inputs such as RGB, and depth from 3 camera views. To validate the usability of BiGym, we thoroughly benchmark the state-of-the-art imitation learning algorithms and demo-driven reinforcement learning algorithms within the environment and discuss the future opportunities.
△ Less
Submitted 11 July, 2024; v1 submitted 10 July, 2024;
originally announced July 2024.
-
HiLight: Technical Report on the Motern AI Video Language Model
Authors:
Zhiting Wang,
Qiangong Zhou,
Kangjie Yang,
Zongyang Liu,
Xin Mao
Abstract:
This technical report presents the implementation of a state-of-the-art video encoder for video-text modal alignment and a video conversation framework called HiLight, which features dual visual towers. The work is divided into two main parts: 1.alignment of video and text modalities; 2.convenient and efficient way to interact with users. Our goal is to address the task of video comprehension in t…
▽ More
This technical report presents the implementation of a state-of-the-art video encoder for video-text modal alignment and a video conversation framework called HiLight, which features dual visual towers. The work is divided into two main parts: 1.alignment of video and text modalities; 2.convenient and efficient way to interact with users. Our goal is to address the task of video comprehension in the context of billiards. The report includes a discussion of the concepts and the final solution developed during the task's implementation.
△ Less
Submitted 11 July, 2024; v1 submitted 9 July, 2024;
originally announced July 2024.
-
HTD-Mamba: Efficient Hyperspectral Target Detection with Pyramid State Space Model
Authors:
Dunbin Shen,
Xuanbing Zhu,
Jiacheng Tian,
Jianjun Liu,
Zhenrong Du,
Hongyu Wang,
Xiaorui Ma
Abstract:
Hyperspectral target detection (HTD) identifies objects of interest from complex backgrounds at the pixel level, playing a vital role in Earth observation. However, HTD faces challenges due to limited prior knowledge and spectral variation, leading to underfitting models and unreliable performance. To address these challenges, this paper proposes an efficient self-supervised HTD method with a pyra…
▽ More
Hyperspectral target detection (HTD) identifies objects of interest from complex backgrounds at the pixel level, playing a vital role in Earth observation. However, HTD faces challenges due to limited prior knowledge and spectral variation, leading to underfitting models and unreliable performance. To address these challenges, this paper proposes an efficient self-supervised HTD method with a pyramid state space model (SSM), named HTD-Mamba, which employs spectrally contrastive learning to distinguish between target and background based on the similarity measurement of intrinsic features. Specifically, to obtain sufficient training samples and leverage spatial contextual information, we propose a spatial-encoded spectral augmentation technique that encodes all surrounding pixels within a patch into a transformed view of the center pixel. Additionally, to explore global band correlations, we divide pixels into continuous group-wise spectral embeddings and introduce Mamba to HTD for the first time to model long-range dependencies of the spectral sequence with linear complexity. Furthermore, to alleviate spectral variation and enhance robust representation, we propose a pyramid SSM as a backbone to capture and fuse multiresolution spectral-wise intrinsic features. Extensive experiments conducted on four public datasets demonstrate that the proposed method outperforms state-of-the-art methods in both quantitative and qualitative evaluations. Code is available at \url{https://github.com/shendb2022/HTD-Mamba}.
△ Less
Submitted 17 July, 2024; v1 submitted 9 July, 2024;
originally announced July 2024.
-
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale
Authors:
Haozhe Zhao,
Xiaojian Ma,
Liang Chen,
Shuzheng Si,
Rujie Wu,
Kaikai An,
Peiyu Yu,
Minjia Zhang,
Qing Li,
Baobao Chang
Abstract:
This paper presents UltraEdit, a large-scale (approximately 4 million editing samples), automatically generated dataset for instruction-based image editing. Our key idea is to address the drawbacks in existing image editing datasets like InstructPix2Pix and MagicBrush, and provide a systematic approach to producing massive and high-quality image editing samples. UltraEdit offers several distinct a…
▽ More
This paper presents UltraEdit, a large-scale (approximately 4 million editing samples), automatically generated dataset for instruction-based image editing. Our key idea is to address the drawbacks in existing image editing datasets like InstructPix2Pix and MagicBrush, and provide a systematic approach to producing massive and high-quality image editing samples. UltraEdit offers several distinct advantages: 1) It features a broader range of editing instructions by leveraging the creativity of large language models (LLMs) alongside in-context editing examples from human raters; 2) Its data sources are based on real images, including photographs and artworks, which provide greater diversity and reduced bias compared to datasets solely generated by text-to-image models; 3) It also supports region-based editing, enhanced by high-quality, automatically produced region annotations. Our experiments show that canonical diffusion-based editing baselines trained on UltraEdit set new records on MagicBrush and Emu-Edit benchmarks. Our analysis further confirms the crucial role of real image anchors and region-based editing data. The dataset, code, and models can be found in https://ultra-editing.github.io.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
PRANCE: Joint Token-Optimization and Structural Channel-Pruning for Adaptive ViT Inference
Authors:
Ye Li,
Chen Tang,
Yuan Meng,
Jiajun Fan,
Zenghao Chai,
Xinzhu Ma,
Zhi Wang,
Wenwu Zhu
Abstract:
We introduce PRANCE, a Vision Transformer compression framework that jointly optimizes the activated channels and reduces tokens, based on the characteristics of inputs. Specifically, PRANCE~ leverages adaptive token optimization strategies for a certain computational budget, aiming to accelerate ViTs' inference from a unified data and architectural perspective. However, the joint framework poses…
▽ More
We introduce PRANCE, a Vision Transformer compression framework that jointly optimizes the activated channels and reduces tokens, based on the characteristics of inputs. Specifically, PRANCE~ leverages adaptive token optimization strategies for a certain computational budget, aiming to accelerate ViTs' inference from a unified data and architectural perspective. However, the joint framework poses challenges to both architectural and decision-making aspects. Firstly, while ViTs inherently support variable-token inference, they do not facilitate dynamic computations for variable channels. To overcome this limitation, we propose a meta-network using weight-sharing techniques to support arbitrary channels of the Multi-head Self-Attention and Multi-layer Perceptron layers, serving as a foundational model for architectural decision-making. Second, simultaneously optimizing the structure of the meta-network and input data constitutes a combinatorial optimization problem with an extremely large decision space, reaching up to around $10^{14}$, making supervised learning infeasible. To this end, we design a lightweight selector employing Proximal Policy Optimization for efficient decision-making. Furthermore, we introduce a novel "Result-to-Go" training mechanism that models ViTs' inference process as a Markov decision process, significantly reducing action space and mitigating delayed-reward issues during training. Extensive experiments demonstrate the effectiveness of PRANCE~ in reducing FLOPs by approximately 50\%, retaining only about 10\% of tokens while achieving lossless Top-1 accuracy. Additionally, our framework is shown to be compatible with various token optimization techniques such as pruning, merging, and sequential pruning-merging strategies. The code is available at \href{https://github.com/ChildTang/PRANCE}{https://github.com/ChildTang/PRANCE}.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Isomorphic Pruning for Vision Models
Authors:
Gongfan Fang,
Xinyin Ma,
Michael Bi Mi,
Xinchao Wang
Abstract:
Structured pruning reduces the computational overhead of deep neural networks by removing redundant sub-structures. However, assessing the relative importance of different sub-structures remains a significant challenge, particularly in advanced vision models featuring novel mechanisms and architectures like self-attention, depth-wise convolutions, or residual connections. These heterogeneous subst…
▽ More
Structured pruning reduces the computational overhead of deep neural networks by removing redundant sub-structures. However, assessing the relative importance of different sub-structures remains a significant challenge, particularly in advanced vision models featuring novel mechanisms and architectures like self-attention, depth-wise convolutions, or residual connections. These heterogeneous substructures usually exhibit diverged parameter scales, weight distributions, and computational topology, introducing considerable difficulty to importance comparison. To overcome this, we present Isomorphic Pruning, a simple approach that demonstrates effectiveness across a range of network architectures such as Vision Transformers and CNNs, and delivers competitive performance across different model sizes. Isomorphic Pruning originates from an observation that, when evaluated under a pre-defined importance criterion, heterogeneous sub-structures demonstrate significant divergence in their importance distribution, as opposed to isomorphic structures that present similar importance patterns. This inspires us to perform isolated ranking and comparison on different types of sub-structures for more reliable pruning. Our empirical results on ImageNet-1K demonstrate that Isomorphic Pruning surpasses several pruning baselines dedicatedly designed for Transformers or CNNs. For instance, we improve the accuracy of DeiT-Tiny from 74.52% to 77.50% by pruning an off-the-shelf DeiT-Base model. And for ConvNext-Tiny, we enhanced performance from 82.06% to 82.18%, while reducing the number of parameters and memory usage. Code is available at \url{https://github.com/VainF/Isomorphic-Pruning}.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
M^3:Manipulation Mask Manufacturer for Arbitrary-Scale Super-Resolution Mask
Authors:
Xinyu Yang,
Xiaochen Ma,
Xuekang Zhu,
Bo Du,
Lei Su,
Bingkui Tong,
Zeyu Lei,
Jizhe Zhou
Abstract:
In the field of image manipulation localization (IML), the small quantity and poor quality of existing datasets have always been major issues. A dataset containing various types of manipulations will greatly help improve the accuracy of IML models. Images on the internet (such as those on Baidu Tieba's PS Bar) are manipulated using various techniques, and creating a dataset from these images will…
▽ More
In the field of image manipulation localization (IML), the small quantity and poor quality of existing datasets have always been major issues. A dataset containing various types of manipulations will greatly help improve the accuracy of IML models. Images on the internet (such as those on Baidu Tieba's PS Bar) are manipulated using various techniques, and creating a dataset from these images will significantly enrich the types of manipulations in our data. However, images on the internet suffer from resolution and clarity issues, and the masks obtained by simply subtracting the manipulated image from the original contain various noises. These noises are difficult to remove, rendering the masks unusable for IML models. Inspired by the field of change detection, we treat the original and manipulated images as changes over time for the same image and view the data generation task as a change detection task. However, due to clarity issues between images, conventional change detection models perform poorly. Therefore, we introduced a super-resolution module and proposed the Manipulation Mask Manufacturer (MMM) framework. It enhances the resolution of both the original and tampered images, thereby improving image details for better comparison. Simultaneously, the framework converts the original and tampered images into feature embeddings and concatenates them, effectively modeling the context. Additionally, we created the Manipulation Mask Manufacturer Dataset (MMMD), a dataset that covers a wide range of manipulation techniques. We aim to contribute to the fields of image forensics and manipulation detection by providing more realistic manipulation data through MMM and MMMD. Detailed information about MMMD and the download link can be found at: the code and datasets will be made available.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Data Overfitting for On-Device Super-Resolution with Dynamic Algorithm and Compiler Co-Design
Authors:
Gen Li,
Zhihao Shu,
Jie Ji,
Minghai Qin,
Fatemeh Afghah,
Wei Niu,
Xiaolong Ma
Abstract:
Deep neural networks (DNNs) are frequently employed in a variety of computer vision applications. Nowadays, an emerging trend in the current video distribution system is to take advantage of DNN's overfitting properties to perform video resolution upscaling. By splitting videos into chunks and applying a super-resolution (SR) model to overfit each chunk, this scheme of SR models plus video chunks…
▽ More
Deep neural networks (DNNs) are frequently employed in a variety of computer vision applications. Nowadays, an emerging trend in the current video distribution system is to take advantage of DNN's overfitting properties to perform video resolution upscaling. By splitting videos into chunks and applying a super-resolution (SR) model to overfit each chunk, this scheme of SR models plus video chunks is able to replace traditional video transmission to enhance video quality and transmission efficiency. However, many models and chunks are needed to guarantee high performance, which leads to tremendous overhead on model switching and memory footprints at the user end. To resolve such problems, we propose a Dynamic Deep neural network assisted by a Content-Aware data processing pipeline to reduce the model number down to one (Dy-DCA), which helps promote performance while conserving computational resources. Additionally, to achieve real acceleration on the user end, we designed a framework that optimizes dynamic features (e.g., dynamic shapes, sizes, and control flow) in Dy-DCA to enable a series of compilation optimizations, including fused code generation, static execution planning, etc. By employing such techniques, our method achieves better PSNR and real-time performance (33 FPS) on an off-the-shelf mobile phone. Meanwhile, assisted by our compilation optimization, we achieve a 1.7$\times$ speedup while saving up to 1.61$\times$ memory consumption. Code available in https://github.com/coulsonlee/Dy-DCA-ECCV2024.
△ Less
Submitted 11 July, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
Light-weight Fine-tuning Method for Defending Adversarial Noise in Pre-trained Medical Vision-Language Models
Authors:
Xu Han,
Linghao Jin,
Xuezhe Ma,
Xiaofeng Liu
Abstract:
Fine-tuning pre-trained Vision-Language Models (VLMs) has shown remarkable capabilities in medical image and textual depiction synergy. Nevertheless, many pre-training datasets are restricted by patient privacy concerns, potentially containing noise that can adversely affect downstream performance. Moreover, the growing reliance on multi-modal generation exacerbates this issue because of its susce…
▽ More
Fine-tuning pre-trained Vision-Language Models (VLMs) has shown remarkable capabilities in medical image and textual depiction synergy. Nevertheless, many pre-training datasets are restricted by patient privacy concerns, potentially containing noise that can adversely affect downstream performance. Moreover, the growing reliance on multi-modal generation exacerbates this issue because of its susceptibility to adversarial attacks. To investigate how VLMs trained on adversarial noisy data perform on downstream medical tasks, we first craft noisy upstream datasets using multi-modal adversarial attacks. Through our comprehensive analysis, we unveil that moderate noise enhances model robustness and transferability, but increasing noise levels negatively impact downstream task performance. To mitigate this issue, we propose rectify adversarial noise (RAN) framework, a recipe designed to effectively defend adversarial attacks and rectify the influence of upstream noise during fine-tuning.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Dense Retrieval with Continuous Explicit Feedback for Systematic Review Screening Prioritisation
Authors:
Xinyu Mao,
Shengyao Zhuang,
Bevan Koopman,
Guido Zuccon
Abstract:
The goal of screening prioritisation in systematic reviews is to identify relevant documents with high recall and rank them in early positions for review. This saves reviewing effort if paired with a stopping criterion, and speeds up review completion if performed alongside downstream tasks. Recent studies have shown that neural models have good potential on this task, but their time-consuming fin…
▽ More
The goal of screening prioritisation in systematic reviews is to identify relevant documents with high recall and rank them in early positions for review. This saves reviewing effort if paired with a stopping criterion, and speeds up review completion if performed alongside downstream tasks. Recent studies have shown that neural models have good potential on this task, but their time-consuming fine-tuning and inference discourage their widespread use for screening prioritisation. In this paper, we propose an alternative approach that still relies on neural models, but leverages dense representations and relevance feedback to enhance screening prioritisation, without the need for costly model fine-tuning and inference. This method exploits continuous relevance feedback from reviewers during document screening to efficiently update the dense query representation, which is then applied to rank the remaining documents to be screened. We evaluate this approach across the CLEF TAR datasets for this task. Results suggest that the investigated dense query-driven approach is more efficient than directly using neural models and shows promising effectiveness compared to previous methods developed on the considered datasets. Our code is available at https://github.com/ielab/dense-screening-feedback.
△ Less
Submitted 17 July, 2024; v1 submitted 30 June, 2024;
originally announced July 2024.
-
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
Authors:
Zihao Wang,
Shaofei Cai,
Zhancun Mu,
Haowei Lin,
Ceyao Zhang,
Xuejie Liu,
Qing Li,
Anji Liu,
Xiaojian Ma,
Yitao Liang
Abstract:
We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in open-world Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimod…
▽ More
We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in open-world Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories $τ$ = {$o_0$, $a_0$, $\dots$} and an imitation learning (IL) policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models (MLMs). With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc. into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the IL policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials.
△ Less
Submitted 27 June, 2024;
originally announced July 2024.
-
Secure Outsourced Decryption for FHE-based Privacy-preserving Cloud Computing
Authors:
Xirong Ma,
Chuan Li,
Yuchang Hu,
Yunting Tao,
Yali Jiang,
Yanbin Li,
Fanyu Kong,
Chunpeng Ge
Abstract:
The demand for processing vast volumes of data has surged dramatically due to the advancement of machine learning technology. Large-scale data processing necessitates substantial computational resources, prompting individuals and enterprises to turn to cloud services. Accompanying this trend is a growing concern regarding data leakage and misuse. Homomorphic encryption (HE) is one solution for saf…
▽ More
The demand for processing vast volumes of data has surged dramatically due to the advancement of machine learning technology. Large-scale data processing necessitates substantial computational resources, prompting individuals and enterprises to turn to cloud services. Accompanying this trend is a growing concern regarding data leakage and misuse. Homomorphic encryption (HE) is one solution for safeguarding data privacy, enabling encrypted data to be processed securely in the cloud. However, the encryption and decryption routines of some HE schemes require considerable computational resources, presenting non-trivial work for clients. In this paper, we propose an outsourced decryption protocol for the prevailing RLWE-based fully homomorphic encryption schemes. The protocol splits the original decryption into two routines, with the computationally intensive part executed remotely by the cloud. Its security relies on an invariant of the NTRU-search problem with a newly designed blinding key distribution. Cryptographic analyses are conducted to configure protocol parameters across varying security levels. Our experiments demonstrate that the proposed protocol achieves up to a $67\%$ acceleration in the client's local decryption, accompanied by a $50\%$ reduction in space usage.
△ Less
Submitted 9 July, 2024; v1 submitted 28 June, 2024;
originally announced June 2024.
-
CHASE: A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems
Authors:
Ziming Zhao,
Tiehua Zhang,
Zhishu Shen,
Hai Dong,
Xingjun Ma,
Xianhui Liu,
Yun Yang
Abstract:
In recent years, the widespread adoption of distributed microservice architectures within the industry has significantly increased the demand for enhanced system availability and robustness. Due to the complex service invocation paths and dependencies at enterprise-level microservice systems, it is challenging to locate the anomalies promptly during service invocations, thus causing intractable is…
▽ More
In recent years, the widespread adoption of distributed microservice architectures within the industry has significantly increased the demand for enhanced system availability and robustness. Due to the complex service invocation paths and dependencies at enterprise-level microservice systems, it is challenging to locate the anomalies promptly during service invocations, thus causing intractable issues for normal system operations and maintenance. In this paper, we propose a Causal Heterogeneous grAph baSed framEwork for root cause analysis, namely CHASE, for microservice systems with multimodal data, including traces, logs, and system monitoring metrics. Specifically, related information is encoded into representative embeddings and further modeled by a multimodal invocation graph. Following that, anomaly detection is performed on each instance node with attentive heterogeneous message passing from its adjacent metric and log nodes. Finally, CHASE learns from the constructed hypergraph with hyperedges representing the flow of causality and performs root cause localization. We evaluate the proposed framework on two public microservice datasets with distinct attributes and compare with the state-of-the-art methods. The results show that CHASE achieves the average performance gain up to 36.2%(A@1) and 29.4%(Percentage@1), respectively to its best counterpart.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning
Authors:
Shiva Kumar Pentyala,
Zhichao Wang,
Bin Bi,
Kiran Ramnath,
Xiang-Bo Mao,
Regunathan Radhakrishnan,
Sitaram Asur,
Na,
Cheng
Abstract:
Large language models (LLMs) have shown remarkable abilities in diverse natural language processing (NLP) tasks. The LLMs generally undergo supervised fine-tuning (SFT) followed by preference alignment to be usable in downstream applications. However, this sequential training pipeline leads to alignment tax that degrades the LLM performance.
This paper introduces PAFT, a new PArallel training pa…
▽ More
Large language models (LLMs) have shown remarkable abilities in diverse natural language processing (NLP) tasks. The LLMs generally undergo supervised fine-tuning (SFT) followed by preference alignment to be usable in downstream applications. However, this sequential training pipeline leads to alignment tax that degrades the LLM performance.
This paper introduces PAFT, a new PArallel training paradigm for effective LLM Fine-Tuning, which independently performs SFT and preference alignment (e.g., DPO and ORPO, etc.) with the same pre-trained model on respective datasets. The model produced by SFT and the model from preference alignment are then merged into a final model by parameter fusing for use in downstream applications. This work reveals important findings that preference alignment like DPO naturally results in a sparse model while SFT leads to a natural dense model which needs to be sparsified for effective model merging. This paper introduces an effective interference resolution which reduces the redundancy by sparsifying the delta parameters. The LLM resulted from the new training paradigm achieved Rank #1 on the HuggingFace Open LLM Leaderboard. Comprehensive evaluation shows the effectiveness of the parallel training paradigm.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Test-Time Generative Augmentation for Medical Image Segmentation
Authors:
Xiao Ma,
Yuhui Tao,
Yuhan Zhang,
Zexuan Ji,
Yizhe Zhang,
Qiang Chen
Abstract:
In this paper, we propose a novel approach to enhance medical image segmentation during test time. Instead of employing hand-crafted transforms or functions on the input test image to create multiple views for test-time augmentation, we advocate for the utilization of an advanced domain-fine-tuned generative model (GM), e.g., stable diffusion (SD), for test-time augmentation. Given that the GM has…
▽ More
In this paper, we propose a novel approach to enhance medical image segmentation during test time. Instead of employing hand-crafted transforms or functions on the input test image to create multiple views for test-time augmentation, we advocate for the utilization of an advanced domain-fine-tuned generative model (GM), e.g., stable diffusion (SD), for test-time augmentation. Given that the GM has been trained to comprehend and encapsulate comprehensive domain data knowledge, it is superior than segmentation models in terms of representing the data characteristics and distribution. Hence, by integrating the GM into test-time augmentation, we can effectively generate multiple views of a given test sample, aligning with the content and appearance characteristics of the sample and the related local data distribution. This approach renders the augmentation process more adaptable and resilient compared to conventional handcrafted transforms. Comprehensive experiments conducted across three medical image segmentation tasks (nine datasets) demonstrate the efficacy and versatility of the proposed TTGA in enhancing segmentation outcomes. Moreover, TTGA significantly improves pixel-wise error estimation, thereby facilitating the deployment of a more reliable segmentation system. Code will be released at: https://github.com/maxiao0234/TTGA.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers
Authors:
Lei Chen,
Yuan Meng,
Chen Tang,
Xinzhu Ma,
Jingyan Jiang,
Xin Wang,
Zhi Wang,
Wenwu Zhu
Abstract:
Recent advancements in diffusion models, particularly the trend of architectural transformation from UNet-based Diffusion to Diffusion Transformer (DiT), have significantly improved the quality and scalability of image synthesis. Despite the incredible generative quality, the large computational requirements of these large-scale models significantly hinder the deployments in real-world scenarios.…
▽ More
Recent advancements in diffusion models, particularly the trend of architectural transformation from UNet-based Diffusion to Diffusion Transformer (DiT), have significantly improved the quality and scalability of image synthesis. Despite the incredible generative quality, the large computational requirements of these large-scale models significantly hinder the deployments in real-world scenarios. Post-training Quantization (PTQ) offers a promising solution by compressing model sizes and speeding up inference for the pretrained models while eliminating model retraining. However, we have observed the existing PTQ frameworks exclusively designed for both ViT and conventional Diffusion models fall into biased quantization and result in remarkable performance degradation. In this paper, we find that the DiTs typically exhibit considerable variance in terms of both weight and activation, which easily runs out of the limited numerical representations. To address this issue, we devise Q-DiT, which seamlessly integrates three techniques: fine-grained quantization to manage substantial variance across input channels of weights and activations, an automatic search strategy to optimize the quantization granularity and mitigate redundancies, and dynamic activation quantization to capture the activation changes across timesteps. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of the proposed Q-DiT. Specifically, when quantizing DiT-XL/2 to W8A8 on ImageNet 256x256, Q-DiT achieves a remarkable reduction in FID by 1.26 compared to the baseline. Under a W4A8 setting, it maintains high fidelity in image generation, showcasing only a marginal increase in FID and setting a new benchmark for efficient, high-quality quantization in diffusion transformers. Code is available at \href{https://github.com/Juanerx/Q-DiT}{https://github.com/Juanerx/Q-DiT}.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Differentiable Distributionally Robust Optimization Layers
Authors:
Xutao Ma,
Chao Ning,
Wenli Du
Abstract:
In recent years, there has been a growing research interest in decision-focused learning, which embeds optimization problems as a layer in learning pipelines and demonstrates a superior performance than the prediction-focused approach. However, for distributionally robust optimization (DRO), a popular paradigm for decision-making under uncertainty, it is still unknown how to embed it as a layer, i…
▽ More
In recent years, there has been a growing research interest in decision-focused learning, which embeds optimization problems as a layer in learning pipelines and demonstrates a superior performance than the prediction-focused approach. However, for distributionally robust optimization (DRO), a popular paradigm for decision-making under uncertainty, it is still unknown how to embed it as a layer, i.e., how to differentiate decisions with respect to an ambiguity set. In this paper, we develop such differentiable DRO layers for generic mixed-integer DRO problems with parameterized second-order conic ambiguity sets and discuss its extension to Wasserstein ambiguity sets. To differentiate the mixed-integer decisions, we propose a novel dual-view methodology by handling continuous and discrete parts of decisions via different principles. Specifically, we construct a differentiable energy-based surrogate to implement the dual-view methodology and use importance sampling to estimate its gradient. We further prove that such a surrogate enjoys the asymptotic convergency under regularization. As an application of the proposed differentiable DRO layers, we develop a novel decision-focused learning pipeline for contextual distributionally robust decision-making tasks and compare it with the prediction-focused approach in experiments.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
LOGCAN++: Adaptive Local-global class-aware network for semantic segmentation of remote sensing imagery
Authors:
Xiaowen Ma,
Rongrong Lian,
Zhenkai Wu,
Hongbo Guo,
Mengting Ma,
Sensen Wu,
Zhenhong Du,
Siyang Song,
Wei Zhang
Abstract:
Remote sensing images usually characterized by complex backgrounds, scale and orientation variations, and large intra-class variance. General semantic segmentation methods usually fail to fully investigate the above issues, and thus their performances on remote sensing image segmentation are limited. In this paper, we propose our LOGCAN++, a semantic segmentation model customized for remote sensin…
▽ More
Remote sensing images usually characterized by complex backgrounds, scale and orientation variations, and large intra-class variance. General semantic segmentation methods usually fail to fully investigate the above issues, and thus their performances on remote sensing image segmentation are limited. In this paper, we propose our LOGCAN++, a semantic segmentation model customized for remote sensing images, which is made up of a Global Class Awareness (GCA) module and several Local Class Awareness (LCA) modules. The GCA module captures global representations for class-level context modeling to reduce the interference of background noise. The LCA module generates local class representations as intermediate perceptual elements to indirectly associate pixels with the global class representations, targeting at dealing with the large intra-class variance problem. In particular, we introduce affine transformations in the LCA module for adaptive extraction of local class representations to effectively tolerate scale and orientation variations in remotely sensed images. Extensive experiments on three benchmark datasets show that our LOGCAN++ outperforms current mainstream general and remote sensing semantic segmentation methods and achieves a better trade-off between speed and accuracy. Code is available at https://github.com/xwmaxwma/rssegmentation.
△ Less
Submitted 1 July, 2024; v1 submitted 24 June, 2024;
originally announced June 2024.
-
Placing Timely Refreshing Services at the Network Edge
Authors:
Xishuo Li,
Shan Zhang,
Hongbin Luo,
Xiao Ma,
Junyi He
Abstract:
Accommodating services at the network edge is favorable for time-sensitive applications. However, maintaining service usability is resource-consuming in terms of pulling service images to the edge, synchronizing databases of service containers, and hot updates of service modules. Accordingly, it is critical to determine which service to place based on the received user requests and service refresh…
▽ More
Accommodating services at the network edge is favorable for time-sensitive applications. However, maintaining service usability is resource-consuming in terms of pulling service images to the edge, synchronizing databases of service containers, and hot updates of service modules. Accordingly, it is critical to determine which service to place based on the received user requests and service refreshing (maintaining) cost, which is usually neglected in existing studies. In this work, we study how to cooperatively place timely refreshing services and offload user requests among edge servers to minimize the backhaul transmission costs. We formulate an integer non-linear programming problem and prove its NP-hardness. This problem is highly non-tractable due to the complex spatial-and-temporal coupling effect among service placement, offloading, and refreshing costs. We first decouple the problem in the temporal domain by transforming it into a Markov shortest-path problem. We then propose a light-weighted Discounted Value Approximation (DVA) method, which further decouples the problem in the spatial domain by estimating the offloading costs among edge servers. The worst performance of DVA is proved to be bounded. 5G service placement testbed experiments and real-trace simulations show that DVA reduces the total transmission cost by up to 59.1% compared with the state-of-the-art baselines.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
The Case for Transport-Level Encryption in Datacenter Networks
Authors:
Tianyi Gao,
Xinshu Ma,
Suhas Narreddy,
Eugenio Luo,
Steven W. D. Chien,
Michio Honda
Abstract:
Cloud applications need network data encryption to isolate from other tenants and protect their data from potential eavesdroppers in the network infrastructure. This paper presents SDP, a protocol design for emerging datacenter transport protocols, such as pHost, NDP, and Homa, to integrate data encryption with the use of existing NIC offloading of cryptographic operations designed for TLS over TC…
▽ More
Cloud applications need network data encryption to isolate from other tenants and protect their data from potential eavesdroppers in the network infrastructure. This paper presents SDP, a protocol design for emerging datacenter transport protocols, such as pHost, NDP, and Homa, to integrate data encryption with the use of existing NIC offloading of cryptographic operations designed for TLS over TCP. Therefore, SDP could enable a deployment path of new transport protocols in datacenters without giving up hardware offloading support, which would otherwise make encryption on those protocols even slower than TLS over TCP. SDP is based on Homa, and outperforms TLS over TCP by up to 29 % in throughput. SDP currently supports two real-world applications, Redis, improving throughput by up to 24 %, and in-kernel NVMe-oF, cutting P99 latency by up to 21 %.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Rethinking Remote Sensing Change Detection With A Mask View
Authors:
Xiaowen Ma,
Zhenkai Wu,
Rongrong Lian,
Wei Zhang,
Siyang Song
Abstract:
Remote sensing change detection aims to compare two or more images recorded for the same area but taken at different time stamps to quantitatively and qualitatively assess changes in geographical entities and environmental factors. Mainstream models usually built on pixel-by-pixel change detection paradigms, which cannot tolerate the diversity of changes due to complex scenes and variation in imag…
▽ More
Remote sensing change detection aims to compare two or more images recorded for the same area but taken at different time stamps to quantitatively and qualitatively assess changes in geographical entities and environmental factors. Mainstream models usually built on pixel-by-pixel change detection paradigms, which cannot tolerate the diversity of changes due to complex scenes and variation in imaging conditions. To address this shortcoming, this paper rethinks the change detection with the mask view, and further proposes the corresponding: 1) meta-architecture CDMask and 2) instance network CDMaskFormer. Components of CDMask include Siamese backbone, change extractor, pixel decoder, transformer decoder and normalized detector, which ensures the proper functioning of the mask detection paradigm. Since the change query can be adaptively updated based on the bi-temporal feature content, the proposed CDMask can adapt to different latent data distributions, thus accurately identifying regions of interest changes in complex scenarios. Consequently, we further propose the instance network CDMaskFormer customized for the change detection task, which includes: (i) a Spatial-temporal convolutional attention-based instantiated change extractor to capture spatio-temporal context simultaneously with lightweight operations; and (ii) a scene-guided axial attention-instantiated transformer decoder to extract more spatial details. State-of-the-art performance of CDMaskFormer is achieved on five benchmark datasets with a satisfactory efficiency-accuracy trade-off. Code is available at https://github.com/xwmaxwma/rschange.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
Authors:
Ziyan Jiang,
Xueguang Ma,
Wenhu Chen
Abstract:
In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the `needle' unit. In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced `heavy' retriever and `light' reader design ca…
▽ More
In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the `needle' unit. In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced `heavy' retriever and `light' reader design can lead to sub-optimal performance. In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a `long retriever' and a `long reader'. LongRAG processes the entire Wikipedia into 4K-token units, which is 30x longer than before. By increasing the unit size, we significantly reduce the total units from 22M to 700K. This significantly lowers the burden of retriever, which leads to a remarkable retrieval score: answer recall@1=71% on NQ (previously 52%) and answer recall@2=72% (previously 47%) on HotpotQA (full-wiki). Then we feed the top-k retrieved units ($\approx$ 30K tokens) to an existing long-context LLM to perform zero-shot answer extraction. Without requiring any training, LongRAG achieves an EM of 62.7% on NQ, which is the best known result. LongRAG also achieves 64.3% on HotpotQA (full-wiki), which is on par of the SoTA model. Our study offers insights into the future roadmap for combining RAG with long-context LLMs.
△ Less
Submitted 30 June, 2024; v1 submitted 21 June, 2024;
originally announced June 2024.
-
Towards Timely Video Analytics Services at the Network Edge
Authors:
Xishuo Li,
Shan Zhang,
Yuejiao Huang,
Xiao Ma,
Zhiyuan Wang,
Hongbin Luo
Abstract:
Real-time video analytics services aim to provide users with accurate recognition results timely. However, existing studies usually fall into the dilemma between reducing delay and improving accuracy. The edge computing scenario imposes strict transmission and computation resource constraints, making balancing these conflicting metrics under dynamic network conditions difficult. In this regard, we…
▽ More
Real-time video analytics services aim to provide users with accurate recognition results timely. However, existing studies usually fall into the dilemma between reducing delay and improving accuracy. The edge computing scenario imposes strict transmission and computation resource constraints, making balancing these conflicting metrics under dynamic network conditions difficult. In this regard, we introduce the age of processed information (AoPI) concept, which quantifies the time elapsed since the generation of the latest accurately recognized frame. AoPI depicts the integrated impact of recognition accuracy, transmission, and computation efficiency. We derive closed-form expressions for AoPI under preemptive and non-preemptive computation scheduling policies w.r.t. the transmission/computation rate and recognition accuracy of video frames. We then investigate the joint problem of edge server selection, video configuration adaptation, and bandwidth/computation resource allocation to minimize the long-term average AoPI over all cameras. We propose an online method, i.e., Lyapunov-based block coordinate descent (LBCD), to solve the problem, which decouples the original problem into two subproblems to optimize the video configuration/resource allocation and edge server selection strategy separately. We prove that LBCD achieves asymptotically optimal performance. According to the testbed experiments and simulation results, LBCD reduces the average AoPI by up to 10.94X compared to state-of-the-art baselines.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models
Authors:
Xincheng Shuai,
Henghui Ding,
Xingjun Ma,
Rongcheng Tu,
Yu-Gang Jiang,
Dacheng Tao
Abstract:
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. It is widely studied in recent years as a promising and challenging field of Artificial Intelligence Generative Content (AIGC). Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models, which generate images according to text prompts. Th…
▽ More
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. It is widely studied in recent years as a promising and challenging field of Artificial Intelligence Generative Content (AIGC). Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models, which generate images according to text prompts. These models demonstrate remarkable generative capabilities and have become widely used tools for image editing. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs. In this survey, we provide a comprehensive review of multimodal-guided image editing techniques that leverage T2I diffusion models. First, we define the scope of image editing from a holistic perspective and detail various control signals and editing scenarios. We then propose a unified framework to formalize the editing process, categorizing it into two primary algorithm families. This framework offers a design space for users to achieve specific goals. Subsequently, we present an in-depth analysis of each component within this framework, examining the characteristics and applicable scenarios of different combinations. Given that training-based methods learn to directly map the source image to target one under user guidance, we discuss them separately, and introduce injection schemes of source image in different scenarios. Additionally, we review the application of 2D techniques to video editing, highlighting solutions for inter-frame inconsistency. Finally, we discuss open challenges in the field and suggest potential future research directions. We keep tracing related works at https://github.com/xinchengshuai/Awesome-Image-Editing.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
EXCEEDS: Extracting Complex Events as Connecting the Dots to Graphs in Scientific Domain
Authors:
Yi-Fan Lu,
Xian-Ling Mao,
Bo Wang,
Xiao Liu,
Heyan Huang
Abstract:
It is crucial to utilize events to understand a specific domain. There are lots of research on event extraction in many domains such as news, finance and biology domain. However, scientific domain still lacks event extraction research, including comprehensive datasets and corresponding methods. Compared to other domains, scientific domain presents two characteristics: denser nuggets and more compl…
▽ More
It is crucial to utilize events to understand a specific domain. There are lots of research on event extraction in many domains such as news, finance and biology domain. However, scientific domain still lacks event extraction research, including comprehensive datasets and corresponding methods. Compared to other domains, scientific domain presents two characteristics: denser nuggets and more complex events. To solve the above problem, considering these two characteristics, we first construct SciEvents, a large-scale multi-event document-level dataset with a schema tailored for scientific domain. It has 2,508 documents and 24,381 events under refined annotation and quality control. Then, we propose EXCEEDS, a novel end-to-end scientific event extraction framework by storing dense nuggets in a grid matrix and simplifying complex event extraction into a dot construction and connection task. Experimental results demonstrate state-of-the-art performances of EXCEEDS on SciEvents. Additionally, we release SciEvents and EXCEEDS on GitHub.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.