subscribe to arXiv mailings

arXiv:2407.20175 [pdf, other]

Towards Localized Fine-Grained Control for Facial Expression Generation

Authors: Tuomas Varanka, Huai-Qian Khor, Yante Li, Mengting Wei, Hanwei Kung, Nicu Sebe, Guoying Zhao

Abstract: Generative models have surged in popularity recently due to their ability to produce high-quality images and video. However, steering these models to produce images with specific attributes and precise control remains challenging. Humans, particularly their faces, are central to content generation due to their ability to convey rich expressions and intent. Current generative models mostly generate… ▽ More Generative models have surged in popularity recently due to their ability to produce high-quality images and video. However, steering these models to produce images with specific attributes and precise control remains challenging. Humans, particularly their faces, are central to content generation due to their ability to convey rich expressions and intent. Current generative models mostly generate flat neutral expressions and characterless smiles without authenticity. Other basic expressions like anger are possible, but are limited to the stereotypical expression, while other unconventional facial expressions like doubtful are difficult to reliably generate. In this work, we propose the use of AUs (action units) for facial expression control in face generation. AUs describe individual facial muscle movements based on facial anatomy, allowing precise and localized control over the intensity of facial movements. By combining different action units, we unlock the ability to create unconventional facial expressions that go beyond typical emotional models, enabling nuanced and authentic reactions reflective of real-world expressions. The proposed method can be seamlessly integrated with both text and image prompts using adapters, offering precise and intuitive control of the generated results. Code and dataset are available in {https://github.com/tvaranka/fineface}. △ Less

Submitted 25 July, 2024; originally announced July 2024.

arXiv:2407.19863 [pdf, other]

Before and After Blockchain: Development and Principle of Distributed Fault Tolerance Consensus

Authors: Huanyu Wu, Chentao Yue, Yixuan Fan, Yonghui Li, Lei Zhang

Abstract: The concept of distributed consensus gained widespread attention following the publication of ``Byzantine Generals Problem'' by Leslie Lamport in the 1980s. This research topic has been active and extensively studied over the last four decades, particularly since the advent of blockchain technology in 2009. Blockchain technology employs Proof-of-X (PoX) or Byzantine-fault-tolerant (BFT) systems, w… ▽ More The concept of distributed consensus gained widespread attention following the publication of ``Byzantine Generals Problem'' by Leslie Lamport in the 1980s. This research topic has been active and extensively studied over the last four decades, particularly since the advent of blockchain technology in 2009. Blockchain technology employs Proof-of-X (PoX) or Byzantine-fault-tolerant (BFT) systems, where all participants follow a protocol to achieve a common state (i.e., consistency) eventually. However, because PoX consensus such as Proof-of-Work is is resource-intensive with high power consumption, most permissioned blockchains employ BFT to achieve consistency. In this article, we provide an introduction to the fundamental principles and history of distributed consensus. We then explore the well-known fault-tolerant state machine replication (SMR) in partially synchronous networks, as well as consensus protocols in asynchronous models and recently proposed DAG-based consensus. Additionally, we examine the relationship between BFT consensus and blockchain technology and discuss the following questions: What is the history and evolution of BFT? Why are BFT protocols designed in the way they are and what core components do they use? What is the connection between BFT and blockchain technology, and what are the driving needs for future BFT research? △ Less

Submitted 29 July, 2024; originally announced July 2024.

arXiv:2407.19843 [pdf, other]

The Second Joint Workshop on Cross Reality

Authors: Nanjia Wang, Yue Li, Francesco Chiossi, Fabian Pointecker, Lixiang Zhao, Daniel Zielasko

Abstract: The 2nd Joint Workshop on Cross Reality (JWCR'24), organized as part of ISMAR 2024, seeks to explore the burgeoning field of Cross Reality (CR), which encompasses the seamless integration and transition between various points on the reality-virtuality continuum (RVC) such as Virtual Reality (VR), Augmented Virtuality (AV), and Augmented Reality (AR). This hybrid workshop aims to build upon the fou… ▽ More The 2nd Joint Workshop on Cross Reality (JWCR'24), organized as part of ISMAR 2024, seeks to explore the burgeoning field of Cross Reality (CR), which encompasses the seamless integration and transition between various points on the reality-virtuality continuum (RVC) such as Virtual Reality (VR), Augmented Virtuality (AV), and Augmented Reality (AR). This hybrid workshop aims to build upon the foundation laid by the inaugural JWCR at ISMAR 2023, which successfully unified diverse CR research communities. The workshop will address key themes including CR visualization, interaction, user behavior, design, development, engineering, and collaboration. CR Visualization focuses on creating and displaying spatial data across the RVC, enabling users to navigate and interpret information fluidly. CR Interaction delves into natural user engagements using gestures, voice commands, and other advanced techniques to enhance immersion. The study of CR User Behavior and Experience investigates how users perceive and interact within these hybrid environments. Furthermore, CR Design and Development emphasizes creating effective CR applications using innovative processes and tools, while CR Collaboration examines methods for fostering teamwork in mixed reality settings. △ Less

Submitted 29 July, 2024; originally announced July 2024.

Comments: 5 pages

Journal ref: 2024 IEEE International Symposium on Mixed and Augmented Reality

arXiv:2407.19784 [pdf, other]

Survey and Taxonomy: The Role of Data-Centric AI in Transformer-Based Time Series Forecasting

Authors: Jingjing Xu, Caesar Wu, Yuan-Fang Li, Gregoire Danoy, Pascal Bouvry

Abstract: Alongside the continuous process of improving AI performance through the development of more sophisticated models, researchers have also focused their attention to the emerging concept of data-centric AI, which emphasizes the important role of data in a systematic machine learning training process. Nonetheless, the development of models has also continued apace. One result of this progress is the… ▽ More Alongside the continuous process of improving AI performance through the development of more sophisticated models, researchers have also focused their attention to the emerging concept of data-centric AI, which emphasizes the important role of data in a systematic machine learning training process. Nonetheless, the development of models has also continued apace. One result of this progress is the development of the Transformer Architecture, which possesses a high level of capability in multiple domains such as Natural Language Processing (NLP), Computer Vision (CV) and Time Series Forecasting (TSF). Its performance is, however, heavily dependent on input data preprocessing and output data evaluation, justifying a data-centric approach to future research. We argue that data-centric AI is essential for training AI models, particularly for transformer-based TSF models efficiently. However, there is a gap regarding the integration of transformer-based TSF and data-centric AI. This survey aims to pin down this gap via the extensive literature review based on the proposed taxonomy. We review the previous research works from a data-centric AI perspective and we intend to lay the foundation work for the future development of transformer-based architecture and data-centric AI. △ Less

Submitted 29 July, 2024; originally announced July 2024.

arXiv:2407.19688 [pdf, other]

doi 10.1145/3627673.3680073

Causal Interventional Prediction System for Robust and Explainable Effect Forecasting

Authors: Zhixuan Chu, Hui Ding, Guang Zeng, Shiyu Wang, Yiming Li

Abstract: Although the widespread use of AI systems in today's world is growing, many current AI systems are found vulnerable due to hidden bias and missing information, especially in the most commonly used forecasting system. In this work, we explore the robustness and explainability of AI-based forecasting systems. We provide an in-depth analysis of the underlying causality involved in the effect predicti… ▽ More Although the widespread use of AI systems in today's world is growing, many current AI systems are found vulnerable due to hidden bias and missing information, especially in the most commonly used forecasting system. In this work, we explore the robustness and explainability of AI-based forecasting systems. We provide an in-depth analysis of the underlying causality involved in the effect prediction task and further establish a causal graph based on treatment, adjustment variable, confounder, and outcome. Correspondingly, we design a causal interventional prediction system (CIPS) based on a variational autoencoder and fully conditional specification of multiple imputations. Extensive results demonstrate the superiority of our system over state-of-the-art methods and show remarkable versatility and extensibility in practice. △ Less

Submitted 29 July, 2024; originally announced July 2024.

Comments: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM '24), October 21--25, 2024, Boise, ID, USA

arXiv:2407.19666 [pdf, other]

Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Authors: Mingyu Zhang, Jiting Cai, Mingyu Liu, Yue Xu, Cewu Lu, Yong-Lu Li

Abstract: Visual reasoning, as a prominent research area, plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning… ▽ More Visual reasoning, as a prominent research area, plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning and their tendency to data bias fitting. In this paper, we revisit visual reasoning with a two-stage perspective: (1) symbolization and (2) logical reasoning given symbols or their representations. We find that the reasoning stage is better at generalization than symbolization. Thus, it is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner. Given our findings, we establish design principles for visual reasoning frameworks following the separated symbolization and shared reasoning. The proposed two-stage framework achieves impressive generalization ability on various visual reasoning tasks, including puzzles, physical prediction, and visual question answering (VQA), encompassing both 2D and 3D modalities. We believe our insights will pave the way for generalizable visual reasoning. △ Less

Submitted 28 July, 2024; originally announced July 2024.

Comments: ECCV 2024, Project page: https://mybearyzhang.github.io/projects/TwoStageReason/

arXiv:2407.19547 [pdf, other]

Temporal Feature Matters: A Framework for Diffusion Model Quantization

Authors: Yushi Huang, Ruihao Gong, Xianglong Liu, Jing Liu, Yuhang Li, Jiwen Lu, Dacheng Tao

Abstract: The Diffusion models, widely used for image generation, face significant challenges related to their broad applicability due to prolonged inference times and high memory demands. Efficient Post-Training Quantization (PTQ) is crucial to address these issues in traditional models. Unlike those models, diffusion models critically rely on the time-step $t$ for effective multi-round denoising. Typicall… ▽ More The Diffusion models, widely used for image generation, face significant challenges related to their broad applicability due to prolonged inference times and high memory demands. Efficient Post-Training Quantization (PTQ) is crucial to address these issues in traditional models. Unlike those models, diffusion models critically rely on the time-step $t$ for effective multi-round denoising. Typically, $t$ from the finite set $\{1, \ldots, T\}$ is encoded into a hypersensitive temporal feature by several modules, entirely independent of the sampling data. However, existing PTQ methods do not optimize these modules individually. Instead, they employ unsuitable reconstruction objectives and complex calibration methods, leading to significant disturbances in the temporal feature and denoising trajectory. To address these challenges, we introduce a novel quantization framework: 1)~TIB-based Maintenance: Based on our innovative Temporal Information Block~(TIB) definition, Temporal Information-aware Reconstruction~(TIAR) and Finite Set Calibration~(FSC) are developed to efficiently align full precision temporal features. 2)~Cache-based Maintenance: Instead of indirect and complex optimization for the related modules, pre-computing and caching quantized counterparts of temporal features are developed to minimize errors. 3)~Disturbance-aware Selection: Employ temporal feature errors to guide a fine-grained selection for superior maintenance. This framework preserves most of the temporal information and ensures high-quality end-to-end generation. Extensive testing on various datasets and diffusion models confirms our superior results. Notably, our approach closely matches the performance of the full-precision model under 4-bit quantization. Furthermore, the quantized SD-XL model achieves hardware acceleration of 2.20$\times$ on CPU and 5.76$\times$ on GPU demonstrating its efficiency. △ Less

Submitted 28 July, 2024; originally announced July 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2311.16503

arXiv:2407.19497 [pdf, other]

Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph

Authors: Zhengcen Li, Xinle Chang, Yueran Li, Jingyong Su

Abstract: Group Activity Recognition aims to understand collective activities from videos. Existing solutions primarily rely on the RGB modality, which encounters challenges such as background variations, occlusions, motion blurs, and significant computational overhead. Meanwhile, current keypoint-based methods offer a lightweight and informative representation of human motions but necessitate accurate indi… ▽ More Group Activity Recognition aims to understand collective activities from videos. Existing solutions primarily rely on the RGB modality, which encounters challenges such as background variations, occlusions, motion blurs, and significant computational overhead. Meanwhile, current keypoint-based methods offer a lightweight and informative representation of human motions but necessitate accurate individual annotations and specialized interaction reasoning modules. To address these limitations, we design a panoramic graph that incorporates multi-person skeletons and objects to encapsulate group activity, offering an effective alternative to RGB video. This panoramic graph enables Graph Convolutional Network (GCN) to unify intra-person, inter-person, and person-object interactive modeling through spatial-temporal graph convolutions. In practice, we develop a novel pipeline that extracts skeleton coordinates using pose estimation and tracking algorithms and employ Multi-person Panoramic GCN (MP-GCN) to predict group activities. Extensive experiments on Volleyball and NBA datasets demonstrate that the MP-GCN achieves state-of-the-art performance in both accuracy and efficiency. Notably, our method outperforms RGB-based approaches by using only estimated 2D keypoints as input. Code is available at https://github.com/mgiant/MP-GCN △ Less

Submitted 28 July, 2024; originally announced July 2024.

arXiv:2407.19453 [pdf, other]

FIND: Fine-tuning Initial Noise Distribution with Policy Optimization for Diffusion Models

Authors: Changgu Chen, Libing Yang, Xiaoyan Yang, Lianggangxu Chen, Gaoqi He, CHangbo Wang, Yang Li

Abstract: In recent years, large-scale pre-trained diffusion models have demonstrated their outstanding capabilities in image and video generation tasks. However, existing models tend to produce visual objects commonly found in the training dataset, which diverges from user input prompts. The underlying reason behind the inaccurate generated results lies in the model's difficulty in sampling from specific i… ▽ More In recent years, large-scale pre-trained diffusion models have demonstrated their outstanding capabilities in image and video generation tasks. However, existing models tend to produce visual objects commonly found in the training dataset, which diverges from user input prompts. The underlying reason behind the inaccurate generated results lies in the model's difficulty in sampling from specific intervals of the initial noise distribution corresponding to the prompt. Moreover, it is challenging to directly optimize the initial distribution, given that the diffusion process involves multiple denoising steps. In this paper, we introduce a Fine-tuning Initial Noise Distribution (FIND) framework with policy optimization, which unleashes the powerful potential of pre-trained diffusion networks by directly optimizing the initial distribution to align the generated contents with user-input prompts. To this end, we first reformulate the diffusion denoising procedure as a one-step Markov decision process and employ policy optimization to directly optimize the initial distribution. In addition, a dynamic reward calibration module is proposed to ensure training stability during optimization. Furthermore, we introduce a ratio clipping algorithm to utilize historical data for network training and prevent the optimized distribution from deviating too far from the original policy to restrain excessive optimization magnitudes. Extensive experiments demonstrate the effectiveness of our method in both text-to-image and text-to-video tasks, surpassing SOTA methods in achieving consistency between prompts and the generated content. Our method achieves 10 times faster than the SOTA approach. Our homepage is available at \url{https://github.com/vpx-ecnu/FIND-website}. △ Less

Submitted 28 July, 2024; originally announced July 2024.

arXiv:2407.19342 [pdf, other]

Parameter-Efficient Fine-Tuning via Circular Convolution

Authors: Aochuan Chen, Ziqi Gao, Zijing Liu, Yu Li, Jia Li

Abstract: Low-Rank Adaptation (LoRA) has gained popularity for fine-tuning large foundation models, leveraging low-rank matrices $\mathbf{A}$ and $\mathbf{B}$ to represent weight changes (\textit{i.e.,} $Δ\mathbf{W} = \mathbf{B} \mathbf{A}$). This method reduces trainable parameters and mitigates heavy memory consumption associated with full delta matrices by sequentially multiplying $\mathbf{A}$ and… ▽ More Low-Rank Adaptation (LoRA) has gained popularity for fine-tuning large foundation models, leveraging low-rank matrices $\mathbf{A}$ and $\mathbf{B}$ to represent weight changes (\textit{i.e.,} $Δ\mathbf{W} = \mathbf{B} \mathbf{A}$). This method reduces trainable parameters and mitigates heavy memory consumption associated with full delta matrices by sequentially multiplying $\mathbf{A}$ and $\mathbf{B}$ with the activation. Despite its success, the intrinsic low-rank characteristic may limit its performance. Although several variants have been proposed to address this issue, they often overlook the crucial computational and memory efficiency brought by LoRA. In this paper, we propose \underline{C}ir\underline{c}ular \underline{C}onvolution \underline{A}daptation (C$^3$A), which not only achieves high-rank adaptation with enhanced performance but also excels in both computational power and memory utilization. Extensive experiments demonstrate that C$^3$A consistently outperforms LoRA and its variants across various fine-tuning tasks. △ Less