Skip to main content

Showing 1–50 of 799 results for author: Jiang, Z

  1. arXiv:2407.19976  [pdf, other

    cs.HC cs.MM

    MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality Fusion

    Authors: Chencan Fu, Yabiao Wang, Jiangning Zhang, Zhengkai Jiang, Xiaofeng Mao, Jiafu Wu, Weijian Cao, Chengjie Wang, Yanhao Ge, Yong Liu

    Abstract: Co-speech gesture generation is crucial for producing synchronized and realistic human gestures that accompany speech, enhancing the animation of lifelike avatars in virtual environments. While diffusion models have shown impressive capabilities, current approaches often overlook a wide range of modalities and their interactions, resulting in less dynamic and contextually varied gestures. To addre… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: Accepted to ACM MM 2024

  2. arXiv:2407.19484  [pdf, ps, other

    cs.IT

    Error Correction Decoding Algorithms of RS Codes Based on An Earlier Termination Algorithm to Find The Error Locator Polynomial

    Authors: Zhengyi Jiang, Hao Shi, Zhongyi Huang, Linqi Song, Bo Bai, Gong Zhang, Hanxu Hou

    Abstract: Reed-Solomon (RS) codes are widely used to correct errors in storage systems. Finding the error locator polynomial is one of the key steps in the error correction procedure of RS codes. Modular Approach (MA) is an effective algorithm for solving the Welch-Berlekamp (WB) key-equation problem to find the error locator polynomial that needs $2t$ steps, where $t$ is the error correction capability. In… ▽ More

    Submitted 28 July, 2024; originally announced July 2024.

  3. arXiv:2407.19035  [pdf, other

    cs.CV

    ScalingGaussian: Enhancing 3D Content Creation with Generative Gaussian Splatting

    Authors: Shen Chen, Jiale Zhou, Zhongyu Jiang, Tianfang Zhang, Zongkai Wu, Jenq-Neng Hwang, Lei Li

    Abstract: The creation of high-quality 3D assets is paramount for applications in digital heritage preservation, entertainment, and robotics. Traditionally, this process necessitates skilled professionals and specialized software for the modeling, texturing, and rendering of 3D objects. However, the rising demand for 3D assets in gaming and virtual reality (VR) has led to the creation of accessible image-to… ▽ More

    Submitted 26 July, 2024; originally announced July 2024.

    Comments: 14 pages

  4. arXiv:2407.18357  [pdf, other

    cs.RO

    Needle Segmentation Using GAN: Restoring Thin Instrument Visibility in Robotic Ultrasound

    Authors: Zhongliang Jiang, Xuesong Li, Xiangyu Chu, Angelos Karlas, Yuan Bi, Yingsheng Cheng, K. W. Samuel Au, Nassir Navab

    Abstract: Ultrasound-guided percutaneous needle insertion is a standard procedure employed in both biopsy and ablation in clinical practices. However, due to the complex interaction between tissue and instrument, the needle may deviate from the in-plane view, resulting in a lack of close monitoring of the percutaneous needle. To address this challenge, we introduce a robot-assisted ultrasound (US) imaging s… ▽ More

    Submitted 25 July, 2024; originally announced July 2024.

    Comments: accepted by IEEE TIM. code: https://github.com/noseefood/NeedleSegmentation-GAN; video: https://youtu.be/4WuEP9PACs0

  5. arXiv:2407.18271  [pdf, other

    cs.AR cs.AI

    Large Language Model for Verilog Generation with Golden Code Feedback

    Authors: Ning Wang, Bingkun Yao, Jie Zhou, Xi Wang, Zhe Jiang, Nan Guan

    Abstract: Recent advancements in large language models (LLMs) have catalyzed significant interest in the automatic generation of Register-Transfer Level (RTL) code, particularly Verilog, from natural language instructions. While commercial LLMs like ChatGPT have dominated this domain, open-source alternatives have lagged considerably in performance, limiting the flexibility and data privacy of this emerging… ▽ More

    Submitted 21 July, 2024; originally announced July 2024.

  6. arXiv:2407.16331  [pdf, other

    cs.HC

    AutoLegend: A User Feedback-Driven Adaptive Legend Generator for Visualizations

    Authors: Can Liu, Xiyao Mei, Zhibang Jiang, Shaocong Tan, Xiaoru Yuan

    Abstract: We propose AutoLegend to generate interactive visualization legends using online learning with user feedback. AutoLegend accurately extracts symbols and channels from visualizations and then generates quality legends. AutoLegend enables a two-way interaction between legends and interactions, including highlighting, filtering, data retrieval, and retargeting. After analyzing visualization legends f… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

    Comments: 12 pages, 10 fugures

  7. arXiv:2407.15273  [pdf, other

    cs.LG cs.AI

    Unifying Invariant and Variant Features for Graph Out-of-Distribution via Probability of Necessity and Sufficiency

    Authors: Xuexin Chen, Ruichu Cai, Kaitao Zheng, Zhifan Jiang, Zhengting Huang, Zhifeng Hao, Zijian Li

    Abstract: Graph Out-of-Distribution (OOD), requiring that models trained on biased data generalize to the unseen test data, has considerable real-world applications. One of the most mainstream methods is to extract the invariant subgraph by aligning the original and augmented data with the help of environment augmentation. However, these solutions might lead to the loss or redundancy of semantic subgraphs a… ▽ More

    Submitted 21 July, 2024; originally announced July 2024.

  8. arXiv:2407.15111  [pdf, other

    cs.CV

    D$^4$-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On

    Authors: Zhaotong Yang, Zicheng Jiang, Xinzhe Li, Huiyu Zhou, Junyu Dong, Huaidong Zhang, Yong Du

    Abstract: In this paper, we introduce D$^4$-VTON, an innovative solution for image-based virtual try-on. We address challenges from previous studies, such as semantic inconsistencies before and after garment warping, and reliance on static, annotation-driven clothing parsers. Additionally, we tackle the complexities in diffusion-based VTON models when handling simultaneous tasks like inpainting and denoisin… ▽ More

    Submitted 21 July, 2024; originally announced July 2024.

    Comments: ECCV2024

  9. arXiv:2407.14641  [pdf, other

    cs.DS cs.CR

    Differential Privacy with Multiple Selections

    Authors: Ashish Goel, Zhihao Jiang, Aleksandra Korolova, Kamesh Munagala, Sahasrajit Sarmasarkar

    Abstract: We consider the setting where a user with sensitive features wishes to obtain a recommendation from a server in a differentially private fashion. We propose a ``multi-selection'' architecture where the server can send back multiple recommendations and the user chooses one from these that matches best with their private features. When the user feature is one-dimensional -- on an infinite line -- an… ▽ More

    Submitted 19 July, 2024; originally announced July 2024.

  10. arXiv:2407.14006  [pdf, other

    eess.AS cs.SD

    MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis

    Authors: Qian Yang, Jialong Zuo, Zhe Su, Ziyue Jiang, Mingze Li, Zhou Zhao, Feiyang Chen, Zhefeng Wang, Baoxing Huai

    Abstract: We introduce an open source high-quality Mandarin TTS dataset MSceneSpeech (Multiple Scene Speech Dataset), which is intended to provide resources for expressive speech synthesis. MSceneSpeech comprises numerous audio recordings and texts performed and recorded according to daily life scenarios. Each scenario includes multiple speakers and a diverse range of prosodic styles, making it suitable for… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Accepted by INTERSPEECH 2024

  11. arXiv:2407.13930  [pdf, other

    cs.CV cs.AI eess.SP

    RT-Pose: A 4D Radar Tensor-based 3D Human Pose Estimation and Localization Benchmark

    Authors: Yuan-Hao Ho, Jen-Hao Cheng, Sheng Yao Kuan, Zhongyu Jiang, Wenhao Chai, Hsiang-Wei Huang, Chih-Lung Lin, Jenq-Neng Hwang

    Abstract: Traditional methods for human localization and pose estimation (HPE), which mainly rely on RGB images as an input modality, confront substantial limitations in real-world applications due to privacy concerns. In contrast, radar-based HPE methods emerge as a promising alternative, characterized by distinctive attributes such as through-wall recognition and privacy-preserving, rendering the method m… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  12. arXiv:2407.13778  [pdf, other

    cs.CV cs.LG

    Assessing the Potential of PlanetScope Satellite Imagery to Estimate Particulate Matter Oxidative Potential

    Authors: Ian Hough, Loïc Argentier, Ziyang Jiang, Tongshu Zheng, Mike Bergin, David Carlson, Jean-Luc Jaffrezo, Jocelyn Chanussot, Gaëlle Uzu

    Abstract: Oxidative potential (OP), which measures particulate matter's (PM) capacity to induce oxidative stress in the lungs, is increasingly recognized as an indicator of PM toxicity. Since OP is not routinely monitored, it can be challenging to estimate exposure and health impacts. Remote sensing data are commonly used to estimate PM mass concentration, but have never been used to estimate OP. In this st… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  13. arXiv:2407.13632  [pdf, other

    cs.CV cs.LG eess.IV

    Data Alchemy: Mitigating Cross-Site Model Variability Through Test Time Data Calibration

    Authors: Abhijeet Parida, Antonia Alomar, Zhifan Jiang, Pooneh Roshanitabrizi, Austin Tapp, Maria Ledesma-Carbayo, Ziyue Xu, Syed Muhammed Anwar, Marius George Linguraru, Holger R. Roth

    Abstract: Deploying deep learning-based imaging tools across various clinical sites poses significant challenges due to inherent domain shifts and regulatory hurdles associated with site-specific fine-tuning. For histopathology, stain normalization techniques can mitigate discrepancies, but they often fall short of eliminating inter-site variations. Therefore, we present Data Alchemy, an explainable stain n… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: accepted to Machine Learning in Medical Imaging (MLMI 2024)

  14. arXiv:2407.13101  [pdf, other

    cs.CL cs.AI

    Retrieve, Summarize, Plan: Advancing Multi-hop Question Answering with an Iterative Approach

    Authors: Zhouyu Jiang, Mengshu Sun, Lei Liang, Zhiqiang Zhang

    Abstract: Multi-hop question answering is a challenging task with distinct industrial relevance, and Retrieval-Augmented Generation (RAG) methods based on large language models (LLMs) have become a popular approach to tackle this task. Owing to the potential inability to retrieve all necessary information in a single iteration, a series of iterative RAG methods has been recently developed, showing significa… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

  15. arXiv:2407.13094  [pdf, other

    cs.CV

    Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data

    Authors: Wufei Ma, Kai Li, Zhongshi Jiang, Moustafa Meshry, Qihao Liu, Huiyu Wang, Christian Häne, Alan Yuille

    Abstract: Recent video-text foundation models have demonstrated strong performance on a wide variety of downstream video understanding tasks. Can these video-text models genuinely understand the contents of natural videos? Standard video-text evaluations could be misleading as many questions can be inferred merely from the objects and contexts in a single frame or biases inherent in the datasets. In this pa… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: ECCV 2024. Project page: https://feint6k.github.io

  16. arXiv:2407.12999  [pdf, other

    cs.CY cs.AI cs.CR

    Securing the Future of GenAI: Policy and Technology

    Authors: Mihai Christodorescu, Ryan Craven, Soheil Feizi, Neil Gong, Mia Hoffmann, Somesh Jha, Zhengyuan Jiang, Mehrdad Saberi Kamarposhti, John Mitchell, Jessica Newman, Emelia Probasco, Yanjun Qi, Khawaja Shams, Matthew Turek

    Abstract: The rise of Generative AI (GenAI) brings about transformative potential across sectors, but its dual-use nature also amplifies risks. Governments globally are grappling with the challenge of regulating GenAI, balancing innovation against safety. China, the United States (US), and the European Union (EU) are at the forefront with initiatives like the Management of Algorithmic Recommendations, the E… ▽ More

    Submitted 21 May, 2024; originally announced July 2024.

  17. arXiv:2407.12576  [pdf, other

    cs.AR cs.AI

    IICPilot: An Intelligent Integrated Circuit Backend Design Framework Using Open EDA

    Authors: Zesong Jiang, Qing Zhang, Cheng Liu, Huawei Li, Xiaowei Li

    Abstract: Open-source EDA tools are rapidly advancing, fostering collaboration, innovation, and knowledge sharing within the EDA community. However, the growing complexity of these tools, characterized by numerous design parameters and heuristics, poses a significant barrier to their widespread adoption. This complexity is particularly pronounced in integrated circuit (IC) backend designs, which place subst… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: under review

  18. arXiv:2407.12565  [pdf, other

    cs.AR

    SigDLA: A Deep Learning Accelerator Extension for Signal Processing

    Authors: Fangfa Fu, Wenyu Zhang, Zesong Jiang, Zhiyu Zhu, Guoyu Li, Bing Yang, Cheng Liu, Liyi Xiao, Jinxiang Wang, Huawei Li, Xiaowei Li

    Abstract: Deep learning and signal processing are closely correlated in many IoT scenarios such as anomaly detection to empower intelligence of things. Many IoT processors utilize digital signal processors (DSPs) for signal processing and build deep learning frameworks on this basis. While deep learning is usually much more computing-intensive than signal processing, the computing efficiency of deep learnin… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

  19. arXiv:2407.10135  [pdf, other

    cs.CV

    FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection

    Authors: Zheng Jiang, Jinqing Zhang, Yanan Zhang, Qingjie Liu, Zhenghui Hu, Baohui Wang, Yunhong Wang

    Abstract: Although multi-view 3D object detection based on the Bird's-Eye-View (BEV) paradigm has garnered widespread attention as an economical and deployment-friendly perception solution for autonomous driving, there is still a performance gap compared to LiDAR-based methods. In recent years, several cross-modal distillation methods have been proposed to transfer beneficial information from teacher models… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024

  20. arXiv:2407.08968  [pdf, other

    cs.CV

    SlideGCD: Slide-based Graph Collaborative Training with Knowledge Distillation for Whole Slide Image Classification

    Authors: Tong Shu, Jun Shi, Dongdong Sun, Zhiguo Jiang, Yushan Zheng

    Abstract: Existing WSI analysis methods lie on the consensus that histopathological characteristics of tumors are significant guidance for cancer diagnostics. Particularly, as the evolution of cancers is a continuous process, the correlations and differences across various stages, anatomical locations and patients should be taken into account. However, recent research mainly focuses on the inner-contextual… ▽ More

    Submitted 19 July, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

    Comments: Accepted for MICCAI 2024

  21. arXiv:2407.08855  [pdf, other

    eess.IV cs.CV

    BraTS-PEDs: Results of the Multi-Consortium International Pediatric Brain Tumor Segmentation Challenge 2023

    Authors: Anahita Fathi Kazerooni, Nastaran Khalili, Xinyang Liu, Debanjan Haldar, Zhifan Jiang, Anna Zapaishchykova, Julija Pavaine, Lubdha M. Shah, Blaise V. Jones, Nakul Sheth, Sanjay P. Prabhu, Aaron S. McAllister, Wenxin Tu, Khanak K. Nandolia, Andres F. Rodriguez, Ibraheem Salman Shaikh, Mariana Sanchez Montano, Hollie Anne Lai, Maruf Adewole, Jake Albrecht, Udunna Anazodo, Hannah Anderson, Syed Muhammed Anwar, Alejandro Aristizabal, Sina Bagheri , et al. (55 additional authors not shown)

    Abstract: Pediatric central nervous system tumors are the leading cause of cancer-related deaths in children. The five-year survival rate for high-grade glioma in children is less than 20%. The development of new treatments is dependent upon multi-institutional collaborative clinical trials requiring reproducible and accurate centralized response assessment. We present the results of the BraTS-PEDs 2023 cha… ▽ More

    Submitted 16 July, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

  22. arXiv:2407.08153  [pdf, other

    cs.CV

    Lifelong Histopathology Whole Slide Image Retrieval via Distance Consistency Rehearsal

    Authors: Xinyu Zhu, Zhiguo Jiang, Kun Wu, Jun Shi, Yushan Zheng

    Abstract: Content-based histopathological image retrieval (CBHIR) has gained attention in recent years, offering the capability to return histopathology images that are content-wise similar to the query one from an established database. However, in clinical practice, the continuously expanding size of WSI databases limits the practical application of the current CBHIR methods. In this paper, we propose a Li… ▽ More

    Submitted 12 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

    Comments: Accepted for MICCAI 2024

  23. arXiv:2407.07504  [pdf, other

    cs.CV

    Pan-cancer Histopathology WSI Pre-training with Position-aware Masked Autoencoder

    Authors: Kun Wu, Zhiguo Jiang, Kunming Tang, Jun Shi, Fengying Xie, Wei Wang, Haibo Wu, Yushan Zheng

    Abstract: Large-scale pre-training models have promoted the development of histopathology image analysis. However, existing self-supervised methods for histopathology images focus on learning patch features, while there is still a lack of available pre-training models for WSI-level feature learning. In this paper, we propose a novel self-supervised learning framework for pan-cancer WSI-level representation… ▽ More

    Submitted 15 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

  24. arXiv:2407.06937  [pdf, other

    cs.CV

    HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance

    Authors: Guian Fang, Wenbiao Yan, Yuanfan Guo, Jianhua Han, Zutao Jiang, Hang Xu, Shengcai Liao, Xiaodan Liang

    Abstract: Text-to-image diffusion models have significantly advanced in conditional image generation. However, these models usually struggle with accurately rendering images featuring humans, resulting in distorted limbs and other anomalies. This issue primarily stems from the insufficient recognition and evaluation of limb qualities in diffusion models. To address this issue, we introduce AbHuman, the firs… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV2024

  25. arXiv:2407.05413  [pdf, other

    cs.AI cs.CL cs.LG

    SBoRA: Low-Rank Adaptation with Regional Weight Updates

    Authors: Lai-Man Po, Yuyang Liu, Haoxuan Wu, Tianqi Zhang, Wing-Yin Yu, Zeyu Jiang, Kun Li

    Abstract: This paper introduces Standard Basis LoRA (SBoRA), a novel parameter-efficient fine-tuning approach for Large Language Models that builds upon the pioneering works of Low-Rank Adaptation (LoRA) and Orthogonal Adaptation. SBoRA further reduces the computational and memory requirements of LoRA while enhancing learning performance. By leveraging orthogonal standard basis vectors to initialize one of… ▽ More

    Submitted 10 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

    Comments: 15 pages, 2 figures

  26. arXiv:2407.04086  [pdf, other

    cs.CR cs.CV cs.LG

    Certifiably Robust Image Watermark

    Authors: Zhengyuan Jiang, Moyang Guo, Yuepeng Hu, Jinyuan Jia, Neil Zhenqiang Gong

    Abstract: Generative AI raises many societal concerns such as boosting disinformation and propaganda campaigns. Watermarking AI-generated content is a key technology to address these concerns and has been widely deployed in industry. However, watermarking is vulnerable to removal attacks and forgery attacks. In this work, we propose the first image watermarks with certified robustness guarantees against rem… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  27. arXiv:2407.03572  [pdf, other

    cs.CL

    Core: Robust Factual Precision Scoring with Informative Sub-Claim Identification

    Authors: Zhengping Jiang, Jingyu Zhang, Nathaniel Weir, Seth Ebner, Miriam Wanner, Kate Sanders, Daniel Khashabi, Anqi Liu, Benjamin Van Durme

    Abstract: Hallucinations -- the generation of untrue claims -- pose a challenge to the application of large language models (LLMs) [1] thereby motivating the development of metrics to evaluate factual precision. We observe that popular metrics using the Decompose-Then-Verify framework, such as FActScore [2], can be manipulated by adding obvious or repetitive claims to artificially inflate scores. We expand… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  28. arXiv:2407.03515  [pdf, other

    stat.ML cs.LG

    Feature-Specific Coefficients of Determination in Tree Ensembles

    Authors: Zhongli Jiang, Dabao Zhang, Min Zhang

    Abstract: Tree ensemble methods provide promising predictions with models difficult to interpret. Recent introduction of Shapley values for individualized feature contributions, accompanied with several fast computing algorithms for predicted values, shows intriguing results. However, individualizing coefficients of determination, aka $R^2$, for each feature is challenged by the underlying quadratic losses,… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  29. arXiv:2407.02746  [pdf, other

    cs.RO

    Motion Comparator: Visual Comparison of Robot Motions

    Authors: Yeping Wang, Alexander Peseckis, Zelong Jiang, Michael Gleicher

    Abstract: Roboticists compare robot motions for tasks such as parameter tuning, troubleshooting, and deciding between possible motions. However, most existing visualization tools are designed for individual motions and lack the features necessary to facilitate robot motion comparison. In this paper, we utilize a rigorous design framework to develop Motion Comparator, a web-based tool that facilitates the co… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted by IEEE Robotics and Automation Letters (RAL)

  30. arXiv:2407.02604  [pdf, other

    cs.AI cs.CL cs.LG eess.IV

    D-Rax: Domain-specific Radiologic assistant leveraging multi-modal data and eXpert model predictions

    Authors: Hareem Nisar, Syed Muhammad Anwar, Zhifan Jiang, Abhijeet Parida, Vishwesh Nath, Holger R. Roth, Marius George Linguraru

    Abstract: Large vision language models (VLMs) have progressed incredibly from research to applicability for general-purpose use cases. LLaVA-Med, a pioneering large language and vision assistant for biomedicine, can perform multi-modal biomedical image and data analysis to provide a natural language interface for radiologists. While it is highly generalizable and works with multi-modal data, it is currently… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  31. arXiv:2407.01264  [pdf, other

    cs.CL

    SignCLIP: Connecting Text and Sign Language by Contrastive Learning

    Authors: Zifan Jiang, Gerard Sant, Amit Moryossef, Mathias Müller, Rico Sennrich, Sarah Ebling

    Abstract: We present SignCLIP, which re-purposes CLIP (Contrastive Language-Image Pretraining) to project spoken language text and sign language videos, two classes of natural languages of distinct modalities, into the same space. SignCLIP is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs, without directly optimizing… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  32. arXiv:2406.20098  [pdf, other

    cs.CV cs.AI cs.CL

    Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

    Authors: Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, Zutao Jiang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, Timothy Baldwin, Zhengzhong Liu, Eric P. Xing, Xiaodan Liang, Zhiqiang Shen

    Abstract: Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-t… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: Website at https://mbzuai-llm.github.io/webpage2code/

  33. arXiv:2406.15319  [pdf, other

    cs.CL cs.AI

    LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

    Authors: Ziyan Jiang, Xueguang Ma, Wenhu Chen

    Abstract: In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the `needle' unit. In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced `heavy' retriever and `light' reader design ca… ▽ More

    Submitted 30 June, 2024; v1 submitted 21 June, 2024; originally announced June 2024.

    Comments: Technical Report

  34. arXiv:2406.15252  [pdf, other

    cs.CV cs.AI

    VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

    Authors: Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bohan Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Yuchen Lin, Wenhu Chen

    Abstract: The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-prov… ▽ More

    Submitted 24 June, 2024; v1 submitted 21 June, 2024; originally announced June 2024.

  35. arXiv:2406.14797  [pdf, other

    cs.CV cs.AI

    Camera-Invariant Meta-Learning Network for Single-Camera-Training Person Re-identification

    Authors: Jiangbo Pei, Zhuqing Jiang, Aidong Men, Haiying Wang, Haiyong Luo, Shiping Wen

    Abstract: Single-camera-training person re-identification (SCT re-ID) aims to train a re-ID model using SCT datasets where each person appears in only one camera. The main challenge of SCT re-ID is to learn camera-invariant feature representations without cross-camera same-person (CCSP) data as supervision. Previous methods address it by assuming that the most similar person should be found in another camer… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  36. arXiv:2406.14380  [pdf, other

    econ.EM cs.LG stat.ME

    Estimating Treatment Effects under Recommender Interference: A Structured Neural Networks Approach

    Authors: Ruohan Zhan, Shichao Han, Yuchen Hu, Zhenling Jiang

    Abstract: Recommender systems are essential for content-sharing platforms by curating personalized content. To evaluate updates to recommender systems targeting content creators, platforms frequently rely on creator-side randomized experiments. The treatment effect measures the change in outcomes when a new algorithm is implemented compared to the status quo. We show that the standard difference-in-means es… ▽ More

    Submitted 5 July, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

  37. arXiv:2406.13301  [pdf, other

    cs.CV cs.RO

    ARDuP: Active Region Video Diffusion for Universal Policies

    Authors: Shuaiyi Huang, Mara Levy, Zhenyu Jiang, Anima Anandkumar, Yuke Zhu, Linxi Fan, De-An Huang, Abhinav Shrivastava

    Abstract: Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video Diffusion for Universal Policies (ARDuP), a novel framework for video-based policy learning that emp… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  38. arXiv:2406.12736  [pdf, other

    cs.CV cs.AI

    Beyond Visual Appearances: Privacy-sensitive Objects Identification via Hybrid Graph Reasoning

    Authors: Zhuohang Jiang, Bingkui Tong, Xia Du, Ahmed Alhammadi, Jizhe Zhou

    Abstract: The Privacy-sensitive Object Identification (POI) task allocates bounding boxes for privacy-sensitive objects in a scene. The key to POI is settling an object's privacy class (privacy-sensitive or non-privacy-sensitive). In contrast to conventional object classes which are determined by the visual appearance of an object, one object's privacy class is derived from the scene contexts and is subject… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 15 pages

  39. arXiv:2406.11551  [pdf, other

    cs.CV

    Simple Yet Efficient: Towards Self-Supervised FG-SBIR with Unified Sample Feature Alignment

    Authors: Jianan Jiang, Di Wu, Zhilin Jiang, Weiren Yu

    Abstract: Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) aims to minimize the distance between sketches and corresponding images in the embedding space. However, scalability is hindered by the growing complexity of solutions, mainly due to the abstract nature of fine-grained sketches. In this paper, we propose a simple yet efficient approach to narrow the gap between the two modes. It mainly facilitate… ▽ More

    Submitted 22 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: 10 pages,8 figures, 4 tables

  40. arXiv:2406.10934  [pdf

    physics.ed-ph cs.HC

    Beyond Answers: Large Language Model-Powered Tutoring System in Physics Education for Deep Learning and Precise Understanding

    Authors: Zhoumingju Jiang, Mengjun Jiang

    Abstract: The integration of artificial intelligence (AI) in education has shown significant promise, yet the effective personalization of learning, particularly in physics education, remains a challenge. This paper proposes Physics-STAR, a framework for large language model (LLM)- powered tutoring system designed to address this gap by providing personalized and adaptive learning experiences for high schoo… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: 13 pages, 3 figures, CSCW 2O24

  41. arXiv:2406.10580  [pdf, other

    cs.CV

    IMDL-BenCo: A Comprehensive Benchmark and Codebase for Image Manipulation Detection & Localization

    Authors: Xiaochen Ma, Xuekang Zhu, Lei Su, Bo Du, Zhuohang Jiang, Bingkui Tong, Zeyu Lei, Xinyu Yang, Chi-Man Pun, Jiancheng Lv, Jizhe Zhou

    Abstract: A comprehensive benchmark is yet to be established in the Image Manipulation Detection \& Localization (IMDL) field. The absence of such a benchmark leads to insufficient and misleading model evaluations, severely undermining the development of this field. However, the scarcity of open-sourced baseline models and inconsistent training and evaluation protocols make conducting rigorous experiments a… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    Comments: Technical report

  42. arXiv:2406.09317  [pdf, other

    eess.IV cs.CV

    Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

    Authors: Meng Wang, Tian Lin, Aidi Lin, Kai Yu, Yuanyuan Peng, Lianyu Wang, Cheng Chen, Ke Zou, Huiyu Liang, Man Chen, Xue Yao, Meiqin Zhang, Binwei Huang, Chaoxin Zheng, Peixin Zhang, Wei Chen, Yilong Luo, Yifan Chen, Honghe Xia, Tingkun Shi, Qi Zhang, Jinming Guo, Xiaolin Chen, Jingcheng Wang, Yih Chung Tham , et al. (24 additional authors not shown)

    Abstract: Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources… ▽ More

    Submitted 30 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

  43. arXiv:2406.07873  [pdf, other

    cs.CV

    Robust 3D Face Alignment with Multi-Path Neural Architecture Search

    Authors: Zhichao Jiang, Hongsong Wang, Xi Teng, Baopu Li

    Abstract: 3D face alignment is a very challenging and fundamental problem in computer vision. Existing deep learning-based methods manually design different networks to regress either parameters of a 3D face model or 3D positions of face vertices. However, designing such networks relies on expert knowledge, and these methods often struggle to produce consistent results across various face poses. To address… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  44. arXiv:2406.07174  [pdf, other

    cs.SE

    ULog: Unsupervised Log Parsing with Large Language Models through Log Contrastive Units

    Authors: Junjie Huang, Zhihan Jiang, Zhuangbin Chen, Michael R. Lyu

    Abstract: Log parsing serves as an essential prerequisite for various log analysis tasks. Recent advancements in this field have improved parsing accuracy by leveraging the semantics in logs through fine-tuning large language models (LLMs) or learning from in-context demonstrations. However, these methods heavily depend on labeled examples to achieve optimal performance. In practice, collecting sufficient l… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  45. arXiv:2406.06979  [pdf, other

    cs.LG cs.CR cs.SD eess.AS

    AudioMarkBench: Benchmarking Robustness of Audio Watermarking

    Authors: Hongbin Liu, Moyang Guo, Zhengyuan Jiang, Lun Wang, Neil Zhenqiang Gong

    Abstract: The increasing realism of synthetic speech, driven by advancements in text-to-speech models, raises ethical concerns regarding impersonation and disinformation. Audio watermarking offers a promising solution via embedding human-imperceptible watermarks into AI-generated audios. However, the robustness of audio watermarking against common/adversarial perturbations remains understudied. We present A… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  46. arXiv:2406.06975  [pdf, other

    cs.DC cs.SE

    TraceMesh: Scalable and Streaming Sampling for Distributed Traces

    Authors: Zhuangbin Chen, Zhihan Jiang, Yuxin Su, Michael R. Lyu, Zibin Zheng

    Abstract: Distributed tracing serves as a fundamental element in the monitoring of cloud-based and datacenter systems. It provides visibility into the full lifecycle of a request or operation across multiple services, which is essential for understanding system dependencies and performance bottlenecks. To mitigate computational and storage overheads, most tracing frameworks adopt a uniform sampling strategy… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by The 2024 IEEE 17th International Conference on Cloud Computing (CLOUD)

  47. arXiv:2406.06858  [pdf, other

    cs.LG cs.DC

    FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

    Authors: Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu

    Abstract: Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation… ▽ More

    Submitted 18 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

  48. arXiv:2406.06279  [pdf, other

    cs.CL

    Multi-Prompting Decoder Helps Better Language Understanding

    Authors: Zifeng Cheng, Zhaoling Chen, Zhiwei Jiang, Yafeng Yin, Shiping Ge, Yuliang Liu, Qing Gu

    Abstract: Recent Pre-trained Language Models (PLMs) usually only provide users with the inference APIs, namely the emerging Model-as-a-Service (MaaS) setting. To adapt MaaS PLMs to downstream tasks without accessing their parameters and gradients, some existing methods focus on the output-side adaptation of PLMs, viewing the PLM as an encoder and then optimizing a task-specific decoder for decoding the outp… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

  49. arXiv:2406.04744  [pdf, other

    cs.CL

    CRAG -- Comprehensive RAG Benchmark

    Authors: Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze Daniel Gui, Ziran Will Jiang, Ziyu Jiang, Lingkun Kong, Brian Moran, Jiaqi Wang, Yifan Ethan Xu, An Yan, Chenyu Yang, Eting Yuan, Hanwen Zha, Nan Tang, Lei Chen, Nicolas Scheffer, Yue Liu, Nirav Shah, Rakesh Wanga, Anuj Kumar , et al. (2 additional authors not shown)

    Abstract: Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering bench… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  50. arXiv:2406.04553  [pdf, other

    cs.IR cs.AI

    Better Late Than Never: Formulating and Benchmarking Recommendation Editing

    Authors: Chengyu Lai, Sheng Zhou, Zhimeng Jiang, Qiaoyu Tan, Yuanchen Bei, Jiawei Chen, Ningyu Zhang, Jiajun Bu

    Abstract: Recommendation systems play a pivotal role in suggesting items to users based on their preferences. However, in online platforms, these systems inevitably offer unsuitable recommendations due to limited model capacity, poor data quality, or evolving user interests. Enhancing user experience necessitates efficiently rectify such unsuitable recommendation behaviors. This paper introduces a novel and… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.