OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian,
Lin Ma, Dongmei Jiang, Yaowei Wang, Xiangyuan Lan1, Xiaodan Liang1
Hao Wang is with the School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China and Pengcheng Lab, Shenzhen 518000 (e-mail: wangh739@mail2.sysu.edu.cn, wanghao9610@gmail.com). Pengzhen Ren is with the School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China. Xiao Dong is with the School of Artificial Intelligence, Zhuhai Campus, Sun Yat-Sen University, Zhuhai, P.R. China, 519082. Zequn Jie,  Chengjian Feng, Lin Ma, and Yinlong Qian are with Meituan Inc, China. Xiangyuan Lan, Dongmei Jiang, and Yaowei Wang are with Pengcheng Lab, Shenzhen 518000, China. Yaowei Wang is also with the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China. Xiaodan Liang is with the School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China, and Pengcheng Lab, Shenzhen 518000 (e-mail: liangxd9@mail.sysu.edu.cn). 1 Xiangyuan Lan, and Xiaodan Liang are the corresponding authors.
Abstract

Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training on diverse large-scale datasets. However, these approaches still face two primary challenges: (i) how to universally integrate diverse data sources for end-to-end training, and (ii) how to effectively leverage the language-aware capability for region-level cross-modality understanding. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which pre-trains on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enable the language-aware ability of the model through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmark datasets, achieving state-of-the-art results with an AP of 50.6% on the COCO dataset and 40.0% on the LVIS dataset in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4% AP, outperforming many existing methods with the same backbone. The code for OV-DINO will be available at https://github.com/wanghao9610/OV-DINO.

Index Terms:
Object detection, open-vocabulary, detection transformer.

1 Introduction

Refer to caption
Figure 1: Comparison of OV-DINO with Previous Methods. (a) Previous methods (e.g. GLIP [1], GLIPv2 [2], G-DINO [3]) are not primarily detection-centric. They first pre-train on large-scale Detection and Grounding data, then generate pseudo labels on Image-Text data, potentially introducing noise (red circle). (b) OV-DINO is a detection-centric method that integrates various data sources into a unified detection data format through a Unified Data Integration pipeline. It undergoes end-to-end pre-training via region-text alignment within a unified detection framework.

The traditional object detection methods, such as Fast R-CNN [4], Faster R-CNN [5], Mask R-CNN [6], DETR [7], and DINO [8], are typically trained on datasets with closed-set categories, This limits their ability to detect objects outside of the predefined categories, which is a significant constraint for real-world applications. To address this limitation, a new task known as Open-Vocabulary Detection (OVD) has been proposed, attracting significant attention from both academic and industrial communities. Open-vocabulary detection requires the ability to detect any object using class names, even including objects that have never been encountered during training. The development of OVD can be traced back to the introduction of Zero-Shot Detection (ZSD) by Bansal et al.[9], where models are trained on a limited set of categories and evaluated on novel categories. Building upon ZSD, Zareian et al.[10] further expanded the concept to OVD by leveraging a visual semantic space derived from image-text data, thereby enhancing the capability of category generalization.

Refer to caption
Figure 2: Illustration of Language-Aware Selective Fusion (LASF). We illustrate the processes of typical cross-modality fusion in G-DINO[3] and language-aware selective fusion. LASF entails query selection and query fusion, which includes selecting the object embedding (  ,  ) related to the text input, and fusing it with the learnable content query to improve prediction accuracy. While G-DINO directly fuses the query with text embedding. The OV-DINO with LASF achieves higher accuracy compared to G-IDNO (e.g. 87% vs 63% for “person”, 93% vs 55% for “tennis racket”), highlighting the effectiveness of LASF in enhancing prediction accuracy.

Recent studies [11, 12, 13] have catalyzed the development of open-world vision methodologies [14, 15, 16, 3, 1], enabling the detection of objects outside pre-defined categories, they typically pre-train on large-scale grounding datasets and then generate pseudo-labels for image-text data. This introduces two distinct challenges: (i) data noise due to the introduced pseudo-labels from diverse data sources, and (ii) a semantic gap between image regions and textual descriptions. As depicted in Figure 1, we illustrate this phenomenon. The first challenge mentioned above is attributed to the limited vocabulary concept of training data, and models trained with the data have poor generalization ability, leading to inaccurate predictions in pseudo-labeling on the image-text data. Methods such as GLIP[1], GLIPv2[2], and G-DINO[3] have approached detection as a grounding task, which pre-trains on large-scale detection and grounding datasets then generates pseudo-labels on image-text data. This pre-training followed by pseudo-labeling paradigm will introduce unavoidable noise. A detection-centric data unification can solve the problem, which integrates various pre-training data sources into a detection-centric data format. This integration is essential to prevent unnecessary noise during pseudo-label generation and provide the model with more accurate supervision information while allowing for flexible expansion of data sources. The second challenge mentioned above is attributed to semantic confusion in region-level modality fusion and alignment. To fuse the region information with text input, methods such as GLIP [1] introduce complex deep fusion in the encoder stage, G-DINO [3] proposes a bidirectional cross-attention-based lightweight feature enhancer to enhance features and extend fusion to the decoder. While DetCLIP [17] discards most cross-modality fusion blocks except for the region-word alignment, simplifying the model architecture while overlooking crucial category information fusion. OVD models make predictions based on both image and language text input, with the language text input serving as a query for the model and influencing the classification labels. Text embedding and region embedding fusion allow the representation of the text embedding to be dynamically adjusted according to the object information in the image, enabling better alignment between text and region embeddings. Therefore, it is crucial to accurately fuse and align modality information for OVD models to effectively capture image information and integrate it with the language text input.

TABLE I: Comparison of OVD methods. We compare OV-DINO with previous OVD methods in terms of method type, modality fusion, and pseudo-label generation. OV-DINO is a unified detection-centric method with LASF, eliminating the need for pseudo-label generation.
Method Type Modality Fusion Pseudo-Label
GLIP[1] Grounding DeepFusion Y
GLIPv2[2] Grounding DeepFusion Y
G-DINO[3] Grounding CrossAttnFusion Y
DetCLIP[17] Detection Y
YOLO-World[18] Detection RepVL-PAN Y
OV-DINO (Ours) Detection LASF N

To address both key challenges, we introduce a novel method called OV-DINO for open-vocabulary detection. For the first challenge, we propose the Unified Data Integration (UniDI) pipeline to integrate diverse data sources into a unified detection-centric data format and pre-trains on large-scale datasets end-to-end, eliminating the requirement of pseudo-label generation on image-text data. The UniDI treats all text inputs, including detection nouns, grounding phrases, and image captions, as categories for detection-centric unification. For the second challenge, we propose a Language-Aware Selective Fusion (LASF) module for region-level cross-modality fusion and alignment. The LASF module facilitates selective and dynamic fusion of text-related object embedding, as illustrated in Figure 2. The language-aware selection and fusion processes enable the model to make more accurate predictions based on the language text input. Moreover, we propose a simple adaptation of the supervised training procedure used in DINO [8] to standardize the training process for open-vocabulary detection, requiring only minimal modifications to the existing framework. To verify the effectiveness, extensive experiments are conducted on the popular open-vocabulary detection datasets COCO [19] and LVIS [20] under zero-shot and fine-tuning settings. The results demonstrate that OV-DINO achieves state-of-the-art performance on both datasets and settings. To highlight the characteristics of our model, we compare OV-DINO with recent methods in terms of method type, modality fusion, and pseudo-label generation in Table I.

In summary, our main contributions are outlined as follows:

  • We present the OV-DINO, a novel unified open vocabulary detection approach that offers superior performance and effectiveness for practical real-world application.

  • We propose a Unified Data Integration pipeline that integrates diverse data sources for end-to-end pre-training, and a Language-Aware Selective Fusion module to improve the vision-language understanding of the model.

  • The proposed OV-DINO shows significant performance improvement on COCO and LVIS benchmarks compared to previous methods, achieving relative improvements of +2.5% AP on COCO and +13.6% AP on LVIS compared to G-DINO in zero-shot evaluation. The pre-trained model and code will be open-sourced to support open-end vision development.

2 Related Work

Vision-Language Pre-Training. Conventional supervised vision methods [21, 22, 23, 24, 25] often rely on manual human annotation, thereby constraining the model’s capacity for generalization. Additionally, it is challenging to define a comprehensive list of categories and collect sufficient sample data for rare categories [14, 26, 27]. The expensive labeling cost limits the wide application of the vision model for the open-world scenario. To overcome the data annotation limitation, Vision-Language Pre-training has been proposed, it is a natural extension and development of the successful pre-train-then-fine-tune scheme in the domains of natural language processing (NLP) [28, 29] and computer vision [30] community. Dual-stream approaches such as CLIP [11] and ALIGN [12] have shown great zero-shot classification ability by pre-training on large-scale image-text pairs data (e.g. CC12M [31], YFCC100M [32], Laion5B [33]) with cross-modal contrastive learning. Single-stream approaches [34, 35, 36] directly model the relation of vision and text embedding by two separate transformer-based encoders, which perform well in tasks like image-text [37, 38, 39] and VQA [40, 41, 42]. Recently, VLMo [43], BLIP [44] and BLIPv2 [45] further explore a hybrid architecture incorporating both single-stream and two-stream architectures to facilitate a more cohesive way of vision-language understanding and generation. However, these models primarily focus on learning whole-image visual representations and cannot be directly applied to more complex core computer vision tasks such as segmentation and detection, which necessitate fine-grained semantic understanding.

Open-Vocabulary Detection. Traditional object detection methods [4, 5, 6] have been successful in supervised scenarios, but face challenges in adapting to open-world scenarios with a large number of classes. It is challenging to explore approaches to acquire more semantic concepts for tasks related to Open-Vocabulary Detection (OVD). Recent approaches such as RegionCLIP [46], Baron [46], and ViLD [47] have concentrated on extracting intricate semantic correspondences and information to improve the inclusiveness of new categories. However, these approaches are based on the pre-trained CLIP model, which restricts their capacity for generalization. Furthermore, recent methods like GLIP [1], GDINO [3], and GLIPv2 [2] aim to integrate multiple data sources to enrich the model’s concept library. These approaches consider object detection as a grounding task and generate pseudo labels for image-text data. However, the grounding-orientated unification imposes limitations on the input length of text, and the pseudo-label generation introduces noise to the model. Meanwhile, DetCLIP [17] proposes a dictionary-enriched visual-concept paralleled pre-training scheme to pre-train a model in a parallel way. DetCLIPv2 [48] further endeavors to unify all data sources in a scalable pre-training approach by utilizing different losses for various data sources, sacrificing the efficiency of architecture. Therefore, this paper proposes a unified framework to integrate all data types into the object detection data format. The proposed approach aims to provide more accurate supervisory information to the model while overcoming text length limitations and the necessity for pseudo-label generation. This unified framework is designed to enhance the generalization of the model and improve the performance of open-vocabulary detection.

Modality Information Fusion and Alignment. Vision-Language model (VLM) has two distinct vision and language modalities, it is crucial to effectively fuse and align the modality information for VLMs. In the image-level VLMs, CLIP [11] and ALIGN [12] directly align the vision and language modality with the contrastive loss [49], FILIP [13] further aligns the modality information in fine-grained scale. To effectively align and fuse the cross-modal information, ALBEF [50] proposes to align before fuse, which utilizes a multi-modal encoder to fuse the image features and text features through cross-modal attention and align the modality with an intermediate image-text contrastive loss. Flamingo [51] bridges the vision-only model and language-only model via the GATED XATTN-DENSE layers, achieving astonishing results on numerous benchmarks. For fine-grained cross-modal understanding, image-level modality fusion and alignment are insufficient for fine-grained vision-language understanding. In the region-level VLMs, RegionCLIP[46] directly aligns the region representation with the region description via region-text pre-training, VLDet [52] considers the region-text alignment as a bipartite matching problem. DetCLIP [17] and DetCLIPv2 [48] further extend the region-text alignment scheme via large-scale pre-training, achieving outstanding open-vocabulary detection performance. However, these approaches primarily concentrate on aligning modality information while ignoring the region-text modality fusion. To fuse the language information with the region representation, GLIP [1] initially integrates cross-modal information in the encoder stage using a cross-attention module, then performs alignment using the region-word alignment loss. G-DINO [3] further integrates modalities in the decoder stage. Although previous methods already consider fusion and alignment for cross-modal information interaction, they do not effectively balance the relationship between fusion and alignment. This paper aims to balance the fusion and alignment of modality information to enhance the model’s ability to capture precise image details guided by language input.

3 Method

Refer to caption
Figure 3: Overall Framework of OV-DINO. The pre-training of OV-DINO comprises three primary data sources (Detection, Grounding, Image-Text). OV-DINO has three components: text embedding extraction, image embedding extraction, and language-aware selective fusion. First, we process the text inputs with Unified Data Integration pipeline to ensure embedding representation consistency across these data sources. Then, the unified prompted text inputs go through a Text Encoder to extract the text embedding, and the original image inputs undergo an Image Encoder and some Encoder Layers to output the multi-scale refined image embedding. Subsequently, we employ the Language-Aware Query Selection to select the most relevant image embedding with the text embedding as the object embedding. The selected object embedding and the learnable content query go through the decoder layers with Language-Aware Query Fusion to fuse the content queries dynamically. Finally, OV-DINO outputs the classification scores by calculating the similarity of the projected query embedding with the text embedding through region-text alignment, and the regressed bounding boxes via an MLP layer.

This paper aims to develop a unified pre-training framework that integrates different data sources into a standardized format suitable for open-vocabulary detection tasks. To accomplish this objective, we propose a novel model called OV-DINO, which leverages diverse data sources to improve the performance of open-vocabulary detectors within a unified pre-training framework (Section 3.1). To facilitate unified pre-training across various data sources, we develop a Unified Data Integration (UniDI) pipeline applicable across various data sources when embedding extraction (Section 3.2). To align and fuse the fine-grained semantics between text embedding and region-specific visual embedding, we introduce a Language-Aware Selective Fusion (LASF) module to dynamically select and fuse the region-level vision-language information (Section 3.3). To enable detection-centric pre-training, we also develop a simple pre-training framework that features a straightforward design and shares similar training objectives with the closed-set detector DINO [3] (Section 3.4).

3.1 Overview

The overall framework of OV-DINO is depicted in Figure 3, which includes a text encoder, an image encoder, and a detector head with some transformer encoder and decoder layers. Given an image accompanied by a prompt, the detection category nouns or grounding phrases are prompted to captions using specific templates to create a unified representation for the general text embedding. Subsequently, vanilla image and text embeddings are extracted using dedicated image and text encoders. After the embedding extraction, the vanilla image embedding along with the positional embedding are input into the transformer encoder layers to generate refined image embedding. To improve the relevance between the image embedding and the text embedding, a language-aware query selection module is employed to select the object embedding associated with the text embedding. The selected object embedding serves as dynamic context embedding and is merged with static learnable content queries in the decoder using a language-aware query fusion module. The output queries from the final decoder layers are then used for classification projection and box regression to predict the corresponding classification scores and regress the object boxes. The model is pre-trained to align the region-specific image embedding with the related text embedding through the region-text alignment. It is optimized using the classification (alignment) loss and the box regression loss.

Refer to caption
Figure 4: Architecture of the Language-Aware Selective Fusion (LASF). The LASF module consists of two main components: language-aware query selection and language-aware query fusion. We illustrate three variants of the LASF module based on the insertion location: (a) Later-LASF, (b) Middle-LASF, and (c) Early-LASF. Additionally, we also illustrate (d) Cross Attention Fusion proposed in G-DINO[3] for clear representation.

3.2 Unified Data Integration

In the pre-training stage of OV-DINO, three primary data sources involve detection, grounding, and image-text data, each with its unique annotation format. For example, the detection data is annotated with class labels and box coordination, the grounding data includes annotations of the caption with token positive indices and box coordination, and the image-text data solely consists of a text description for the image. Typically, various type of data requires distinct processing methods, such as designing diverse loss functions for detection data sources and generating pseudo-labels for image-text data. This increases the complexity of model optimization, preventing the model from reaching its optimal performance. Regarding this problem, OV-DINO converts all data formats into a unified detection-centric data format during data preparation, thereby enabling the integration of different types of data and harmonizing data from diverse sources for end-to-end training. Integrating detection and grounding data is relatively straightforward, as grounding data can be considered a specific type of detection data, with each image having multiple grounding phrases. The challenge lies in seamlessly transforming large-scale image-text data into the detection data format. Drawing inspiration from Detic [53], we argue that the caption description of an image can be treated as a unique category for the image. Additionally, the annotation box for the image can be utilized as an image-sized bounding box. This innovative approach called Caption Box, enables the merging of these three types of data into a detection-centric data format.

To handle various data sources, we have established a standardized format for representing triplets of data as (x𝑥xitalic_x, {bi}i=1nsuperscriptsubscriptsubscript𝑏𝑖𝑖1𝑛\{b_{i}\}_{i=1}^{n}{ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, y𝑦yitalic_y), where x3×H×W𝑥superscript3𝐻𝑊x\in\mathbb{R}^{3\times H\times W}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_H × italic_W end_POSTSUPERSCRIPT represents the image input, {bi4}i=1nsuperscriptsubscriptsubscript𝑏𝑖superscript4𝑖1𝑛\{b_{i}\in\mathbb{R}^{4}\}_{i=1}^{n}{ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT represents the bounding boxes, and yC𝑦superscript𝐶y\in\mathbb{R}^{C}italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT represents the language text inputs. Here, H𝐻Hitalic_H stands for the image height, W𝑊Witalic_W for the image width, and n𝑛nitalic_n for the number of object instances. The bounding boxes bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are used as annotated boxes for detection and grounding data, while for image-text data, they represent the image-size box. The language text input yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT varies depending on the data type. For detection data, it consists of pre-defined category names, for grounding data, it represents the grounding nouns or phrases of entities, and for image-text data, it is the entire caption. To ensure a consistent representation in language text embedding, we employ simple templates to prompt detection and grounding data (e.g. , a photo of {category}.), while leaving the text input for image-text data unchanged since it already serves as a caption. This approach is referred to as Unified Prompt, which enables all text inputs to be represented as the caption.

With the unified data integration pipeline (Caption Box and Unified Prompt), we can pre-train the model by combining training data from different data sources (detection, grounding, and image-text data), thereby reducing the complexities of model optimization.

3.3 Language-Aware Selective Fusion

The open-vocabulary detection model is designed to identify objects within an image based on the provided text inputs, where the text inputs can serve as a query for the model. To ensure that the query in the decoder aligns with the region context, we propose a Language-Aware Selective Fusion module. This module consists of two key components: language-aware query selection and language-aware query fusion. The detailed architecture of the Language-Aware Query Fusion can be seen in Figure 4 (a).

The language-aware query selection component selects the object embedding by assessing the similarity between the image embedding and the text embedding. It computes the similarity of the multi-scale image embedding Eencsubscript𝐸𝑒𝑛𝑐E_{enc}italic_E start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT and the text embedding Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and then utilizes RankTop𝑅𝑎𝑛𝑘𝑇𝑜𝑝RankTopitalic_R italic_a italic_n italic_k italic_T italic_o italic_p to choose the most relevant proposal embedding Espsubscript𝐸𝑠𝑝E_{sp}italic_E start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT and object embedding Esosubscript𝐸𝑠𝑜E_{so}italic_E start_POSTSUBSCRIPT italic_s italic_o end_POSTSUBSCRIPT. The selected proposal embedding is utilized to initialize the reference anchors, and the selected object embedding Esosubscript𝐸𝑠𝑜E_{so}italic_E start_POSTSUBSCRIPT italic_s italic_o end_POSTSUBSCRIPT is forwarded for subsequent query fusion. The language-aware query fusion component gradually fuses language-aware object embedding while preserving the original semantics of the content queries. This component as a crucial part of the decoder layers is iterated M times. Each decoder layer consists of several sub-layers including self-attention, cross-attention, gated-cross-attention, gated-feed-forward, and feed-forward layers. Initially, it takes the multi-scale image embedding Eencsubscript𝐸𝑒𝑛𝑐E_{enc}italic_E start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT, the selected object embedding Esosubscript𝐸𝑠𝑜E_{so}italic_E start_POSTSUBSCRIPT italic_s italic_o end_POSTSUBSCRIPT, and the learnable content query Qlcsubscript𝑄𝑙𝑐Q_{lc}italic_Q start_POSTSUBSCRIPT italic_l italic_c end_POSTSUBSCRIPT as input, and then dynamically updates the content query Qlcsubscript𝑄𝑙𝑐Q_{lc}italic_Q start_POSTSUBSCRIPT italic_l italic_c end_POSTSUBSCRIPT.

The language-aware query selection can be formulated as follows:

Eso,Espsubscript𝐸𝑠𝑜subscript𝐸𝑠𝑝\displaystyle E_{so},E_{sp}italic_E start_POSTSUBSCRIPT italic_s italic_o end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT =RankTop(EencEtT)absent𝑅𝑎𝑛𝑘𝑇𝑜𝑝tensor-productsubscript𝐸𝑒𝑛𝑐superscriptsubscript𝐸𝑡𝑇\displaystyle=RankTop(E_{enc}\otimes E_{t}^{T})= italic_R italic_a italic_n italic_k italic_T italic_o italic_p ( italic_E start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT ⊗ italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) (1)

where EtTsuperscriptsubscript𝐸𝑡𝑇E_{t}^{T}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes the transpose of Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, tensor-product\otimes denotes the Kronecker product[54], and RankTop𝑅𝑎𝑛𝑘𝑇𝑜𝑝RankTopitalic_R italic_a italic_n italic_k italic_T italic_o italic_p is a parameter-less operation that arranges the elements in descending order and then selects the top Q𝑄Qitalic_Q elements.

The language-aware query fusion can be formulated as follows:

Qlc0isubscriptsuperscript𝑄𝑖𝑙subscript𝑐0\displaystyle Q^{i}_{lc_{0}}italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =ΦAttn(qkv=Qlci1),absentsubscriptΦ𝐴𝑡𝑡𝑛𝑞𝑘𝑣subscriptsuperscript𝑄𝑖1𝑙𝑐\displaystyle=\Phi_{Attn}(qkv=Q^{i-1}_{lc}),= roman_Φ start_POSTSUBSCRIPT italic_A italic_t italic_t italic_n end_POSTSUBSCRIPT ( italic_q italic_k italic_v = italic_Q start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_c end_POSTSUBSCRIPT ) , (2)
Qlc1isubscriptsuperscript𝑄𝑖𝑙subscript𝑐1\displaystyle Q^{i}_{lc_{1}}italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =ΦAttn(q=Qlc0i1,kv=Eenc),absentsubscriptΦ𝐴𝑡𝑡𝑛formulae-sequence𝑞subscriptsuperscript𝑄𝑖1𝑙subscript𝑐0𝑘𝑣subscript𝐸𝑒𝑛𝑐\displaystyle=\Phi_{Attn}(q=Q^{i-1}_{lc_{0}},kv=E_{enc}),= roman_Φ start_POSTSUBSCRIPT italic_A italic_t italic_t italic_n end_POSTSUBSCRIPT ( italic_q = italic_Q start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_k italic_v = italic_E start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT ) , (3)
Qlc2isubscriptsuperscript𝑄𝑖𝑙subscript𝑐2\displaystyle Q^{i}_{lc_{2}}italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =Qlc1i+tanh(αa)ΦAttn(q=Qlc1i,kv=Eso),absentsubscriptsuperscript𝑄𝑖𝑙subscript𝑐1subscript𝛼𝑎subscriptΦ𝐴𝑡𝑡𝑛formulae-sequence𝑞subscriptsuperscript𝑄𝑖𝑙subscript𝑐1𝑘𝑣subscript𝐸𝑠𝑜\displaystyle=Q^{i}_{lc_{1}}+\tanh(\alpha_{a})\ast\Phi_{Attn}(q=Q^{i}_{lc_{1}}% ,kv=E_{so}),= italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + roman_tanh ( italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ∗ roman_Φ start_POSTSUBSCRIPT italic_A italic_t italic_t italic_n end_POSTSUBSCRIPT ( italic_q = italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_k italic_v = italic_E start_POSTSUBSCRIPT italic_s italic_o end_POSTSUBSCRIPT ) , (4)
Qlc3isubscriptsuperscript𝑄𝑖𝑙subscript𝑐3\displaystyle Q^{i}_{lc_{3}}italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =Qlc2i+tanh(αb)ΦFFW(Qlc2i),absentsubscriptsuperscript𝑄𝑖𝑙subscript𝑐2subscript𝛼𝑏subscriptΦ𝐹𝐹𝑊subscriptsuperscript𝑄𝑖𝑙subscript𝑐2\displaystyle=Q^{i}_{lc_{2}}+\tanh(\alpha_{b})\ast\Phi_{FFW}(Q^{i}_{lc_{2}}),= italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + roman_tanh ( italic_α start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ∗ roman_Φ start_POSTSUBSCRIPT italic_F italic_F italic_W end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (5)
Qlcisubscriptsuperscript𝑄𝑖𝑙𝑐\displaystyle Q^{i}_{lc}italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_c end_POSTSUBSCRIPT =ΦFFW(Qlc3i),absentsubscriptΦ𝐹𝐹𝑊subscriptsuperscript𝑄𝑖𝑙subscript𝑐3\displaystyle=\Phi_{FFW}(Q^{i}_{lc_{3}}),= roman_Φ start_POSTSUBSCRIPT italic_F italic_F italic_W end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (6)

where the upper index i𝑖iitalic_i represents the module index, ΦAttnsubscriptΦ𝐴𝑡𝑡𝑛\Phi_{Attn}roman_Φ start_POSTSUBSCRIPT italic_A italic_t italic_t italic_n end_POSTSUBSCRIPT represents the attention layer, ΦFFWsubscriptΦ𝐹𝐹𝑊\Phi_{FFW}roman_Φ start_POSTSUBSCRIPT italic_F italic_F italic_W end_POSTSUBSCRIPT represents the feed-forward layer, αasubscript𝛼𝑎\alpha_{a}italic_α start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and the αbsubscript𝛼𝑏\alpha_{b}italic_α start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT are learnable parameters initialized to zero. This initialization ensures the training consistent with the original decoder framework while gradually incorporating language-aware context into the content query.

To facilitate comprehension, we present the pseudocode for the Language-Aware Selective Fusion (LASF) in Algorithm 1. To enhance representation, we illustrate three different variants of LASF in Figure 4: Later-LASF, Middle-LASF, and Early-LASF, depending on the insert position. Additionally, the Cross Attention Fusion (CAF) proposed in G-DINO [3] is also considered.

Algorithm 1 Pseudocode of LASF in a PyTorch-like style.
def laqs(embed_enc, embed_t):
"""
␣␣␣␣embed_enc:encodedembedding,shape:[B,P,D].
␣␣␣␣embed_t:textembedding,shape:[B,C,D].
␣␣␣␣"""
enc_cls = embed_enc @ embed_t.T #[B, P, C]
enc_coord = BoxMLP(embed_enc) #[B, P, 4]
topk_idx = TopK(enc_cls.max(-1)[0], Q, dim=1)
# embed_so: [B, Q, D]
embed_so = Gather(enc_coord, dim=1, topk_idx)
# embed_sp: [B, Q, 4]
embed_sp = Gather(enc_cls, dim=1, topk_idx)
return embed_so, embed_sp
def laqf(q_lc, embed_so):
"""
␣␣␣␣q_lc:learnablecontentquery,shape:[B,Q,D].
␣␣␣␣embed_so:selectedobjectembedding,shape:[B,Q,D].
␣␣␣␣"""
# self-attention
q_lc = Attn(qkv=q_lc)
# cross-attention
q_lc = Attn(q=q_lc, kv=embed_enc)
# gated-cross-attention
q_lc = q_lc + Tanh(a) * Attn(q=q_lc, kv=embed_so)
# gated-ffw
q_lc = q_lc + Tanh(b) * FFW(q_lc)
# ffw
q_lc = FFW(q_lc)
return q_lc
def lasf(embed_enc, embed_t, q_lc):
"""
␣␣␣␣embed_enc:encodedembedding,shape:[B,P,D].
␣␣␣␣embed_t:textembedding,shape:[B,C,D].
␣␣␣␣q_lc:learnablecontentquery,shape:[B,Q,D].
␣␣␣␣NOTE:Bisthebatchsize,Pisthepatchnumber,Disthedimensionnumber,Cisthecaptionnumber,andQisthequerynumber.
␣␣␣␣"""
# 1. Language-aware query selection.
embed_so, embed_sp = laqs(embed_enc, embed_t)
# embed_sp to initialize the reference points,
# omit here for concise.
# 2. Language-aware query fusion.
# Decoder layers with laqf, iterate M times.
for _ in range(M):
q_lc = laqf(q_lc, embed_so)
q_sf = q_lc
return q_sf

 TopK: topk selection; Gather: gathers values along index specified by dim; Attn: attention layer; FFW: feed-forward layer; Tanh: tanh activation function.

3.4 Detection-Centric Pre-Training

The unified data integration introduces a detection-centric data format that transforms different types of data into a format suitable for detection. This approach combines a range of data sources such as detection data, grounding data, and image-text data, allowing for the pre-training of a model within a unified framework with a focus on detection.

Model Forward. OV-DINO takes the triplet-wise data (x𝑥xitalic_x, {bi}i=1nsuperscriptsubscriptsubscript𝑏𝑖𝑖1𝑛\{b_{i}\}_{i=1}^{n}{ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, {yi}i=1nsuperscriptsubscriptsubscript𝑦𝑖𝑖1𝑛\{y_{i}\}_{i=1}^{n}{ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT) as input. The image-encoder ΦIsubscriptΦ𝐼\Phi_{I}roman_Φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT is an image backbone to extract the image embedding EiP×Dsubscript𝐸𝑖superscript𝑃𝐷E_{i}\in\mathbb{R}^{P\times D}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_D end_POSTSUPERSCRIPT from the input image xH×W×3𝑥superscript𝐻𝑊3x\in\mathbb{R}^{H\times W\times 3}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT, where P𝑃Pitalic_P represents the spatial size of the flattened image embedding, D𝐷Ditalic_D represents the dimension of embedding. The text encoder ΦTsubscriptΦ𝑇\Phi_{T}roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT takes the language text {yiC}i=1nsuperscriptsubscriptsubscript𝑦𝑖superscript𝐶𝑖1𝑛\{y_{i}\in\mathbb{R}^{C}\}_{i=1}^{n}{ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as input and obtains the text embedding EtC×Dsubscript𝐸𝑡superscript𝐶𝐷E_{t}\in\mathbb{R}^{C\times D}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_D end_POSTSUPERSCRIPT. Based on the DINO[3], the OV-DINO detector head comprises a transformer encoder, a language-aware query selection module, and a transformer decoder with a language-aware query fusion module. The transformer encoder ΦEncsubscriptΦ𝐸𝑛𝑐\Phi_{Enc}roman_Φ start_POSTSUBSCRIPT italic_E italic_n italic_c end_POSTSUBSCRIPT takes encoded image embedding Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as input and outputs the refined multi-scale image embedding Eencsubscript𝐸𝑒𝑛𝑐E_{enc}italic_E start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT. The language-aware query selection module filters the most relevant image embedding according to the text embedding Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the object embedding EsoQ×Dsubscript𝐸𝑠𝑜superscript𝑄𝐷E_{so}\in\mathbb{R}^{Q\times D}italic_E start_POSTSUBSCRIPT italic_s italic_o end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_D end_POSTSUPERSCRIPT. The transformer decoder takes the selected object embedding Esosubscript𝐸𝑠𝑜E_{so}italic_E start_POSTSUBSCRIPT italic_s italic_o end_POSTSUBSCRIPT and a learnable content query QlcQ×Dsubscript𝑄𝑙𝑐superscript𝑄𝐷Q_{lc}\in\mathbb{R}^{Q\times D}italic_Q start_POSTSUBSCRIPT italic_l italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_D end_POSTSUPERSCRIPT as inputs and interacts with the refined image embedding Eencsubscript𝐸𝑒𝑛𝑐E_{enc}italic_E start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT, which enables the query classification following the language text content in the selective fusion module. After the decoder, followed a classification project layer Fcsubscript𝐹𝑐F_{c}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT that projects the query embedding to a classification query logits OQ×D𝑂superscript𝑄𝐷O\in\mathbb{R}^{Q\times D}italic_O ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_D end_POSTSUPERSCRIPT, and a regression layer Frsubscript𝐹𝑟F_{r}italic_F start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT that predicts bounding boxes coordinates BQ×4𝐵superscript𝑄4B\in\mathbb{R}^{Q\times 4}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × 4 end_POSTSUPERSCRIPT. Q𝑄Qitalic_Q, and C𝐶Citalic_C denote the length of queries and prompted captions, respectively. The classification alignment score matrix SQ×C𝑆superscript𝑄𝐶S\in\mathbb{R}^{Q\times C}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × italic_C end_POSTSUPERSCRIPT is obtained by calculating the similarity of O𝑂Oitalic_O and EtTsuperscriptsubscript𝐸𝑡𝑇E_{t}^{T}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. The overall process of model forward can be formulated as follows:

Eisubscript𝐸i\displaystyle E_{\mathrm{i}}italic_E start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT =ΦI(x),Et=ΦT(yi),Eenc=ΦEnc(Ei),formulae-sequenceabsentsubscriptΦI𝑥formulae-sequencesubscript𝐸tsubscriptΦTsubscript𝑦𝑖subscript𝐸encsubscriptΦEncsubscriptEi\displaystyle=\Phi_{\mathrm{I}}(x),\;E_{\mathrm{t}}=\Phi_{\mathrm{T}}(y_{i}),% \;E_{\mathrm{enc}}=\Phi_{\mathrm{Enc}}(\mathrm{E}_{\mathrm{i}}),= roman_Φ start_POSTSUBSCRIPT roman_I end_POSTSUBSCRIPT ( italic_x ) , italic_E start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_E start_POSTSUBSCRIPT roman_enc end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT roman_Enc end_POSTSUBSCRIPT ( roman_E start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT ) , (7)
Esosubscript𝐸so\displaystyle E_{\mathrm{so}}italic_E start_POSTSUBSCRIPT roman_so end_POSTSUBSCRIPT =ΦQS(Eenc,Et),Qsf=ΦQF(Eenc,Eso,Qlc),formulae-sequenceabsentsubscriptΦQSsubscript𝐸encsubscript𝐸tsubscript𝑄sfsubscriptΦQFsubscript𝐸encsubscript𝐸sosubscript𝑄lc\displaystyle=\Phi_{\mathrm{QS}}(E_{\mathrm{enc}},E_{\mathrm{t}}),\;Q_{\mathrm% {sf}}=\Phi_{\mathrm{QF}}(E_{\mathrm{enc}},E_{\mathrm{so}},Q_{\mathrm{lc}}),= roman_Φ start_POSTSUBSCRIPT roman_QS end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT roman_enc end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ) , italic_Q start_POSTSUBSCRIPT roman_sf end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT roman_QF end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT roman_enc end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT roman_so end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT roman_lc end_POSTSUBSCRIPT ) , (8)
O𝑂\displaystyle Oitalic_O =Fc(Qsf),B=Fr(Qsf),S=OEtT,formulae-sequenceabsentsubscript𝐹𝑐subscript𝑄sfformulae-sequence𝐵subscript𝐹𝑟subscript𝑄sf𝑆tensor-product𝑂superscriptsubscript𝐸t𝑇\displaystyle=F_{c}(Q_{\mathrm{sf}}),\;B=F_{r}(Q_{\mathrm{sf}}),\;S=O\otimes E% _{\mathrm{t}}^{T},= italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT roman_sf end_POSTSUBSCRIPT ) , italic_B = italic_F start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT roman_sf end_POSTSUBSCRIPT ) , italic_S = italic_O ⊗ italic_E start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (9)

where EtTsuperscriptsubscript𝐸𝑡𝑇E_{t}^{T}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes the transpose of Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, tensor-product\otimes means the Kronecker product[54], Espsubscript𝐸spE_{\mathrm{sp}}italic_E start_POSTSUBSCRIPT roman_sp end_POSTSUBSCRIPT is omitted for concise.

Model Optimization. The classification ground-truth GTcls{0,1}Q×CsubscriptGTclssuperscript01𝑄𝐶\mathrm{GT_{cls}}\in\{0,1\}^{Q\times C}roman_GT start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_Q × italic_C end_POSTSUPERSCRIPT is a matrix that indicates the matched relationship between predicted regions and prompted texts. The bounding box ground-truth GTboxQ×4subscriptGTboxsuperscript𝑄4\mathrm{GT_{box}}\in\mathbb{R}^{Q\times 4}roman_GT start_POSTSUBSCRIPT roman_box end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q × 4 end_POSTSUPERSCRIPT is a matrix that contains corresponding box coordinates, there are constructed using the bipartite matching algorithm as described in [8, 7]. The classification loss clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT is calculated using the predicted alignment score S𝑆Sitalic_S and the ground-truth classification ground-truth GTclssubscriptGTcls\mathrm{GT_{cls}}roman_GT start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT. The regression loss regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT is calculated using the regressed bounding box B𝐵Bitalic_B and the bounding box ground-truth GTboxsubscriptGTbox\mathrm{GT_{box}}roman_GT start_POSTSUBSCRIPT roman_box end_POSTSUBSCRIPT. The regression loss encompasses both the box loss boxsubscript𝑏𝑜𝑥\mathcal{L}_{box}caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT and the generalized intersection over union (GIoU) loss giousubscript𝑔𝑖𝑜𝑢\mathcal{L}_{giou}caligraphic_L start_POSTSUBSCRIPT italic_g italic_i italic_o italic_u end_POSTSUBSCRIPT. In addition to the classification and regression losses, a denoising loss dnsubscript𝑑𝑛\mathcal{L}_{dn}caligraphic_L start_POSTSUBSCRIPT italic_d italic_n end_POSTSUBSCRIPT [55] is introduced to enhance the stability of the training process. This loss function contributes to improving the robustness of the model during training. To maintain the simplicity of the detection-centric framework, the optimization objective of the pre-training stage is kept consistent with DINO [8]. The whole optimization objective \mathcal{L}caligraphic_L is expressed as a combination of different loss components, and can be written as:

=αcls+βbox+γgiou+dn.𝛼subscript𝑐𝑙𝑠𝛽subscript𝑏𝑜𝑥𝛾subscript𝑔𝑖𝑜𝑢subscript𝑑𝑛\mathcal{L}=\alpha\mathcal{L}_{cls}+\beta\mathcal{L}_{box}+\gamma\mathcal{L}_{% giou}+\mathcal{L}_{dn}.\\ caligraphic_L = italic_α caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT + italic_β caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT + italic_γ caligraphic_L start_POSTSUBSCRIPT italic_g italic_i italic_o italic_u end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_d italic_n end_POSTSUBSCRIPT . (10)

Here, α𝛼\alphaitalic_α, β𝛽\betaitalic_β and γ𝛾\gammaitalic_γ represent the weight factors of clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT, boxsubscript𝑏𝑜𝑥\mathcal{L}_{box}caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT and giousubscript𝑔𝑖𝑜𝑢\mathcal{L}_{giou}caligraphic_L start_POSTSUBSCRIPT italic_g italic_i italic_o italic_u end_POSTSUBSCRIPT, respectively. clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT is implemented by a sigmoid focal loss[56]. boxsubscript𝑏𝑜𝑥\mathcal{L}_{box}caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT is implemented by an L1 loss. giousubscript𝑔𝑖𝑜𝑢\mathcal{L}_{giou}caligraphic_L start_POSTSUBSCRIPT italic_g italic_i italic_o italic_u end_POSTSUBSCRIPT is implemented by a GIoU loss[57]. dnsubscript𝑑𝑛\mathcal{L}_{dn}caligraphic_L start_POSTSUBSCRIPT italic_d italic_n end_POSTSUBSCRIPT represents the sum of the denoising losses [55] of the box and label.

4 Experiments

TABLE II: Pre-Training Data. The dataset specifications used for pre-training OV-DINO. # Texts denotes the number of categories for the detection dataset, the number of phrases for the grounding data, and the number of captions for the image-text dataset, respectively. # Images denotes the number of images. # Anno. denotes the number of instance annotations. CC1M refers to our filtered 1M subset without any instance annotations.
Dataset Type # Texts # Images # Anno.
O365[58] Detection 365 609K 9621K
GQA[59] Grounding 387K 621K 3681K
Flickr30k[60] Grounding 94K 149K 641K
CC1M[31] Image-Text 1M 1M

In this section, we demonstrate the effectiveness of the proposed OV-DINO by conducting extensive experiments on two widely used open-vocabulary detection benchmarks: the COCO[19] and LVIS[20]. We provide an overview of the pre-training datasets and the evaluation metrics in Section 4.1, and delve into the details of implementation in Section 4.2. We pre-train OV-DINO on large-scale diverse datasets and perform a zero-shot evaluation on the COCO and LVIS benchmarks. Following this, we fine-tune the pre-trained model on the COCO dataset and evaluate its performance in terms of close-set detection, as discussed in Section 4.3. To demonstrate the effectiveness of our model design, we conduct ablations in Section 4.4. Additionally, we present qualitative results for comparison with other methods, showcasing a clear representation of the detection results in Section 4.5.

4.1 Pre-Training Data and Evaluation Metric

Refer to caption
Figure 5: Illustration of the Noise in the Image-caption Dataset. The upper figure is the image, and the bottom text is the related caption for each sample. The sample on the left shows a high score of image-text similarity, while the sample on the right shows a lower score.
TABLE III: Hyper-Parameters in Pre-Training and Fine-Tuning of OV-DINO. We emphasize the essential hyper-parameters for pre-training, while only addressing the distinct items of fine-tuning that differ from pre-training.
Item Value
Pre-Training Config
batch size 128
training epochs 24
optimizer AdamW[61]
weight decay 1e-4
optimizer momentum β1=0.9,β2=0.999formulae-sequencesubscript𝛽10.9subscript𝛽20.999\beta_{1}=0.9,\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999
warmup iter 1000
lr of image encoder 2e-4
lr of text encoder 2e-5
learning rate schedule multi-step decay
clip max norm 0.1
input resolution [800, 1333]
hidden dim (D) 256
# of encoder layers (N) 6
# of decoder layers (M) 6
# of heads 8
# of queries (Q) 900
# of prompted text (C) 150
cost of class 1
cost of bbox 5
cost of giou 2
loss of class (α𝛼\alphaitalic_α) 2
loss of bbox (β𝛽\betaitalic_β) 5
loss of giou (γ𝛾\gammaitalic_γ) 2
Fine-Tuning Config
batch size 32
lr of image encoder 1e-5
lr of text encoder 1e-6
# of prompted text (C) 80

Pre-Training Data. In our experiments, we make use of several datasets as referenced in [1, 3, 59]. These datasets comprise the Objects365 detection dataset [58], the GoldG grounding dataset [59], and the Conceptual Captions image-text dataset [31], as detailed in Table II. Our model is trained using the detection and grounding datasets following the methodology outlined in GLIP [1]. However, the image-text dataset contains a significant amount of low-quality image-text pairs, as illustrated in Figure 5. The caption of the left sample effectively describes the image content, whereas the caption of the right sample does not align well with the image content. To mitigate the noise in the image-text dataset, we employ CLIP-Large [11] to filter 1 million image-text pairs from the original CC3M dataset. The filtering process begins by computing the similarity of 3 million pairs and subsequently ranking the top 1 million based on their image-text similarity.

Evaluation Metric. After pre-training, we evaluate the performance of the proposed OV-DINO under a zero-shot setting on the COCO [19] and LVIS [20] benchmarks. In addition, we conduct further analysis by fine-tuning the pre-trained model on the COCO dataset to explore the effectiveness of continual fine-tuning. Following previous methods [1, 3], we use the standard Average Precision (AP) metric to evaluate the performance of COCO, and the Fixed AP[62] metric on LVIS for fair comparison.

4.2 Implementation Details

TABLE IV: Zero-shot Domain Transfer Evaluation on LVIS MiniVal and Val 1.0 Datasets(%). APr, APc, and APf indicate the AP of race, common and frequent categories, respectively. Gray numbers denote that the model is trained on the LVIS dataset using either supervised or few-shot settings. CC3M denotes the pseudo-labeled CC3M in [18]. CC1M denotes a filtered subset from the CC3M dataset in our setting.
Model Image Params Pre-Training Data LVIS MiniVal LVIS Val
Encoder AP APr APc APf AP APr APc APf
DETR[7] RN101 LVIS 17.8 3.2 12.9 24.8
MDETR[59] RN101 169M GoldG, LVIS 24.2 20.9 24.9 24.3
MaskRCNN[6] RN101 LVIS 33.3 26.3 34.0 33.9
GLIP-T(A)[1] Swin-T 232M O365 18.5 14.2 13.9 23.4 12.3 6.0 8.0 19.4
GLIP-T(B)[1] Swin-T 232M O365 17.8 13.5 12.8 22.2 11.3 4.2 7.6 18.6
GLIP-T(C)[1] Swin-T 232M O365, GoldG 24.9 17.7 19.5 31.0 16.5 7.5 11.6 26.1
GLIP-T[1] Swin-T 232M O365, GoldG, Cap4M 26.0 20.8 21.4 31.0 17.2 10.1 12.5 25.5
G-DINO-T2[3] Swin-T 172M O365, GoldG 25.6 14.4 19.6 32.2
G-DINO-T3[3] Swin-T 172M O365, GoldG, Cap4M 27.4 20.8 21.4 31.0
DetCLIP-T(A)[17] Swin-T 155M O365 28.8 26.0 28.0 30.0 22.1 18.4 20.1 19.4
DetCLIP-T(B)[17] Swin-T 155M O365, GoldG 34.4 26.9 33.9 36.3 27.2 21.9 25.5 31.5
DetCLIP-T[17] Swin-T 155M O365, GoldG, YFCC1M 35.9 33.2 35.7 36.4 28.4 25.0 27.0 31.6
YOLO-World-S[18] YOLOv8-S 77M O365, GoldG 26.2 19.1 23.6 29.8 24.2 16.4 21.7 27.8
YOLO-World-M[18] YOLOv8-M 92M O365, GoldG 31.0 23.8 29.2 33.9
YOLO-World-L[18] YOLOv8-L 110M O365, GoldG, CC3M 35.4 27.6 34.1 38.0
OV-DINO1(Ours) Swin-T 166M O365 24.4 15.5 20.2 29.7 18.7 9.3 14.5 27.4
OV-DINO2(Ours) Swin-T 166M O365, GoldG 39.4 31.5 38.9 41.3 32.2 26.2 30.1 37.3
OV-DINO3(Ours) Swin-T 166M O365, GoldG, CC1M 40.0 34.6 39.5 41.5 32.9 29.1 30.4 37.4

Model Architecture. Constrained by the high cost of model training, we pre-train the model specifically using Swin-T [21] as the image encoder, which has shown superior performance compared to other methods. To ensure a fair comparison, we utilized the BERT-base from HuggingFace [63] as the text encoder, consistent with the approaches used by GLIP [1] and G-DINO [3]. To incorporate category names in detection and noun phrases in grounding data during pre-training with image-text data, we adopted a unified data integration pipeline by prompting all category names or noun phrases with specific templates in CLIP [11], such as ”a photo of {category}.”. Following DINO [8], we extracted multi-scale features at 4 scales ranging from 8x to 64x. Additionally, we set the maximum number of prompted texts at 150, encompassing positive categories or phrases present in the image and randomly selected negative texts from all other data sources. For text embedding extraction, we employed the max-length padding mode and utilized mean pooling to aggregate text embedding along the length dimension. We integrated a linear projection layer to project the image embedding into the same embedding space as the text embedding. By default, we set the number of queries to 900, with six transformer layers in the encoder and decoder layers.

Model Training. To maintain simplicity in the model, we adhere to a similar training procedure as the original DINO setting [8]. We adopt the AdamW [61] optimizer with a weight decay of 1e-4. The total batch size is 128, with a base learning rate of 2e-4 for all model parameters except the text encoder, which has a learning rate of 0.1 times the base learning rate (specifically set to 1e-5). During the fine-tuning stage on COCO, the base learning rate is adjusted to 1e-5, while the remaining hyper-parameters remain the same as in the pre-training stage. Both pre-training and fine-tuning are conducted for 24 epochs (2x schedule), using a step learning rate schedule where the learning rate is reduced to 0.1 and 0.01 of the base learning rate at the 16th and 22nd epochs, respectively. The weights allocated to the classification loss, box loss, and GIoU loss are 2.0, 5.0, and 2.0, respectively. The weights for matching cost components are identical to the losses except for the classification cost, which is given a weight of 1.0. The hyper-parameters used in the pre-training and fine-tuning stages of OV-DINO are detailed in Table III.

4.3 Main Results

TABLE V: Zero-shot Domain Transfer and Fine-tuning Evaluation on COCO(%). OV-DINO achieves superior performance than prior methods in zero-shot evaluation. Further fully fine-tuned on COCO, OV-DNIO surpasses the previous State-of-the-Art (SoTA) performance under the same setting. Gray numbers denote the method is trained on the COCO dataset under the settings of supervised or few-shot.
Model Image Pre-Training Data Data Size Epochs COCO 2017 Val
Encoder Zero-Shot Fine-Tuning
Faster RCNN[5] RN50-FPN COCO 118K 36 40.3
Faster RCNN[5] RN101-FPN COCO 118K 36 41.8
DyHead-T[64] Swin-T COCO 118K 24 49.7
DINO-T[8] Swin-T COCO 118K 24 51.3
GLIP-T(A)[1] Swin-T O365 0.66M 30 42.9 52.9
GLIP-T(B)[1] Swin-T O365 0.66M 30 44.9 53.8
GLIP-T(C)[1] Swin-T O365, GoldG 1.43M 30 46.7 55.1
GLIP-T[1] Swin-T O365, GoldG, Cap4M 5.43M 30 46.3 54.9
G-DINO-T1[3] Swin-T O365 0.61M 50 46.7 56.9
G-DINO-T2[3] Swin-T O365, GoldG 1.38M 50 48.1 57.1
G-DINO-T3[3] Swin-T O365, GoldG, Cap4M 5.38M 50 48.4 57.2
YOLO-World-S[18] YOLOv8-S O365, GoldG 0.61M 100 37.6 45.9
YOLO-World-M[18] YOLOv8-M O365, GoldG 0.61M 100 42.8 51.2
YOLO-World-L[18] YOLOv8-L O365, GoldG, CC3M 1.63M 100 45.1 53.3
OV-DINO1(Ours) Swin-T O365 0.60M 24 49.5 57.5
OV-DINO2(Ours) Swin-T O365, GoldG 1.38M 24 50.6 58.4
OV-DINO3(Ours) Swin-T O365, GoldG, CC1M 2.38M 24 50.2 58.2
TABLE VI: Ablations on Unified Data Integration and Language-Aware Query Fusion. We evaluate the zero-shot performance on LVIS MiniVal of the proposed methods. UniDI, UniPro, and CapBox denote the Unified Data Integration, Unified Prompt, and Caption Box, respectively.
# Pre-Training Data UniDI LASF AP APr APc APf
UniPro CapBox
0 O365-100K 18.3 10.1 14.8 22.8
1 O365-100K 18.9 12.8 15.2 23.4
2 O365-100K 19.2 10.5 16.5 23.1
3 O365-100K 19.5 12.8 16.6 23.4
4 O365-100K, CC-100K 20.6 13.1 17.9 24.4
5 O365-100K, CC-100K 22.0 14.0 20.0 25.2

LVIS Benchmark. In Table IV, we provide a comprehensive comparison of our proposed OV-DINO with recent state-of-the-art methods on the LVIS benchmark. The LVIS dataset is specifically designed to address long-tail objects and encompasses over 1000 categories for evaluation. Our evaluation of OV-DINO is conducted on the LVIS MiniVal and LVIS Val datasets under the zero-shot evaluation setting. OV-DINO surpasses previous state-of-the-art methods across various pre-training data settings. Notably, OV-DINO has fewer parameters and undergoes pre-training for only 24 epochs, which is notably less than the pre-training schedules of other methods. Despite this, OV-DINO achieves superior results, showcasing its effectiveness and exceptional capability in detecting a wide range of categories. Moreover, when integrated with the image-text dataset, OV-DINO attains the highest AP results using the Swin-T image encoder under fair pre-training settings. These findings underscore the robustness and efficacy of OV-DINO in the context of object detection tasks, particularly in scenarios involving diverse and numerous object categories.

COCO Benchmark. In Table V, we compare the proposed OV-DINO with recent state-of-the-art methods on the COCO benchmark in both zero-shot and fine-tuning settings. In the zero-shot setting, our models are pre-trained on various large-scale datasets and directly evaluated on the COCO dataset. Firstly, we pre-train the model on the Objects365 (O365) dataset [58] and evaluate it using the zero-shot manner, where OV-DINO outperforms all previous models in the zero-shot evaluation setting. Remarkably, OV-DINO achieves the best results when combined with the GoldG [59] data. Additionally, we further fine-tune the pre-trained model on the COCO dataset, resulting in a new record of 58.4 AP on COCO2017 val using only Swin-T [21] as the image encoder. It’s interesting to note that the addition of image-text data brings negative improvement to COCO, potentially due to the limited category names in the COCO dataset. Nevertheless, we find that image-text data is essential for discovering more diverse categories, as demonstrated in LVIS experiments.

4.4 Ablation Study

We conducted extensive ablation studies to analyze the effectiveness of the proposed OV-DINO. To reduce the cost of training with the full data, we randomly sampled 100,000 images from the original O365v1 [58] dataset and 100,000 images from the filtered CC3M [31] subset for all ablation studies. We set the batch size to 32 and the training schedule to 12 epochs. Unless specified, we pre-train OV-DINO on the sampled O365-100K and CC-100K datasets and evaluate zero-shot performance on the LVIS MiniVal dataset.

UniDI and LASF. In Table VI, we conduct an ablation study on the proposed Unified Data Integration (UniDI) and the Language-Aware Selective Fusion (LASF). The UniDI harmonizes different data sources through Unified Prompt and Caption Box, while the LASF selects and fuses the cross-modality information dynamically. The proposed UniDI and LASF consistently improve performance on LVIS MiniVal.

Variants of LASF. In Table VII, we make a comparison of variants of the proposed LASF with the Cross Attention Fusion (CAF) in G-DINO[3]. Figure 4 illustrates three variants of the LASF based on the insertion location: Later-LASF, Middle-LASF, and Early-LASF. Additionally, the architecture of CAF is provided for comparison. Extensive experiments are conducted to validate the effectiveness of LASF. All models in the ablations are pre-trained using a Swin-T as the image encoder on the sampled O365-100K subset. The results demonstrate that our LASF module is more effective in capturing language-aware context compared to the CAF module. Furthermore, the Later-LASF variant demonstrates superior zero-shot transfer ability on the LVIS MiniVal benchmark, which is adopted as our default architecture.

Text Embedding Pooling. In Table VIII, we evaluate the impact of different text embedding pooling methods, such as mean-pooling and max-pooling of the text embedding. We pre-train the models on O365-100K and CC-100K with these two pooling methods, and it is observed that mean pooling demonstrates superior performance when applied to combined datasets. The mean-pooling method is effective in capturing the comprehensive representation of a prompted caption, making it suitable for UniDI.

TABLE VII: Ablations on Variants of Language-Aware Selective Fusion and Cross Attention Fusion. We ablate the variants of LASF and CAF through the zero-shot LVIS MiniVal evaluation. All models are pre-trained on the O365-100K dataset.
# Model AP APr APc APf
0 Baseline 18.3 10.1 14.8 22.8
1 Baseline + CAF 18.9 10.4 16.0 22.9
2 Baseline + Eearly-LASF 18.8 9.5 16.1 22.9
3 Baseline + Middle-LASF 18.5 9.4 15.5 22.8
4 Baseline + Later-LASF 19.2 10.5 16.5 23.1
TABLE VIII: Ablations on Text Embedding Pooling. We ablate the different text embedding pooling methods on O365-100K and CC-100K datasets, then evaluate zero-shot performance on LVIS MiniVal. The pooling methods considered include mean and max, where mean represents mean-pooling of text embedding, and max represents max-pooling of text embedding.
# Pre-Training EmbedPool AP APr APc APf
mean max
0 O365 19.0 11.8 15.7 23.3
1 O365 18.9 10.7 15.1 23.7
2 O365, CC 21.4 13.5 18.3 25.5
3 O365, CC 22.0 14.0 20.0 25.2

Data Source of Caption Box. In Table IX, we compare the performance of different data sources for Caption Box. We conducted the comparison by selecting the bottom and top 100K samples based on the image-text similarity of CLIP, as well as a random 100K sample. The results show that the rank_top data source yields the best performance, while the rank_bottom performs the worst. This highlights the inevitable noise in the image-text dataset and emphasizes the necessity of our filtering operation.

4.5 Qualitative Results

Visualization on COCO. We present visualization results derived from the pre-trained OV-DINO. Figure 6 showcases the visualization results of zero-shot inference on the COCO dataset, where only the box predictions with a confidence score exceeding the threshold of 0.5 are displayed. Furthermore, a comparison is made with the predictions of GLIP [1] and G-DINO [3]. The first column depicts the image with ground truth, the second and third columns show the predictions of GLIP-T(B) and G-DINO-T3 , and the last column represents the predictions of OV-DINO2. It is evident from the visualization that OV-DINO produces more precise predictions with higher confidence scores and is adept at detecting small objects. These findings demonstrate the robust zero-shot transfer capability of OV-DINO in successfully detecting all objects based on the language text input.

Visualization on LVIS. We also present visualization results derived from the pre-trained OV-DINO3. Figure 7 illustrates the visualization results of zero-shot inference on the LVIS dataset. The LVIS dataset is a long-tail dataset with over 1200 categories, which can lead to numerous predictions in an image. For a clear visualization, we only display the box predictions with scores higher than 0.5. OV-DINO demonstrates exceptional performance in detecting a diverse range of categories, resulting in highly accurate predictions.

TABLE IX: Ablations on the Data Source of Caption Box. We ablate the different data sources of the image-text dataset and evaluate the zero-shot performance on LVIS MiniVal. The three data sources considered are: random_select entails randomly selecting 100K samples, rank_bottom and rank_top involve retaining the bottom 100K samples and the top 100K samples of the descending sorted image-text pairs, respectively.
# Data Source AP APr APc APf
0 rank_bottom 19.6 9.5 16.7 24.0
1 random_select 20.8 11.6 18.1 24.8
2 rank_top 22.0 14.0 20.0 25.2
Refer to caption
Figure 6: Comparison of Visualization Results for Zero-Shot Inference on COCO. We visualize the predictions of GLIP[1], G-DINO[3] and the proposed OV-DINO. The failures are highlighted with a yellow circle. OV-DINO is capable of detecting all objects defined by COCO, and it can even detect additional objects that have not been labeled in the annotation (red circle).
Refer to caption
Figure 7: Visualized Results for Zero-Shot Inference on the LVIS. We visualize the predictions of OV-DINO, which shows a diverse range of instances being detected. Best viewed in zoom.

5 Discussions

Conclusions. In this paper, we present OV-DINO, a robust unified open-vocabulary detector that aims to improve the performance of open-vocabulary detection. We propose a unified data integration pipeline to efficiently integrate various data sources, enabling end-to-end training with a unified detection framework for consistency and coherence. Additionally, we introduce a language-aware selective fusion module to selectively fuse cross-modality information, thereby improving the overall performance of OV-DINO through dynamic fusion of multi-modal data. Experimental results demonstrate that OV-DINO outperforms previous state-of-the-art methods when evaluated on the challenging COCO and LVIS benchmarks.

Limitations. Despite the remarkable performance of OV-DINO as a unified open-vocabulary detection method, it is crucial to recognize that some specific challenges and limitations need to be addressed. One potential limitation is scaling up OV-DINO by incorporating a larger encoder and utilizing more extensive datasets. Scaling up shows a potential vision for improving the performance and applicability of the open-vocabulary detection model. However, it is inevitable to acknowledge that the pre-training requires stage substantial computational resources, which may present a barrier to scalability. Therefore, it is essential to strategically optimize the training process to facilitate the advancement of open-vocabulary tasks.

Broader Impact. In our research, we explore the detection-centric pre-training for open-vocabulary detection (OVD), which differs from the traditional approach of custom-designing for various data sources. Additionally, we introduce the concept of language-aware cross-modality fusion and alignment, marking a departure from the conventional method of simple region-concept alignment. Consequently, our research provides an innovative perspective for OVD. We expect that OV-DINO will encourage further exploration of ways to effectively leverage language-aware cross-modality information for open-vocabulary vision tasks.

References

  • [1] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 965–10 975.
  • [2] H. Zhang, P. Zhang, X. Hu, Y.-C. Chen, L. H. Li, X. Dai, L. Wang, L. Yuan, J.-N. Hwang, and J. Gao, “Glipv2: Unifying localization and vision-language understanding,” in Advances in Neural Information Processing Systems, 2022.
  • [3] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023.
  • [4] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
  • [5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
  • [6] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
  • [7] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16.   Springer, 2020, pp. 213–229.
  • [8] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” arXiv preprint arXiv:2203.03605, 2022.
  • [9] A. Bansal, K. Sikka, G. Sharma, R. Chellappa, and A. Divakaran, “Zero-shot object detection,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 384–400.
  • [10] A. Zareian, K. D. Rosa, D. H. Hu, and S.-F. Chang, “Open-vocabulary object detection using captions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 393–14 402.
  • [11] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  • [12] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International conference on machine learning.   PMLR, 2021, pp. 4904–4916.
  • [13] L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C. Xu, “Filip: fine-grained interactive language-image pre-training,” arXiv preprint arXiv:2111.07783, 2021.
  • [14] Y. Long, Y. Wen, J. Han, H. Xu, P. Ren, W. Zhang, S. Zhao, and X. Liang, “Capdet: Unifying dense captioning and open-world detection pretraining,” 2023.
  • [15] Y. Xu, M. Zhang, C. Fu, P. Chen, X. Yang, K. Li, and C. Xu, “Multi-modal queried object detection in the wild,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [16] C. Feng, Y. Zhong, Z. Jie, X. Chu, H. Ren, X. Wei, W. Xie, and L. Ma, “Promptdet: Towards open-vocabulary detection using uncurated images,” in European Conference on Computer Vision.   Springer, 2022, pp. 701–717.
  • [17] L. Yao, J. Han, Y. Wen, X. Liang, D. Xu, W. Zhang, Z. Li, C. Xu, and H. Xu, “Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection,” arXiv preprint arXiv:2209.09407, 2022.
  • [18] T. Cheng, L. Song, Y. Ge, W. Liu, X. Wang, and Y. Shan, “Yolo-world: Real-time open-vocabulary object detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2024.
  • [19] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.
  • [20] A. Gupta, P. Dollár, and R. Girshick, “Lvis: A dataset for large vocabulary instance segmentation,” Computer Vision and Pattern Recognition, 2019.
  • [21] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
  • [22] P. Ren, C. Li, G. Wang, Y. Xiao, Q. Du, X. Liang, and X. Chang, “Beyond fixation: Dynamic window visual transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 987–11 997.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [24] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3146–3154.
  • [25] H. Wang, W. Wang, and J. Liu, “Temporal memory attention for video semantic segmentation,” in 2021 IEEE International Conference on Image Processing (ICIP).   IEEE, 2021, pp. 2254–2258.
  • [26] P. Ren, C. Li, H. Xu, Y. Zhu, G. Wang, J. Liu, X. Chang, and X. Liang, “Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency,” arXiv preprint arXiv:2302.10307, 2023.
  • [27] S. Wu, W. Zhang, S. Jin, W. Liu, and C. C. Loy, “Aligning bag of regions for open-vocabulary object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 254–15 264.
  • [28] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [29] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.
  • [30] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [31] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556–2565.
  • [32] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “Yfcc100m: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016.
  • [33] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022.
  • [34] W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 5583–5594.
  • [35] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visualbert: A simple and performant baseline for vision and language,” arXiv preprint arXiv:1908.03557, 2019.
  • [36] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic vision linguistic representations for vision-and-language tasks,” Advances in neural information processing systems, vol. 32, 2019.
  • [37] L. Guo, J. Liu, X. Zhu, P. Yao, S. Lu, and H. Lu, “Normalized and geometry-aware self-attention network for image captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 327–10 336.
  • [38] T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 684–699.
  • [39] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning.   PMLR, 2015, pp. 2048–2057.
  • [40] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433.
  • [41] P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, and H. Li, “Dynamic fusion with intra-and inter-modality attention flow for visual question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6639–6648.
  • [42] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16.   Springer, 2020, pp. 121–137.
  • [43] H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal, S. Som, and F. Wei, “Vlmo: Unified vision-language pre-training with mixture-of-modality-experts,” arXiv preprint arXiv:2111.02358, 2021.
  • [44] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning.   PMLR, 2022, pp. 12 888–12 900.
  • [45] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International conference on machine learning.   PMLR, 2023, pp. 19 730–19 742.
  • [46] Y. Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y. Li et al., “Regionclip: Region-based language-image pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 793–16 803.
  • [47] X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” arXiv preprint arXiv:2104.13921, 2021.
  • [48] L. Yao, J. Han, X. Liang, D. Xu, W. Zhang, Z. Li, and H. Xu, “Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 497–23 506.
  • [49] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), vol. 2.   IEEE, 2006, pp. 1735–1742.
  • [50] J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” Advances in neural information processing systems, vol. 34, pp. 9694–9705, 2021.
  • [51] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
  • [52] C. Lin, P. Sun, Y. Jiang, P. Luo, L. Qu, G. Haffari, Z. Yuan, and J. Cai, “Learning object-language alignments for open-vocabulary object detection,” arXiv preprint arXiv:2211.14843, 2022.
  • [53] X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra, “Detecting twenty-thousand classes using image-level supervision,” in European Conference on Computer Vision.   Springer, 2022, pp. 350–368.
  • [54] C. F. Van Loan, “The ubiquitous kronecker product,” Journal of computational and applied mathematics, vol. 123, no. 1-2, pp. 85–100, 2000.
  • [55] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “Dn-detr: Accelerate detr training by introducing query denoising,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 619–13 627.
  • [56] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
  • [57] G. I. O. Union, “A metric and a loss for bounding box regression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 658–666.
  • [58] S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun, “Objects365: A large-scale, high-quality dataset for object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 8430–8439.
  • [59] A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion, “Mdetr-modulated detection for end-to-end multi-modal understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1780–1790.
  • [60] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2641–2649.
  • [61] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  • [62] A. Dave, P. Dollár, D. Ramanan, A. Kirillov, and R. Girshick, “Evaluating large-vocabulary object detectors: The devil is in the details,” arXiv preprint arXiv:2102.01066, 2021.
  • [63] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Huggingface’s transformers: State-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019.
  • [64] X. Dai, Y. Chen, B. Xiao, D. Chen, M. Liu, L. Yuan, and L. Zhang, “Dynamic head: Unifying object detection heads with attentions,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7373–7382.